What does gplex really do?

Oct 25, 2012 at 9:29 PM



  I have an issue with the workflow of gplex. If I am not mistaken the input is (for example) a string (like "foo bar 1597") and the output is... single int (see: yylex() method).


If anything, output should be a sequence of ints (which are ids of tokens), so for example I would get in the above example, 1 0 1 0 2 (meaning: word, ws, word, ws, number).


But even better, I should be able to specify the return type to something else than int, and work with some class MySymbol. Of course the output should be a sequence of such objects as well.

Currently, as I understand, the only way to achieve this is to add manually a mechanism which will store such sequence and in each rule instead of something like "return new MySymbol(...." write rather "sequence.add(new MySymbol(...); return int_does_not_matter;". However it is a lot of added work, it is complete artificial.


So, please either say I am wrong, and clarify the workflow, or if I understood the workflow correctly allow user to specify return type and return the sequence, not single value.

Oct 28, 2012 at 10:08 PM

Hi macias

GPLEX is a tool that supplies low-level code for scanners of various kinds.  The functionality of yylex() is not up for debate.  If yylex didn't return an int, then it wouldn't be a equivalent to the traditional LEX.  Similarly the overall architecture of the API belongs to a rather old tradition. 

The essential core of GPLEX and all similar tools is to recognize sub-sequences of input symbols that match regular expressions, and perform semantic actions specified for those patterns.  Some applications do not even call yylex, or only call it once.  See, for example, the wordcount example in the distribution.  This calls yyylex only once per input file, and no semantic action even mentions "return XXX".

For the traditional use of lex-style scanners yylex is a bit primitive, but the framework is flexible enough to cater to even difficult cases like C# itself.  If you really want a sequence of token-information there are several ways of doing it:  My particular favorite is to define a semantic value type that encapsulates everything that the client needs.  This may include elements of the yylex return value, and the yylval and yylloc variables of the trandition API.  The definition and initialization of the sequence variable goes in the definitions section, the creation of the instance value and add statement goes in the user-supplied epilog (see the doco, Figure 11 on page 18. Finally the wrapper method is declared in the usercode section and returns the completed sequence. 

Oct 29, 2012 at 12:24 PM

Thank you for your answer. Could you please give just a minimal outline how to acquire a sequence of "CustomToken" where CustomToken is a class (custom)?


So far, the only way I see this could work would be adding something like "LinkedList<CustomToken> sequence" in header of lex file, and then for each rule adding a code "sequence.Add(new CustomToken(...))" + dummy "return 0" to satisfy gplex. After calling yylex() the sequence variable should be filled with CustomToken objects.


A little background: I am asking about this because previously I worked with JLex and it allowed to obtain a sequence of custom objects like CustomToken (name is not important here) without any workarounds, so I could write rules like that:


<YYINITIAL>if                             { return new CustomToken(TokenConstants.IF); }

Oct 29, 2012 at 12:48 PM

A little more clarification. I have already written lex file and parser (for JLex and Java CUP), now I would like to migrate to gplex and gpparser. To be sure everything works, I would like to write just lex part, lex my files, and compare result with existing lexer (built with JLex).


In other word my lexer built with gplex has to work in standalone mode AND also with parser. Now, I am working on standalone mode.

Nov 2, 2012 at 8:49 AM

Hi Macias

I have put up a working example on the download page that creates a list of scanner result objects.  

It is not hard, but it is a little awkward.  If we could vary the type returned by Scan() or even yylex, then it would be much easier.  Unfortunately the return of an integer is deeply embedded in the design of the LEX - YACC (or in this case GPLEX, GPPG) interface.  I did something very similar in the early days of gplex when I needed to interface a gplex scanner to a COCO/R parser.  The parser needed to traverse the input multiple times, so building a sequence of token-information was the best way to go.

In my example there is a parser stub which supplies the definitions of the scan object type ScanObj, the token enumeration TokEnum, and the base type ScanBase that gplex scanners expect to sub-class.  This stub also supplies the Main method to test the scanner.

The scanner specification declares an instance field to hold the result list in the header.  In the User Code section it declares the initialization method and the wrapper that calls Scan() until it hits end of file.

Hope this makes it clear.  Let me know if you have any other issues.


Nov 2, 2012 at 12:30 PM

Thank you very much for the example.


Btw. as for lex/yacc I understand the need of backward compatibility but there is no requirement that you cannot add new features (if you like of course). Like the one if scanning and parsing should be int-based or regular object-based. In the second mode, user would use Symbol (for example) class with such fields as -- ID (int), placeholder for custom object, left and right (for example) for parsing needs.


Thus moving from lex/yacc world would be straightforward, but if you would build something from scratch you could use more advanced (comfortable) means to do so.

Feb 11, 2015 at 5:19 PM
I got the same situation as Macias, but I couldn't find the example on the download page mentioned there "I have put up a working example on the download page that creates a list of scanner result objects.".
Could you provide a link?