I need to parse a "token-level" language, i.e. the input is already tokenized with a semicolon as a delimiter. Sample input: A;B;A;D0;ASSIGN;X;. Here's also my grammar file.
I'd like to track location columns per-token. For the previous example, here's how I'd like to have columns defined:
Input: A;B;A;D0;ASSIGN;X;\n
Col: 1122334445555555666
So basically I'd like to increment column every time a semicolon is hit. I made a function that increments column count when semicolon is hit and for all actions I just set column in yylloc to my custom column count. However, with this approach I have to copypaste a function call to every action. Do you please know if there's any other cleaner way? Also there'll be no lexical errors in the input since it's autogenerated.
Edit: Nevermind, my solution actually doesn't work. So I'll be happy for any suggestions :)
%lex
%{
var delimit = (terminal) => { this.begin('delimit'); return terminal }
var columnInc = () => {
if (yy.lastLine === undefined) yy.lastLine = -1
if (yylloc.first_line !== yy.lastLine) {
yy.lastLine = yylloc.first_line
yy.columnCount = 0
}
yy.columnCount++
}
var setColumn = () => {
yylloc.first_column = yylloc.last_column = yy.columnCount
}
%}
%x delimit
%%
"ASSIGN" { return delimit('ASSIGN'); setColumn() }
"A" { return delimit('A'); setColumn() }
<delimit>{DELIMITER} { columnInc(); this.popState(); setColumn() }
\n { setColumn() }
...
There are a few ways to accomplish this in jison-gho. As you're looking to implement a token counter which is tracked by the parser, this invariably means we need to find a way to 'hook' into the code path where the lexer passes tokens to the parser.
Before we go look at a few implementations, a few thoughts that may help others who are facing similar, yet slightly different problems:
completely custom lexer for prepared token streams: as your input is a set of tokens already, one might consider using a custom lexer which would then just take the input stream as-is and do as little as possible while passing the tokens to the parser. This is doable in jison-gho and a fairly minimal example of such is demonstrated here:
https://github.com/GerHobbelt/jison/blob/0.6.1-215/examples/documentation--custom-lexer-ULcase.jison
while another way to integrate that same custom lexer is demonstrated here:
https://github.com/GerHobbelt/jison/blob/0.6.1-215/examples/documentation--custom-lexer-ULcase-alt.jison
or you might want to include the code from an external file via a
%include "documentation--custom-lexer-ULcase.js"statement. Anyway, I digress.augmenting the jison lexer: of course another approach could be to augment all (or a limited set of) lexer rules' action code, where you modify
yytextto pass this data to the parser. This is the classic approach from the days of yacc/bison. Indeed,yytextdoesn't have to be a string, but can be anything, e.g.For this problem, this is a lot of code duplication and thus a maintenance horror.
hooking into the flow between parser and lexer: this is new and facilitated by the
jison-ghotool bypre_lexandpost_lexcallbacks. (The same mechanism is available around theparse()call so that you can initialize and postprocess a parser run in any way you want:pre_parseandpost_parseare for that.Here, since we want to count tokens, the simplest approach would be using the
post_lexhook, which is only invoked when the lexer has completely parsed yet another token and passes it to the parser. In other words:post_lexis executed at the very end of thelex()call in the parser.Here it is:
Do note that options 1 and 3 are not available in vanilla jison, with one remark about option 1:
jisondoes not accept a custom lexer as part of the jison parser/lexer spec file as demonstrated in the example links above. Of course, you can always go around and wrap the generated parser and thus inject a custom lexer and do other things.Implementing the token counter using
post_lexNow how does it look in actual practice?
Solution 1: Let's do it nicely
We are going to 'abuse'/use (depending on your POV about riding on undocumented features) the
yyllocinfo object and augment it with a counter member. We choose to do this so that we never risk interfering (or getting interference from) the default text/line-orientedyyllocposition tracking system in the lexer and parser.Hooking into the lexer token output means we'll have to augment the lexer first, which we can easily do in the
%%section before the/lexend-of-lexer-spec-marker:where the magic bit is this statement:
this.yylloc.counter = token_counter.We hook a
pre_lexcallback into the flow by directly injecting it into the lexer definition vialexer.post_lex = function (){...}.Now all we have to do is complete this with a tiny bit of
pre_parsecode to ensure multipleparser.parse(input)invocations each will restart with the token counter reset to zero:Of course, that bit has to be added to the parser's final code block, after the
%%in the grammar spec part of the jison file.Full jison source file is available as a gist here.
How to compile and test:
Solution 2: Be a little nasty and re-use the
yylloc.columnlocation info and trackingInstead of using the line info part of
yylloc, I chose to use the column part instead, as to me that's about the same granularity level as a token sequence index. Doesn't matter which one you use, line or column, as long as you follow the same pattern.When we do this right, we get the location tracking features of
jison-ghoadded in for free, which is: column and line ranges for a grammar rule are automatically calculated from the individual tokenyyllocinfo in such a way that the first/last members ofyyllocwill show the first and last column, pardon, token index of the token sequence which is matched by the given grammar rule. This is theclassic,mergejison-gho behaviour as mentioned in the--default-actionCLI option:Now that we are going to 're-use' the
fist_columnandlast_columnmembers ofyyllocinstead of adding a newcountermember, the magic bits that do the work remain nearly the same as in Solution 1:augmenting the lexer in its
%%section:We hook a
pre_lexcallback into the flow by directly injecting it into the lexer definition vialexer.post_lex = function (){...}.Same as Solution 1, now all we have to do is complete this with a tiny bit of
pre_parsecode to ensure multipleparser.parse(input)invocations each will restart with the token counter reset to zero:Of course, that bit has to be added to the parser's final code block, after the
%%in the grammar spec part of the jison file.Full jison source file is available as a gist here.
How to compile and test:
Aftermath / Observations about the solutions provided
Observe the test verification data at the end of both those jison files provided for how the token index shows up in the parser output:
Solution 1 (stripped, partial) output:
Note here that the
counterindex is not really accurate for compound elements, i.e. elements which were constructed from multiple tokens matching one or more grammar rules: only the first token index is kept.Solution 2 fares much better in that regard:
Solution 2 (stripped, partial) output:
As you can see the
first_columnpluslast_columnmembers nicely track the set of tokens which constitute each part. (Note that the counter increment code implied we start counting with ONE(1), not ZERO(0)!)Parting thought
Given the input
A;B;A;D0;ASSIGN;X;SEMICOLON;the current grammar parses this likeABA0 = X;and I wonder if this is what you really intend to get: constructing the identifierABA0like that seems a little odd to me.Alas, that's not relevant to your question. It's just me encountering something quite out of the ordinary here, that's all. No matter.
Cheers and hope this long blurb is helpful to more of us. :-)
Source files: