I'm trying to write a grammar for content in the RIS format with nearley
Example of file:
TY - JOUR
KW - foo
KW - bar
ER -
A *.ris file always starts with the tag TY and ends with the tag ER. In between there can be many other tags like KW (keyword).
The spec says that a single KW statement can span across multiple lines.
So this:
TY - JOUR
KW - foo
bar
baz
KW - bat
ER -
Is equivalent to:
TY - JOUR
KW - foo bar baz
KW - bat
ER -
I'm struggling to come up with a grammar that says something like:
A keyword starts with
KWfollowed by-followed by either:
- letters until the end of the line
- letters until the end of the line and any other lines until the next keyword
Whatever I try ends up "swallowing" all other statements, e.g. the first multi-line keyword captures everything else after it.
How would you write this rule? I'm not necessarily interested in a nearley specific answer. Anything that triggers my "Aha" moment will do!
I am definitely not very good at designing grammar (you probably figured that out) but this triggered my Aha moment:
See https://nearley.js.org/docs/how-to-grammar-good
And:
See https://nearley.js.org/docs/tokenizers
I know that nearley recommends using moo-lexer:
See https://nearley.js.org/docs/tokenizers
So I googled around and found this amazing tutorial on YouTube which definitely unblocked me. Thank you so much @airportyh!
At first I thought this was way too complicated for my use case but it turned out that using a lexer actually made things both possible and simpler!
For the sake of simplicity I will provide a solution with a truncated RIS file:
sample.ris
This file should yield
['foo bar baz', 'bat']after parsing.First let's install some stuff
Now let's define our lexer
lexer.js
We have defined four tokens:
NLKWkeyword ... keyword!SEPARATORbetween a tag and its contentCONTENTof the tagNext let's define our grammar
grammar.ne
Note: see how we can refer to the tokens defined in the lexer by prefixing with
%!Now we need to compile our grammar
Nearley ships with a compiler:
You can also define a
compilescript in yourpackage.json:Finally let's build a parser and use it!
Note: this is requiring the compiled grammar i.e.
grammar.jsLet's throw some text at it:
Final tip: you can also test your grammar with
nearley-test: