I have a (Lark) grammar that I think should be unambiguous, but depending on the version of Lark, fails to parse in one way or another:
import lark
syntax = r"""
stmt: mov_stmt
| special_stmt
mov_stmt: reg ASSIGN (reg | const)
special_stmt: ("RS" SPECIAL_ASSIGN const)
reg: REG
const: DEC_NUM
REG.2: /R[0-7]|RS/
DEC_NUM: /0|[1-9]\d*/i
ASSIGN: "="
SPECIAL_ASSIGN: "&="
WS: /[ \t]+/
%ignore WS
"""
parser = lark.Lark(syntax, start="stmt", parser="lalr")
print(parser.parse("R3 = 7")) # 1. ok
print(parser.parse("R3 = R7")) # 2. ok
print(parser.parse("RS &= 1")) # 3. Fails on lark==1.1.9; expected special_stmt
print(parser.parse("RS = R7")) # 4. Fails on lark-parser==0.12.0; expected mov_stmt
With lark-parser==0.12.0, invocation number 4. fails. I expect a mov_stmt, but it is expecting a SPECIAL_ASSIGN token, meaning it is matching special_stmt.
lark.exceptions.UnexpectedToken: Unexpected token Token('ASSIGN', '=') at line 1, column 4.
Expected one of:
* SPECIAL_ASSIGN
Previous tokens: [Token('RS', 'RS')]
With lark==1.1.9, the opposite happens and invocation number 3. fails. I expect a special_stmt, but it is expecting an ASSIGN token, meaning it is matching mov_stmt.
lark.exceptions.UnexpectedToken: Unexpected token Token('SPECIAL_ASSIGN', '&=') at line 1, column 4.
Expected one of:
* ASSIGN
Previous tokens: [Token('REG', 'RS')]
In my mind, the grammar should be unambiguous. An = always means mov_stmt, and &= always means special_stmt (which only works for reg=RS).
How do I disambiguate this?
I tried assigning priorities to different terminals, to no effect.
This fixes the grammar:
It seems that when "RS" was part of the
REGterminal regex, andspecial_stmtused a "RS" literal, there was ambiguity.By defining a new
SPECIAL_REGterminal, (and including it in theregrule), the lexer is able to properly disambiguate the statements.