I'm attempting to write a grammar to parse numbers in English sentences, and I can successfully parse up to 999. Once I add in the logic to support the thousands place, I get a reduce parsing conflict, and I'm having a hard time understanding what's causing it.
I've attached a portion of the parser.out file generated by lemon, and I'm hoping someone can shed some light on the issue. I've also included a large portion of the grammar, where everything below the line works on it's own, but once I add in the logic for thousands above the line I begin running into issues.
My thought is that I'm running into an issue similar to the "dangling else" but with my separator. However, that normally manifests as a shift-reduce error, whereas it looks like I have just a reduce error. Lemon documentation is a bit sparse, and I'm not exactly sure how to read the parser.out file's contents. For example, in the line HYPHEN reduce 15 ** Parsing conflict **, what does the 15 even refer to?
Any help would be greatly appreciated!
Portion of my grammar file:
final_number(A) ::= one_to_999999(B).
final_number(A) ::= ZERO.
one_to_999999(A) ::= thousands(B) separator one_to_999(C).
one_to_999999(A) ::= thousands(B).
one_to_999999(A) ::= one_to_999(B).
thousands(A) ::= one_to_999(B) separator THOUSAND.
thousands(A) ::= THOUSAND.
/* -------------------------------------- */
one_to_999(A) ::= hundreds(B) separator one_to_99(C).
one_to_999(A) ::= hundreds(B).
one_to_999(A) ::= one_to_99(B).
one_to_99(A) ::= tens(B) separator one_to_9(C).
one_to_99(A) ::= tens(B).
one_to_99(A) ::= ten_to_19(B).
one_to_99(A) ::= one_to_9(B).
hundreds(A) ::= one_to_9(B) separator HUNDRED.
hundreds(A) ::= HUNDRED.
separator ::= WHITESPACE.
separator ::= HYPHEN.
separator ::= .
Portion of parser.out with an error:
State 5:
one_to_99 ::= tens * separator one_to_9
(15) one_to_99 ::= tens *
separator ::= * WHITESPACE
separator ::= * HYPHEN
(65) separator ::= *
$ reduce 15 one_to_99 ::= tens
THOUSAND reduce 15 one_to_99 ::= tens
WHITESPACE shift-reduce 63 separator ::= WHITESPACE
WHITESPACE reduce 15 ** Parsing conflict **
HYPHEN shift-reduce 64 separator ::= HYPHEN
HYPHEN reduce 15 ** Parsing conflict **
separator shift 4
{default} reduce 65 separator ::=
There isn't actually quite enough information here to diagnose the complete problem, but I think I can fill in the blanks.
What's indicated is that the problem is a state where the parser has recognized
tens(that would be "twenty", "thirty", ..., "ninety", right?) and it now needs aseparator(which might be optional). If the lookahead token is an actual separator, it has to decide whether to reducetensimmediately toone_to_99(as a prelude to completingone_to_999without a trailing digit) or shift theWHITESPACEorHYPHENcharacter in order to extendtenswith aseparatorand a single digit (one_to_9).The parser really cannot make that decision just looking at the separator token. It needs to know what follows (which might be, for example,
THOUSANDorONE, amongst other possibilities).This doesn't happen before you add thousands to the grammar, because without the possibility of
THOUSAND, if there is no single digit at the end of the number there is no separator following thetenstoken either. So if there is an explicit separator, then there must be a digit and thus the shift is required. Once you add theTHOUSANDoption, the existence of the separator token is no longer an adequate guide.Trying to explicitly match whitespace in a parser is similar to what is commonly known as "scannerless parsing", although that's not strictly the case here since you probably do actually have a scanner. However, the scanner is not doing its job properly; it is failing to drop tokens which have no syntactic value. While there are people who like scannerless parsing, it is generally agreed that it tends to increase lookahead requirements. [Note 1] Since you cannot increase the lookahead for a lemon parser (nor for many other yacc-based parser generators), scannerless parsing is problematic with such tools.
In this case, it is hard to see what you might gain by forcing the parser to deal with separators, and it is apparent what you have lost (LALR(1) parsability), so I'd suggest that you just drop whitespace and hyphens on the floor in your scanner, and remove them from your parser. You might argue that doing that will allow incorrect sentences such as
three hundred forty---two. That's true, but your current grammar allowsthree hundred-forty two(which is not correct in any style guide I've ever seen), and may forbidforty - two, depending on what pattern your scanner uses to recognize hyphens.If you want to be "hyphen-correct", by all means return hyphens (but not whitespace) from your scanner, and then accept them only where useful:
which will not produce any shift/reduce conflicts.
Notes