Why does Lark split names into separate characters?

40 Views Asked by At

I'm trying to parse a simple text like this:

test abc

Lark grammar is here:

start: test

test: "test" _WSI name _NL
name: (LETTER | DIGIT | "_")+

%import common.WS_INLINE -> _WSI
%import common.NEWLINE -> _NL
%import common.LETTER
%import common.DIGIT

Now if I print and pretty_print it, 'name' is split into separate tokens:

Tree(Token('RULE', 'start'), [Tree(Token('RULE', 'test'), [Tree(Token('RULE', 'name'), [Token('LETTER', 'a'), Token('LETTER', 'b'), Token('LETTER', 'c')])])])
start
  test
    name
      a
      b
      c

Why? I want to have that name as a string, not separate characters...

1

There are 1 best solutions below

0
Deru On

What is happening

Lark uses different naming conventions for Nonterminals and Terminals. To define a nonterminal you use lower-case. While if you wanted to define a terminal you should use UPPER-CASE.

Due to this distinction, lark will read the token name as a nonterminal, with production rule:

(LETTER | DIGIT | "_")+

This will result in a tree structure as you have seen in your output:

    name
      a
      b
      c

How to solve

As you actually are not interested in the tree structure and want to have a single token, you actually want to make a Terminal instead. This means changing the token name to NAME instead. When doing so, you tell lark to read the matched input as a single token, a terminal. Applying this change results in the grammar:

start: test

test: "test" _WSI NAME _NL
NAME: (LETTER | DIGIT | "_")+

%import common.WS_INLINE -> _WSI
%import common.NEWLINE -> _NL
%import common.LETTER
%import common.DIGIT

Running this will result in the following output:

Tree(Token('RULE', 'test'), [Token('NAME', 'abc')])
test    abc

Which solves the issue you have encountered.