Parsing formulas using Lark / EBNF

Question

Parsing formulas using Lark / EBNF

171 Views Asked by VicVic At 30 May 2023 at 15:14

I am working on parsing formulas written in an internal syntax. I am working with Lark. Its the first time im doing this, please bear with me.

The formulas look something like this:

MEAN(1,SUM({T(F_01.01)R(0100)C(0100)S(AT)[T-1Y]},{T(F_01.01)R(0100,0120)C(0100)S(AT)[T-1Y]})))

In a first step I would like to convert the above into something like this:

MEAN(1,SUM(F_01.01_r0100_c0100_sAT[T-1Y],F_01.01_r0100_c0100_sAT[T-1Y],F_01.01_r0120_c0100_sAT[T-1Y])))

Here an example of the code:

from lark import Lark,Transformer

grammar = """

?start:
    | NUMBER
    | [symbols] datapoints ([symbols]+ datapoints)* [symbols]

?symbols.1:
    | /\+/
    | /\-/
    | /\//
    | /\*/
    | /\*\*/
    | /\,/
    | /\(/
    | /\)/
    | /\w+/

?datapoints.2:
           | "{" "T" "(" TABLE ")" [ "R" "(" ROW ")"] ["C" "(" COLUMN ")"] ["S" "(" SHEETS  ")"] [TIME_SHIFT] "}"   -> its_data_point
           | "{" "SPE.DPI" "(" CNAME ")" [TIME_SHIFT] "}"    -> ste_data_point

TIME_UNIT: "M" | "Q" | "Y"
TIME_SHIFT: /\[T\-/ INT TIME_UNIT /\]/ | /\[PYE\]/

TABLE: /[A-Z]{1}/ "_" (/\d{3}/ | /\d{2}/) "." /\d{2}/ ["." /[a-z]/]
ROW:  /\d{4}/ (/\,\d{4}/)*
COLUMN: /\d{4}/ (/\,\d{4}/)*
SHEETS: /[a-zA-T0-9_]+/ ("," /a-zA-T0-9_/)*

OTHER: /[a-zA-Z]+/

%import common.WS_INLINE
%import common.INT
%import common.CNAME
%import common.NUMBER

%ignore WS_INLINE

"""

sp = Lark(grammar)

class MyTransFormer(Transformer):

    def __init__(self):
        self.its_data_points = []

    def its_data_point(self,items):
        t,r,c,s,ts=items
        res = []
        for row in r.split(','):
            res.append(str(t)+'_r'+ str(row)+'_c'+str(c)+'_s'+str(s)+str(ts))
        self.its_data_points += res
        return ','.join(res)

    def __default_token__(self, token):
        return str(token.value)

    def __default__(self, data, children, meta):
        return ''.join(children)

teststr="MEAN(1,SUM({T(F_01.01)R(0100,0120)C(0100)S(AT)[T-1Y]},{T(F_01.01)R(0100)C(0100)S(AT)[T-1Y]}))"
tree = sp.parse(teststr)
mt = MyTransFormer()
print(mt.transform(tree))

but with this i get:

MEANMEAN(1,SUM(F_01.01_r0100_c0100_sAT[T-1Y],F_01.01_r0120_c0100_sAT[T-1Y],F_01.01_r0100_c0100_sAT[T-1Y]))

why do I get a 'mean' twice ?

Original Q&A

There are 1 best solutions below

**MegaIng** · Answer 1 · 2023-06-01T11:08:44.107000

The problem is that your grammar is written in such an ambigous way that the default Lark amibuity resolver get's messed up and duplicates terminals. That shouldn't happen from the library point of view and I think there is already an issue open for something like that.

However, there is a really simple fix of rewritting the grammar to be far less ambigous:

?start: NUMBER
      | (symbols|datapoints)*

?!symbols: "+" | "-" | "*" | "**" | "," | "(" | ")" | /\w+/

?datapoints: "{" "T" "(" TABLE ")" [ "R" "(" ROW ")"] ["C" "(" COLUMN ")"] ["S" "(" SHEETS  ")"] [TIME_SHIFT] "}"   -> its_data_point
           | "{" "SPE.DPI" "(" CNAME ")" [TIME_SHIFT] "}"    -> ste_data_point

I took away that symbols and datapoints could be empty. For otherwise fixed size rules this is better expressed in the rule above with an optional marker, i.e. ? or []. In addition, the combination of symbols and datapoints you had in the second line of start boils down to any combination of symbols and datapoints in any order. Not sure if that is what you wanted, but simplified like this it gets parsed correctly.

You can see that the ambiguity is the problem by passing ambiguity="explicit" to the Lark constructor. Then the parsing doesn't complete because it can't correctly generated the millions of possibilities the original grammar has.

I would suggest to always aim to create a grammar in such a way that parser='lalr' works. For the original one, that raises complains about various ambiguities that you would fix. Although that isn't always possible, but here it probably is.

Parsing formulas using Lark / EBNF

There are 1 best solutions below

Related Questions in PYTHON-3.X

Related Questions in PARSING

Related Questions in BNF

Related Questions in EBNF

Related Questions in LARK-PARSER

Trending Questions

Popular # Hahtags

Popular Questions