I am working on parsing formulas written in an internal syntax. I am working with Lark. Its the first time im doing this, please bear with me.
The formulas look something like this:
MEAN(1,SUM({T(F_01.01)R(0100)C(0100)S(AT)[T-1Y]},{T(F_01.01)R(0100,0120)C(0100)S(AT)[T-1Y]})))
In a first step I would like to convert the above into something like this:
MEAN(1,SUM(F_01.01_r0100_c0100_sAT[T-1Y],F_01.01_r0100_c0100_sAT[T-1Y],F_01.01_r0120_c0100_sAT[T-1Y])))
Here an example of the code:
from lark import Lark,Transformer
grammar = """
?start:
| NUMBER
| [symbols] datapoints ([symbols]+ datapoints)* [symbols]
?symbols.1:
| /\+/
| /\-/
| /\//
| /\*/
| /\*\*/
| /\,/
| /\(/
| /\)/
| /\w+/
?datapoints.2:
| "{" "T" "(" TABLE ")" [ "R" "(" ROW ")"] ["C" "(" COLUMN ")"] ["S" "(" SHEETS ")"] [TIME_SHIFT] "}" -> its_data_point
| "{" "SPE.DPI" "(" CNAME ")" [TIME_SHIFT] "}" -> ste_data_point
TIME_UNIT: "M" | "Q" | "Y"
TIME_SHIFT: /\[T\-/ INT TIME_UNIT /\]/ | /\[PYE\]/
TABLE: /[A-Z]{1}/ "_" (/\d{3}/ | /\d{2}/) "." /\d{2}/ ["." /[a-z]/]
ROW: /\d{4}/ (/\,\d{4}/)*
COLUMN: /\d{4}/ (/\,\d{4}/)*
SHEETS: /[a-zA-T0-9_]+/ ("," /a-zA-T0-9_/)*
OTHER: /[a-zA-Z]+/
%import common.WS_INLINE
%import common.INT
%import common.CNAME
%import common.NUMBER
%ignore WS_INLINE
"""
sp = Lark(grammar)
class MyTransFormer(Transformer):
def __init__(self):
self.its_data_points = []
def its_data_point(self,items):
t,r,c,s,ts=items
res = []
for row in r.split(','):
res.append(str(t)+'_r'+ str(row)+'_c'+str(c)+'_s'+str(s)+str(ts))
self.its_data_points += res
return ','.join(res)
def __default_token__(self, token):
return str(token.value)
def __default__(self, data, children, meta):
return ''.join(children)
teststr="MEAN(1,SUM({T(F_01.01)R(0100,0120)C(0100)S(AT)[T-1Y]},{T(F_01.01)R(0100)C(0100)S(AT)[T-1Y]}))"
tree = sp.parse(teststr)
mt = MyTransFormer()
print(mt.transform(tree))
but with this i get:
MEANMEAN(1,SUM(F_01.01_r0100_c0100_sAT[T-1Y],F_01.01_r0120_c0100_sAT[T-1Y],F_01.01_r0100_c0100_sAT[T-1Y]))
why do I get a 'mean' twice ?
The problem is that your grammar is written in such an ambigous way that the default Lark amibuity resolver get's messed up and duplicates terminals. That shouldn't happen from the library point of view and I think there is already an issue open for something like that.
However, there is a really simple fix of rewritting the grammar to be far less ambigous:
I took away that
symbolsanddatapointscould be empty. For otherwise fixed size rules this is better expressed in the rule above with an optional marker, i.e.?or[]. In addition, the combination ofsymbolsanddatapointsyou had in the second line ofstartboils down to any combination ofsymbolsanddatapointsin any order. Not sure if that is what you wanted, but simplified like this it gets parsed correctly.You can see that the ambiguity is the problem by passing
ambiguity="explicit"to theLarkconstructor. Then the parsing doesn't complete because it can't correctly generated the millions of possibilities the original grammar has.I would suggest to always aim to create a grammar in such a way that
parser='lalr'works. For the original one, that raises complains about various ambiguities that you would fix. Although that isn't always possible, but here it probably is.