I have been trying to get my head around Scala's parser combinators. It seems that they are pretty powerful but the only tutorial examples I seem to find are with mathematical expressions and very little proper real-world parsing examples with DSLs that need to be parsed and mapped to different entities etc.
For the sake of this example, lets say I have this BNF where I have this entity named Model, which is made up of a string like this: [model [name <name> ]]. This is a simplistic example of a much larger BNF I have and there are more entities in reality.
So I defined my own class Model which takes the name as the constructor and then defined my own ModelParser object which extends JavaTokenParsers. I then defined the following parsers, following the BNF (I know some may have a simpler regex matcher but I preferred to follow the BNF exactly for other reasons).
def model : Parser[Model] = "[model" ~> "[name" ~> name <~ "]]" ^^ ( Model(_) )
def name : Parser[String] = (letter ~ (anyChar*)) ^^ {case text => text.toString())
def anyChar = letter | digit | "_".r | "-".r
def letter = """[a-zA-Z]""".r
def digit = """\d""".r
The toString of Model looks like this:
override def toString : String = "[model " + name + "]"
When I try to run it with a string like [model [name helloWorld]] I get this
[model [h~List(e, l, l, o, W, o, r, l, d)]] instead of what I am expecting [model helloWorld]
How do I get those individual characters to join back in the string they were originally in?
I am also confused with the individual parsers and the use of .r. Sometimes I saw examples where they had just the following as a parser (to parse "hello"):
def hello = "hello"
Isn't that just a String? How on Earth did it suddenly become a parser that can be combined with other parsers? And what is the .r actually doing? I have read at least 3 tutorials but still totally lost what is actually happening.
The problem is that
anyChar*parses aList[String](where in this case each string is a single character), and the result of callingtoStringon a list of strings is"List(...)", not the string you'd get by concatenating the contents. In addition, thecase text =>pattern is matching on the entireletter ~ (anyChar*), not just theanyChar*part.It's possible to address both of these issues pretty straightforwardly:
We just append the first character string to the list of the rest, and then call
mkStringon the entire list, which will concatenate the contents. This works as expected:As you note, it would be possible (and possibly clearer and more performant) to let the regular expressions do more of the work:
This example also illustrates the way that the parsing combinator library uses implicit conversions to cut down on some of the verbosity of writing parsers. As you say,
def hello = "hello"defines a string, and"[a-zA-Z]+".rdefines aRegex(via thermethod onStringOps), but either can be used as a parser becauseRegexParsersdefines implicit conversions fromString(this one's namedliteral) andRegex(regex) toParser[String].