I'm parsing SQL (Sqlite) using FParsec, and I'm running into issues with backtracking when parsing constructs like START_KEYWORD element_0, …, element_N END_KEYWORD. Since the parser for recognizing identifiers also recognizes keywords, I have to make sure that I try to parse the keyword first and only if that fails then I continue with the identifier. For that, I'm using the parser combinator many1Till, but it still seems like I cannot escape backtracking.
Here's an example of how I'm currently parsing an ORDER BY clause:
let sep1CommaEnd p endP =
let commaOrEnd = symbol S.Comma <|> followedBy endP
many1Till (p .>> commaOrEnd) (followedBy endP)
let orderBy =
parse {
do! keyword K.Order >>. keyword K.By
let! xs = sep1CommaEnd ident (keyword K.Asc <|> keyword K.Desc)
let! asc = (keyword K.Asc >>. preturn true <|> (keyword K.Desc >>. preturn false) <|> preturn true )
return { columns = xs; asc = asc }
}
This implementation has a few issues:
- It uses
followedBytwice insep1CommaEnd, which can cause backtracking. - It repeats parsing the keyword used for ending the construct, which is inefficient.
To solve these issues, I came up with the idea of creating a parser combinator called sepBy1Cont that allows me to:
- End right after the keyword used for ending is detected.
- Accumulate all results, including the one returned by the parser of the keyword used for ending.
Here's my implementation of
sepBy1Cont:
type Either<'a, 'b> =
| Left of 'a
| Right of 'b
let sepBy1Cont (p: Parser<'a, unit>) (sepEnd: Parser<Either<'b, 'c>, unit>) =
let rec loop (acc: 'a list) =
parse {
let! x = p
let! d = sepEnd
match d with
| Left _ -> return! loop (x :: acc)
| Right next ->
let xs = List.rev (x :: acc)
return (xs, next)
}
loop []
My questions are:
- Is there a better way to solve this problem in FParsec that I'm missing?
- If
sepBy1Contis a good solution, is there a better way to implement it? Should I learn about the internals of FParsec to improve it?
Answering your 1st question (in general): There are many builtin string parsers in FParsec that can / should be used for clarity and performance.
Given a pseudo-grammar that should be parsed:
START_KEYWORD element_0, …, element_N END_KEYWORDHere's an idea: