I am writing a parser for a markdown-like document format. I want to be able to match something like ^[some *formatted* text] as a footnote in my syntax definition. Here's a minimal example:
{- cabal:
build-depends: base, text, megaparsec, parser-combinators, hspec, hspec-megaparsec
-}
{-# LANGUAGE ImportQualifiedPost #-}
{-# LANGUAGE OverloadedStrings #-}
import Data.Text (Text)
import Data.Void (Void)
import Test.Hspec
import Test.Hspec.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer qualified as L
type Parser = Parsec Void Text
data Words
= PlainText Text
| BoldText Text
| MagicText [Words]
deriving (Show, Eq)
text_ :: Parser Words
text_ =
choice
[
MagicText <$> between (string "^[") (char ']') (manyTill (text_ <* optional space) (char ']')),
BoldText <$> between (char '*') (char '*') (takeWhile1P (Just "bold text") (/= '*')),
PlainText <$> takeWhile1P (Just "plain text") (\c -> c /= ' ' && c /= '\n')
]
main :: IO ()
main = hspec $ do
context "for basic one-word-at-a-time input" $ do
it "parses plain text" $ parse text_ "" "hello" `shouldParse` PlainText "hello"
it "parses bold text" $ parse text_ "" "*hello*" `shouldParse` BoldText "hello"
context "parses nested \"MagicText\"" $ do
it "on it's own with just one word inside" $
parse text_ "" "^[hello]" `shouldParse` MagicText [PlainText "hello"]
it "on it's own with bold text inside" $
parse text_ "" "^[*hello*]" `shouldParse` MagicText [BoldText "hello"]
The last two test cases fail with the following errors:
~/sandbox > cabal run ParseBetween.hs
for basic one-word-at-a-time input
parses plain text [✔]
parses bold text [✔]
parses nested "MagicText"
on it's own with just one word inside [✘]
on it's own with bold text inside [✘]
Failures:
/home/gideon/sandbox/ParseBetween.hs:43:33:
1) parses nested "MagicText" on it's own with just one word inside
expected: MagicText [PlainText "hello"]
but parsing failed with error:
1:9:
|
1 | ^[hello]
| ^
unexpected end of input
expecting "^[", '*', ']', plain text, or white space
To rerun use: --match "/parses nested \"MagicText\"/on it's own with just one word inside/" --seed 100639639
/home/gideon/sandbox/ParseBetween.hs:46:35:
2) parses nested "MagicText" on it's own with bold text inside
expected: MagicText [BoldText "hello"]
but parsing failed with error:
1:11:
|
1 | ^[*hello*]
| ^
unexpected end of input
expecting ']'
To rerun use: --match "/parses nested \"MagicText\"/on it's own with bold text inside/" --seed 100639639
From the definition of manyTill_ I would expect it to match the ending ] first, and therefore not run into this "unexpected end-of-input" error, but I can't work out how to have this nested parsing behaviour in a way which works.
I can't see by inspection what's wrong with your bold-text example. But the problem with
"[hello]"is simple enough. You start parsingMagicText, which consumes the[and delegates totext_again, planning to consume a]afterwards. But the parser insidePlainTextdoesn't know it's supposed to leave behind a]character. It happily consumes all the way to the end of the string, because it never encounters one of its stop characters,' 'or'\n'. Then it completes, and theMagicTextabove it is upset it can't find its closing].A common way to handle problems like this is to have a grammar with more explicit separations of its concepts, encoded in a hierarchy. A
MagicTextdoesn't contain "any text, including magic, bold, or plain text": it includes "bold text or plain text". ABoldTextdoesn't contain "any text, including magic, bold, or plain text": it contains only plain text. AndPlainTextexplicitly rejects characters that would be treated as delimiters/metacharacters for the levels above it. Roughly like this: