Parsing inside `between` with Megaparsec

Question

Parsing inside `between` with Megaparsec

89 Views Asked by GTF At 14 February 2024 at 17:18

I am writing a parser for a markdown-like document format. I want to be able to match something like ^[some *formatted* text] as a footnote in my syntax definition. Here's a minimal example:

{- cabal:
build-depends: base, text, megaparsec, parser-combinators, hspec, hspec-megaparsec
-}
{-# LANGUAGE ImportQualifiedPost #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Text (Text)
import Data.Void (Void)
import Test.Hspec
import Test.Hspec.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer qualified as L

type Parser = Parsec Void Text

data Words
  = PlainText Text
  | BoldText Text
  | MagicText [Words]
  deriving (Show, Eq)

text_ :: Parser Words
text_ =
  choice
    [
      MagicText <$> between (string "^[") (char ']') (manyTill (text_ <* optional space) (char ']')),
      BoldText <$> between (char '*') (char '*') (takeWhile1P (Just "bold text") (/= '*')),
      PlainText <$> takeWhile1P (Just "plain text") (\c -> c /= ' ' && c /= '\n')
    ]

main :: IO ()
main = hspec $ do
  context "for basic one-word-at-a-time input" $ do
    it "parses plain text" $ parse text_ "" "hello" `shouldParse` PlainText "hello"
    it "parses bold text" $ parse text_ "" "*hello*" `shouldParse` BoldText "hello"

  context "parses nested \"MagicText\"" $ do
    it "on it's own with just one word inside" $
      parse text_ "" "^[hello]" `shouldParse` MagicText [PlainText "hello"]

    it "on it's own with bold text inside" $
      parse text_ "" "^[*hello*]" `shouldParse` MagicText [BoldText "hello"]

The last two test cases fail with the following errors:

~/sandbox > cabal run ParseBetween.hs

for basic one-word-at-a-time input
  parses plain text [✔]
  parses bold text [✔]
parses nested "MagicText"
  on it's own with just one word inside [✘]
  on it's own with bold text inside [✘]

Failures:

  /home/gideon/sandbox/ParseBetween.hs:43:33: 
  1) parses nested "MagicText" on it's own with just one word inside
       expected: MagicText [PlainText "hello"]
       but parsing failed with error:
         1:9:
           |
         1 | ^[hello]
           |         ^
         unexpected end of input
         expecting "^[", '*', ']', plain text, or white space

  To rerun use: --match "/parses nested \"MagicText\"/on it's own with just one word inside/" --seed 100639639

  /home/gideon/sandbox/ParseBetween.hs:46:35: 
  2) parses nested "MagicText" on it's own with bold text inside
       expected: MagicText [BoldText "hello"]
       but parsing failed with error:
         1:11:
           |
         1 | ^[*hello*]
           |           ^
         unexpected end of input
         expecting ']'

  To rerun use: --match "/parses nested \"MagicText\"/on it's own with bold text inside/" --seed 100639639

From the definition of manyTill_ I would expect it to match the ending ] first, and therefore not run into this "unexpected end-of-input" error, but I can't work out how to have this nested parsing behaviour in a way which works.

Original Q&A

There are 2 best solutions below

lsmor On 16 February 2024 at 09:29

Notice this line can't work as expected

MagicText <$> between (string "^[") (char ']') (manyTill (text_ <* optional space) (char ']'))

The reason is that you are consuming the closing brackets with manyTill. Let me draw a diagram

-- Below, when we say "consume" we mean that such a token disappear from the input string
-- This is the way megaparsec (and most parsing libraries) uses the word "consume". You can 
-- think of this as each step in the parsing algorithm returns a tuple with the result and the
-- rest of the string to parse

-- We have 4 parser. Let's name them p1, p2, p3 and p4
--                    |- p1: This parser consumes the string "^["
--                    |            |- p2: This parser consumes the string "]" 
--                    |            |         |- p3: Consumes recursively many times
--                    |            |         |                                   |- p4: This parser consumes the string "]"
myparser = between (string "^[") (char ']') (manyTill (text_ <* optional space) (char ']'))

Let see how this works in practice

>>> parse myparser "" "^[hello]"
step 1 ->
  p1 consumes "^["
  result = ()     -- When using `between a b c`, the result of the opening and closing token isn't used
  rest of string = "hello]"

step 2 ->
  p3 consumes what ever consumes `text_` till reaching ']'
  result = PlainText "hello"
  rest of string = "]"

  step 2.1 ->  -- This step is executed within p3 and consumes(!!) the end token
    p4 consumes "]"
    result = PlainText "hello"
    rest of string = ""

step 3 -> 
  p2 tries to consume "]"
  result = error!! The string left by step 2.1 is the empty string, hence p2 can't consume "]"

I haven't tested this, but I think you can call text_ han that's it. Because, text_ already parses many tokens so

-- Maybe this works.. haven't tested
text_ :: Parser Words
text_ =
  choice
    [
      MagicText <$> between (string "^[") (char ']') text_, -- the other choices of `text_` use `takeWhile1P` so, they are already consuming as much input as they can
      BoldText <$> between (char '*') (char '*') (takeWhile1P (Just "bold text") (/= '*')),
      PlainText <$> takeWhile1P (Just "plain text") (\c -> c /= ' ' && c /= '\n')
    ]

**amalloy** · Accepted Answer · 2024-02-16T01:16:55.957000

I can't see by inspection what's wrong with your bold-text example. But the problem with "[hello]" is simple enough. You start parsing MagicText, which consumes the [ and delegates to text_ again, planning to consume a ] afterwards. But the parser inside PlainText doesn't know it's supposed to leave behind a ] character. It happily consumes all the way to the end of the string, because it never encounters one of its stop characters, ' ' or '\n'. Then it completes, and the MagicText above it is upset it can't find its closing ].

A common way to handle problems like this is to have a grammar with more explicit separations of its concepts, encoded in a hierarchy. A MagicText doesn't contain "any text, including magic, bold, or plain text": it includes "bold text or plain text". A BoldText doesn't contain "any text, including magic, bold, or plain text": it contains only plain text. And PlainText explicitly rejects characters that would be treated as delimiters/metacharacters for the levels above it. Roughly like this:

text_ :: Parser Words
text_ =
  choice
    [
      MagicText <$> between (string "^[") (char ']') (nonMagicText `sepBy1` space),
      nonMagicText
    ]

nonMagicText = 
  choice
    [
      BoldText <$> between (char '*') (char '*') plainText,
      PlainText <$> plainText
    ]

plainText = 
  takeWhile1P (Just "plaintext") (`notElem` "*^[] \n")

Parsing inside `between` with Megaparsec

There are 2 best solutions below

Related Questions in PARSING

Related Questions in HASKELL

Related Questions in PARSER-COMBINATORS

Related Questions in MEGAPARSEC

Trending Questions

Popular # Hahtags

Popular Questions