stream parsing xml without knowing the ordering of child tags

Question

stream parsing xml without knowing the ordering of child tags

212 Views Asked by epsilonhalbe At 10 November 2015 at 13:05

I have to parse some xml, i decided to use xml-conduit for that task and use the streaming part of it.

The structure of the xml is given by an xsd-file that contains elements and how often they may occur. But not in which order they are expected.

How do I parse all possible reorderings of children of an xml-structure using `Text.XML.Stream.Parse` ?

The problem

Say we have an xml description like

    Root 
    /  \
   A    B

then both <Root><A>atext</A><B>btext</B></Root> and <Root><B>btext</B><A>atext</A></Root> are valid instances of this xml-structure. But parsing in the streaming setup needs an ordering to succeed.

I thought of using something like parseRoot1 <|> parseRoot2 but then I have to implement the Alternative instance and write all the possibilities by hand, which I really don't want to.

Here is a minimal sample haskell program.

Example.hs

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE Rank2Types #-}

module Main where

import           Control.Exception
import           Control.Monad.Trans.Resource (MonadThrow)
import           Text.XML.Stream.Parse
import           Data.Monoid ((<>))
import           Data.Maybe
import           Data.Text (Text)
import           Data.XML.Types (Event)
import           Data.Conduit (ConduitM, Consumer, yield, ($=), ($$))

data Root = Root {a :: A, b :: B} deriving (Show, Eq)
data A = A Text deriving (Show, Eq)
data B = B Text deriving (Show, Eq)

ex1, ex2 :: Text
ex1 = "<Root>"<>
        "<A>Atest</A>"<>
        "<B>Btest</B>"<>
      "</Root>"
ex2 = "<Root>"<>
        "<B>Btest</B>"<>
        "<A>Atest</A>"<>
      "</Root>"

ex :: Root
ex = Root {a = A "Atest", b = B "Btest"}

parseA :: MonadThrow m => ConduitM Event o m (Maybe A)
parseA = tagIgnoreAttrs "A"
            $ do result <- content
                 return (A $ result)

parseB :: MonadThrow m => ConduitM Event o m (Maybe B)
parseB = tagIgnoreAttrs "B"
            $ do result <- content
                 return (B result)

parseRoot1 :: MonadThrow m => ConduitM Event o m (Maybe Root)
parseRoot1 = tagIgnoreAttrs "Root" $ do
                 a' <- fromMaybe (error "error parsing A") <$> parseA
                 b' <- fromMaybe (error "error parsing B") <$> parseB
                 return $ Root{a = a', b = b'}

parseRoot2 :: MonadThrow m => ConduitM Event o m (Maybe Root)
parseRoot2 = tagIgnoreAttrs "Root" $ do
                 b' <- fromMaybe (error "error parsing B") <$> parseB
                 a' <- fromMaybe (error "error parsing A") <$> parseA
                 return $ Root{a = a', b = b'}

parseTxt :: Consumer Event (Either SomeException) (Maybe a)
                          -> Text 
                          -> Either SomeException (Maybe a)
parseTxt p inTxt = yield inTxt
                  $= parseText' def
                  $$ p

main :: IO ()
main = do putStrLn "Poor Mans Test Suite"
          putStrLn "===================="
          putStrLn "test1 Root -> A - B " -- works
          print $ parseTxt parseRoot1 ex1
          putStrLn "test1 Root -> B - A " -- fails
          print $ parseTxt parseRoot1 ex2
          putStrLn "test2 Root -> A - B " -- fails
          print $ parseTxt parseRoot2 ex1
          putStrLn "test2 Root -> B - A " -- works again
          print $ parseTxt parseRoot2 ex2

note

example.cabal

[...]
  build-depends: base >=4.8 && <4.9
               , conduit
               , resourcet
               , text
               , xml-conduit
               , xml-types
[...]

Original Q&A

There are 1 best solutions below

**ErikR** · Accepted Answer · 2015-11-10T18:47:42.017000

Here's my idea...

First some definitions:

 {-# LANGUAGE OverloadedStrings, MultiWayIf #-}

 import Control.Monad.Trans.Resource
 import Data.Conduit
 import Data.Text (Text, unpack)
 import Data.XML.Types
 import Text.XML.Stream.Parse

 data SumType = A Text | B Text | C Text

We start with a conduit which accepts either an A or B tag, ignores the attributes and returns the name and content:

 parseAorB :: MonadThrow m => ConduitM Event o m (Maybe (Name, Text))
 parseAorB =
   tag (\n -> if (n == "A" || n == "B") then Just n else Nothing) -- accept either A or B
       (\n -> return n)                                           -- ignore attributes
       (\n -> do c <- content; return (n,c))                      -- extract content

Then we use that to write a conduit which parses two tags, makes sure that one is an A and the other is a B:

 parseAB :: MonadThrow m => ConduitM Event o m (Maybe (SumType, SumType))
 parseAB = do
   t1 <- parseAorB
   case t1 of
     Nothing      -> return Nothing
     Just (n1,c1) -> do
       t2 <- parseAorB
       case t2 of
         Nothing -> return Nothing
         Just (n2,c2) -> do
           if | "A" == n1 && "B" == n2 -> return $ Just (A c1, B c2)
              | "A" == n2 && "B" == n1 -> return $ Just (A c2, B c1)
              | otherwise              -> return Nothing

Update

You can reduce the boilerplate in parseAB by using the MaybeT transformer:

import Control.Monad.Trans.Maybe
import Control.Monad.Trans

parseAB' :: MonadThrow m => MaybeT (ConduitM Event o m) (SumType, SumType)
parseAB' = do
 (n1, c1) <- MaybeT parseAorB
 (n2, c2) <- MaybeT parseAorB
 if | "A" == n1 && "B" == n2 -> return (A c1, B c2)
    | "A" == n2 && "B" == n1 -> return (A c2, B c1)
    | otherwise              -> MaybeT $ return Nothing

And if you have several constructors, I'd consider doing something like this:

allkids = do
  kids <- many parseAorB
  let sorted = sort kids -- automatically sorts by name
  if map fst kids == [ "A", "B", "C", "D", "E", "F", "G", "H"]
    then let [ca, cb, cc, cd, ce, cf, cg, ch] = map snd kids
         in return (A ca, B cb, C cc, D cd, E ce, F cf, G cg, H ch)
    else ...error...

The many combinator comes from Tet.XML.Stream.Parse.

stream parsing xml without knowing the ordering of child tags

How do I parse all possible reorderings of children of an xml-structure using `Text.XML.Stream.Parse` ?

The problem

There are 1 best solutions below

Related Questions in XML

Related Questions in HASKELL

Related Questions in STREAM-PROCESSING

Related Questions in XML-CONDUIT

Trending Questions

Popular # Hahtags

Popular Questions

stream parsing xml without knowing the ordering of child tags

How do I parse all possible reorderings of children of an xml-structure using Text.XML.Stream.Parse ?

The problem

There are 1 best solutions below

Related Questions in XML

Related Questions in HASKELL

Related Questions in STREAM-PROCESSING

Related Questions in XML-CONDUIT

Trending Questions

Popular # Hahtags

Popular Questions

How do I parse all possible reorderings of children of an xml-structure using `Text.XML.Stream.Parse` ?