How can I extract the tags and content of children using xml-conduit?

122 Views Asked by At

UPDATE: After many different attempts I have for now concluded that the behaviour seen below is expected and that I am only running into difficulties because I am using ToJson later on. Will update if I solve it completely.

==============================================================

I am trying to use xml-conduit to achieve this:

  • go down the cursor of a document to the element of an html document.
  • treat any descendants of this node as text. i.e. don't only get the content as text, but also the tags themselves.

so for example, I edited this example as per my comments below

<ol> 
   <li class="c1"> 
      this is really "good work" <i>John</i>.
   </li>
</ol>

should return

"<li class="c1"> this is really "good work" <i>John</i> </li>" 

Whereas I'm getting

" "\u003cli class=\"c1\"\u003e this is really "good work"\u003ci\u003eJohn\u003c/i\u003e\u003c/li\u003e

So there is something going on within the tags that is different from what is happening to the content. I think perhaps the ToHtml instance isnt quite what i want really. I just want to collapse all descendants, turn their tags and content into text. I can't find a way though.

import Text.Blaze.Html                           (toHtml, preEscapedToHtml)
import Text.Blaze.Html.Renderer.Utf8             (renderHtml)
--import Text.Blaze.Html.Renderer.String           (renderHtml)
--import Text.Blaze.Html.Renderer.Text             (renderHtml)
import           Data.Text.Encoding              (decodeUtf8)
import           Data.ByteString.Lazy            (toStrict)


extractAllParas :: Cursor -> [ Text ]
extractAllParas c = do
  let mt' = c $/ laxElement "body"  &// laxElement "ol" &/ anyElement
      mt =  (map (decodeUtf8 . toStrict . renderHtml . toHtml . node) $ mt')
  case mt of
    [ "" ] -> []
    paras -> formatParas paras

formatParas :: [ Text ] -> [ Text ]

0

There are 0 best solutions below