UPDATE: After many different attempts I have for now concluded that the behaviour seen below is expected and that I am only running into difficulties because I am using ToJson later on. Will update if I solve it completely.
==============================================================
I am trying to use xml-conduit to achieve this:
- go down the cursor of a document to the element of an html document.
- treat any descendants of this node as text. i.e. don't only get the content as text, but also the tags themselves.
so for example, I edited this example as per my comments below
<ol> <li class="c1"> this is really "good work" <i>John</i>. </li> </ol>
should return
"<li class="c1"> this is really "good work" <i>John</i> </li>"
Whereas I'm getting
" "\u003cli class=\"c1\"\u003e this is really "good work"\u003ci\u003eJohn\u003c/i\u003e\u003c/li\u003e
So there is something going on within the tags that is different from what is happening to the content. I think perhaps the ToHtml instance isnt quite what i want really. I just want to collapse all descendants, turn their tags and content into text. I can't find a way though.
import Text.Blaze.Html (toHtml, preEscapedToHtml)
import Text.Blaze.Html.Renderer.Utf8 (renderHtml)
--import Text.Blaze.Html.Renderer.String (renderHtml)
--import Text.Blaze.Html.Renderer.Text (renderHtml)
import Data.Text.Encoding (decodeUtf8)
import Data.ByteString.Lazy (toStrict)
extractAllParas :: Cursor -> [ Text ]
extractAllParas c = do
let mt' = c $/ laxElement "body" &// laxElement "ol" &/ anyElement
mt = (map (decodeUtf8 . toStrict . renderHtml . toHtml . node) $ mt')
case mt of
[ "" ] -> []
paras -> formatParas paras
formatParas :: [ Text ] -> [ Text ]