Spacy displacy.render produces </br> tags which are not xml compliant

114 Views Asked by At

I am using SPacy for NER on various texts. The dataframe is being parsed into XML for storage and analysis in eXist-DB and I want to take the visualizer results as html to store and show alongside. So far so good. However, the html generated contains </br> tags that are automatically invalid in eXist-DB:

<!DOCTYPE html>
<html lang="xx">
    <head>
        <title>displaCy</title>
    </head>

    <body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem; direction: ltr">
<figure style="margin-bottom: 6rem">
<div class="entities" style="line-height: 2.5; direction: ltr"></br></br>Some text here </br> some more text  
<mark class="entity" style="background: #33ff82; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    more text
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">LOC</span>
</mark>
more text </div>
</figure>
</body>
</html>

I can write something to go through each html doc and change the tags, but wondered if there was any way to make displacy.render produce xml compliant html from the start?

1

There are 1 best solutions below

0
Ralph Corrigan On

I've applied a simple (if inelegant) fix by running

re.sub(r"</br>,"<br/>",html) 

against my html before saving it. This works, but I would still like to know if there is anything I can apply to stop the </br> tag being created in the first place.