So I have a relatively complex XML encoding where the text can contain an open number of elements. Let's take this simplified example:
<div>
<p>-I like James <stage><hi>he said to her </hi></stage>, but I am not sure James understands <hi>Peter</hi>'s problems.</p>
</div>
I want to enclose all named entities in the sentence (the two instances of James and Peter) with an rs element:
<div>
<p>-I like <rs>James</rs> <stage><hi>he said to her </hi></stage>, but I am not sure <rs>James</rs> understands <hi><rs>Peter</rs></hi>'s problems.</p>
</div>
To simplify this, let's say I have a list of names I could find in the text, such as:
names = ["James", "Peter", "Mary"]
I want to use lxml for this. I know I could use the etree.SubElement() and append a new element at the end of the p element, but I don't know how to deal with the tails and the other possible elements.
I understand that I need to handle the three references in my example differently.
- The first
Jamesis in the text of thepelement. I could just do this:
p = etree.SubElement(div, "p")
p.text = "-I like <rs>James</rs>"
Right?
- The second
Jamesis in the tail of thepelement. I don't know how to deal with that. - The reference to
Peteris in the text ofhielement. I guess I have to iterate through all possible elements, look both at the text and at the tail of each element and look for the named entities of my list.
rs = etree.SubElement(hi, "rs")
rs.text = "<rs>Peter</rs>"
My guess is that there is a much better way to handle all of this. Any help? Thanks in advance!
It's a little convoluted, but can be done.
Let's say your XML looks like this:
I inserted another div, and added formatting for clarity. Note that this assumes that each
<div>contains only one<p>; if that's not the case, it will have to be refined more.In this case, the output should be: