Having an XML like this:
<?xml version="1.0">
<!DOCTYPE us-application SYSTEM "us" []>
<us-application lang="ENG" dtd-version="v4">
<us-biblio>
<inv-title> Name </inv-title>
</us-biblio>
<drawings>
<figure id="01"> Alog
</figure>
</drawings>
<p id="p-0037"> </p>
</us-application>
<?xml version="1.0">
<!DOCTYPE us-application SYSTEM "us" []>
<us-application lang="ENG" dtd-version="v4">
<us-biblio>
<inv-title> AnotherName </inv-title>
</us-biblio>
<drawings>
<figure id="01">SomeImg Sources</figure>
</drawings>
<p id="p-0037"> </p>
</us-application>
With the following code, I'm trying to retrieve "inv-title" of the document:
import pandas as pd
from lxml import etree
import xml.etree.ElementTree as ET
MiFile = 'C:\\Users\\TestFilePat.json'
parser = etree.XMLParser(recover=True,ns_clean=True,remove_blank_text =True,dtd_validation=True)
doc = etree.parse(MiFile, parser=parser)
root = doc.getroot()
for elm in root.findall('.//us-biblio'):
for i in elm.getchildren():
print (i.text)
I also tried with:
for child in root.findall('.//us-biblio'):
print(child.find('inv-title').text)
But I'm only getting the first "inv-title" (Name) and I am missing the other value under "inv-title" tag ("AnotherName"). From the documentation, "findall" is supposed to find all the elements on the documents but I can only get the first one. Any sugestions on how to retrieve all the elements under the tag.
Edit: Is woth to mention that the document is coming with the two xml separations and that is the challenge I'm facing as I tried unsuccessfully to split it and reading line by line to ignore those tags:
file1 = open(MiFile)
accumulate_xml=[]
while True:
line = file1.readline()
if line:
if line.startswith(('<?xml','<!DOCTYPE')):
if accumulate_xml:
tree = ET.fromstring(''.join(accumulate_xml))
for invent in tree.findall('.//us-biblio'):
csv_line = [invent.find('inv-title').text]
accumulate_xml = []
else:
accumulate_xml.append(line.strip())
else:
break
file1.close()
I came up with the answer after struggling a lot. Basically what I did was some preprocessing:
Thanks to this answers, it helped to preprocess my file: https://stackoverflow.com/a/28057753/6506593 and https://stackoverflow.com/a/5917395/6506593
The code in case anybody wantsto know my approach:
Specially the preprocessing part can be improved but for now it works for me. Big thanks to @jack-fleeting. The problem is that inside of my xml, I'll have thousands of xml
<?xml version="1.0">declarations which leads to have thousands of files.