Get data from multiple tags from XML file

50 Views Asked by At

Having an XML like this:

<?xml version="1.0">
<!DOCTYPE us-application SYSTEM "us" []>
<us-application lang="ENG" dtd-version="v4">
    <us-biblio>
        <inv-title> Name </inv-title>
    </us-biblio>
    <drawings>
        <figure id="01"> Alog
        </figure>
    </drawings>
    <p id="p-0037"> </p>
</us-application>
<?xml version="1.0">
<!DOCTYPE us-application SYSTEM "us" []>
<us-application lang="ENG" dtd-version="v4">
    <us-biblio>
        <inv-title> AnotherName </inv-title>
    </us-biblio>
    <drawings>
        <figure id="01">SomeImg Sources</figure>
    </drawings>
    <p id="p-0037"> </p>
</us-application>

With the following code, I'm trying to retrieve "inv-title" of the document:

import pandas as pd
from lxml import etree
import xml.etree.ElementTree as ET


MiFile = 'C:\\Users\\TestFilePat.json'

 
parser = etree.XMLParser(recover=True,ns_clean=True,remove_blank_text =True,dtd_validation=True)
doc = etree.parse(MiFile, parser=parser)

root = doc.getroot()

for elm in root.findall('.//us-biblio'):
    for i in elm.getchildren():
        print (i.text)

I also tried with:

for child in root.findall('.//us-biblio'):
    print(child.find('inv-title').text)

But I'm only getting the first "inv-title" (Name) and I am missing the other value under "inv-title" tag ("AnotherName"). From the documentation, "findall" is supposed to find all the elements on the documents but I can only get the first one. Any sugestions on how to retrieve all the elements under the tag.

Edit: Is woth to mention that the document is coming with the two xml separations and that is the challenge I'm facing as I tried unsuccessfully to split it and reading line by line to ignore those tags:

file1 = open(MiFile)
accumulate_xml=[]

while True:
    line = file1.readline()
    if line:
        if line.startswith(('<?xml','<!DOCTYPE')):
            if accumulate_xml:
                tree = ET.fromstring(''.join(accumulate_xml))
                for invent in tree.findall('.//us-biblio'):
                    csv_line = [invent.find('inv-title').text]
                accumulate_xml = []
        else:
            accumulate_xml.append(line.strip())
    else:
        break
file1.close() 
2

There are 2 best solutions below

0
Juliana Rivera On BEST ANSWER

I came up with the answer after struggling a lot. Basically what I did was some preprocessing:

  1. Delete junk lines
  2. Add a tag at the beginning and the end of the file
  3. Use the code with ET (I just did a small change)

Thanks to this answers, it helped to preprocess my file: https://stackoverflow.com/a/28057753/6506593 and https://stackoverflow.com/a/5917395/6506593

The code in case anybody wantsto know my approach:

import pandas as pd
from lxml import etree
import xml.etree.ElementTree as ET
import json 

MiFile = 'C:\\Users\\demelo\\Desktop\\TestFilePat.json'

##First preprocess: delete junk lines in xml

with open(MiFile, "r+") as f:
    d = f.readlines()
    f.seek(0)
    for i in d:
        if not i.startswith(("<!DOCTYPE us-application SYSTEM","<?xml version")):
            f.write(i)
    f.truncate()


###Add </data> at the end of the file
begTag = "<data>"
endTag = "</data>"

f=open(MiFile,'a')
f.write('\n' +"</data>") #Write at the end of the file
f.close()
###Add <data> at the begining of the file
with open(MiFile, 'r+') as f:
        content = f.read()
        f.seek(0, 0)
        f.write(begTag.rstrip('\r\n') + '\n' + content)
f.close()

parser = etree.XMLParser(recover=True,ns_clean=True,remove_blank_text =True,dtd_validation=True)
doc = etree.parse(MiFile, parser=parser)


root = doc.getroot()



#Used when it has <data>...</data>
for child in root.findall('.//us-application/us-biblio'):
    print(child.tag)
    print(child.find('inv-title').text)

Specially the preprocessing part can be improved but for now it works for me. Big thanks to @jack-fleeting. The problem is that inside of my xml, I'll have thousands of xml <?xml version="1.0"> declarations which leads to have thousands of files.

0
Jack Fleeting On

Try something along these lines:

your_xml_file_2 = your_xml_file.replace('<?xml version="1.0">', 'xxx').split('xxx')
#this splits the main file into 3 files, including an initial empty file
for targ_file in your_xml_file2[1:]:
    #the "[1:]" removes the empty file
    doc=etree.XML(targ_file)
    title =  doc.xpath('//inv-title/text()')[0]
    print(title)

Output of your example:

 Name 
 AnotherName