Filter XML file using python minidom

72 Views Asked by At

I am trying to filter an XML file using python's minidom. I want to return a list of email addresses (<wd:Email_Address>) based on the criteria that the address is a WORK email address. I need to use the element <wd:ID wd:type="Communication_Usage_Type_ID">WORK</wd:ID> to filter the email addresses. Below is the file:

<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:env="http://schema.xmlsoap.org/soap/envelope/">
<env:Body>
    <wd:Get_Working_Response xmlns:wd="urn:com.workway/bsvc"
                             wd:version="v40.1">
        <wd:Request_Criteria>
            <wd:Transaction_Log_Criteria_Data>
            </wd:Transaction_Log_Criteria_Data>
            <wd:Field_And_Parameter_Criteria_Data>
            </wd:Field_And_Parameter_Criteria_Data>
            <wd:Eligibility_Criteria_Data>
            </wd:Eligibility_Criteria_Data>
        </wd:Request_Criteria>
        <wd:Response_Filter>
        </wd:Response_Filter>
        <wd:Response_Group>
        </wd:Response_Group>
        <wd:Response_Results>
        </wd:Response_Results>
        <wd:Response_Data>
            <wd:Worker>
                <wd:Worker_Reference>
                    <wd:ID wd:type="WID">787878787878787</wd:ID>
                    <wd:ID wd:type="Employee_ID">123456</wd:ID>
                </wd:Worker_Reference>
                <wd:Worker_Descriptor>John Smith</wd:Worker_Descriptor>
                <wd:Worker_Data>
                    <wd:Worker_ID>123456</wd:Worker_ID>
                    <wd:User_ID>jsmith</wd:User_ID>
                    <wd:Personal_Data>
                            <wd:Email_Address_Data>
                                <wd:Email_Address>[email protected]</wd:Email_Address>
                                <wd:Usage_Data wd:Public="0">
                                    <wd:Type_Data wd:Primary="1">
                                        <wd:Type_Reference>
                                            <wd:ID wd:type="WID">000000000000000</wd:ID>
                                            <wd:ID wd:type="Communication_Usage_Type_ID">HOME</wd:ID>
                                        </wd:Type_Reference>
                                    </wd:Type_Data>
                                </wd:Usage_Data>
                                <wd:Email_Reference>
                                    <wd:ID wd:type="WID">99999999999999999999999</wd:ID>
                                    <wd:ID wd:type="Email_ID">EMAIL_REFERENCE-3-3960</wd:ID>
                                </wd:Email_Reference>
                                <wd:ID>EMAIL_REFERENCE-3-3960</wd:ID>
                            </wd:Email_Address_Data>
                            <wd:Email_Address_Data>
                                <wd:Email_Address>[email protected]</wd:Email_Address>
                                <wd:Usage_Data wd:Public="1">
                                    <wd:Type_Data wd:Primary="1">
                                        <wd:Type_Reference>
                                            <wd:ID wd:type="WID">999999999999999999999999999</wd:ID>
                                            <wd:ID wd:type="Communication_Usage_Type_ID">WORK</wd:ID>
                                        </wd:Type_Reference>
                                    </wd:Type_Data>
                                </wd:Usage_Data>
                                <wd:Email_Reference>
                                    <wd:ID wd:type="WID">999999999999999999999999</wd:ID>
                                    <wd:ID wd:type="Email_ID">EMAIL_REFERENCE-3-4017</wd:ID>
                                </wd:Email_Reference>
                                <wd:ID>EMAIL_REFERENCE-3-4017</wd:ID>
                            </wd:Email_Address_Data>
                            </wd:Personal_Data>
                    </wd:Worker_Data>   
            </wd:Worker>
        </wd:Response_Data>
    </wd:Get_Working_Response>
</env:Body>
</env:Envelope>

So far, I have been able to get a list (workelements) that contains the filtered DOM Elements for WORK. I think I need to use this to somehow filter the file and place results in a list (lNodesWithLevel2) that has only the Email_Address_Data elements for work emails. Once I have only these elements, I should be able to get the values of Email_Address. Any help would be much appreciated. I am open to using other libraries if that is easier. Here is what I have so far:

xmlDoc = minidom.parse('XML_Example.xml')

workelements =[]
lNodesWithLevel1 = xmlDoc.getElementsByTagName('wd:ID')
for mynodes in lNodesWithLevel1:
    if mynodes.firstChild.nodeValue == 'WORK':
        workelements.append(mynodes)

lNodesWithLevel2 = [lNode for lNode in xmlDoc.getElementsByTagName('wd:Email_Address_Data')
                 if lNode.getElementsByTagName('wd:ID') == li]
4

There are 4 best solutions below

1
Hermann12 On BEST ANSWER

With xml.dom.minidom you can do:

import xml.dom.minidom

xmlDoc = xml.dom.minidom.parse('XML_Example.xml')

business = []
for email in xmlDoc.getElementsByTagName("wd:Email_Address_Data"):
    for t in email.getElementsByTagName("wd:ID"):
        if t.getAttribute("wd:type") == "Communication_Usage_Type_ID":
            business_mail = t.firstChild.nodeValue
    for m in email.getElementsByTagName("wd:Email_Address"):
        if business_mail == "WORK":
            business.append(m.firstChild.nodeValue)

print("WORK EMAILs:", business) 

Output:

WORK EMAILs: ['[email protected]']
1
balderman On

I want to return a list of email addresses (<wd:Email_Address>) based on the criteria that the address is a WORK email address. I need to use the element <wd:ID wd:type="Communication_Usage_Type_ID">WORK</wd:ID> to filter the email addresses.

Using python ElementTree core library

import xml.etree.ElementTree as ET

xml_data = '''<env:Envelope xmlns:env="http://schema.xmlsoap.org/soap/envelope/">
<env:Body>
    <wd:Get_Working_Response xmlns:wd="urn:com.workway/bsvc"
                             wd:version="v40.1">
        <wd:Request_Criteria>
            <wd:Transaction_Log_Criteria_Data>
            </wd:Transaction_Log_Criteria_Data>
            <wd:Field_And_Parameter_Criteria_Data>
            </wd:Field_And_Parameter_Criteria_Data>
            <wd:Eligibility_Criteria_Data>
            </wd:Eligibility_Criteria_Data>
        </wd:Request_Criteria>
        <wd:Response_Filter>
        </wd:Response_Filter>
        <wd:Response_Group>
        </wd:Response_Group>
        <wd:Response_Results>
        </wd:Response_Results>
        <wd:Response_Data>
            <wd:Worker>
                <wd:Worker_Reference>
                    <wd:ID wd:type="WID">787878787878787</wd:ID>
                    <wd:ID wd:type="Employee_ID">123456</wd:ID>
                </wd:Worker_Reference>
                <wd:Worker_Descriptor>John Smith</wd:Worker_Descriptor>
                <wd:Worker_Data>
                    <wd:Worker_ID>123456</wd:Worker_ID>
                    <wd:User_ID>jsmith</wd:User_ID>
                    <wd:Personal_Data>
                            <wd:Email_Address_Data>
                                <wd:Email_Address>[email protected]</wd:Email_Address>
                                <wd:Usage_Data wd:Public="0">
                                    <wd:Type_Data wd:Primary="1">
                                        <wd:Type_Reference>
                                            <wd:ID wd:type="WID">000000000000000</wd:ID>
                                            <wd:ID wd:type="Communication_Usage_Type_ID">HOME</wd:ID>
                                        </wd:Type_Reference>
                                    </wd:Type_Data>
                                </wd:Usage_Data>
                                <wd:Email_Reference>
                                    <wd:ID wd:type="WID">99999999999999999999999</wd:ID>
                                    <wd:ID wd:type="Email_ID">EMAIL_REFERENCE-3-3960</wd:ID>
                                </wd:Email_Reference>
                                <wd:ID>EMAIL_REFERENCE-3-3960</wd:ID>
                            </wd:Email_Address_Data>
                            <wd:Email_Address_Data>
                                <wd:Email_Address>[email protected]</wd:Email_Address>
                                <wd:Usage_Data wd:Public="1">
                                    <wd:Type_Data wd:Primary="1">
                                        <wd:Type_Reference>
                                            <wd:ID wd:type="WID">999999999999999999999999999</wd:ID>
                                            <wd:ID wd:type="Communication_Usage_Type_ID">WORK</wd:ID>
                                        </wd:Type_Reference>
                                    </wd:Type_Data>
                                </wd:Usage_Data>
                                <wd:Email_Reference>
                                    <wd:ID wd:type="WID">999999999999999999999999</wd:ID>
                                    <wd:ID wd:type="Email_ID">EMAIL_REFERENCE-3-4017</wd:ID>
                                </wd:Email_Reference>
                                <wd:ID>EMAIL_REFERENCE-3-4017</wd:ID>
                            </wd:Email_Address_Data>
                            </wd:Personal_Data>
                    </wd:Worker_Data>   
            </wd:Worker>
        </wd:Response_Data>
    </wd:Get_Working_Response>
</env:Body>
</env:Envelope>
'''

# Parse the XML data
root = ET.fromstring(xml_data)

# Namespace dictionary
ns = {'wd': 'urn:com.workway/bsvc'}

for email_elem_root in root.findall('.//wd:Email_Address_Data', ns):
    email = email_elem_root.find('./wd:Email_Address', ns).text
    should_collect: bool = email_elem_root.find('.//wd:ID[@wd:type="Communication_Usage_Type_ID"]', ns).text == 'WORK'
    if should_collect:
        print("Collecting Email:", email)
    else:
        print("Ignoring Email:", email)

output

Ignoring Email: [email protected]
Collecting Email: [email protected]
1
Daniel Haley On

Here's an example using lxml. If you're going to be working with XML, XPath would be well worth your time in learning.

from lxml import etree

tree = etree.parse("input.xml")
xpath = "//wd:Email_Address_Data[wd:Usage_Data//wd:ID[@wd:type='Communication_Usage_Type_ID']='WORK']/wd:Email_Address"
work_emails = [email_elem.text for email_elem in tree.xpath(xpath, namespaces={"wd": "urn:com.workway/bsvc"})]

print(work_emails)

This outputs:

['[email protected]']
0
Parfait On

If preference is limited to only minidom, consider running list/dict comprehensions on getElementsByTagName() calls to extract each email data of each worker:

def get_nodes(el):
   return {
      f"{t.tagName}{i}": t.firstChild.nodeValue.strip()
      for i, t in enumerate(el.getElementsByTagName('*'))
   }

worker_data = [
   {i: [get_nodes(e) for e in w.getElementsByTagName('wd:Email_Address_Data')]
   for i, w in enumerate(xmlDoc.getElementsByTagName('wd:Worker'))}
]


pprint(worker_data)
# [{0: [{'wd:Email_Address0': '[email protected]',
#       'wd:Email_Reference6': '',
#       'wd:ID4': '000000000000000',
#       'wd:ID5': 'HOME',
#       'wd:ID7': '99999999999999999999999',
#       'wd:ID8': 'EMAIL_REFERENCE-3-3960',
#       'wd:ID9': 'EMAIL_REFERENCE-3-3960',
#       'wd:Type_Data2': '',
#       'wd:Type_Reference3': '',
#       'wd:Usage_Data1': ''},
#      {'wd:Email_Address0': '[email protected]',
#       'wd:Email_Reference6': '',
#       'wd:ID4': '999999999999999999999999999',
#       'wd:ID5': 'WORK',
#       'wd:ID7': '999999999999999999999999',
#       'wd:ID8': 'EMAIL_REFERENCE-3-4017',
#       'wd:ID9': 'EMAIL_REFERENCE-3-4017',
#       'wd:Type_Data2': '',
#       'wd:Type_Reference3': '',
#       'wd:Usage_Data1': ''}]}]

And to retrieve all worker's work emails:

for wd in worker_data:
   for k, v in wd.items():
      for e in v:
          if 'WORK' in e.values():
              print("Worker: " + e["wd:Email_Address0"])

# Worker Email:[email protected]

Online Demo