BeautifulSoup and lxml modifying tags unexpectedly

44 Views Asked by At

I'm currently manipulating vendor-proprietary xml files with Python 3.11. I'm noticing that when processing the files, some of the tags are being modified for seemingly no reason. Here's a couple of examples:

<check_type:"Windows" version:"2"> is becoming <check_type: version:="">
<user:"Administrators"> is becoming <user:>
<file_acl:"ProgramFiles"> is becoming <file_acl:>

I've tried looking through the documentation but I'm not seeing anything that matches this behavior in either of these packages.

My original intent is to remove certain items tagged with <custom_item>. This is working as intended. Here's how I'm processing the data:

def build_audit_file(items, file):
    """
    Removes specified item from audit file.
    Params:
    items - list - passed-in items to remove
    file - str - passed-in filename of the audit file we're presently working
    """
    # Open specified audit file
    file = open(file, 'r')
    # Read data in from audit file
    file_data = file.read()
    # Close audit file gracefully to prevent corruption
    file.close()
    # Parse the data read from the file as HTML. XML doesn't work on the
    # audit files because of their formatting, but the data is actually XML.
    html_data = BeautifulSoup(file_data, "lxml")
    for item in items:
        # Remove check that matches the FP or exception name
        for tag in html_data.find_all(lambda t: t.name == "custom_item" and 
                                     item in t.text):
            tag.extract()
    # Convert html_data into a string so we can manipulate it while printing to 
    # a new file if needed in the future.
    new_audit_file_data = str(html_data)
    # Split the data apart at newlines so we can better parse it.
    new_audit_file_data = new_audit_file_data.split("\n")
    # Derive output file name from the provided base file we're using
    output_file_name = root_directory + "\\output\\XFP-" + file.name.split(
        "\\")[-1]
    # Create and open the output file
    output_file = open(output_file_name, "a")
    # Write lines to file, while removing injected html, p, and body tags
    for line in new_audit_file_data:
            if "<html><body><p>" in line:
                output_file.write(line.replace("<html><body><p>", "") + "\n")
            elif "</p></body></html>" in line:
                output_file.write(line.replace("</p></body></html>", "") + 
                                  "\n")
            else:
                output_file.write(line + "\n")
    # Close the output file gracefully
    output_file.close()

Any ideas how to ensure the tags other than <custom_item> are left alone?

0

There are 0 best solutions below