I'm currently manipulating vendor-proprietary xml files with Python 3.11. I'm noticing that when processing the files, some of the tags are being modified for seemingly no reason. Here's a couple of examples:
<check_type:"Windows" version:"2"> is becoming <check_type: version:="">
<user:"Administrators"> is becoming <user:>
<file_acl:"ProgramFiles"> is becoming <file_acl:>
I've tried looking through the documentation but I'm not seeing anything that matches this behavior in either of these packages.
My original intent is to remove certain items tagged with <custom_item>. This is working as intended. Here's how I'm processing the data:
def build_audit_file(items, file):
"""
Removes specified item from audit file.
Params:
items - list - passed-in items to remove
file - str - passed-in filename of the audit file we're presently working
"""
# Open specified audit file
file = open(file, 'r')
# Read data in from audit file
file_data = file.read()
# Close audit file gracefully to prevent corruption
file.close()
# Parse the data read from the file as HTML. XML doesn't work on the
# audit files because of their formatting, but the data is actually XML.
html_data = BeautifulSoup(file_data, "lxml")
for item in items:
# Remove check that matches the FP or exception name
for tag in html_data.find_all(lambda t: t.name == "custom_item" and
item in t.text):
tag.extract()
# Convert html_data into a string so we can manipulate it while printing to
# a new file if needed in the future.
new_audit_file_data = str(html_data)
# Split the data apart at newlines so we can better parse it.
new_audit_file_data = new_audit_file_data.split("\n")
# Derive output file name from the provided base file we're using
output_file_name = root_directory + "\\output\\XFP-" + file.name.split(
"\\")[-1]
# Create and open the output file
output_file = open(output_file_name, "a")
# Write lines to file, while removing injected html, p, and body tags
for line in new_audit_file_data:
if "<html><body><p>" in line:
output_file.write(line.replace("<html><body><p>", "") + "\n")
elif "</p></body></html>" in line:
output_file.write(line.replace("</p></body></html>", "") +
"\n")
else:
output_file.write(line + "\n")
# Close the output file gracefully
output_file.close()
Any ideas how to ensure the tags other than <custom_item> are left alone?