I am in the process of parsing a very large XML file that is about 9 GB in size. I have tried the .iterparse method, which is, from what I have gathered, the recommended way to go about this task.
However, this seems to take too long. Now, I am trying to implement a multiprocessing approach where I try to parse the elements of interest in separate processes.
I believe it was possible to do .iterparse('path_to_file.xml, events=("start", "end"), tag='some_tag) in the past, but it does not look like this is supported anymore.
So, the way I have come up with is this,
root = ET.parse('path_to_file.xml').getroot()
for element in root.iter('some_tag'):
# do something
Is there a better way to go about this? From what I know, this is a memory intensive operation.
If there is no other way to do this, is there a way to clear the memory when using this approach? In the same way that we do element.clear() when using .iterparse?