OOM when using iterparse on huge XML dump file

61 Views Asked by Celso França At 21 July 2023 at 16:56

Reading the large StackOverflow XML dump file (Posts.xml ~90 GB) through the following approach

from xml.etree.cElementTree import iterparse

for evt, elem in iterparse("Posts.xml", events=('end',)):
    if elem.tag == 'row':
        user_fields = elem.attrib

cause OOM just iterating over the XML elements (without any memory allocation), even on a 128 GB RAM computer environment.

Since I did not get any info from documentation or other examples in the StackOverflow community, could you help me figure out how to work around it?

Original Q&A

There are 1 best solutions below

Aldebaran On 21 July 2023 at 18:58 BEST ANSWER

Based on Daniel Haley's comments, you could try:

from lxml.etree import iterparse # replace xml to lxml

for evt, elem in iterparse("Posts.xml", events=('end',), tag="row"):
    user_fields = elem.attrib
    ...
    elem.clear()

OOM when using iterparse on huge XML dump file

There are 1 best solutions below

Related Questions in PYTHON-3.X

Related Questions in XML

Related Questions in ITERPARSE

Trending Questions

Popular # Hahtags

Popular Questions