<" /> <" /> <"/>

Parsing large XML file and inserting the data to MongoDB in the fastest way

94 Views Asked by At

I have 5.5 GB XML file needs to be parsed and insert into MongoDB. My XML will have different classes / group. sample group:

<parent class="class1">
    <p key="value"/>        
</parent>
<parent class="class2">
    <p key="value"/>
    <p key1="value1"/>
    <p key2="value2"/>
</parent>

The classes are located all over the file (not sequential or grouped). Each group should be inserted into a separate Mongo Collection.

Solution:

  1. I used lxml library from python to parse the XML file, this library is a memory-efficient.
  2. I loop through the entire ~5GB file and grouping the unique class elements
  3. The consolidated data per class is stored in a separate collection in MongoDB

The above process is taking approximately 1.5 hours. Parsing: ~1 hour and 23 mins. Mongo insertion: ~7 mins

I want each class to have a unique collection in MongoDB, the data i want is the parent attributes (e.g. class='class1') + the children elements (p) in one document for each group (parent+child).

Is there any in-memory libraries that I can use to speed up the overall processing time?

0

There are 0 best solutions below