I have some extremely large XML files that I will need to parse through and extract the relevant data into a csv file, essentially performing a partial-flatten on the XML document. The XML file will have a "records tag" where all records are stored. It will look a lot like this, for example:
<persons>
<person id="1">
<firstname>James</firstname>
<lastname>Smith</lastname>
<middlename></middlename>
<dob_year>1980</dob_year>
<dob_month>1</dob_month>
<gender>M</gender>
<salary currency="Euro">10000</salary>
</person>
<person id="2">
<firstname>Michael</firstname>
<lastname></lastname>
<middlename>Rose</middlename>
<dob_year>1990</dob_year>
<dob_month>6</dob_month>
<gender>M</gender>
<salary currency="Dollor">10000</salary>
</persons>
The record tag here is 'person', and the resulting transformation into CSV would look like this:
_id, dob_month, dob_year,firstname,gender,lastname, middlename salary
1,1,1980, James,M,Smith,,{"_VALUE":10000,"_currency":"Euro"}
2,6,1990, Michael M,, Rose,{"_VALUE":10000,"_currency":"Dollor"}
This might be incorrect - I quickly typed it - but you get the idea.
There are some constraints to keep in mind:
- The file is very large - 1GB+ - so I can't load this into memory.
- I should only read this once.
- (obvioulsy) data cannot be lost or incorrect.
Okay, so I currently have a parser that will transform a simple XML file into a csv given a "ROWTAG", which is the tag that has the records in it (person in this example). You can see it here.
But there are some limitations on it. I know how to fix/address most of them, but there are two that I have that I can't figure out without breaking the constraints.
- Based on how the parser is implemented, the order of the tags matter. Let us say I had an xml file that look like this:
<person id="1">
<firstname>James</firstname>
<middlename></middlename>
<lastname>Smith</lastname>
</person>
<person id="2">
<firstname>Michael</firstname>
<lastname>Jordan</lastname>
<middlename>Rose</middlename>
</person>
The middlename in the first record tag comes second, and in the second record comes third. This would result in a CSV file that looks like this:
firstname,middlename,lastname
James,,Smith,
Michael,Jordan,Rose
Michael's name was recorded as Michael Jordan Rose, when it should be Michael Rose Jordan.
- If attributes are added, removed, or changed, the program does not reflect that. This is because the program only looks at the tags from the first record element, and does not care about the tags in the next elements (as was illustrated in example one).
Let us take this example:
<person id="1">
<firstname>James</firstname>
<middlename></middlename>
<lastname>Smith</lastname>
</person>
<person id="2">
<firstname>Michael</firstname>
<lastname>Jordan</lastname>
<middlename>Rose</middlename>
<dob>1/10/11</dob>
</person>
The resulting CSV would look like this:
firstname,middlename,lastname
James,,Smith,
Michael,Jordan,Rose, 1/10/11
This is a big problem, of course, and it must be solved.
My sort of solution
Before I get to the solution, I'm going to summarize very quickly how exactly the program works. The parser moves through the XML document and everytime it encounters a tag, it spits that back to my program. My program has a "rowTag" that, as I've explained, the program looks for. Once it is encountered, my program starts looking at all the tags and values inside of this rowTag and saves them inside a StringBuilder. It will them dump that information when it encounters the end rowTag. During the first iteration, it will also save all the headers it encounters, then before it dumps the values of the records once the end tag is reached, it will first dump the headers.
Now... as I mentioned, this creates a problem with conserving the order and with any changed, removed, or added tags. I have a solution that solves the order, and it should solve the header issue not being updated, but I'm not sure if its feasible for my usecase (which i will explain why in a minute).
My idea is to have something like a hashmap that collects the values of the tags and the order they are encountered over time. The key would be the tag's value, and the value would be the order of the tag's first appearance.
When we collect the records as we move along the program, we would place them in an array that is as large as the hashmap in the correct place it needs to be. If we encounter new tags, we would simply resize the array and add the tag to the hashmap NOT with value of the order it was currently encountered (because that is likely to overwrite something), but rather the value of the previous element + 1 (this is going to be an ordered hashmap, so I would know what the previous element is).
Once we are completely done with the program, we would dump the headers collected to the first line of the file.
So, let us look at the first example:
<person id="1">
<firstname>James</firstname>
<middlename></middlename>
<lastname>Smith</lastname>
</person>
<person id="2">
<firstname>Michael</firstname>
<lastname>Jordan</lastname>
<middlename>Rose</middlename>
</person>
The hashmap, after running the first time, would look like this:
{firstname: 0, middlename: 1, lastname: 2}
When we get to the second record, where middlename and lastname's locations are switched, we would simple place lastname in the array[2] spot, and the middlename in the array[1] spot.
If we are missing something, so for example something like this:
<person id="1">
<firstname>James</firstname>
<middlename></middlename>
<lastname>Smith</lastname>
</person>
<person id="2">
<firstname>Michael</firstname>
<lastname>Rose</lastname>
</person>
The array, during the second run, would be null in the second value (because middlename isnt there) which we would just convert to an empty string and a comma would be appended there like normal.
The interesting part is when something is added:
<person id="1">
<firstname>James</firstname>
<middlename></middlename>
<lastname>Smith</lastname>
</person>
<person id="2">
<firstname>Michael</firstname>
<lastname>Rose</lastname>
<dob></dob>
</person>
This would result in a CSV that looks like this:
firstname,middlename,lastname,dob
James,,Smith
Michael,,Rose,1/10/11
The first column doesn't have that extra , after Smith, but it seems like even though that's not valid CSV that that is fine? Cool.
Anyways - now comes the real issue with this, I think. I'm not actually using bufferedreader/bufferedwriter in java. We're using a stream reader / writer that comes with Azure because of course all these files are on the cloud, and underneath the hood its basically just rest api calls. So I don't think I'd be able to dump the headers to the first line of the file. I'm not even sure that would have been possible, regardless.
So. Any geniuses out there that have any ideas?