I need to transform an XML file into CSV but must only read the file once and headers in the record tag may change, any ideas?

Question

I need to transform an XML file into CSV but must only read the file once and headers in the record tag may change, any ideas?

417 Views Asked by John Lexus At 26 February 2020 at 21:10

I have some extremely large XML files that I will need to parse through and extract the relevant data into a csv file, essentially performing a partial-flatten on the XML document. The XML file will have a "records tag" where all records are stored. It will look a lot like this, for example:

<persons>
    <person id="1">
        <firstname>James</firstname>
        <lastname>Smith</lastname>
        <middlename></middlename>
        <dob_year>1980</dob_year>
        <dob_month>1</dob_month>
        <gender>M</gender>
        <salary currency="Euro">10000</salary>   
    </person>
    <person id="2">
        <firstname>Michael</firstname>
        <lastname></lastname>
        <middlename>Rose</middlename>
        <dob_year>1990</dob_year>
        <dob_month>6</dob_month>
        <gender>M</gender>
        <salary currency="Dollor">10000</salary>
</persons>

The record tag here is 'person', and the resulting transformation into CSV would look like this:

_id, dob_month, dob_year,firstname,gender,lastname, middlename  salary 
1,1,1980, James,M,Smith,,{"_VALUE":10000,"_currency":"Euro"}
2,6,1990, Michael M,, Rose,{"_VALUE":10000,"_currency":"Dollor"}

This might be incorrect - I quickly typed it - but you get the idea.

There are some constraints to keep in mind:

The file is very large - 1GB+ - so I can't load this into memory.
I should only read this once.
(obvioulsy) data cannot be lost or incorrect.

Okay, so I currently have a parser that will transform a simple XML file into a csv given a "ROWTAG", which is the tag that has the records in it (person in this example). You can see it here.

But there are some limitations on it. I know how to fix/address most of them, but there are two that I have that I can't figure out without breaking the constraints.

Based on how the parser is implemented, the order of the tags matter. Let us say I had an xml file that look like this:

<person id="1">
        <firstname>James</firstname>
        <middlename></middlename> 
        <lastname>Smith</lastname>
    </person>
    <person id="2">
        <firstname>Michael</firstname>
        <lastname>Jordan</lastname>
        <middlename>Rose</middlename>
</person>

The middlename in the first record tag comes second, and in the second record comes third. This would result in a CSV file that looks like this:

firstname,middlename,lastname
James,,Smith,
Michael,Jordan,Rose

Michael's name was recorded as Michael Jordan Rose, when it should be Michael Rose Jordan.

If attributes are added, removed, or changed, the program does not reflect that. This is because the program only looks at the tags from the first record element, and does not care about the tags in the next elements (as was illustrated in example one).

Let us take this example:

<person id="1">
        <firstname>James</firstname>
        <middlename></middlename> 
        <lastname>Smith</lastname>
    </person>
    <person id="2">
        <firstname>Michael</firstname>
        <lastname>Jordan</lastname>
        <middlename>Rose</middlename>
        <dob>1/10/11</dob>
</person>

The resulting CSV would look like this:

firstname,middlename,lastname
James,,Smith,
Michael,Jordan,Rose, 1/10/11

This is a big problem, of course, and it must be solved.

My sort of solution

Before I get to the solution, I'm going to summarize very quickly how exactly the program works. The parser moves through the XML document and everytime it encounters a tag, it spits that back to my program. My program has a "rowTag" that, as I've explained, the program looks for. Once it is encountered, my program starts looking at all the tags and values inside of this rowTag and saves them inside a StringBuilder. It will them dump that information when it encounters the end rowTag. During the first iteration, it will also save all the headers it encounters, then before it dumps the values of the records once the end tag is reached, it will first dump the headers.

Now... as I mentioned, this creates a problem with conserving the order and with any changed, removed, or added tags. I have a solution that solves the order, and it should solve the header issue not being updated, but I'm not sure if its feasible for my usecase (which i will explain why in a minute).

My idea is to have something like a hashmap that collects the values of the tags and the order they are encountered over time. The key would be the tag's value, and the value would be the order of the tag's first appearance.

When we collect the records as we move along the program, we would place them in an array that is as large as the hashmap in the correct place it needs to be. If we encounter new tags, we would simply resize the array and add the tag to the hashmap NOT with value of the order it was currently encountered (because that is likely to overwrite something), but rather the value of the previous element + 1 (this is going to be an ordered hashmap, so I would know what the previous element is).

Once we are completely done with the program, we would dump the headers collected to the first line of the file.

So, let us look at the first example:

<person id="1">
        <firstname>James</firstname>
        <middlename></middlename> 
        <lastname>Smith</lastname>
    </person>
    <person id="2">
        <firstname>Michael</firstname>
        <lastname>Jordan</lastname>
        <middlename>Rose</middlename>
</person>

The hashmap, after running the first time, would look like this:

{firstname: 0, middlename: 1, lastname: 2}

When we get to the second record, where middlename and lastname's locations are switched, we would simple place lastname in the array[2] spot, and the middlename in the array[1] spot.

If we are missing something, so for example something like this:

<person id="1">
        <firstname>James</firstname>
        <middlename></middlename> 
        <lastname>Smith</lastname>
    </person>
    <person id="2">
        <firstname>Michael</firstname>
        <lastname>Rose</lastname>
</person>

The array, during the second run, would be null in the second value (because middlename isnt there) which we would just convert to an empty string and a comma would be appended there like normal.

The interesting part is when something is added:

<person id="1">
        <firstname>James</firstname>
        <middlename></middlename> 
        <lastname>Smith</lastname>
    </person>
    <person id="2">
        <firstname>Michael</firstname>
        <lastname>Rose</lastname>
        <dob></dob> 
</person>

This would result in a CSV that looks like this:

firstname,middlename,lastname,dob
James,,Smith
Michael,,Rose,1/10/11

The first column doesn't have that extra , after Smith, but it seems like even though that's not valid CSV that that is fine? Cool.

Anyways - now comes the real issue with this, I think. I'm not actually using bufferedreader/bufferedwriter in java. We're using a stream reader / writer that comes with Azure because of course all these files are on the cloud, and underneath the hood its basically just rest api calls. So I don't think I'd be able to dump the headers to the first line of the file. I'm not even sure that would have been possible, regardless.

So. Any geniuses out there that have any ideas?

Original Q&A

There are 1 best solutions below

**pratiksha** · Answer 1 · 2020-02-27T05:18:15.277000

xls to csv convert




public class DeviceLibraryModel {
    private String parameterName;
    private String  dataType;
    private String  noOfRegister;
    private String  address;


    public String getParameterName() {
        return parameterName;
    }

    public String getDataType() {
        return dataType;
    }

    public String getNoOfRegister() {
        return noOfRegister;
    }

    public String getAddress() {
        return address;
    }

    public void setParameterName(String parameterName) {
        this.parameterName = parameterName;
    }

    public void setDataType(String dataType) {
        this.dataType = dataType;
    }

    public void setNoOfRegister(String noOfRegister) {
        this.noOfRegister = noOfRegister;
    }

    public void setAddress(String address) {
        this.address = address;
    }


    @Override
    public String toString() {
        return "DeviceLibraryModel{" + "ParameterName=" + parameterName + ", DataType=" + dataType + ", NoOfRegister=" + noOfRegister + ", Address=" + address + '}';
    }


}



> 



public class HeaderNameIndex  {


   private int pratameterNameIndex;
   private int dataTypeIndex;
   private int noOfRegister;
    private int address;

    public HeaderNameIndex(){

    }

    public int getPratameterNameIndex() {
        return pratameterNameIndex;
    }

    public int getDataTypeIndex() {
        return dataTypeIndex;
    }

    public int getNoOfRegister() {
        return noOfRegister;
    }

    public int getAddress() {
        return address;
    }

    public void setPratameterNameIndex(int pratameterNameIndex) {
        this.pratameterNameIndex = pratameterNameIndex;
    }

    public void setDataTypeIndex(int dataTypeIndex) {
        this.dataTypeIndex = dataTypeIndex;
    }

    public void setNoOfRegister(int noOfRegister) {
        this.noOfRegister = noOfRegister;
    }

    public void setAddress(int address) {
        this.address = address;
    }


    }






package model;


public interface HeaderNameInt {

    String PARAMETERNAME="Parameter Name";
    String DATATYPE="Data Type";
    String NOOFREGISTER="No Of Register";
    String ADDRESS="Address";



}


package services;




public class ReadFromXls extends HeaderNameIndex implements HeaderNameInt {

    public List<DeviceLibraryModel> xlsConvert(String xlsPath) throws FileNotFoundException, IOException {
        File file = new File(xlsPath);
        FileInputStream fi = new FileInputStream(file);
        List<DeviceLibraryModel> list = new ArrayList<>();
        HeaderNameIndex objHeaderNameIndex = new HeaderNameIndex();

        Workbook hw = new HSSFWorkbook(fi);
        Sheet sheet = hw.getSheetAt(0);


        Iterator<Row> rit = sheet.rowIterator();

        int rowNumber = 0;
        while (rit.hasNext()) {

            Row next = rit.next();

                DeviceLibraryModel dm = new DeviceLibraryModel();

                Iterator<Cell> cit = next.cellIterator();
                while (cit.hasNext()) {

                Cell cellit = cit.next();

                int iColumnIndex = cellit.getColumnIndex();
                DataFormatter dataFormatter = new DataFormatter();//to get all string
                String formatCellValue = dataFormatter.formatCellValue(cellit);

                if (rowNumber == 0) {
                    switch (formatCellValue) {
                        case PARAMETERNAME:
                            objHeaderNameIndex.setPratameterNameIndex(iColumnIndex);
                            break;

                        case DATATYPE:
                            objHeaderNameIndex.setDataTypeIndex(iColumnIndex);
                            //System.err.println(objHeaderNameIndex.getDataTypeIndex());
                            break;

                        case NOOFREGISTER:
                            objHeaderNameIndex.setNoOfRegister(iColumnIndex);
                            //System.err.println(objHeaderNameIndex.getNoOfRegister());
                            break;
                        case ADDRESS:
                            objHeaderNameIndex.setAddress(iColumnIndex);
                            break;

                        default:
                            System.err.println("nothing");

                    }
                }
                if (rowNumber > 0) {

                    if(iColumnIndex == objHeaderNameIndex.getPratameterNameIndex()) 
                        dm.setParameterName(formatCellValue);
                     else if (iColumnIndex == objHeaderNameIndex.getDataTypeIndex()) 
                        dm.setDataType(formatCellValue);
                     else if (iColumnIndex == objHeaderNameIndex.getNoOfRegister()) 
                        dm.setNoOfRegister(formatCellValue);
                     else if (iColumnIndex == objHeaderNameIndex.getAddress()) 
                        dm.setAddress(formatCellValue);

                }
            }
            if (rowNumber > 0) {

               list.add(dm);
            }
            rowNumber++;
            fi.close();
        }
       // System.err.println(list);

        return list;
    }
}






public class ConvertXlsToCsv {

    public void toCsv() throws IOException{
    ReadFromXls readXls=new ReadFromXls();
    String xlsPath="C:\\Users\\admin\\Desktop\\Java Training\\Input file\\Device.xls";
    List<DeviceLibraryModel> list = readXls.xlsConvert(xlsPath);
    String sep=",";
    String csvPath="C:\\Users\\admin\\Desktop\\Java Training\\Input file\\XlsToCsv.csv"; 
    File file=new File(csvPath);
    FileWriter writeData=new FileWriter(file,true);

       for(DeviceLibraryModel dm:list)
       {
           if(file.exists())
           {
           String parameterName = dm.getParameterName();
          writeData.append(parameterName+ '\n');
          writeData.append(sep+dm.getDataType()+sep+ '\n');     
          writeData.append(dm.getNoOfRegister()+sep+ '\n');
          writeData.append(dm.getAddress()+sep+ '\n');
          }else

           {
               String parameterName = dm.getParameterName();
               writeData.write(parameterName);
               writeData.write(sep+dm.getDataType()+sep);
               writeData.write(dm.getNoOfRegister()+sep);
               writeData.write(dm.getAddress()+sep);
           }


       }



       writeData.flush();
           writeData.close();  



    }
}


`



public class ReadXls {
    public static void main(String[] args)  {


        ConvertXlsToCsv convert=new ConvertXlsToCsv();
        try {

            convert.toCsv();

             } catch (IOException ex) {

                 System.err.println(ex);
            Logger.getLogger(ReadXls.class.getName()).log(Level.SEVERE, null, ex);
        }


    }

}
`

I need to transform an XML file into CSV but must only read the file once and headers in the record tag may change, any ideas?

My sort of solution

There are 1 best solutions below

Related Questions in JAVA

Related Questions in XML

Related Questions in STREAM

Related Questions in AZURE-BLOB-STORAGE

Related Questions in WOODSTOX

Trending Questions

Popular # Hahtags

Popular Questions