Parse multiple record type fixedlength file with beanio gives oom and timeout error for 10GB data file

31 Views Asked by user3274140 At 31 March 2024 at 05:14

Have a usecase where i was trying to parse fixedlength file with huge load with beanio approach in my spark code and was getting timeout or OOM issue . Do we have any non beanio approach to read complex and very large fixedlength file using spark code ? My approach for single record type of flat file was so quick that i was able to process 10GB file within a minute. but this approach does not work with multi record type of flatfile. Existing approach:

Create csv file for schema - metadata file

    col_name,size
    product_id,2
    first_name,10
    last_name,10

Read the flatfile using csv as metadat file to load schema of flatfile as below:

            Dataset<Row> metadata = session.read()
                    .option("header", "true")
                    .csv("/path_to_csv_metadata");
            List<String> header = metadata.select("col_name").toJavaRDD().map(row -> row.getString(0).trim()).collect();
            List<Integer> sizeOfColumn = metadata.select("size").javaRDD().map(row -> Integer.parseInt(row.getString(0).trim())).collect();
            List<StructField> fields = new ArrayList<>();
            for (String fieldName : header) {
                fields.add(DataTypes.createStructField(fieldName, DataTypes.StringType, true));
            }
            StructType schema = DataTypes.createStructType(fields);
            Dataset<Row> df = session.createDataFrame(rdd.map(row -> lsplit(sizeOfColumn, row)), schema);
df.show(5)

Aoove approach works only with single record type, can anyone suggest any good approach to load multiple record type fixedlength file. In csv metadata we cannot have multiple headers to parse the fixedlength file . Please provide inputs.

Original Q&A

Parse multiple record type fixedlength file with beanio gives oom and timeout error for 10GB data file

There are 0 best solutions below

Related Questions in JAVA

Related Questions in APACHE-SPARK

Related Questions in BEAN-IO

Trending Questions

Popular # Hahtags

Popular Questions