Have a usecase where i was trying to parse fixedlength file with huge load with beanio approach in my spark code and was getting timeout or OOM issue . Do we have any non beanio approach to read complex and very large fixedlength file using spark code ? My approach for single record type of flat file was so quick that i was able to process 10GB file within a minute. but this approach does not work with multi record type of flatfile. Existing approach:
- Create csv file for schema - metadata file
col_name,size
product_id,2
first_name,10
last_name,10
- Read the flatfile using csv as metadat file to load schema of flatfile as below:
Dataset<Row> metadata = session.read()
.option("header", "true")
.csv("/path_to_csv_metadata");
List<String> header = metadata.select("col_name").toJavaRDD().map(row -> row.getString(0).trim()).collect();
List<Integer> sizeOfColumn = metadata.select("size").javaRDD().map(row -> Integer.parseInt(row.getString(0).trim())).collect();
List<StructField> fields = new ArrayList<>();
for (String fieldName : header) {
fields.add(DataTypes.createStructField(fieldName, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
Dataset<Row> df = session.createDataFrame(rdd.map(row -> lsplit(sizeOfColumn, row)), schema);
df.show(5)
Aoove approach works only with single record type, can anyone suggest any good approach to load multiple record type fixedlength file. In csv metadata we cannot have multiple headers to parse the fixedlength file . Please provide inputs.