I'm working with some JSON files that were converted from .avro. Some of the JSON fields have embedded JSON documents that must be parsed out before I can query them. E.g.:
[
{
"RecordNum": 1,
"Record": "[{\u0022val\u0022:1}]"
},
{
"RecordNum": 2,
"Record": "[{\u0022val\u0022:\u0022hello\u0022}]"
}
]
If you look closely at the embedded JSON in the Record field, you'll see that the first record uses a numeric value for the val field, while the second uses a character value:
Record 1: [{ "val":1 }]
Record 2: [{ "val":"hello"}]
When I tell Drill to parse the embedded JSON using convert_fromJSON(Record), it throws an exception:
apache drill> select RecordNum, convert_fromJSON(Record) from fs.`\mydir\somefile.json`;
Error: DATA_READ ERROR: You tried to write a VarChar type when you are using a ValueWriter of type NullableBigIntWriterImpl.
My guess here is that Drill is inferring a numeric value for the schema in the first record, and then trying to apply the same schema to the second record and getting an error. If you reverse the order of the records you get:
Error: DATA_READ ERROR: You tried to write a BigInt type when you are using a ValueWriter of type NullableVarCharWriterImpl.
I have the store.json.all_text_mode option set to TRUE but this doesn't seem to affect convert_fromJSON.
For various reasons, I cannot change the contents of the file and I cannot use the original .avro files.
I was hoping that Drill to either respect the store.json.all_text_mode value for convert_fromJSON or that it would just not care about the difference. I've tried a number of things to narrow this down to this simple use case. The actual data I'm dealing with is extremely large.