Inconsistency between from_avro and from_json function

211 Views Asked by At

Spark from_avro function does not allow schema to use dataframe column but takes a String schema:

def from_avro(col: Column, jsonFormatSchema: String): Column 

This makes it impossible to deserialize rows of Avro records with different schema since only one schema string can be pass externally. 

Here is what I would expect:

def from_avro(col: Column, jsonFormatSchema: Column): Column  

Code example:

import org.apache.spark.sql.functions.from_avro

val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" 

val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))


parsed.show()

// Output:
// +------------+
// |  parsedData|
// +------------+
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// +------------+

Any suggestion on how to use different schemas to pass different avro records in spark dataframe ?

0

There are 0 best solutions below