Creating DataFrame in Java based on ByteArrayInputStream

350 Views Asked by At

I need to convert following to Spark DataFrame in Java with the saving of the structure according to the avro schema. And then I'm going to write it to s3 based on this avro structure.

GenericRecord r = new GenericData.Record(inAvroSchema);
r.put("id", "1");
r.put("cnt", 111);

Schema enumTest =
        SchemaBuilder.enumeration("name1")
                .namespace("com.name")
                .symbols("s1", "s2");

GenericData.EnumSymbol symbol = new GenericData.EnumSymbol(enumTest, "s1");

r.put("type", symbol);

ByteArrayOutputStream bao = new ByteArrayOutputStream();
GenericDatumWriter<GenericRecord> w = new GenericDatumWriter<>(inAvroSchema);

Encoder e = EncoderFactory.get().jsonEncoder(inAvroSchema, bao);
w.write(r, e);
e.flush();

I can create the object based on JSON structure

  Object o = reader.read(null, DecoderFactory.get().jsonDecoder(inAvroSchema, new ByteArrayInputStream(bao.toByteArray())));

But maybe there is any way to create DataFrame based on ByteArrayInputStream(bao.toByteArray())?

Thanks

1

There are 1 best solutions below

2
andreoss On

No, you have to use a Data Source to read Avro data. And it's crutial for Spark to read Avro as files from filesystem, because many optimizations and features depend on it (such as compression and partitioning). You have to add spark-avro (unless you are above 2.4). Note that EnumType you are using will be String in Spark's Dataset

Also see this: Spark: Read an inputStream instead of File

Alternatively you can consider deploying a bunch of tasks with SparkContext#parallelize and reading/writing the files explicitly by DatumReader/DatumWriter.