I am trying to read some protocol buffer files and apparently - I am not 100% sure - these are compressed using snappy. The files are in a binary format.
I am running a notebook on Databricks using runtime version 10.4 LTS and Spark 3
sc.sequenceFile[NullWritable, BytesWritable](concatUris)
.map(b => {
val msg: Array[Byte] = b._2.copyBytes()
val feed: a_feed = a_feed.parseFrom(msg)
val properties = feed.toPMessage.value
.map { case (key, value) =>
key.name -> (value match {
case i: PInt => i.value
case l: PLong => l.value
case s: PString => s.value
case d: PDouble => d.value
case f: PFloat => f.value
case b: PByteString => b.value
case c: PBoolean => c.value
case e: PEnum => e.value.name
case other => other.toString()
})
}
(uid, properties + ("user_id" -> uid))
})
A more extended stack trace:
Caused by: UnsatisfiedLinkError: org.apache.hadoop.shaded.org.xerial.snappy.SnappyNative.rawUncompress(Ljava/nio/ByteBuffer;IILjava/nio/ByteBuffer;I)I
at org.apache.hadoop.shaded.org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.apache.hadoop.shaded.org.xerial.snappy.Snappy.uncompress(Snappy.java:551)
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressDirectBuf(SnappyDecompressor.java:267)
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:217)
at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:92)
I have tried to install different version of org.xerial.snappy:snappy-java:<version>:jar but to no avail.
I can only install libraries on the Databricks cluster using the Compute>Libraries tab and uploading them and I am not sure if they're install throughout the cluster (driver+executors) or only on one of the two.