Creating a SQLContext Dataset from an RDD containing arrays of Strings in Spark

683 Views Asked by osk At 30 June 2025 at 20:31

So I have a variable data which is a RDD[Array[String]]. I want to iterate over it and compare adjacent elements. To do this I must create a dataset from the RDD.

I try the following, sc is my SparkContext:

import org.apache.spark.sql.SQLContext

val sqc = new SQLContext(sc)
val lines = sqc.createDataset(data)

And I get the two following errors:

Error:(12, 34) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases. val lines = sqc.createDataset(data)

Error:(12, 34) not enough arguments for method createDataset: (implicit evidence$4: org.apache.spark.sql.Encoder[Array[String]])org.apache.spark.sql.Dataset[Array[String]]. Unspecified value parameter evidence$4. val lines = sqc.createDataset(data)

Sure, I understand I need to pass an Encoder argument, however, what would it be in this case and how do I import Encoders? When I try myself it says that createDataset does not take that as argument.

There are similar questions, but they do not answer how to use the encoder argument. If my RDD is a RDD[String] it works perfectly fine, however in this case it is RDD[Array[String]].

Original Q&A

There are 1 best solutions below

Ramesh Maharjan On 07 December 2017 at 15:57 BEST ANSWER

All of the comments in the question are trying to tell you the following things

You say you have RDD[Array[String]], which I create by doing the following

val rdd = sc.parallelize(Seq(Array("a", "b"), Array("d", "e"), Array("g", "f"), Array("e", "r")))   //rdd: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[0] at parallelize at worksheetTest.sc4592:13

Now converting the rdd to dataframe is to call .toDF but before that you need to import implicits._ of sqlContext as below

val sqc = new SQLContext(sc)
import sqc.implicits._
rdd.toDF().show(false)

You should have dataframe as

+------+
|value |
+------+
|[a, b]|
|[d, e]|
|[g, f]|
|[e, r]|
+------+

Isn't this all simple?

Creating a SQLContext Dataset from an RDD containing arrays of Strings in Spark

There are 1 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in DATASET

Related Questions in RDD

Related Questions in APACHE-SPARK-1.6

Trending Questions

Popular # Hahtags

Popular Questions