How to store dataframe, view in tuple in spark-scala

532 Views Asked by At

I am trying to get the data from MongoDB in parallel and store all dataframes, view names in a collection so that I can refer them back.

For this, I created a collection where I am trying to store dataframes and views. I am getting error appending element to a collection. I tried using Vector, List, Seq. But nothing seems to be working for me.

Is there a way to handle such problems?

var mongoFrames = Nil

for(c <- collections) {
    var connectionString = connectionInt.setCollection(c);
    var dframe = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", connectionString).load()
    var view = dframe.createOrReplaceTempView(c);
    var mongoQuery = s"select * from $c where tuid in (${tuidIn.mkString(",")})";

    var tup = (c, dframe, view, mongoQuery)
    mongoFrames += tup
}

for(v <- mongoFrames) yield spark.sql(v._4).collect() // load data from source into spark

Update

When trying to use +:, I am getting following error

error: value +: is not a member of (String, org.apache.spark.sql.DataFrame, Unit, String) mongoFrames +: tup

2

There are 2 best solutions below

0
zmerr On BEST ANSWER

You can write it as:

var mongoFrames: Seq[Tuple3[String, DataFrame,String]] = Seq.empty

and

var tup: Tuple[String, DataFrame, String] = (c, dframe, mongoQuery)

mongoFrames = mongoFrames :+ tup

then

iterate over it

for(v <- mongoFrames) yield spark.sql(v._3).collect() 

Edit 1:

a more idiomatic way of iterating over the collection in this case is to write:

mongoFrames.foreach(spark.sql(_._3).collect())

using the anonymous function.

This is short for:

mongoFrames.foreach(mongoFrame => spark.sql(mongoFrame._3).collect())

1
Mohana B C On

This should work for you:

var mongoFrames = List.empty[(String, DataFrame, Unit, String)]

for(c <- collections) {
//...
mongoFrames = mongoFrames:+ tup
}

Don't add variable of createOrReplaceTempView to tuple, it's of no use since method returns Unit. You can use access that temp view with it's name within SparkSession.