sequence files from sqoop import

397 Views Asked by Jaya At 21 January 2020 at 14:10

I have imported a table using sqoop and saved it as a sequence file.

How do I read this file into an RDD or Dataframe?

I have tried sc.sequenceFile() but I'm not sure what to pass as keyClass and value Class. I tried tried using org.apache.hadoop.io.Text, org.apache.hadoop.io.LongWritable for keyClass and valueClass but it did not work. I am using pyspark for reading the files.

Original Q&A

There are 2 best solutions below

Karthik On 29 January 2020 at 14:11

in python its not working however in SCALA it works:

You need to do following steps:

step1: If you are importing as sequence file from sqoop, there is a jar file generated, you need to use that as ValueClass while reading sequencefile. This jar file is generally placed in /tmp folder, but you can redirect it to a specific folder (i.e. to local folder not hdfs) using --bindir option.

example: sqoop import --connect jdbc:mysql://ms.itversity.com/retail_export -- username retail_user --password itversity --table customers -m 1 --target-dir '/user/srikarthik/udemy/practice4/problem2/outputseq' --as-sequencefile --delete-target-dir --bindir /home/srikarthik/sqoopjars/

step2: Also, you need to download the jar file from below link: http://www.java2s.com/Code/Jar/s/Downloadsqoop144hadoop200jar.htm

step3: Suppose, customers table is imported using sqoop as sequence file. Run spark-shell --jars path-to-customers.jar,sqoop-1.4.4-hadoop200.jar

example:

spark-shell --master yarn --jars /home/srikarthik/sqoopjars/customers.jar,/home/srikarthik/tejdata/kjar/sqoop-1.4.4-hadoop200.jar

step4: Now run below commands inside the spark-shell

scala> import org.apache.hadoop.io.LongWritable

scala> val data = sc.sequenceFile[LongWritable,customers]("/user/srikarthik/udemy/practice4/problem2/outputseq")

scala> data.map(tup => (tup._1.get(), tup._2.toString())).collect.foreach(println)

Gara Walid On 22 March 2021 at 20:32

You can use SeqDataSourceV2 package to read the sequence file with the DataFrame API without any prior knowledge of the schema (aka keyClass and valueClass).
Please note that the current version is only compatible with Spark 2.4

$ pyspark --packages seq-datasource-v2-0.2.0.jar

df = spark.read.format("seq").load("data.seq")
df.show()

sequence files from sqoop import

There are 2 best solutions below

Related Questions in PYSPARK

Related Questions in SQOOP

Related Questions in SEQUENCEFILE

Trending Questions

Popular # Hahtags

Popular Questions