I have imported a table using sqoop and saved it as a sequence file.
How do I read this file into an RDD or Dataframe?
I have tried sc.sequenceFile() but I'm not sure what to pass as keyClass and value Class. I tried tried using org.apache.hadoop.io.Text, org.apache.hadoop.io.LongWritable for keyClass and valueClass
but it did not work. I am using pyspark for reading the files.
in python its not working however in SCALA it works:
You need to do following steps:
step1: If you are importing as sequence file from sqoop, there is a jar file generated, you need to use that as ValueClass while reading sequencefile. This jar file is generally placed in /tmp folder, but you can redirect it to a specific folder (i.e. to local folder not hdfs) using --bindir option.
example: sqoop import --connect jdbc:mysql://ms.itversity.com/retail_export -- username retail_user --password itversity --table customers -m 1 --target-dir '/user/srikarthik/udemy/practice4/problem2/outputseq' --as-sequencefile --delete-target-dir --bindir /home/srikarthik/sqoopjars/
step2: Also, you need to download the jar file from below link: http://www.java2s.com/Code/Jar/s/Downloadsqoop144hadoop200jar.htm
step3: Suppose, customers table is imported using sqoop as sequence file. Run spark-shell --jars path-to-customers.jar,sqoop-1.4.4-hadoop200.jar
example:
step4: Now run below commands inside the spark-shell