I use CentOS on Cloudera QuickStart VM. I created a sbt-managed Spark application following the other question How to save DataFrame directly to Hive?.
build.sbt
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.5.2"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.5.2"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.5.2"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.2"
I'd like to use a DataFrame as a Hive table as follows:
recordDF.registerTempTable("mytempTable")
hiveContext.sql("create table productstore as select * from mytempTable");
I noticed that I am getting the error:
The root scratch dir: /tmp/hive should be writable. Current permissions are: rwx------
I followed other questions and set chmod 777 for the /tmp/hive in HDFS.
It occurred to me know that spark as using local filesystem /tmp/hive.
I did a chmod for local filesystem.
Now I am getting error
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/productstore is not a directory or unable to create one)
I'd like to store a DataFrame in HDFS hive warehouse.
There are two issues in play here.
Issue #1 - Permissions
On CentOS (or other Unix flavours), e.g. Linux or macOS, just do the following:
(which should have been done writable for anyone on the OS since it is a temporary directory).
Issue #2 - Hive Metastore Directory
Since you work with Cloudera QuickStart VM, you use pre-Spark 2.0 (1.6.3 perhaps?).
The issue is due to no
pathspecified when persisting a DataFrame to a Hive table.The directory
/user/hive/warehouseis used by default and to avoid writing to the directory, define thepathoption when saving to a Hive table usingoptionmethod orsavewith the path option.As of Spark 2.0, the above line will write to a local Hive metastore (using Derby) which is in the current directory as
spark-warehouseas specified byspark.sql.warehouse.dirSpark property.To upgrade please define your Spark dependencies using
2.0.2(not1.5.2):(Just one single line is all you really need to have Spark SQL with Hive support).
You can then use
hive.metastore.warehouse.dirorspark.sql.warehouse.dirto set the Spark warehouse and point to HDFS where the other Hive tables live.