After doing the following
devtools::install_github('apache/[email protected]', subdir='R/pkg', force = TRUE)
library(SparkR)
I ran this to convert my data into a spark DataFrame
as.DataFrame(value1)
However, I got the following error message
Error in getSparkSession() : SparkSession not initialized
So, I ran this..
sparkR.session()
It gives the following prompt:
Will you download and install (or reuse if it exists) Spark package under the cache [/home/analytics/.cache/spark]? (y/n):
If I click no, I get this...
Error in sparkCheckInstall(sparkHome, master, deployMode) :
Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
However, If I click yes, I got a longggg message which is as follows:
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://dlcdn.apache.org/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
- https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
simpleWarning in download.file(remotePath, localPath): cannot open URL 'https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz': HTTP status was '404 Not Found'
To use backup site...
Downloading spark-3.3.0 for Hadoop 2.7 from:
- http://www-us.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'http://www-us.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
simpleWarning in download.file(remotePath, localPath): URL 'http://www-us.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz': status was 'Couldn't resolve host name'
- Unable to download from default mirror site: http://www-us.apache.org/dist/spark
Error in robustDownloadTar(mirrorUrl, version, hadoopVersion, packageName, :
Unable to download Spark spark-3.3.0 for Hadoop 2.7. Please check network connection, Hadoop version, or provide other mirror sites.
How to eliminate this errors??
You would also need the Spark package in your system as per my understanding.
Spark can be installed using these links: download Spark 3.3.0, download Hadoop 3.0.0, Java OpenJDK 11.0.13 LTS.
Set system environment variable
SPARK_HOMEto the Spark 3.3.0 directory downloaded previously; and similarly setHADOOP_HOMEandJAVA_HOME.Then run below R script to load
SparkRlibrary by updating the<spark-lib-path>as the unpacked Spark installation directory downloaded earlier.These steps worked for me when I tried earlier, I used Spark 3.1.2 with Hadoop 2.7.4 .