Error in sparkCheckInstall(sparkHome, master, deployMode):

59 Views Asked by At

After doing the following

devtools::install_github('apache/[email protected]', subdir='R/pkg', force = TRUE)
library(SparkR)

I ran this to convert my data into a spark DataFrame

as.DataFrame(value1)

However, I got the following error message

Error in getSparkSession() : SparkSession not initialized

So, I ran this..

sparkR.session()

It gives the following prompt:

Will you download and install (or reuse if it exists) Spark package under the cache [/home/analytics/.cache/spark]? (y/n):

If I click no, I get this...

 Error in sparkCheckInstall(sparkHome, master, deployMode) : 
  Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.

However, If I click yes, I got a longggg message which is as follows:

Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://dlcdn.apache.org/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
- https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
simpleWarning in download.file(remotePath, localPath): cannot open URL 'https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz': HTTP status was '404 Not Found'


To use backup site...
Downloading spark-3.3.0 for Hadoop 2.7 from:
- http://www-us.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'http://www-us.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
simpleWarning in download.file(remotePath, localPath): URL 'http://www-us.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz': status was 'Couldn't resolve host name'


- Unable to download from default mirror site: http://www-us.apache.org/dist/spark
Error in robustDownloadTar(mirrorUrl, version, hadoopVersion, packageName,  : 
  Unable to download Spark spark-3.3.0 for Hadoop 2.7. Please check network connection, Hadoop version, or provide other mirror sites.

How to eliminate this errors??

1

There are 1 best solutions below

0
Vivek Atal On

You would also need the Spark package in your system as per my understanding.

Spark can be installed using these links: download Spark 3.3.0, download Hadoop 3.0.0, Java OpenJDK 11.0.13 LTS.

Set system environment variable SPARK_HOME to the Spark 3.3.0 directory downloaded previously; and similarly set HADOOP_HOME and JAVA_HOME.

Then run below R script to load SparkR library by updating the <spark-lib-path> as the unpacked Spark installation directory downloaded earlier.

library(SparkR, lib.loc = .libPaths(c(file.path('<spark-lib-path>', 'R', 'lib'), .libPaths())))

These steps worked for me when I tried earlier, I used Spark 3.1.2 with Hadoop 2.7.4 .