koalas for PySpark: why does it try to run Python on Driver machine?

39 Views Asked by At

I'm trying to use koalas library for PySpark 2.4.5

I have set up HADOOP_CONF_DIR and PYSPARK_PYTHON environment variables and created a Spark session with client deployment mode:

spark = SparkSession.builder.enableHiveSupport().master("yarn").getOrCreate()

With this Spark session I have created a couple of PySpark dataframes and created koalas dataframes from them (via .to_koalas() method).

But when I have tried to use .loc[] and .filter(...) operation on my koalas dataframes I got the following error: java.io.IOException: Cannot run program "/usr/local/bin/python3.7": error=2, No such file or directory

The path "/usr/local/bin/python3.7" was exactly the path to Python interpreter on my Spark Executor hosts (and it was set in PYSPARK_PYTHON environment variable).

After two hours of crying I've tried to add a symlink to Python interpreter on my driver machine at the same path: /usr/local/bin/python3.7, and it helped: my script stopped failing with this error.

But on production environment I won't have such an option to set such symlinks so easily. And I still have a question: why does koalas searches for Python interpreter on driver by PYSPARK_PYTHON instead of PYSPARK_DRIVER_PYTHON?

0

There are 0 best solutions below