I'm trying to use koalas library for PySpark 2.4.5
I have set up HADOOP_CONF_DIR and PYSPARK_PYTHON environment variables and created a Spark session with client deployment mode:
spark = SparkSession.builder.enableHiveSupport().master("yarn").getOrCreate()
With this Spark session I have created a couple of PySpark dataframes and created koalas dataframes from them (via .to_koalas() method).
But when I have tried to use .loc[] and .filter(...) operation on my koalas dataframes I got the following error:
java.io.IOException: Cannot run program "/usr/local/bin/python3.7": error=2, No such file or directory
The path "/usr/local/bin/python3.7" was exactly the path to Python interpreter on my Spark Executor hosts (and it was set in PYSPARK_PYTHON environment variable).
After two hours of crying I've tried to add a symlink to Python interpreter on my driver machine at the same path: /usr/local/bin/python3.7, and it helped: my script stopped failing with this error.
But on production environment I won't have such an option to set such symlinks so easily. And I still have a question: why does koalas searches for Python interpreter on driver by PYSPARK_PYTHON instead of PYSPARK_DRIVER_PYTHON?