I am new in PySpark and am trying to create a simple dataFrame from an array or dictionary and in both cases they are throwing the same exception. I have tired creating a dataframe from .csv files using spark.sql and they worked just fine.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("example").getOrCreate()
simpleData1 = [
(1, 'firstname1', 'lastname1', 'address1', -97),
(2, 'firstname1', 'lastname1', 'address1', -23),
(3, 'firstname2', 'lastname2', 'address2', -23),
(4, 'firstname2', 'lastname2', 'address2', -97)
]
columns = ["id","name", "last", "address", "bookID"]
books = [-23, -44, -97, -32, -57, -76]
simpleData2 = [
{"id":1, "name":'firstname1', "last":'lastname1', "address":'address1', "bookID":-97},
{"id":2, "name":'firstname1', "last":'lastname1', "address":'address1', "bookID":-23},
{"id":3, "name":'firstname2', "last":'lastname2', "address":'address2', "bookID":-23},
{"id":4, "name":'firstname2', "last":'lastname2', "address":'address2', "bookID":-97}
]
#Trying using array:
df1 = spark.createDataFrame(simpleData1).toDF(*columns)
#Trying using dictionary
df2 = spark.createDataFrame(simpleData2)
df1.show()
df2.show()
When running this code, I get the same following error. Here is the starting point:
Py4JJavaError Traceback (most recent call last)
Cell In[2], line 36
34 df = spark.createDataFrame(simpleData).toDF(*columns)
35 #df = spark.createDataFrame(simpleData2)
---> 36 df.show()
File C:\devTools\Anaconda\Lib\site-packages\pyspark\sql\dataframe.py:899, in DataFrame.show(self, n, truncate, vertical)
893 raise PySparkTypeError(
894 error_class="NOT_BOOL",
895 message_parameters={"arg_name": "vertical", "arg_type": type(vertical).__name__},
896 )
898 if isinstance(truncate, bool) and truncate:
--> 899 print(self._jdf.showString(n, 20, vertical))
900 else:
901 try:
- I tried installing
findpark - I tried running it the same code in PySpark shell and I get the error that
Python was not found; run without arguments to install from the Micorsoft store, or disable this shortcut form Settings Management App Execution Aliases'. After that I tried disabling Python in Manage App Execution Aliases and I get the error thatCannot run program ``python': CreateProcess error =2, The system cannot find the file specified`. - Tried creating dataframe from .csv files in both PySparkShell and Jupiter notebook and dataframe gets created in both environments.
- I have also tried
df1 = spark.createDataFrom(SimpleData1, columns)which exactly the same.
I was able to finally fix my issue. First I realized that even though I had Anaconda installed, I had also installed Python and Spyder myself. Once I uninstalled my own Spyder installation and removed Python and Python/scripts paths from my System Path, the error message changed to 'Python worker failed to connect back when execute spark action'. Then, I added the PYSPARK environment variables mentioned in this link: SparkException: Python worker failed to connect back when execute spark action