I have a simplified task that sometimes causes an error when converting a Pandas dataframe to a Pandas on Pyspark dataframe.
My notebook runs fine as a standalone notebook when I click the "Run All" button.
However, the conversion gives me an AttributeError whenever running as a scheduled task.
My notebook is being run on Databricks 13.3 LTS ML with similar servers in both cases. The pyspark package version is 3.5.1 in both cases.
The notebook contains this code:
import pyspark
import pandas as pd
data_dict = {"A": [1,3,2], "B": [3,5,4]}
pandas_df = pd.DataFrame(data=data_dict)
output_pdf = pyspark.pandas.from_pandas(pandas_df)
The error is in the last line, when running as the scheduled workflow job. It runs without error when run manually in the notebook using the "Run All" button.
Here is the error I get when running as a scheduled job:
AttributeError: module 'pyspark' has no attribute 'pandas'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File <command-2134319411417476>, line 1
----> 1 output_pdf = pyspark.pandas.from_pandas(pandas_df)
2 display(output_pdf)
AttributeError: module 'pyspark' has no attribute 'pandas'
I tried restarting my server to see if that would cause the manually run notebook to fail, but it didn't.
What am I missing?