I have following code
dbutils.fs.cp('dbfs:/mnt/loc/PyM.cpython-310-x86_64-linux-gnu.so', 'dbfs:/tmp/simple')
import PyM
and a function
def test(df):
data={'c':[],'name':[]}
data['c'].append(df['_c0'].iat[0])
mc = PyM.PyM(Ln=30, dy=2)
data['name'].append(f"Module version: {mc.get_build_info()}")
return pd.DataFrame.from_dict(data,orient='index').transpose()
I have a pyspark dataframelines
df=lines.limit(2).toPandas()
df1 = test(df)
works correctly
However
dResultAll = lines.groupby('_c0').applyInPandas(test, schema=tSchema)
produces "ModuleNotFoundError: No module named 'PyM'"
I believe it is due to the absence binary file PyM.cpython-310-x86_64-linux-gnu.so on the worker node
How do I get this to work?
I have tried https://docs.databricks.com/en/_extras/notebooks/source/kb/python/run-c-plus-plus-python.html
num_worker_nodes = 1
def copyFile(filepath):
shutil.copyfile("/dbfs%s" % filepath, filepath)
os.system("chmod u+x %s" % filepath)
sc.parallelize(range(0, 2 * (1 + num_worker_nodes))).map(lambda s: copyFile("/tmp/simple")).count()
in the hopes of getting the binary file to the worker nodes, but that did not fix the issue