Spark: Provide input data as file to external executable

94 Views Asked by At

Using (py)Spark, is it possible somehow to provide input data to an executable as a file in an efficient manner?

What I currently do is:

listOfPaths=["path1.bin","path2.bin",...]
rdd_paths=spark_context.parallelize(listOfPaths)

def downloadAndRun(path):
   localFilePath = downloadFile(path)
   subprocess.Popen(["executable.exe", "--input", localFilePath]) 


rdd_paths.map(lambda x: downloadAndRun(x))

But this downloads the file from an accessible location each time to the worker node, stores it locally and then runs the application on it. This works, but is inefficient.

Since I need to run many iterations of that kind I rather would like to have the input data inside an rdd itself, which then would need to be provided to the 3rd party executable in the map stage. Ideally directly out of memory. Sadly I cannot change the executable itself, which only takes data via its --input argument.

Storing the input data inside an rdd might be possible by using a bytestream or something similar, but what about providing it to the executable? Is this possible somehow without dumping the bytestream to a local file each time? I imagine something like a RamDisk or so for making the data a "virtual" file.

Any hints or suggestions are highly appreciated. Thanks!

0

There are 0 best solutions below