PySpark running embarassingly parallel jobs

153 Views Asked by Marcus Therkildsen At 30 August 2023 at 15:22

So I am trying to run 1000+ embarassingly parallel jobs using PySpark on a cluster. I instantiate 5 executors, each having 20 cores, i.e. should be able to execute 100 jobs concurrently, as far as I understand.

This is what I have so far, "values" containing tuples, each tuple being a job

values = [(something, something_else),..., (howdy, partner)]    
rdd_values = sc.parallelize(values)
results = rdd_values.map(my_func).collect()

A few questions

Is this really the recommended way of doing this in PySpark?
The Spark UI is really unhelpful when I run more jobs than I have cores available. What am I missing?
Why do people use PySpark? It seems so cumbersome

I know this is probably not what PySpark is best suited at but it is the tool I have available.

Rant

Furthermore, it's really frustrating that there is no way of getting a simple progressbar out of this (or is there?). The Spark UI only helps me so far.

Original Q&A

There are 1 best solutions below

Marcus Therkildsen On 04 September 2023 at 12:42

Turns out; with my expected behaviour being: "I want to run a 1000 jobs, 100 at a time" this seems to work:

values = [(something, something_else),..., (howdy, partner)]    
rdd_values = sc.parallelize(values, numSlices=len(values))
results = rdd_values.map(my_func).collect()

The difference being the added "numSlices" argument.

Thanks for your suggestions

PySpark running embarassingly parallel jobs

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in PARALLEL-PROCESSING

Related Questions in EMBARRASSINGLY-PARALLEL

Trending Questions

Popular # Hahtags

Popular Questions