PySpark running embarassingly parallel jobs

153 Views Asked by At

So I am trying to run 1000+ embarassingly parallel jobs using PySpark on a cluster. I instantiate 5 executors, each having 20 cores, i.e. should be able to execute 100 jobs concurrently, as far as I understand.

This is what I have so far, "values" containing tuples, each tuple being a job

values = [(something, something_else),..., (howdy, partner)]    
rdd_values = sc.parallelize(values)
results = rdd_values.map(my_func).collect()

A few questions

  1. Is this really the recommended way of doing this in PySpark?
  2. The Spark UI is really unhelpful when I run more jobs than I have cores available. What am I missing?
  3. Why do people use PySpark? It seems so cumbersome

I know this is probably not what PySpark is best suited at but it is the tool I have available.

Rant

Furthermore, it's really frustrating that there is no way of getting a simple progressbar out of this (or is there?). The Spark UI only helps me so far.

1

There are 1 best solutions below

0
Marcus Therkildsen On

Turns out; with my expected behaviour being: "I want to run a 1000 jobs, 100 at a time" this seems to work:

values = [(something, something_else),..., (howdy, partner)]    
rdd_values = sc.parallelize(values, numSlices=len(values))
results = rdd_values.map(my_func).collect()

The difference being the added "numSlices" argument.

Thanks for your suggestions