EMR on EC2: How does Spark decide on the number of executors it needs to spawn?

304 Views Asked by At

I am trying to reason around Spark's default behavior here. Here's my scenario:

  • I am submitting a Spark job on a cluster with one master and 10 core/slave nodes.
  • These core nodes have been randomly selected based on their spot availability and price.
  • Altogether, they might have 10 or 1000 cores and an arbitrary amount of memory.
  • The default setting for cores per executor (4 cores per executor) is untouched and there's no num_executors setting on the Spark submit
  • Once I submit the job and it starts running I can see that a number of executors are spawned.
  • This number might be equal to the number of slave instances but it's usually larger.

I want to understand what is the default logic behind the number of executors. What is the formula Spark uses - if the user has not explicitly defined the number of executors - in order to come up with this number?

There's an abundance of literature out there focusing on how to come up with an optimal number of executors yourself and provide it as a Spark config, but I cannot find anything explaining the default behavior.

Thanks a lot.

1

There are 1 best solutions below

0
Sajjan Bhattarai On

The number of executors in Spark application will depend on whether Dynamic Allocation is enabled or not. For static allocation, it is controlled by spark.executor.instances (default 2) or --num-executors.

Whereas with dynamic allocation enabled spark.dynamicAllocation.enabled: true, the initial number of executors is determined by spark.dynamicAllocation.initialExecutors (default spark.dynamicAllocation.minExecutors (default 0)). If --num-executors (or spark.executor.instances) is set and larger than this value, it will be used as the initial number of executors [1]. Then, it can scale between spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.maxExecutors.

[1] https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation