I am trying to run several EMR steps in parallel. I saw other questions regarding this issue on SO, as well as googled options. so things i have tried:
- Configure CapacityScheduler with set of queues
- Configure FairScheduler
- Try to use AWS data pipelines with PARALLEL_FAIR_SCHEDULING, PARALLEL_CAPACITY_SCHEDULING
this wasn't worked for me, yarn was created all queues properly, and submission was done on different queues. But EMR still ran just a single step at once (one step was RUNNING rest of them PENDING)
I also saw from one of the answers that step is meant to be sequential, but you can put several jobs inside single step. I wasn't managed to find a way to do this, and according to UI there is no option for this.
I wasn't tried to submit jobs to yarn cluster directly Submit Hadoop Jobs Interactively, i wanted to submit jobs from AWS API, and i havent found a way to do this from API
This is my configuration for CapacityScheduler CapacityScheduler
This is steps configuration StepsConfiguration
Might be late, but hope this would be helpful.
Spark provides an option that specifying whether the caller (step) will wait or not for spark application completion after submission. You can set this value as
falsethen, AWS emr step will submit and will return immediately.spark.yarn.submit.waitAppCompletion: "false"