I'm working on a research project where I've deployed a Kubernetes job designed to generate specific CPU and memory loads. The job requests 0.5 CPU and 500Mi of memory, and I aim to run 20 copies of this job in parallel by setting the parallelism to 20. Given the capacity of my cluster to handle approximately 15 jobs concurrently, I anticipate that 15 jobs will complete successfully, while the remaining 5 should fail or remain in a pending state due to resource limitations.
The issue is that the scheduler is putting some of these pods in the pending state and launching them whenever other jobs have finished. This behavior doesn't align with my project requirements. I need the scheduler to attempt an initial scheduling of all jobs and directly fail those that cannot be immediately accommodated due to resource constraints, rather than delaying them. This will help me report the number of successful jobs and failed ones.
Below is the Job YAML file:
apiVersion: batch/v1
kind: Job
metadata:
name: stress-job
spec:
parallelism: 20
template:
metadata:
name: stress-job
spec:
containers:
- name: stress-app
image: annis99/stress-app:v1.1
imagePullPolicy: Always
ports:
- containerPort: 8081
resources:
requests:
cpu: 500m
memory: 500Mi
limits:
cpu: 500m
memory: 600Mi
restartPolicy: Never
By setting
parallelismto0, the scheduler will attempt to schedule all jobs at once. If there are insufficient resources available to accommodate all jobs, the scheduler will immediately fail the jobs that cannot be scheduled, allowing you to report the number of successful jobs and failed ones.You can also set the spec.completions to set number of overall tasks and spec.completionMode in your YAML file to retrieve the status (
completedIndexesandfailedIndexes).