One of the SGE job was running slow and killed by qmaster to enforce the h_rt=1200.
Is that possible SGE admin dynamically change the setting to make the job(id=2771780) running slow? If yes, what could be the setting to do so? If not, what could cause this?
qname test.q
hostname abc
group domain
owner jenkins
project NONE
department defaultdepartment
jobname top
jobnumber 2771780
taskid undefined
account sge
priority 0
qsub_time Mon Dec 20 11:46:06 2021
start_time Mon Dec 20 11:46:07 2021
end_time Mon Dec 20 12:06:08 2021
granted_pe NONE
slots 1
failed 37 : qmaster enforced h_rt, h_cpu, or h_vmem limit
exit_status 137 (Killed)
ru_wallclock 1201s
ru_utime 0.088s
ru_stime 8.797s
ru_maxrss 5.559KB
ru_ixrss 0.000B
ru_ismrss 0.000B
ru_idrss 0.000B
ru_isrss 0.000B
ru_minflt 23574
ru_majflt 0
ru_nswap 0
ru_inblock 128
ru_oublock 240
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 24156
ru_nivcsw 66
cpu 1454.650s
mem 54.658GBs
io 495.010GB
iow 0.000s
maxvmem 1014.082MB
arid undefined
ar_sub_time undefined
category -U arusers,digital -q test.q -l h_rt=1200
If you are saying that usually the job finishes in 1200s, but ran slowly on this particular occasion, this could be for various external factors such as contention for storage or network bandwidth. You may have also landed on a different compute node type that had slower CPU. An SGE admin can change various resource settings before the job starts executing such as the number of cores, but the more likely issue is contention for storage/io or even throttled cpu for thermal reasons.