SGE Setting to Slow Down Specific Job

73 Views Asked by At

One of the SGE job was running slow and killed by qmaster to enforce the h_rt=1200.

Is that possible SGE admin dynamically change the setting to make the job(id=2771780) running slow? If yes, what could be the setting to do so? If not, what could cause this?

qname        test.q        
hostname     abc     
group        domain              
owner        jenkins             
project      NONE                
department   defaultdepartment   
jobname      top                 
jobnumber    2771780             
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Mon Dec 20 11:46:06 2021
start_time   Mon Dec 20 11:46:07 2021
end_time     Mon Dec 20 12:06:08 2021
granted_pe   NONE                
slots        1                   
failed       37  : qmaster enforced h_rt, h_cpu, or h_vmem limit
exit_status  137                  (Killed)
ru_wallclock 1201s
ru_utime     0.088s
ru_stime     8.797s
ru_maxrss    5.559KB
ru_ixrss     0.000B
ru_ismrss    0.000B
ru_idrss     0.000B
ru_isrss     0.000B
ru_minflt    23574               
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   128                 
ru_oublock   240                 
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     24156               
ru_nivcsw    66                  
cpu          1454.650s
mem          54.658GBs
io           495.010GB
iow          0.000s
maxvmem      1014.082MB
arid         undefined
ar_sub_time  undefined
category     -U arusers,digital -q test.q -l h_rt=1200

1

There are 1 best solutions below

1
Simon B On

If you are saying that usually the job finishes in 1200s, but ran slowly on this particular occasion, this could be for various external factors such as contention for storage or network bandwidth. You may have also landed on a different compute node type that had slower CPU. An SGE admin can change various resource settings before the job starts executing such as the number of cores, but the more likely issue is contention for storage/io or even throttled cpu for thermal reasons.