OGS/SGE np_load_avg Not Decaying

21 Views Asked by At

I'm using Open Grid Scheduler, which piggybacked off of SGE in the early 2010s. I'm getting a problem where the np_load_avg is failing to decay, even after jobs have been completed or killed. This means that it builds up and exceeds the suspend_threshold, which puts my queue in state 'a', and pending jobs are not allowed to start running. I can solve this by modifying the queue to increase the threshold value, or deleting the threshold. However, increasing the threshold was a temporary fix as the load caught up again. I am hesitant to delete the threshold, as that may invite disaster down the road. I've never seen the load build up like this before, so I think something has gone wrong.

I'm also seeing the same number from np_load_avg under CQLOAD, so I think the two are related, but I'm not familiar enough to know why.

How do I manually clear the np_load_avg / CQLOAD when the decay stops working?

0

There are 0 best solutions below