I wrote a python script to gather the cumulative memory and v-core seconds allocated to a (spark) application.
The application polls yarn application -status <application_id> every n seconds, parses the output which is then graphed.
In a lower environment with few other jobs running I see memory and v-core seconds increase for the duration of the application.
In a higher environment with many more jobs running I see the cumulative memory and v-core seconds regularly drop to zero. When graphed both memory and vcore-seconds hover around zero with a few large spikes.
Both jobs take about the same time to run.
I can only assume that either:
- Polling the yarn CLI is not a reliable way to gather application stats.
- Something else is going on such as cumulative stats being reset if yarn withdraws resources or pauses execution. If yarn was withdrawing resources or pausing execution I would expect the job to take longer which was not the case.
Has anyone seen this before - any help would be appreciated.