I have employee data like below. I want to group the below data by EMP_ID and if 'Status' of this grouped EMP_ID has the value 'Not Done' then entire 'overall_status' for the EMP_ID should be marked as 'Not Done'. How can I achieve this in Dataframe or SparkSql?
Input:
| EMP_ID | Status |
|---|---|
| 1 | Done |
| 1 | Not Done |
| 1 | Done |
| 2 | Done |
| 2 | Done |
Expected Output:
| EMP_ID | Status | overall_status |
|---|---|---|
| 1 | Done | Not Done |
| 1 | Not Done | Not Done |
| 1 | Done | Not Done |
| 2 | Done | Done |
| 2 | Done | Done |
I have tried to solve it using a simple strategy.
I have done a groupby on
EMP_IDand collected distinct values of theStatuscolumns.Then I created a
overall_statuscolumn based on the fact if the distinct values containNot Donethen that column will have valueNot DoneelseDone.Then joined this created dataframe with the original on
EMP_IDOutput: