PySpark: Find specific value in a grouped data and mark entire group as different value

56 Views Asked by At

I have employee data like below. I want to group the below data by EMP_ID and if 'Status' of this grouped EMP_ID has the value 'Not Done' then entire 'overall_status' for the EMP_ID should be marked as 'Not Done'. How can I achieve this in Dataframe or SparkSql?

Input:

EMP_ID Status
1 Done
1 Not Done
1 Done
2 Done
2 Done

Expected Output:

EMP_ID Status overall_status
1 Done Not Done
1 Not Done Not Done
1 Done Not Done
2 Done Done
2 Done Done
1

There are 1 best solutions below

0
user238607 On BEST ANSWER

I have tried to solve it using a simple strategy.

I have done a groupby on EMP_ID and collected distinct values of the Status columns.

Then I created a overall_status column based on the fact if the distinct values contain Not Done then that column will have value Not Done else Done.

Then joined this created dataframe with the original on EMP_ID

import sys
from pyspark import SparkContext, SQLContext
from pyspark.sql import functions as F
from pyspark.sql.functions import udf


sc = SparkContext('local')
sqlContext = SQLContext(sc)


data1 = [

["1",   "Done"],
["1",   "Not Done"],
["1",   "Done"],
["2",   "Done"],
["2",   "Done"],
      ]

df1Columns = ["EMP_ID", "Status"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)


df1.show(n=100, truncate=False)


df1_unique_values = df1.groupby("EMP_ID").agg(F.collect_set( F.col("Status")).alias("distinct_status")) \
                        .withColumn("overall_status", F.when(  F.array_contains( F.col("distinct_status"), "Not Done"), "Not Done").otherwise("Done")  ).drop("distinct_status")

df1_unique_values.show(n=100, truncate=False)


df1_final =  df1.join(df1_unique_values, on=["EMP_ID"])


df1_final.show(n=100, truncate=False)

Output:

+------+--------+
|EMP_ID|Status  |
+------+--------+
|1     |Done    |
|1     |Not Done|
|1     |Done    |
|2     |Done    |
|2     |Done    |
+------+--------+

+------+--------------+
|EMP_ID|overall_status|
+------+--------------+
|1     |Not Done      |
|2     |Done          |
+------+--------------+

+------+--------+--------------+
|EMP_ID|Status  |overall_status|
+------+--------+--------------+
|1     |Done    |Not Done      |
|1     |Not Done|Not Done      |
|1     |Done    |Not Done      |
|2     |Done    |Done          |
|2     |Done    |Done          |
+------+--------+--------------+