PySpark: Find specific value in a grouped data and mark entire group as different value

56 Views Asked by sanju At 14 September 2023 at 17:54

I have employee data like below. I want to group the below data by EMP_ID and if 'Status' of this grouped EMP_ID has the value 'Not Done' then entire 'overall_status' for the EMP_ID should be marked as 'Not Done'. How can I achieve this in Dataframe or SparkSql?

Input:

EMP_ID	Status
1	Done
1	Not Done
1	Done
2	Done
2	Done

Expected Output:

EMP_ID	Status	overall_status
1	Done	Not Done
1	Not Done	Not Done
1	Done	Not Done
2	Done	Done
2	Done	Done

Original Q&A

There are 1 best solutions below

user238607 On 14 September 2023 at 18:25 BEST ANSWER

I have tried to solve it using a simple strategy.

I have done a groupby on EMP_ID and collected distinct values of the Status columns.

Then I created a overall_status column based on the fact if the distinct values contain Not Done then that column will have value Not Done else Done.

Then joined this created dataframe with the original on EMP_ID

import sys
from pyspark import SparkContext, SQLContext
from pyspark.sql import functions as F
from pyspark.sql.functions import udf


sc = SparkContext('local')
sqlContext = SQLContext(sc)


data1 = [

["1",   "Done"],
["1",   "Not Done"],
["1",   "Done"],
["2",   "Done"],
["2",   "Done"],
      ]

df1Columns = ["EMP_ID", "Status"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)


df1.show(n=100, truncate=False)


df1_unique_values = df1.groupby("EMP_ID").agg(F.collect_set( F.col("Status")).alias("distinct_status")) \
                        .withColumn("overall_status", F.when(  F.array_contains( F.col("distinct_status"), "Not Done"), "Not Done").otherwise("Done")  ).drop("distinct_status")

df1_unique_values.show(n=100, truncate=False)


df1_final =  df1.join(df1_unique_values, on=["EMP_ID"])


df1_final.show(n=100, truncate=False)

Output:

+------+--------+
|EMP_ID|Status  |
+------+--------+
|1     |Done    |
|1     |Not Done|
|1     |Done    |
|2     |Done    |
|2     |Done    |
+------+--------+

+------+--------------+
|EMP_ID|overall_status|
+------+--------------+
|1     |Not Done      |
|2     |Done          |
+------+--------------+

+------+--------+--------------+
|EMP_ID|Status  |overall_status|
+------+--------+--------------+
|1     |Done    |Not Done      |
|1     |Not Done|Not Done      |
|1     |Done    |Not Done      |
|2     |Done    |Done          |
|2     |Done    |Done          |
+------+--------+--------------+

PySpark: Find specific value in a grouped data and mark entire group as different value

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in AWS-GLUE

Related Questions in APACHE-SPARK-DATASET

Trending Questions

Popular # Hahtags

Popular Questions