Count of rows in pyspark dataframe over a window

477 Views Asked by At

I have a pyspark dataframe with below data

[The columns #, A, B and C is my data.

My code:

W = Window.partitionBy("A").orderBy(col("C"))
main_df = main_df.withColumn("cnt", F.count("B").over(W))

Is there something wrong in how I have used the count function? What can I do so the values in column 'Actual' match with 'Expecting'? I see two issues with my output -

  1. the count starts at 1 when it should start from 0
  2. for each group the last count is getting assigned instead of the first

Can anyone help me resolve this issue?

1

There are 1 best solutions below

0
notNull On BEST ANSWER

Try with dense_rank window function and substract 1 from the dense_rank value.

Example:

df = spark.createDataFrame([('0','king','4/30/2022'),('1','king','5/17/2022'),('2','king','5/26/2022'),('3','king','6/5/2022'),('4','king','7/8/2022'),('5','king','7/9/2022'),('6','king','4/21/2023'),('7','king','4/21/2023'),('8','king','4/22/2023'),('9','king','8/22/2023')],['#','A','C'])
W = Window.partitionBy("A").orderBy(col("C"))

df.withColumn("c",to_date(col("C"),"M/dd/yyyy")).\
  withColumn("Expecting",dense_rank().over(W)-1).show()
#+---+----+----------+---------+
#|  #|   A|         c|Expecting|
#+---+----+----------+---------+
#|  0|king|2022-04-30|        0|
#|  1|king|2022-05-17|        1|
#|  2|king|2022-05-26|        2|
#|  3|king|2022-06-05|        3|
#|  4|king|2022-07-08|        4|
#|  5|king|2022-07-09|        5|
#|  6|king|2023-04-21|        6|
#|  7|king|2023-04-21|        6|
#|  8|king|2023-04-22|        7|
#|  9|king|2023-08-22|        8|
#+---+----+----------+---------+