group by in pandas API on spark

290 Views Asked by code_bug At 11 November 2022 at 14:47

I have a pandas dataframe below,

data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(data)

Here df is a Pandas dataframe.

I am trying to convert this dataframe to pandas API on spark

import pyspark.pandas as ps
pdf = ps.from_pandas(df)
print(type(pdf))

Now the dataframe type is '<class 'pyspark.pandas.frame.DataFrame'> ' No I am applying group by function on pdf like below,

for i,j in pdf.groupby("Team"):
    print(i)
    print(j)

I am getting an error below like

KeyError: (0,)

Not sure this functionality will work on pandas API on spark ?

Original Q&A

There are 2 best solutions below

Azhar Khan On 14 November 2022 at 08:01 BEST ANSWER

The pyspark pandas does not implement all functionalities as-is because Spark has distributed architecture. Hence operations like rowwise iterations etc. can be subjective.

If you want to print the groups, then pyspark pandas code:

pdf.groupby("Team").apply(lambda g: print(f"{g.Team.values[0]}\n{g}"))

is equivalent to pandas code:

for name, sub_grp in df.groupby("Team"):
    print(name)
    print(sub_grp)

Reference to source code

If you scan the source code, you will find that there is no __iter__() implementation for pyspark pandas: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/pandas/groupby.html

but the iterator yields (group_name, sub_group) for pandas: https://github.com/pandas-dev/pandas/blob/v1.5.1/pandas/core/groupby/groupby.py#L816

Documentation reference to iterate groups

pyspark pandas : https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/groupby.html?highlight=groupby#indexing-iteration

pandas : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#iterating-through-groups

INGl0R1AM0R1 On 11 November 2022 at 15:19

If you want to see the given groups just define your pyspark df correctly and utilize the print statement with the given results of the generator. Or just use pandas

for i in df.groupby("Team"):
    print(i)

for i in pdf.groupBy("Team"):
    print(i)

group by in pandas API on spark

There are 2 best solutions below

Related Questions in PANDAS

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in GROUP-BY

Related Questions in SPARK-KOALAS

Trending Questions

Popular # Hahtags

Popular Questions