Counting nested categories in pandas/python

72 Views Asked by At

Aim: Get a summary table of counts of categories that are nested within another category.

Example: I have this dataframe:

# initialize data of lists.
data = {'Name': ['Tom', 'Tom', 'Tom', 'jack', 'jack', 'Mary', 'Mary', 'Mary', 'Jim', 'Jim'],
        'CEFR_level': ['A1', 'A2', 'PreA1', 'A2', 'A1','A1','B1','C1', 'A1', 'B1']}
 
# Create DataFrame
df = pd.DataFrame(data)

# I use this code to recode:

Easy = ["PreA1","A1", "A2"]
Medium =["B1", "B2"]
Hard = ["C1", "C2"]

data ['CEFR_categories'] = np.where(data['CEFR_level'].isin(Easy), 'Easy',
                                               np.where(data['CEFR_level'].isin(Medium), 'Medium',
                                                        np.where(data ['CEFR_level'].isin(Hard), 'Hard', 'Other')))


I managed to make the column: data ['CEFR_categories'] and categorize properly the Easy, Medium and Hard.

My problem is with the groupby.

Task: DONE I want to recode the X, Y and Z into easy, medium and hard.

I then I want to groupby combining categories. For example, the new easy category, repeats 2 times (Tom (who has in the CEFR_level 'A1', 'A2', 'PreA1') and Jack (who has A1 and A2 in the CEFR level)). Easy-Medium-Hard (repeats 1 time for Mary (which a different combination of CEFR_level which is then recoded to easy, medium and hard)) and Easy-Medium repeats 1 time for Jim.

I have for hours tried to recode this, I can recode into another column but I only have 1 category (for example) easy for the first row. (With the code above)

My output should look like this:

enter image description here

How can i group by this.

Thanks for your help

EDIT and UPDATE

I used @Timeless answer I got the following output:

enter image description here

Any suggestions? The first 4 lines in my real data has a cat1 of: easy, easy, easy, medium. Which would result in a Easy-Medium.

But the ouput says there is none.

Final answer

This code by timeless also works.

cats = sorted(testlet_item_bank["CEFR_categories"].unique()) 
#status = dict(zip(cats, ["Easy", "Medium", "Hard"])) # this was mixing categories

ps = list(map("-".join, powerset(cats)))[1:]

out = (
      testlet_item_bank # the first chain can be optional
      .astype({"CEFR_categories": pd.CategoricalDtype(cats, ordered=True)})
      .groupby("TestletID")["CEFR_categories"]
      .agg(lambda x: "-".join(pd.unique(x.sort_values())))
      .value_counts()
      .reindex(ps, fill_value=0)
      .rename_axis("Categories")
      .reset_index(name="Counts")
#    .replace(status, regex=True) this mixes categories
)

2

There are 2 best solutions below

6
Timeless On BEST ANSWER

With a value_counts/powerset :

from more_itertools import powerset

mapper = {
    "Easy": ["PreA1", "A1", "A2"],
    "Medium": ["B1", "B2"],
    "Hard": ["C1", "C2"],
}

status = {v: k for k,lst_v in mapper.items() for v in lst_v}

df["CEFR_level"] = (
    df["CEFR_level"].map(status).fillna("Other")
    .astype(pd.CategoricalDtype(list(mapper) + ["Other"], ordered=True))
)

ps = list(map("-".join, powerset(mapper)))[1:]

out = (
    df # the first chain can be optional
    .groupby("Name")["CEFR_level"]
    .agg(lambda x: "-".join(pd.unique(x.sort_values())))
    .value_counts()
    .reindex(ps, fill_value=0)
    .rename_axis("Categories")
    .reset_index(name="Counts")
)

NB : If you can't install more_itertools, you can use this recipe from the docs.

Output :

         Categories  Counts
0              Easy       2
1            Medium       0
2              Hard       0
3       Easy-Medium       1
4         Easy-Hard       0
5       Medium-Hard       0
6  Easy-Medium-Hard       1

[7 rows x 2 columns]
1
esqew On

This can be done in a relatively low complexity way, but be forewarned the performance of this may leave something to be desired at scale:

import pandas as pd

data = {'Name': ['Tom', 'Tom', 'Tom', 'jack', 'jack', 'Mary', 'Mary', 'Mary', 'Jim', 'Jim'],
        'Cat1': ['X', 'X', 'X', 'X', 'X','X','Y','Z', 'X', 'Y']}
df = pd.DataFrame(data)

# First, sort by name then `Cat1` value to maintain the eventual ordering of `Easy`/`Medium`/`Hard`
df.sort_values(['Name', 'Cat1'], ignore_index=True, axis=0)

# De-dupe rows
df = df.drop_duplicates()

# Map X, Y, Z to Easy, Medium, Hard
df['Cat1'] = df['Cat1'].replace(['X', 'Y', 'Z'], ['Easy', 'Medium', 'Hard'])

# Roll up levels grouped by unique Name value
df = df.groupby('Name').agg({'Cat1': '-'.join})

# Rename Cat1 column to 'counts' to match spec
df = df.rename(columns={"Cat1": "counts"})

# Get value_counts() of resulting `counts` column
return_value = df.value_counts('counts')

Result:

counts
Easy                2
Easy-Medium         1
Easy-Medium-Hard    1
Name: count, dtype: int64

The result here doesn't include 0 quantity category combinations, but it'd be trivial to work that aspect in if you absolutely require it.