Can I use np.where to on-the-run update values based on previous rows?

99 Views Asked by At

Is there any way I can implement an on-the run function like that on a dataframe?

I have an imaginary dataframe 'Classroom" as follows with - an imaginary implementation to illustrate the issue -:

Student. Subject. Mark. New Sub.
Mike English pass English
Mike French pass None
Mike History pass None
Mike Bio Fail None
... ...... ...... None

**I want to implement values in 'New Sub." Column such that for every student, it checks all the subject's grades, if the subject is "pass" then it gets appended to the current list of "New Sub." if it's not added, if it is "Fail" then the subject should'nt be added, to have the current view: **

Student. Subject. Mark. New Sub.
Mike English pass English
Mike French pass English,French
Mike History pass English,French,History
Mike Bio Fail English,French,History
... ...... ...... None

I tried implementing that using (np.where) like that:

Classroom["New Sub."]= np.where((Classroom["Mark."]=="pass"),Classroom["New Sub."].shift(1)+","+["Subject"],Classroom["New Sub."])

The issue is that "New Sub." Column doesn't get updated on the run while running np.where, so what I get is the following:

Student. Subject. Mark. New Sub.
Mike English pass English
Mike French pass English,French
Mike History pass None,History
Mike Bio Fail None
... ...... ...... None

As if it gets the old values of "New Sub." and not getting the previous values after modifications.

Is there any way I can implement an on-the run function like that on a dataframe?

2

There are 2 best solutions below

0
Namith Nimlaka On

The issue with your approach using 'np.where' is that it's not designed to perform operations like concatenating strings and accessing previous values within a pandas DataFrame column directly. You're trying to use shift(1) and accessing the Subject column inside 'np.where', which won't work as expected.

you can use a combination of groupby, cumsum, and apply functions in pandas.

import pandas as pd
# Example DataFrame
data = {
'Student': ['Mike', 'Mike', 'Mike', 'Mike'],
'Subject': ['English', 'French', 'History', 'Bio'],
'Mark': ['pass', 'pass', 'pass', 'fail']}
df = pd.DataFrame(data)

def update_new_sub(group):
pass_subjects = group[group['Mark'] == 'pass']['Subject']
group['New Sub.'] = ','.join(pass_subjects)
return group

df['New Sub.'] = df.groupby('Student').apply(update_new_sub)['New Sub.']
print(df)
0
e-motta On

np.where and df.shift are vectorized operations, so they will not process each row sequentially.

You could in theory do what you want with an iterative approach like df.iterrows, but that's hardly ever efficient when dealing with dataframes.

What you can do instead is use a vectorized approach to compute all the values at once, then merge them into the original dataframe:

df = pd.DataFrame(
    {
        "Student": ["Mike", "Mike", "Mike", "Mike", "Bob"],
        "Subject": ["English", "French", "History", "Bio", "English"],
        "Mark": ["pass", "pass", "pass", "Fail", "pass"],
    }
)

new_subs = (
    df.loc[df["Mark"] == "pass", :]
    .groupby("Student")
    .apply(lambda row: ",".join(row["Subject"]))
    .reset_index(name="New Sub.")
)

df = df.merge(new_subs, on="Student", how="left")

Resulting in:

  Student  Subject  Mark                New Sub.
0    Mike  English  pass  English,French,History
1    Mike   French  pass  English,French,History
2    Mike  History  pass  English,French,History
3    Mike      Bio  Fail  English,French,History