Data correction in Pandas / Python

68 Views Asked by At

I need to update Data of a subset of my df with values calculated out of a subset of my dataframe.

PERSNR XYZ DATE VALUE
22222 a Jan 0,8
22222 b Jan 0,2
22222 a Feb 0,8
22222 b Feb 0,2

i have got a Dataframe with like 8000 PERSNR that looks like this, just with many more columns

For a subset, which i already know (i got a list of PERSNR to filter that) i need to add the Values together, ignoring XYZ

i Can do that with pivot_table and can get another df out of it.

The problem i have got is, that i need to update the values afterwards in my original DF.

My goal is to obtain the following without loosing any other data of my DF.

PERSNR XYZ DATE VALUE
22222 a Jan 1,0
22222 b Jan 0,0

in Excel i would do a sumifs on my Value Column on my PERSNR, XYZ and DATE.

and then i would replace my values with vlookup in the Value column and overwrite Values with identifier XYZ "b" with zero. This will always be zero.

Any suggestions on how to do that?

thanks in advance

Mike

1

There are 1 best solutions below

2
e-motta On

Assuming a data frame like this:

data = {
    "PERSNR": [22222, 22222, 22222, 22222, 55555, 55555],
    "XYZ": ["a", "b", "a", "b", "a", "b"],
    "DATE": ["Jan", "Jan", "Feb", "Feb", "Jan", "Jan"],
    "VALUE": [0.8, 0.2, 0.8, 0.2, 0.8, 0.2],
}
   PERSNR XYZ DATE  VALUE
0   22222   a  Jan    0.8
1   22222   b  Jan    0.2
2   22222   a  Feb    0.8
3   22222   b  Feb    0.2
4   55555   a  Jan    0.8
5   55555   b  Jan    0.2
  1. Merge two data frames: (i) the original minus column VALUE and (ii) a data frame where you group on PERSNR (assuming there will be more than one value for this; otherwise, this column isn't necessary) and DATE, then sum the VALUE in each group and reset the index.

  2. Set column VALUE with 0 where df["XYZ"] == "b".

  3. Use .loc to do the previous steps only for the selected PERSNR values.

selected_persnr = [22222]  # add all selected values here

df.loc[df["PERSNR"].isin(selected_persnr)] = pd.merge(
    df.drop(columns="VALUE"),
    df.groupby(["PERSNR", "DATE"])["VALUE"].sum().reset_index(),
)
df.loc[(df["PERSNR"].isin(selected_persnr)) & (df["XYZ"] == "b"), "VALUE"] = 0
   PERSNR XYZ DATE  VALUE
0   22222   a  Jan    1.0
1   22222   b  Jan    0.0
2   22222   a  Feb    1.0
3   22222   b  Feb    0.0
4   55555   a  Jan    0.8
5   55555   b  Jan    0.2