Replacing values in pands df (not Nan) with the column sum except first two cols

54 Views Asked by At

Update: I mixed up the axis. Thats why my question seems weired. This is the new input data:

give following data:

data = {'Org': ['Tom', 'Kelly', 'Rick', 'Dave','Sara','Liz'],
'A': ['NaN', 1, 1, 1, 'NaN', 'NaN'],
'B': [1, 1, 1, 1, 'NaN', 1],
'C': [1, 1, 1, 1, 1, 1],
'D': ['NaN', 'NaN', 1, 'NaN', 1, 'NaN'],
'E': [1, 1, 1, 1, 'NaN', 1],
'F': ['NaN', 1, 1, 1, 'NaN', 1]}

df = pd.DataFrame(data)

I want to sum the columns except the first two and then replace the values not NaN with the sum the column results:

the result should like this:

data = {'Org': ['Tom', 'Kelly', 'Rick', 'Dave','Sara','Liz'],
'A': ['NaN', 1, 1, 1, 'NaN', 'NaN'],
'B': [5, 5, 5, 5, 'NaN', 5],
'C': [6, 6, 6, 6, 6, 6],
'D': ['NaN', 'NaN', 2, 'NaN', 2, 'NaN'],
'E': [5, 5, 5, 5, 'Nan', 5],
'F': ['NaN',4, 4, 4, 'NaN', 4]}

df = pd.DataFrame(data)

I tried:

column_sums = df.iloc[:, 2:].sum()
for column in iloc[:, 2:].columns:
     df[column] = column_sums[column]

But that replaces me all values.

Is there a smooth solution for that possible?

Thanks

3

There are 3 best solutions below

4
mozway On BEST ANSWER

Build as mask, sum and modify in place after broadcasting the sum:

import pandas as pd
import numpy as np

# read data, ensure NaNs are valid (not strings)
df = pd.DataFrame(data).replace('NaN', float('nan'))

# number of initial columns to skip
N = 2

# build boolean mask
m = df.notna()
m.iloc[:, :N] = False

# sum, broadcast values, replace in dataframe
df[m] = np.repeat(m.sum(0).reindex(df.columns).to_numpy()[None],
                  df.shape[0], axis=0)

Output:

     Org    A    B  C    D  E    F
0    Tom  NaN  5.0  6  NaN  6  NaN
1  Kelly  1.0  5.0  6  NaN  6  4.0
2   Rick  1.0  5.0  6  2.0  6  4.0
3   Dave  1.0  5.0  6  NaN  6  4.0
4   Sara  NaN  NaN  6  2.0  6  NaN
5    Liz  NaN  5.0  6  NaN  6  4.0

Intermediate m:

     Org      A      B     C      D     E      F
0  False  False   True  True  False  True  False
1  False  False   True  True  False  True   True
2  False  False   True  True   True  True   True
3  False  False   True  True  False  True   True
4  False  False  False  True   True  True  False
5  False  False   True  True  False  True   True
1
jezrael On

Use DataFrame.update with select all columns without first N columns in DataFrame.mask:

import numpy as np

#If NaNs are strings convert them to missing values NaNs
df = df.replace('NaN', np.nan)

N = 2
df.update(df.iloc[:, N:].mask(df.iloc[:,N:].notna(), 
          df.iloc[:, N:].sum(numeric_only=True, axis=1), axis=0))
print (df)
  Org  Tom  Kelly  Rick  Dave  Sara  Liz
0   A  NaN    4.0     4   4.0     4  NaN
1   B  1.0    4.0     4   NaN     4  4.0
2   C  1.0    5.0     5   5.0     5  5.0
3   D  1.0    4.0     4   NaN     4  4.0
4   E  NaN    NaN     2   2.0     2  NaN
5   F  NaN    4.0     4   NaN     4  4.0
1
Panda Kim On

Code

df = pd.DataFrame(data)
df[df.columns[2:]] = (
    df.iloc[:, 2:]
    .replace({'NaN': float('nan')})
    .pipe(lambda x: x.mul(x.sum()))
)

df

     Org    A    B    C    D    E    F
0    Tom  NaN  5.0  6.0  NaN  5.0  NaN
1  Kelly    1  5.0  6.0  NaN  5.0  4.0
2   Rick    1  5.0  6.0  2.0  5.0  4.0
3   Dave    1  5.0  6.0  NaN  5.0  4.0
4   Sara  NaN  NaN  6.0  2.0  NaN  NaN
5    Liz  NaN  5.0  6.0  NaN  5.0  4.0