Replace dataframe values with labels from a second dataframe

613 Views Asked by At

I have a dataframe like this (simplified example)

DF1

id q1 q2_1  q2_2 
1  1  1     1  
2  1  2     2
3  2  3     3
4  2  3     3
5  1  1     3

and a dataframe with corresponding values like this

DF2

var  num  val
q1   1    male
q1   2    female
q2   1    one
q2   2    two
q2   3    three

What I would like its to replace values in the df1 with corresponding values from df2 to get DF3

DF3

id q1      q2_1    q2_2 
1  male    one     one  
2  male    two     two
3  female  three   three
4  female  three   three
5  male    one     three
4

There are 4 best solutions below

0
Suneesh Jacob On
from pandas import DataFrame as df

DF1=df({'id': [1, 2, 3, 4, 5], 'q1': [1, 1, 2, 2, 1], 'q2_1': [1, 2, 3, 3, 1], 'q2_2': [1, 2, 3, 3, 3]})
DF2=df({'var': ['q1', 'q1', 'q2', 'q2', 'q2'], 'num': [1, 2, 1, 2, 3], 'val': ['male', 'female', 'one', 'two', 'three']})

DF3 = DF1[:]
mapping_df_q1 = DF2[DF2['var']=='q1'][['num','val']]
mapping_dictionary_q1 = dict(zip(mapping_df_q1['num'],mapping_df_q1['val']))
DF3['q1'] = DF3['q1'].map(mapping_dictionary_q1)

mapping_df_q2 = DF2[DF2['var']=='q2'][['num','val']]
mapping_dictionary_q2 = dict(zip(mapping_df_q2['num'],mapping_df_q2['val']))
DF3['q2_1'] = DF3['q2_1'].map(mapping_dictionary_q2)
DF3['q2_2'] = DF3['q2_2'].map(mapping_dictionary_q2)
0
Joe Ferndz On

One way to do this is to convert the df2 values into dictionary and use map to replace df1 values. Here's how I did it.

import pandas as pd
df1 = pd.DataFrame({'q1':[1,1,2,2,1],'q2_1':[1,2,3,3,1],'q2_2':[1,2,3,3,3]})
print (df1)
df2 = pd.DataFrame({'var':['q1','q1','q2','q2','q2'],
                    'num':[1,2,1,2,3],
                    'val':['male','female','one','two','three']})

print (df2)

#create a slice of q1 and q2
q1 = df2.loc[df2['var'] == 'q1'][['num','val']]
q2 = df2.loc[df2['var'] == 'q2'][['num','val']]

#convert q1 and q2 to dicts
x = dict(zip(q1.num,q1.val))
y = dict(zip(q2.num,q2.val))

#use map to convert value using x and y
df1['q1'] = df1['q1'].map(x)
df1['q2_1'] = df1['q2_1'].map(y)
df1['q2_2'] = df1['q2_2'].map(y)
print (df1)

df1:

   q1  q2_1  q2_2
0   1     1     1
1   1     2     2
2   2     3     3
3   2     3     3
4   1     1     3

df2:

  var  num     val
0  q1    1    male
1  q1    2  female
2  q2    1     one
3  q2    2     two
4  q2    3   three

df1 after update:

       q1   q2_1   q2_2
0    male    one    one
1    male    two    two
2  female  three  three
3  female  three  three
4    male    one  three
0
smci On

So df1 = encoded categorical values, and df2 = data dictionary, i.e. the label corresponding to each categorical level, for each question. (Personally I strongly recommend you rename them dv and dd to avoid severe confusion, otherwise the numbers '1' and '2' have three diffent meanings, but anyway).

So for each of the individual data columns where 'col' is df1['q1'], df1['q2_1'], df1['q2_2'], we can do either of:

  1. use df1[col].map(...), if we pass it a dictionary for 'q1' where keys = categorical levels, values = categorical labels.
  2. use df1[col].replace(...) to directly index into the subset of df2 corresponding to that question. (This is simpler than using .loc[])
  3. A merge/join approach: right-join with pd.merge(...)

The 2. .replace() approach is simplest. So for example, we'll be using the following list to map the responses df1['q1'] to 'q1' :

>>> df2.query("var == 'q1'").set_index('num')['val']
num
1      male
2    female

This works nice and cleanly indexing with .loc[]:

>>> df2.query("var == 'q1'").set_index('num')['val'] [2]
'female'

But .replace() is simpler than .loc[]:

>>> df1['q1'].replace( df2.query("var == 'q1'").set_index('num') ['val'] )
0      male
1      male
2    female
3    female
4      male

Then we can pd.concat([...], axis=1) the three subresults:

dout = [df1['q1'].replace(df2.query("var == 'q1'").set_index('num') ['val']),
        df1['q2_1'].replace(df2.query("var == 'q2'").set_index('num') ['val']),
        df1['q2_2'].replace(df2.query("var == 'q2'").set_index('num') ['val']) ]

pd.concat(dout, axis=1)

       q1   q2_1   q2_2
0    male    one    one
1    male    two    two
2  female  three  three
3  female  three  three
4    male    one  three

Finally you can fix up the index to the original 'id' column:

>>> pd.concat(dout, axis=1).set_index(df1['id'])
        q1   q2_1   q2_2
id                      
1     male    one    one
2     male    two    two
3   female  three  three
4   female  three  three
5     male    one  three

(PS you might want to do df1.set_index('id') right from the beginning. It makes things cleaner.)

  1. The merge/join approach:

Looks like this:

>>> pd.merge( df1['q1'], df2.query("var == 'q1'").set_index('num') ['val'], left_on='q1', right_index=True, how='left', sort=False)

   q1     val
0   1    male
1   1    male
2   2  female
3   2  female
4   1    male

This was more annoying than the .replace() approach because I couldn't figure out how to easily prevent the 'q1' being passed to the output of the join.

0
r-beginners On

I think the most important part of this response is to create a dictionary of categorical variables as its source. I have replaced the num column with an index, and replaced the grouping with a map function.

# dict create
df2.set_index('num', inplace=True)
df2 = df2.groupby('var').agg(dict)

df2
        val
var     
q1  {1: 'male', 2: 'female'}
q2  {1: 'one', 2: 'two', 3: 'three'}

# df update
df1['q1'] = df1['q1'].map(df2['val'][0])
df1['q2_1'] = df1['q2_1'].map(df2['val'][1])
df1['q2_2'] = df1['q2_2'].map(df2['val'][1])