How to populate a pandas df based on conditions from another df?

61 Views Asked by At

I have 2 tables in consideration:

Table 1:

Country Age Marks
A 25 7
B 45 8

Table 2:

Age Band From Age Band To Country A Country B
20 30
40 50

The output I want to have:

Age Band From Age Band To Country A Country B
20 30 7 NaN
40 50 NaN 8

Here's what I tried:

for index,row in table2.iterrows():
    table2.loc[index,'Country A'] = table1[(table1['Country']=='A')&
                                           (table1['Age']>=row[0])&
                                           (table1['Age']<=row[1])]['Marks'].values[0]

But this gives the following error: index 0 is out of bounds for axis 0 with size 0

I think I've probably guessed where the error occurs: Whenever the compiler encounters a Age Band from Table2 for which a corresponding Age doesn't exist in Table1.

Any help on this problem is much appreciated! Thank You in advance...

2

There are 2 best solutions below

0
mozway On BEST ANSWER

There is no efficient builtin solution in pandas to perform this operation in a generic way. A cross-merge will work on small dataframes but will have poor to terrible efficiency on large datasets. In fact as it has a quadratic complexity, this will even crash python for anything above a few thousand rows.

A robust option is to pivot then perform a conditional_join using pyjanitor:

# pip install pyjanitor
import janitor

out = (table2[['Age Band From', 'Age Band To']]
       .conditional_join(table1.pivot(index='Age', columns='Country', values='Marks')
                               .add_prefix('Country ').reset_index(),
                         ('Age Band From', 'Age', '<='),
                         ('Age Band To', 'Age', '>='),
                         how='left'
                        )
      )

Output:

   Age Band From  Age Band To  Age  Country A  Country B
0             20           30   25        7.0        NaN
1             40           50   45        NaN        8.0

Comparison of efficiencies

Using a length N for both tables with random data (and deduplication):

enter image description here

Alternatively, if you only have a one to one mapping (i.e. no overlapping intervals, only one match in table1 per row of table2), you can use merge_asof for a pure pandas solution:

tmp = (table1.pivot(index='Age', columns='Country', values='Marks')
             .add_prefix('Country ').reset_index()
             .sort_values(by='Age')
      )

         # merge on left boundary
out = (pd.merge_asof(table2[['Age Band From', 'Age Band To']].reset_index()
                     .sort_values(by='Age Band From'),
                     tmp, direction='forward',
                     left_on='Age Band From', right_on='Age')
         .set_index('index').reindex(table2.index)
         # hide based on right boundary
         .where(lambda d: d.pop('Age').le(d['Age Band To']))
         # restore Band data if needed
         .combine_first(table2)
      )

Output:

   Age Band From  Age Band To  Country A  Country B
0             20           30        7.0        NaN
1             40           50        NaN        8.0
0
jezrael On

Use DataFrame.merge with pivoted df1 by DataFrame.pivot and filter values by Series.between:

out = (df2[['Age Band From','Age Band To']]
            .merge(df1.pivot(index='Age', columns='Country', values='Marks')
                       .add_prefix('Country ').reset_index(), how='cross'))

out = out[out['Age'].between(out['Age Band From'], out['Age Band To'])]
print (out)
   Age Band From  Age Band To  Age  Country A  Country B
0             20           30   25        7.0        NaN
3             40           50   45        NaN        8.0

For more general solution is add DataFrame.join:

print (df2)
   Age Band From  Age Band To  Country A  Country B
0             20           30        NaN        NaN
1              4            5        NaN        NaN
2             20           35        NaN        NaN
3             40           50        NaN        NaN

cols = ['Age Band From','Age Band To']

out = (df2.reset_index()[cols + ['index']]
            .merge(df1.pivot(index='Age', columns='Country', values='Marks')
                       .add_prefix('Country ').reset_index(), how='cross'))

out = df2[cols].join(out[out['Age'].between(out['Age Band From'], out['Age Band To'])]
                       .set_index('index').rename_axis(None).filter(like='Country'))
print (out)
   Age Band From  Age Band To  Country A  Country B
0             20           30        7.0        NaN
1              4            5        NaN        NaN
2             20           35        7.0        NaN
3             40           50        NaN        8.0