Convert SQL query that uses CASE expression as a JOIN key to pandas

130 Views Asked by At
SELECT a.year, a.country, b.amount
FROM table_a a
LEFT JOIN table_b b
ON a.country=b.country
AND (CASE WHEN b.country = 'Europe' THEN b.year = 2022 ELSE b.year = 2023 END)

I am trying to transform this code into python. What I have already tried is using pd.merge() for the left join but unsure how to proceed with using case when in join condition of pandas merge.

Given the following input:

table_a = pd.DataFrame({
    'country': ['Europe', 'Europe', 'USA', 'Africa'],
    'year': [2022, 2020, 2023, 2021]
})
table_b = pd.DataFrame({
    'country': ['Europe', 'USA', 'Africa', 'USA', 'Europe'],
    'year': [2023, 2022, 2022, 2023, 2022],
    'amount': [10, 20, 30, 40, 50]
})

the output should be:

  country  year  amount
0  Europe  2022    50.0
1  Europe  2020    50.0
2     USA  2023    40.0
3  Africa  2021     NaN

How do I make it work?

2

There are 2 best solutions below

2
cottontail On BEST ANSWER

Since the condition is used to filter table_b before merging, we can go ahead and filter table_b first using the condition and merge later on country column.

output = (
    table_a.merge(
        table_b.query("(country == 'Europe' and year == 2022) or (country != 'Europe' and year == 2023)"), 
        on=['country'], how='left', suffixes=('', '_'))
    [['year', 'country', 'amount']]
)

Using Nick's example, the above code produces the following output:

  country  year  amount
0  Europe  2022    50.0
1     USA  2023    40.0
2  Africa  2021     NaN
4
Nick On

You don't need to include that condition in the merge, just use it to filter b before the merge. Something like this:

mask = (df_b['country'] == 'Europe') & (df_b['year'] == 2022) | \
   (df_b['country'] != 'Europe') & (df_b['year'] == 2023)
out = df_a.merge(df_b[mask], on=['country'], how='left', suffixes=['', '_y']).drop(columns='year_y')

For example, with this sample data:

df_a = pd.DataFrame({
    'country': ['Europe', 'Europe', 'USA', 'Africa'],
    'year': [2022, 2020, 2023, 2021]
})
df_b = pd.DataFrame({
    'country': ['Europe', 'USA', 'Africa', 'USA', 'Europe'],
    'year': [2023, 2022, 2022, 2023, 2022],
    'amount': [10, 20, 30, 40, 50]
})

The output would be:

  country  year  amount
0  Europe  2022    50.0
1  Europe  2020    50.0
2     USA  2023    40.0
3  Africa  2021     NaN