Issues with duplicates when merging multiple data frames with pandas

Question

Issues with duplicates when merging multiple data frames with pandas

49 Views Asked by Q.Ask At 14 March 2024 at 16:00

I am trying to make a data frame which shows various economic factors per country and over the a time period as follows:

Country	Economic factor	1980	1981
Argentina	GDP	5000	5400
Argentina	Unemployment Rate	6.7	6.6
Argentina	Inflation Rate	6.7	6.6
Albania	GDP	4000	300
Australia	GDP	4000	300
Australia	Unemployment Rate	4000	300

So for each country, I would like multiple factors (say gdp, unemployment rate and inflation rate), and I would like that over different years.

I have an excel sheet for the GDP per country over the years (which has the above format, but only GDP in the economic factor column), one for unemployment rate (also the same format), and one for inflation rate.

Now, I want to merge them together to get the above format. Note that I want them outer merged so that some countries that don't have a value for inflation but have one for GDP are still in the merged data frame (like Albania above).

I tried the usual merging functions, but that didn't seem to work, so I came up with the following code to merge the GDP data and the unemployment rate data:

gdp = pd.read_excel("...xlsx")
unemployment_rates = pd.read_excel("...xlsx")

def add_unemployment_rate(country):
    if country in unemployment_rates['Country'].values:
        country_unemployment = unemployment_rates[unemployment_rates['Country'] == country]
        country_data = pd.concat([gdp[gdp['Country'] == country], country_unemployment], ignore_index=True)
        return country_data
    else:
        return gdp[gdp['Country'] == country]

# Apply the function to each country and concatenate the results
merged_df = pd.concat([add_unemployment_rate(country) for country in gdp['Country']], ignore_index=True)

# Sort the DataFrame by Country and Year
merged_df.sort_values(by=['Country', 'Socio-Economic Factor'], inplace=True)
merged_df['Socio-Economic Factor'] = merged_df['Socio-Economic Factor'].fillna('Unemployment Rate')

This worked, but when I then want to add another economic factor, like inflation, and I use the same method, but now merge the merged_df and the inflation data frame:

inflation = pd.read_excel("...xlsx")

def add_inflation_rate(country):
    if country in inflation['Country'].values:
        country_inflation = inflation[inflation['Country'] == country]
        country_inflation_data = pd.concat([merged_df[merged_df['Country'] == country], country_inflation], ignore_index=True)
        return country_inflation_data
    else:
        return merged_df[merged_df['Country'] == country]

# Apply the function to each country and concatenate the results
merged_df2 = pd.concat([add_inflation_rate(country) for country in merged_df['Country']], ignore_index=True)

# Sort the DataFrame by Country and Year
merged_df2.sort_values(by=['Country', 'Socio-Economic Factor'], inplace=True)
merged_df2['Socio-Economic Factor'] = merged_df2['Socio-Economic Factor'].fillna('Inflation Rate')

This correctly adds my inflation rate below the unemployment rate, but now I get duplicates, so the countries that have an unemployment rate now get duplicate rows as follows:

Country	Economic factor	1980	1981
Argentina	GDP	5000	5400
Argentina	GDP	5000	5400
Argentina	Unemployment Rate	6.7	6.6
Argentina	Unemployment Rate	6.7	6.6
Argentina	Inflation Rate	6.7	6.6
Albania	GDP	4000	300
Australia	GDP	4000	300
Australia	GDP	4000	300
Australia	Unemployment Rate	4000	300
Australia	Unemployment Rate	4000	300

I am not sure what I am doing wrong, or if this is even the best way to do this?

Sorry about the long text!

Original Q&A

There are 1 best solutions below

**Robert Long** · Answer 1 · 2024-03-14T16:20:22.603000

OK, let's try this:

gdp_long = pd.melt(gdp, id_vars=['Country'], var_name='Year', value_name='GDP')

unemployment_long = pd.melt(unemployment_rates, id_vars=['Country'], var_name='Year', value_name='Unemployment Rate')

inflation_long = pd.melt(inflation, id_vars=['Country'], var_name='Year', value_name='Inflation Rate')

Then we should have three long-format DataFrames with correct 'Year' and respective economic factor values. Then we merge these on 'Country' and 'Year':

merged_df = pd.merge(gdp_long, unemployment_long, on=['Country', 'Year'], how='outer')
merged_df = pd.merge(merged_df, inflation_long, on=['Country', 'Year'], how='outer')

It's possible that the pivot is at the heart of the problem, so you could try playing around a little (eg. leaving it out). It pivots the data so that each economic factor is a separate column.

merged_df = pd.pivot_table(merged_df, index=['Country', 'Year'], values=['GDP', 'Unemployment Rate', 'Inflation Rate'])

Issues with duplicates when merging multiple data frames with pandas

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in MERGING-DATA

Trending Questions

Popular # Hahtags

Popular Questions