I am trying to make a data frame which shows various economic factors per country and over the a time period as follows:
| Country | Economic factor | 1980 | 1981 |
|---|---|---|---|
| Argentina | GDP | 5000 | 5400 |
| Argentina | Unemployment Rate | 6.7 | 6.6 |
| Argentina | Inflation Rate | 6.7 | 6.6 |
| Albania | GDP | 4000 | 300 |
| Australia | GDP | 4000 | 300 |
| Australia | Unemployment Rate | 4000 | 300 |
So for each country, I would like multiple factors (say gdp, unemployment rate and inflation rate), and I would like that over different years.
I have an excel sheet for the GDP per country over the years (which has the above format, but only GDP in the economic factor column), one for unemployment rate (also the same format), and one for inflation rate.
Now, I want to merge them together to get the above format. Note that I want them outer merged so that some countries that don't have a value for inflation but have one for GDP are still in the merged data frame (like Albania above).
I tried the usual merging functions, but that didn't seem to work, so I came up with the following code to merge the GDP data and the unemployment rate data:
gdp = pd.read_excel("...xlsx")
unemployment_rates = pd.read_excel("...xlsx")
def add_unemployment_rate(country):
if country in unemployment_rates['Country'].values:
country_unemployment = unemployment_rates[unemployment_rates['Country'] == country]
country_data = pd.concat([gdp[gdp['Country'] == country], country_unemployment], ignore_index=True)
return country_data
else:
return gdp[gdp['Country'] == country]
# Apply the function to each country and concatenate the results
merged_df = pd.concat([add_unemployment_rate(country) for country in gdp['Country']], ignore_index=True)
# Sort the DataFrame by Country and Year
merged_df.sort_values(by=['Country', 'Socio-Economic Factor'], inplace=True)
merged_df['Socio-Economic Factor'] = merged_df['Socio-Economic Factor'].fillna('Unemployment Rate')
This worked, but when I then want to add another economic factor, like inflation, and I use the same method, but now merge the merged_df and the inflation data frame:
inflation = pd.read_excel("...xlsx")
def add_inflation_rate(country):
if country in inflation['Country'].values:
country_inflation = inflation[inflation['Country'] == country]
country_inflation_data = pd.concat([merged_df[merged_df['Country'] == country], country_inflation], ignore_index=True)
return country_inflation_data
else:
return merged_df[merged_df['Country'] == country]
# Apply the function to each country and concatenate the results
merged_df2 = pd.concat([add_inflation_rate(country) for country in merged_df['Country']], ignore_index=True)
# Sort the DataFrame by Country and Year
merged_df2.sort_values(by=['Country', 'Socio-Economic Factor'], inplace=True)
merged_df2['Socio-Economic Factor'] = merged_df2['Socio-Economic Factor'].fillna('Inflation Rate')
This correctly adds my inflation rate below the unemployment rate, but now I get duplicates, so the countries that have an unemployment rate now get duplicate rows as follows:
| Country | Economic factor | 1980 | 1981 |
|---|---|---|---|
| Argentina | GDP | 5000 | 5400 |
| Argentina | GDP | 5000 | 5400 |
| Argentina | Unemployment Rate | 6.7 | 6.6 |
| Argentina | Unemployment Rate | 6.7 | 6.6 |
| Argentina | Inflation Rate | 6.7 | 6.6 |
| Albania | GDP | 4000 | 300 |
| Australia | GDP | 4000 | 300 |
| Australia | GDP | 4000 | 300 |
| Australia | Unemployment Rate | 4000 | 300 |
| Australia | Unemployment Rate | 4000 | 300 |
I am not sure what I am doing wrong, or if this is even the best way to do this?
Sorry about the long text!
OK, let's try this:
Then we should have three long-format DataFrames with correct 'Year' and respective economic factor values. Then we merge these on 'Country' and 'Year':
It's possible that the pivot is at the heart of the problem, so you could try playing around a little (eg. leaving it out). It pivots the data so that each economic factor is a separate column.