I'm trying to create a data vis which is essentially a scattergraph.
A snip of the data looks like this:
| Location | Year | A | B |
|---|---|---|---|
| Afghanistan | 1990 | 48.3973 | 1377.85863 |
| Afghanistan | 2020 | 65.432 | 620.40751 |
| Albania | 1990 | 76.4334 | 32.87494 |
| Albania | 2020 | 79.6756 | 8.27645 |
| Algeria | 1990 | 68.5167 | 205.81979 |
| Algeria | 2020 | 75.9125 | 77.69479 |
| Angola | 1990 | 82.6756 | 1101.50579 |
| Angola | 2020 | 83.1922 | 221.90649 |
| Antigua and Barbuda | 1990 | 73.6497 | 58.0267 |
| Antigua and Barbuda | 2020 | 80.7452 | 21.24795 |
| Argentina | 1990 | 75.9699 | 87.81427 |
| Argentina | 2020 | 81.3155 | 44.90008 |
That is, each country has a pair of points. I'd like to plot A against B, and then ideally have each country's pair of points (in a perfect world) in a different colour (ideally with labels, showing which country they are and which year).
I then would like to plot a regression line of all the 1990 points, and a regression line on the same axes of the 2020 points.
In Excel would be ideal, but I could potentially use something else if it would be more appropriate. Any help would be very gratefully received.
Since you are open to other options, I thought I can answer your question in python, if that's ok. Here's how you can do it with seaborn, a nice data visualization package that has a lot of helpful functions for statistics and ML.
First, to get you started with python, I would recommend anaconda, as their installation already includes a tonne of nice packages which are all compatible with each other.
After getting acquainted, you can import the following packages for the example:
I defined the data without Afghanistan for now, as it seems to be a bad outlier that negatively affects the regression models, I'll show that at the end. For now, here are the results without AFG:
This leads to the following plot:
Of course, if you have the data in a better way, you can use
pd.read_excel()orpd.read_csv()to get the data from saved files.Check the data with Afghanistan here:
As you see, the regression primarily captures the variance found between AFG and the other groups. This can be solves, for example, by using robust regression instead.
I hope this answer encourages you to start learning python and doing this. The advantage would be that there are a lot of resources these days to get started and to even master python. Also, chatGPT can help you get started (you will come back complaining about it anyway ;) )