I have a data frame "DF" of 2020 observations and 79066 variables. The first column is the year spanning continuously from 1 to 2020, the other columns (variables) are numbers.
For reproducibility, I created I fake data frame with 20 years from 2000 to 2020, and only 100 variables. E.g.:
set.seed(123)
i <- 100
DF <- data.frame(year=c(2000:2020),
setNames(
as.data.frame(lapply(1:i, function(k) c(rnorm(21)))),
paste("Var_", 1:i, sep = "")))
I then created a Mean by row
DF$Aver <- apply(DF[, 2:101], 1, mean, na.rm=TRUE)
I then plotted the average as a line and added the points
plot(DF$year, DF$Aver, type="l", col=1, cex=0.5, las=1, xlab="", ylab="", ylim=c(-4, 4))
for (i in 2:101) {
points(DF$year, DF[, i], pch=20, cex=1, col='gray')
}
However, what I would like to have is a scatterplot where the points close to the mean are dark grey and the grey colour goes shading (light grey) towards the tail values.
You could
normalizethe variable in iteration and use it as alpha value in thergb(red, gree, blue, alpha)function, where values are between [0, 1]. Since smallest values get zero alpha, we simply makepointsa second time with a very light gray.Update
It is not feasible to plot a huge number of points like 159,713,320 in your case. The human eye cannot resolve this, and a nasty surprise could be waiting for us in the copy store.
A common way of solving the issue is to use a smaller random
samplesof the data with a fraction of the columns in thesize=argument. (No need to care for integers,samplewill round reals.) It will represent your data adequately.We might initialize an empty
plotusingtype=n, since the line will be overlaid from thepoints` anyway.Next, in the
forloop for thepointswe iterate exactly over this subsets. Here with that many points, we could simplify to a single color'#RRGGBBaa'defined in hexadecimal format 1-F, where R=red, G=green, B=blue and a=alpha (opacity). I commented on the approach from above, but you might want to try both.Finally, using
lineswe draw the average line as the top layer, where we may use the whole dataset.See also these older questions dealing with plotting of large data: 1, 2, 3.
Data: