How to extrapolate values over dates using R

562 Views Asked by At

I have data for the first 7 months of a year, and would like to linearly extrapolate the data for the last 5 months. I have tried using the approxExtrap function from the Hmisc package but am not sure how to employ it, particularly what to specify for xout. I'm open to any solution (dplyr-based would be ideal if possible). Thank you for your time.

Here is a sample of the data:


df <- tibble(pop = c(18968, 18956, 18946, 18934, 18923, 18912, 18901, NA, NA, NA, NA, NA),
                       date = c("2020-01-01", "2020-02-01", "2020-03-01", 
                                "2020-04-01", "2020-05-01", "2020-06-01", 
                                "2020-07-01", "2020-08-01", "2020-09-01", 
                                "2020-10-01", "2020-11-01", "2020-12-01"))
df$date <- lubridate::as_date(df$date)

3

There are 3 best solutions below

5
G. Grothendieck On BEST ANSWER

Run lm and predict and then use coalesce to combine the known and predicted values.

library(dplyr)
df %>%
  mutate(pop2 = coalesce(pop, predict(lm(pop ~ date), across(date))))

giving the following where pop2 is pop with the NA's filled in with predicted values.

# A tibble: 12 × 3
     pop date         pop2
   <dbl> <date>      <dbl>
 1 18968 2020-01-01 18968 
 2 18956 2020-02-01 18956 
 3 18946 2020-03-01 18946 
 4 18934 2020-04-01 18934 
 5 18923 2020-05-01 18923 
 6 18912 2020-06-01 18912 
 7 18901 2020-07-01 18901 
 8    NA 2020-08-01 18889.
 9    NA 2020-09-01 18878.
10    NA 2020-10-01 18867.
11    NA 2020-11-01 18856.
12    NA 2020-12-01 18845.
1
Rui Barradas On

If you want linear interpolation, lm will create a fitted object and you can predict from it.

na is the new data passed on to predict.lm as argument newdata.

df <- data.frame(pop = c(18968, 18956, 18946, 18934, 18923, 18912, 18901, NA, NA, NA, NA, NA),
                 date = c("2020-01-01", "2020-02-01", "2020-03-01", 
                          "2020-04-01", "2020-05-01", "2020-06-01", 
                          "2020-07-01", "2020-08-01", "2020-09-01", 
                          "2020-10-01", "2020-11-01", "2020-12-01"))
df$date <- lubridate::as_date(df$date)

fit <- lm(pop ~ date, df)
na <- df[is.na(df$pop), "date", drop = FALSE]
newpop <- predict(fit, newdata = na)
na <- cbind(na, pop = newpop)
na
#>          date      pop
#> 8  2020-08-01 18889.45
#> 9  2020-09-01 18878.06
#> 10 2020-10-01 18867.03
#> 11 2020-11-01 18855.64
#> 12 2020-12-01 18844.61

plot(pop ~ date, df, ylim = c(18800, 19000), pch = 19)
points(pop ~ date, na, col = "red", pch = 19)
abline(fit)

Created on 2022-12-20 with reprex v2.0.2

2
Abdur Rohman On

In case you want to use approxExtrap() function, xout is a numeric vector of x values where you want to put the extrapolated y values. In your case, it is the 8th to 12th values of date column.

extrap_8to12 <- approxExtrap(df$date, df$pop, xout = df$date[8:12]) 
df$pop[8:12] <- extrap_8to12$y

df
# A tibble: 12 × 2
     pop date      
   <dbl> <date>    
 1 18968 2020-01-01
 2 18956 2020-02-01
 3 18946 2020-03-01
 4 18934 2020-04-01
 5 18923 2020-05-01
 6 18912 2020-06-01
 7 18901 2020-07-01
 8 18901 2020-08-01
 9 18901 2020-09-01
10 18901 2020-10-01
11 18901 2020-11-01
12 18901 2020-12-01