R: Time-Series Measurements of Multiple Sensors with Missing Observations (Mice? Sensor-Fusion)

34 Views Asked by At

This is probably a very common sensor-fusion problem.

I have a set of (temperature) sensors that are all trying to measure the same thing. They may be offset by a constant, with or without noise. Alas, the sensors didn't work at random times, and moreover some were installed, some were removed. a simple example with three sensors would be:

x1 <- c(rnorm(11)); x2 <- x1 + 10; x3 <- x2 + 50
x1[1:4] <- NA; x2[3:8] <- NA; x3[7:10] <- NA
x1[11] <- x2[11] <- x3[11] <- NA

This gives me an availability pattern of:

# 12345678901
# ....YYYYYY.
# YY......YY.
# YYYYYY.....

What I must absolutely avoid is the predictable changes when a sensor comes on-line or off-line, so rowMeans over available sensors is terrible.

In this example without noise, I could even perfectly infer the missing values. Perfect collinearity is my friend. Here, I could use obs 1+2 to infer x1 (from x2 and x3), 5+6 to infer x2; and 9+10 to infer x3. now I have cases with all three sensors, which I can use to infer even 3+4 and 7+8, even though I only have one sensor working. Finally, I can take the rowMeans, and I have everything from 1:10, and, important, an NA in 11. More generally, I would have to try different subsets of variables and observations and then fill in iteratively.

I cannot work with complete.cases (no such thing), and I do not want mice to make time-series interpolations. however, I still want mice like intelligent iterations to create reasonably consistent cross-sensor fitted values, over which I can later take the rowMeans. I could start with observations that have the most sensor observations, work my way up, and then do a rowMeans or a factor analysis. (mice could help if it has a parameter to suppress its attempts to do time-series interpolation. does it?)

However, it would be better to recognize that measured values are more reliable than fitted values.

Pointers to solutions would be highly appreciated.

1

There are 1 best solutions below

1
Jon Spring On

Not sure if this solves your problem, but it gives me the result I'd expect from the data, identifying the typical change from point to point among the available observations.

library(tidyverse)
# using set.seed(42) before generating data
data.frame(x1, x2, x3) |>
  mutate(time = row_number()) |>
  pivot_longer(-time) |>
  mutate(chg = value - lag(value), .by = name) |>
  summarize(chg_median = median(chg, na.rm = TRUE), .by = time)

Result

    time chg_median
   <int>      <dbl>
 1     1     NA    
 2     2     -1.94 
 3     3      0.928
 4     4      0.270
 5     5     -0.229
 6     6     -0.510
 7     7      1.62 
 8     8     -1.61 
 9     9      2.11 
10    10     -2.08 
11    11     NA 

Visual check

df |> # df = result of chain up to `mutate(chg`...
  ggplot(aes(time, value)) +
  geom_point(aes(color = name)) +
  geom_line(aes(color = name)) +
  geom_segment(aes(x = time - 0.5, xend = time -0.5, y = 0, yend = chg_median),
               arrow = arrow(type = "closed", length = unit(5, "points")),
               data = df |> 
                 summarize(chg_median = median(chg, na.rm = TRUE), .by = time))

enter image description here