Updating a data table in R (with data.table)

80 Views Asked by At

I have two DTs

X = data.table(names = c("a", "a", "b", "b", "c", "c"), years = c("2001", "2002", "2001", "2002", "2001", "2002"), val.1 = 1:6, key = c("names", "years"))

X

| names | years | val.1 |
| -------- | -------- |
| a   | 2001   |  1 |
| a   | 2002   |  2 |
| b   | 2001   |  3 |
| b   | 2002   |  4 |
| c   | 2001   |  5 |
| c   | 2002   |  6 |

and

X.update = data.table(names = c("a", "b", "b", "c", "d", "d", "d"), years = c("2003", "2002", "2003", "2003", "2001", "2002", "2003"), val.1 = 11:17, key = c("names", "years"))

X.update

| names | years | val.1 |
| -------- | -------- |
| a   | 2003   |  11 |
| b   | 2002   |  12 |
| b   | 2003   |  13 |
| c   | 2003   |  14 |
| d   | 2001   |  15 |
| d   | 2002   |  16 |
| d   | 2003   |  17 |

The task looks natural to me. X.update supersedes all old values (val.1) for the same c("names", "year") and adds new entries everywhere else.

Here it means:

  1. A new row for all names for the year 2003
  2. A new row of d for every year (d is basically added)
  3. A correction of b in the year 2002

X.final

| names | year | val.1 |
| -------- | -------- |
| a   | 2001   |  1 |
| a   | 2002   |  2 |
| a   | 2003   |  11 |       <-added for a for 2003
| b   | 2001   |  3 |        # edited, R2Evans and Sam are right
| b   | 2002   |  12 |       <-corrected for b for 2002
| b   | 2003   |  13 |       <-added for b for 2003
| c   | 2001   |  5 |
| c   | 2002   |  6 |
| c   | 2003   |  14 |       <-added for c for 2003
| d   | 2001   |  15 |       <-added
| d   | 2002   |  16 |       <-added
| d   | 2003   |  17 |       <-added

Since I need this for tables with 100,000s of rows, I wanted to ask for an idiomatic (= fast) solution in DT.

2

There are 2 best solutions below

0
r2evans On BEST ANSWER

If you remove from X what exists in X.update and then row-bind, it'll be rather direct.

(FYI, I think your row 4 should have val.1 = 3 instead of 4.)

out <- rbind(X[!X.update,], X.update)
setkeyv(out, key(X))
out
# Key: <names, years>
#      names  years val.1
#     <char> <char> <int>
#  1:      a   2001     1
#  2:      a   2002     2
#  3:      a   2003    11
#  4:      b   2001     3
#  5:      b   2002    12
#  6:      b   2003    13
#  7:      c   2001     5
#  8:      c   2002     6
#  9:      c   2003    14
# 10:      d   2001    15
# 11:      d   2002    16
# 12:      d   2003    17

X[!X.update,] is an anti-join, operating on the keys of the first frame.

0
SamR On

You can do a full outer join with merge.data.table(), specifying all = TRUE. Then fcoalesce() the results, i.e. Fill in missing values in a vector by successively pulling from candidate vectors in order, selecting the value from the updated table first and otherwise using the original.

merge(
    X.update,
    X,
    by = c("names", "years"),
    all = TRUE,
    suffix = c("_update", "_original")
)[, val.1 := fcoalesce(val.1_update, val.1_original)][,
    `:=`(val.1_update = NULL, val.1_original = NULL)
][]

Output:

     names  years val.1
    <char> <char> <int>
 1:      a   2001     1
 2:      a   2002     2
 3:      a   2003    11
 4:      b   2001     3
 5:      b   2002    12
 6:      b   2003    13
 7:      c   2001     5
 8:      c   2002     6
 9:      c   2003    14
10:      d   2001    15
11:      d   2002    16
12:      d   2003    17