Best way for replacing values in a data.frame using the information stored in another one?

70 Views Asked by At

I have two different objects:

AUX <- data.frame(SPANISH = c("Si", "No", "Tal vez"),
                  ENGLISH = c("Yes", "No", "Perhaps"))


TAB <- data.frame(V1 = c(18, 15, 19, 23),
                  V2 = c("Si", "Si", "No", "Si"),
                  V3 = c("Tal vez", "No", "No", "Si"))

I want to replace in TAB the values appearing in AUX; that is, where there is a "Si" in TAB y want to replace it by a "Yes" ad so on.

This is a toy example, what I really have is two data.frames bigger than these two presented here. Which is the fastest way to replace the values in TAB using the information in AUX? I saw some suggestions about using mutate, but is it there any faster alternative?

Thanks in advance

3

There are 3 best solutions below

4
Wimpel On BEST ANSWER

This a a possible data.table approach... I bet it can also be achieved using some clever joining, but this sholuld work fine except for huuuge data.tables

library(data.table)
setDT(AUX); setDT(TAB)
# melt to long
TAB.long <- melt(TAB, id.vars = "V1")
# update
TAB.long[AUX, value := i.ENGLISH, on = .(value = SPANISH)]
# cast to wide again
TAB.new <- dcast(TAB.long, V1 ~ variable, value.var = "value")
#       V1     V2      V3
#    <num> <char>  <char>
# 1:    15    Yes      No
# 2:    18    Yes Perhaps
# 3:    19     No      No
# 4:    23    Yes     Yes
0
G. Grothendieck On

Try match. No packages are used.

cols <- c("V2", "V3")
replace(TAB, cols, lapply(cols,
  \(x) with(AUX, ENGLISH[match(TAB[[x]], SPANISH)])))

or form a dictionary and look up the names.

cols <- c("V2", "V3")
dict <- with(AUX, setNames(ENGLISH, SPANISH))
replace(TAB, cols, lapply(cols,\(x) dict[TAB[[x]]]))

Performance Benchmark

The benchmark below indicates that a and b take about the same time based on the median column and the ratio of run times to c (c takes about 16x longer than a or b) is statistically significant based on the cld column.

library(microbenchmark)
library(data.table)

TAB2 <- TAB; AUX2 <- AUX
setDT(AUX2); setDT(TAB2)

microbenchmark(
  a = {
    cols <- c("V2", "V3")
    replace(TAB, cols, lapply(cols,
      \(x) with(AUX, ENGLISH[match(TAB[[x]], SPANISH)])))
  },
  b = {
    cols <- c("V2", "V3")
    dict <- with(AUX, setNames(ENGLISH, SPANISH))
    replace(TAB, cols, lapply(cols,\(x) dict[TAB[[x]]]))
  },
c = {
    TAB.long <- melt(TAB2, id.vars = "V1")
    TAB.long[AUX2, value := i.ENGLISH, on = .(value = SPANISH)]
    TAB.new <- dcast(TAB.long, V1 ~ variable, value.var = "value")
})

giving

Unit: microseconds
 expr    min      lq      mean   median       uq     max neval cld
    a  518.9  547.90   794.820   626.60   745.20  8093.7   100  a 
    b  496.7  536.95   796.511   626.35   880.65  6591.6   100  a 
    c 7883.7 8677.65 11782.475 10392.95 12110.25 95478.0   100   b
1
Alper Göktuğ Tamer On

Try this code:

library(data.table)
AUX <- data.frame(SPANISH = c("Si", "No", "Tal vez"),
              ENGLISH = c("Yes", "No", "Perhaps"))

TAB <- data.frame(V1 = c(18, 15, 19, 23),
              V2 = c("Si", "Si", "No", "Si"),
              V3 = c("Tal vez", "No", "No", "Si"))
for (i in 1:nrow(TAB)) {


for (j in 1:ncol(TAB)) {
if (is.character(TAB[i, j])) {
  index <- match(TAB[i, j], AUX$SPANISH)
  if (!is.na(index)) {
    TAB[i, j] <- AUX$ENGLISH[index]


}}}} 


print(TAB)