c-/r-bind from a list of data tables w/o recycling and w/ using list names as part of col names

62 Views Asked by At

there many post which shed some light on individual parts of my problem but I can't puzzle it together and need help. I'm trying to bind many data tables with varying row-count but identical column-count and -names from a list of data tables but:

  1. I do not want to recycle rows from the tables with fewer rows
  2. I do want to use the list item names to create individual columns from the "border" columns in the resulting table
  3. I do not want to use the list item names and create a single column from all "value" columns in the resulting table

Here is s small example and various partial solutions taken from other posts:

# Create dummy list of data frames    
df1 <- data.frame(border=c(1,2,3), value=c(4,5,6))
df2 <- data.frame(border=as.factor(c("A","B")), value=c(3,5.7))
df_lst <- list(df1, df2)
names(df_lst) <- c("df1","df2")


t1 <- as.data.table(df_lst)
t2 <- DataVisualizations::CombineCols(df1, df2)
t3 <- do.call(CombineCols,as.data.table(df_lst))
t4 <- rbindlist(df_lst)

Can anyone help me puzzle this together such that I can get something like this using the list (an not the individual data frames which I do not have in the real life example) as sole input ...

   df1 df2 value
1:   1 NA    4
2:   2 NA    5
3:   3 NA    6
4:  NA  A   3.0
5:  NA  B   5.7

Any hints appreciated!

Cheers, Mark

2

There are 2 best solutions below

3
Maël On BEST ANSWER

You can use mapply to rename the first column with the name of the data.frame, and then use dplyr::bind_rows (or if you prefer, data.table::rbindfill(fill = TRUE)):

df_lst |> 
  mapply(FUN = \(x, y){names(x)[1] <- y; x}, 
         x = _, y = names(df_lst), SIMPLIFY = FALSE) |> 
  dplyr::bind_rows()

#   df1 value  df2
# 1   1   4.0 <NA>
# 2   2   5.0 <NA>
# 3   3   6.0 <NA>
# 4  NA   3.0    A
# 5  NA   5.7    B
3
smacks On

I am not sure I have understood what you didn't want, but I managed to get your desired output using base R.

  1. Using a for loop (there is probably a better way out there such as using lapply for example) to rename the column "border" of your dataframe so it takes the name of the dataframe you desired. (not sure if this is the second point that you didn't want)
  2. use reduce to merge the dataframes of the list. It should work with more than two dataframes as well.
df1 <- data.frame(border=c(1,2,3), value=c(4,5,6))
df2 <- data.frame(border=as.factor(c("A","B")), value=c(3,5.7))
df_lst <- list(df1, df2)
list_names <- c("df1","df2")
names(df_lst) <- list_names

for(i in list_names) {
  names(df_lst[[i]])[which(names(df_lst[[i]]) == "border")] <- i
  # works fine if you just want to specify position of column and not name :
  # names(df_lst[[i]])[1] <- i
}

df_lst <- Reduce(
            function(df1,df2) {
               merge(
                 df1,
                 df2,
                 by = c("value"),
                 all = TRUE
               )
             },
             df_lst
            )

> print(df_lst)
  value df1  df2
1   3.0  NA    A
2   4.0   1 <NA>
3   5.0   2 <NA>
4   5.7  NA    B
5   6.0   3 <NA>

Hope this helps !

Edit : realizing there will be a problem if you have same values in the column "value", merge may not be the solution. for example with different values :

df1 <- data.frame(border=c(1,2,5,3), value=c(4,5,5,6))
df2 <- data.frame(border=as.factor(c("A","B","C")), value=c(3,5.7,5))

> df_lst
  value df1  df2
1   3.0  NA    A
2   4.0   1 <NA>
3   5.0   2    C # you get C twice
4   5.0   5    C # you get C twice
5   5.7  NA    B
6   6.0   3 <NA>