Create new column with string containing column names based on condition in R

614 Views Asked by At

I need to clean up a large datafile (±1300 columns) and store it into a logical data frame. I know it should be possible, but I have no idea how.

So my data only contains 'yes', 'no' or 'n/a' values, but it's the columns they are in that have all the needed information. It's about who you spent time with during your day. Each row is 1 participant. Here is an example:

Monday.06.00.adult.1 Monday.06.00.adult.2 Monday.06.00.adult.3 Monday.07.00.adult.1 Monday.07.00.adult.2 Monday.07.00.adult.3 Monday.08.00.adult.1 Monday.08.00.adult.2 Monday.08.00.adult.3
yes yes n/a yes yes n/a no yes n/a
yes no no yes no no yes yes yes

Here you can deduct that participant 1 spends his Monday morning at 6 o'clock with adult 1 and adult 2, and there is no adult 3. Adult 1 leaves at 8, as then his presence is changed to 'no'. For participant 2 we see that his mornings start with only adult 1, and adult 2 and 3 join at 8. I have columns for every hour of every day for a whole week for spending it with 5 adults and 5 children... How I would want my data to look at the end:

Participant Day Hour SpentWith
1 Monday 06 Adult1, Adult2
1 Monday 07 Adult1, Adult2
1 Monday 08 Adult2
2 Monday 06 Adult1
2 Monday 07 Adult1
2 Monday 08 Adult1, Adult2, Adult3

Is there any way that I can select (per row) all the 'yes' answers, and then combine their column names if they overlap on the day and hour? (So for participant 1, the 'yes' answers would be in column Monday.06.00.adult.1 and Monday.06.00.adult.2, and I need to be able to extract and then combine 'Adult1' and 'Adult2' and store it into a separate column.

I've tried to collect all column names from the 'yes' answers with

names(data)[which(data == "yes", arr.ind=T)[, "col"]]

But this provides me with a list where I cannot see which value (column name) belongs to which participant, which makes it impossible to concatenate the right ones.

"monday.06.00.adult.1"      "monday.06.00.adult.1"      "monday.06.00.adult.2"      "monday.06.00.adult.2"      "monday.07.00.adult.1"      "monday.07.00.adult.2"      "monday.07.00.adult.2".     "monday.07.00.child.1"      "monday.07.00.child.1"     

(This is a random example)

I've thought about using dplyr (but I have little experience with it).

data <- data %>% mutate(SpentWith = case_when(
if monday.07.00.adult.1 == "yes", ~ "Adult1",
if monday.07.00.adult.2 == "yes", ~ "Adult2",
if monday.07.00.adult.3 == "yes", ~ "Adult3"
))

Where it would be ideal if it would add on the 'Adult1', 'Adult2', 'Adult3' strings in one character variable if it meets that condition.

Any help in any step of this process would be very helpful!

1

There are 1 best solutions below

4
zephryl On

Using dplyr and tidyr::pivot_longer():

library(dplyr) 
library(tidyr)

data %>% 
  mutate(Participant = row_number()) %>%   # add participant IDs
  pivot_longer(                            # pivot rows to columns:
    !Participant,                            ## use all columns except Participant
    names_sep = "\\.",                       ## break up column names by "."
    names_to = c("Day", "Hour", "Minute", "Adult", "SpentWith")  ## names for new cols
  ) %>% 
  filter(value == "yes") %>%              # keep "yes" values only
  group_by(Participant, Day, Hour) %>%    # within each Participant / Day / Hour:
  summarize(                              # combine SpentWith values into one string
    SpentWith = paste0("Adult", SpentWith, collapse = ", "),
    .groups = "drop"
  )
# A tibble: 6 × 4
  Participant Day    Hour  SpentWith             
        <int> <chr>  <chr> <chr>                 
1           1 Monday 06    Adult1, Adult2        
2           1 Monday 07    Adult1, Adult2        
3           1 Monday 08    Adult2                
4           2 Monday 06    Adult1                
5           2 Monday 07    Adult1                
6           2 Monday 08    Adult1, Adult2, Adult3