I need to clean up a large datafile (±1300 columns) and store it into a logical data frame. I know it should be possible, but I have no idea how.
So my data only contains 'yes', 'no' or 'n/a' values, but it's the columns they are in that have all the needed information. It's about who you spent time with during your day. Each row is 1 participant. Here is an example:
| Monday.06.00.adult.1 | Monday.06.00.adult.2 | Monday.06.00.adult.3 | Monday.07.00.adult.1 | Monday.07.00.adult.2 | Monday.07.00.adult.3 | Monday.08.00.adult.1 | Monday.08.00.adult.2 | Monday.08.00.adult.3 |
|---|---|---|---|---|---|---|---|---|
| yes | yes | n/a | yes | yes | n/a | no | yes | n/a |
| yes | no | no | yes | no | no | yes | yes | yes |
Here you can deduct that participant 1 spends his Monday morning at 6 o'clock with adult 1 and adult 2, and there is no adult 3. Adult 1 leaves at 8, as then his presence is changed to 'no'. For participant 2 we see that his mornings start with only adult 1, and adult 2 and 3 join at 8. I have columns for every hour of every day for a whole week for spending it with 5 adults and 5 children... How I would want my data to look at the end:
| Participant | Day | Hour | SpentWith |
|---|---|---|---|
| 1 | Monday | 06 | Adult1, Adult2 |
| 1 | Monday | 07 | Adult1, Adult2 |
| 1 | Monday | 08 | Adult2 |
| 2 | Monday | 06 | Adult1 |
| 2 | Monday | 07 | Adult1 |
| 2 | Monday | 08 | Adult1, Adult2, Adult3 |
Is there any way that I can select (per row) all the 'yes' answers, and then combine their column names if they overlap on the day and hour? (So for participant 1, the 'yes' answers would be in column Monday.06.00.adult.1 and Monday.06.00.adult.2, and I need to be able to extract and then combine 'Adult1' and 'Adult2' and store it into a separate column.
I've tried to collect all column names from the 'yes' answers with
names(data)[which(data == "yes", arr.ind=T)[, "col"]]
But this provides me with a list where I cannot see which value (column name) belongs to which participant, which makes it impossible to concatenate the right ones.
"monday.06.00.adult.1" "monday.06.00.adult.1" "monday.06.00.adult.2" "monday.06.00.adult.2" "monday.07.00.adult.1" "monday.07.00.adult.2" "monday.07.00.adult.2". "monday.07.00.child.1" "monday.07.00.child.1"
(This is a random example)
I've thought about using dplyr (but I have little experience with it).
data <- data %>% mutate(SpentWith = case_when(
if monday.07.00.adult.1 == "yes", ~ "Adult1",
if monday.07.00.adult.2 == "yes", ~ "Adult2",
if monday.07.00.adult.3 == "yes", ~ "Adult3"
))
Where it would be ideal if it would add on the 'Adult1', 'Adult2', 'Adult3' strings in one character variable if it meets that condition.
Any help in any step of this process would be very helpful!
Using dplyr and
tidyr::pivot_longer():