I have the following data frame:
id = c(1,2,3)
where_home = c(1, 0, NA)
where_work = c(0, 1, NA)
with_alone = c(0,0,0)
with_parents = c(0,1,1)
with_colleagues = c(1,1,0)
gender_male = c(1,0,1)
gender_female = c(0,1,0)
p_affect = c(10,14,20)
n_affect = c(20,30,10)
df = data.frame(id, where_home, where_work,
with_alone, with_parents, with_colleagues,
gender_male, gender_female, p_affect, n_affect)
Where there are 3 IDs, and multiple hot-encoded columns (where, with, gender) along with not hot-encoded columns (p_affect, n_affect).
What I would like is to convert the hot-encoded columns while keeping the not hot-encoded ones as they are.
I did the following:
library(dplyr)
df_transformed <- df %>%
rowwise() %>%
mutate(Gender = case_when(
gender_male == 1 ~ "Male",
gender_female == 1 ~ "Female",
TRUE ~ NA_character_
),
Context = paste(
ifelse(with_alone == 1, "Alone", ""),
ifelse(with_parents == 1, "Parents", ""),
ifelse(with_colleagues == 1, "Colleagues", ""),
collapse = " and "
),
Location = trimws(ifelse(
where_home == 1 & where_work == 1,
'Home and Work',
paste(
ifelse(where_home == 1, 'Home', ''),
ifelse(where_work == 1, 'Work', '')
)
))) %>%
select(-starts_with("gender_"), -starts_with("with_"))
df_transformed <- df_transformed %>%
select(id, Gender, Context, Location, p_affect, n_affect)
And the result:
id Gender Context Location p_affect n_affect
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 Male " Colleagues" Home 10 20
2 2 Female " Parents Colleagues" Work 14 30
3 3 Male " Parents " NA 20 10
This seems to work, but there are a few issues:
- some of the spacing seems strange in the 'Context' column. I would prefer a cleaner format without any spaces separated by 'and' (e.g. "Parents and Colleagues" instead of " Parents Colleagues"
- in this approach, I need to define each column and each case separately, which is tedious, as the original data frame is massive with many columns and possible options. I would like something like:
pseudocode:
vector_of_columns_that_are_hot_encoded = c('where', 'with', 'gender')
for column in vector_of_columns:
# modify the hot-encoded columns and make a new data frame while keeping the columns that are not in the vector_of_columns_that_are_hot_encoded as they are
# mind that some hot-encoded columns are binary (gender), while others have multiple values. If multiple values are present, put them in the data frame using "Value 1 and Value 2 and ..."
I think there has to be a simple way of doing this. As I am a beginner with dplyr, if possible to explain the code and keep it simple.
If the names are that nicely formatted as shown, you could
strsplitat'_'and exploit pre and suffixes inby.Explanation: The
strsplitsplits the selected variables at the"_"and thesapplyaround it selects the1st element, so the (complete) variable names are splitbytheir prefixes. Thesubonly leaves the suffix, so the comparison with1inapplyselects the correct endings, a little nice with a comma usingtoString. Next wecbindfirst the created selections and finally everything to the rest ofdf.Data: