Need to convert entire data frame of character strings to factor for association rules analysis

364 Views Asked by At

I have a data frame of character strings and missing values that I need to convert to character factors in R in preparation for a market basket analysis. the rows are transactions without transaction IDs. I'm concerned that if I convert individual columns to factors, then the same item in two different columns will not be recognized as the same item after I then change the data frame to a transaction class. This is for a class. I met with the instructor who showed me this line in R 4.1:

newDF <- factor(oldDF)

...but in R 4.2, this fails with the message: "Warning in xtfrm.data.frame(x) : cannot xtfrm data frames"

The error makes sense to me as when I read up on the factor() function, it does alphabetize the result. For this reason, I'm guessing I don't want to convert the dataframe to a single, large vector and then run factor() on it.

Maybe the trans() function from the "a-rules" package automagically deals with factors for the same item in different columns.

I just want an item in one column to be evaluated as the same item in another column, but I don't know how assigning factors on an as-column basis supports this, with no guarantee that all items are represented in all of the columns.

1

There are 1 best solutions below

0
lhs On

If you provide the levels to factor(), then the resulting factor will contain all provided values, even if none of them were present in that column. Be careful, though -- if you don't include a value as a level, or you spell it wrong, it will be replaced with NA without warning, so make sure to include all of the possible values.

You can apply this to all of the variables at once using across(). Now every column in newDF will be a factor with the same levels, "item A", "item B", "item C". If there were any "item D" in any of the columns, they'll have been replaced with NA.

library(dplyr)

newDF <- oldDF %>% 
  mutate(across(everything(), ~factor(.x, levels = c("item A", "item B", "item C"))))