Clustering or multiple correspondence analysis for a set of check-all-that-apply questions in R

197 Views Asked by At

I have a dataset that contains a set of check-all-that-apply questions. Let's say it goes like this

Q1. "Which sports do you play"

  • A. Football B. Basketball C. ... etc. etc. etc.

Q2. "Which of the following elements motivated you to play sports?"

  • A. Competitiveness B. Achievement C. ... etc. etc. etc.

There are 4 goals I want to achieve:

  1. Understand which sports are "close" to each other
  2. Understand which elements are "close" to each other
  3. Understand which sports are associated with which elements
  4. Categorize participants considering their participation preferences in sports (e.g. stamina sports players vs. agility sports player vs. etc.)

Is it valid to do K-means clustering on a table like this (note: the actual data is much larger; this table is only for demo)? Why and why not?

Is multiple correspondence analysis a better way? Why and why not?

football <- c(1,1,1,1,0,0,0,0)
basketball <- c(1,1,0,0,0,0,0,1)
other <- c(1,0,0,0,0,0,0,0)
df<- data.frame(football, basketball, other)
m = as.matrix(df)
t(m) %*% m / colSums(m)

#             football basketball     other
# football   1.0000000        0.5 0.2500000
# basketball 0.6666667        1.0 0.3333333
# other      1.0000000        1.0 1.0000000
0

There are 0 best solutions below