I have a dataset that contains a set of check-all-that-apply questions. Let's say it goes like this
Q1. "Which sports do you play"
- A. Football B. Basketball C. ... etc. etc. etc.
Q2. "Which of the following elements motivated you to play sports?"
- A. Competitiveness B. Achievement C. ... etc. etc. etc.
There are 4 goals I want to achieve:
- Understand which sports are "close" to each other
- Understand which elements are "close" to each other
- Understand which sports are associated with which elements
- Categorize participants considering their participation preferences in sports (e.g. stamina sports players vs. agility sports player vs. etc.)
Is it valid to do K-means clustering on a table like this (note: the actual data is much larger; this table is only for demo)? Why and why not?
Is multiple correspondence analysis a better way? Why and why not?
football <- c(1,1,1,1,0,0,0,0)
basketball <- c(1,1,0,0,0,0,0,1)
other <- c(1,0,0,0,0,0,0,0)
df<- data.frame(football, basketball, other)
m = as.matrix(df)
t(m) %*% m / colSums(m)
# football basketball other
# football 1.0000000 0.5 0.2500000
# basketball 0.6666667 1.0 0.3333333
# other 1.0000000 1.0 1.0000000