How to decide on the clustering method for categorical data in R?

179 Views Asked by At

I'm trying to perform a cluster analysis on mixed data (demographics variables + Likert scales from 1 to 10 preferences). I am trying to apply hierarchical clustering with the function daisy() for mixed data, but when i compute the goodness of fit - cophenetic correlation, the score is 0.60 which is not very high.

How can i improve the goodness of fit? Is hierarchical method suitable for this data? Should the Likert scale data be treated as factors or as numeric? Also, when calling - hclust(seg.dist, method="complete"), is this method suitable for my data?

I also tried Latent Class Analysis but the results are not interesting (unless I was doing it wrong)

seg.dist <- daisy(EUR_data)
as.matrix(seg.dist)
seg.hc <- hclust(seg.dist, method="complete")

to calculate the cophenetic correlation:

cor(cophenetic(seg.hc), seg.dist)

1

There are 1 best solutions below

0
Has QUIT--Anony-Mousse On

Improve preprocessing of your data.

Some attributes will be more important than others.

Likert attributes also often cannot be treated as interval scale, because people are less likely to give a 7 than a 6 or 8 because of cultural reasons: 7 is bad luck.

Clustering will only be as good as your distance, so improve your preprocessing and distance computations!