Calculating relative entropy for a latent class analysis - two methods give different results

133 Views Asked by At

I'm using R poLCA to run a latent class model with 4 categorical indicators (3 levels, 3 levels, 9 levels and 5 levels). As poLCA doesn't compute relative entropy, I have found two formulas for calculating it manually from the results, both presented in this answer here. However, I get slightly different result from these formulas for my own data but not for the poLCA package example data (carcinoma), and am wondering why this might be.

Here's an excerpt of my data for a reproducible example

var1<-c(1,1,1,1,1,1,1,3,1,1,3,1,2,2,1,1,1,1,1,1,1,3,2,1,1,1,1,1,1,1,1,1,1,1,3,2,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,2,2,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,3,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,1,2,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,3,3,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,2,1,1,3,1,1,1,3,1,1,3,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
var2<-c(3,2,2,1,2,1,2,2,1,1,1,3,2,3,1,3,2,1,1,3,1,2,1,3,1,1,3,3,3,3,1,2,1,3,2,3,1,1,1,3,3,3,1,3,3,2,3,3,3,3,2,2,2,3,3,1,3,1,1,1,2,3,1,3,2,1,1,1,1,1,3,1,3,2,1,1,1,1,1,1,2,1,1,1,1,2,1,2,3,3,3,2,1,1,2,3,3,2,1,3,3,3,3,3,3,1,2,3,3,1,3,3,3,3,3,2,1,3,2,3,1,1,1,2,2,2,1,2,1,2,2,1,1,3,1,3,1,2,3,2,1,2,1,3,1,1,1,1,2,2,2,2,1,2,3,1,3,1,1,1,2,2,1,2,2,3,2,3,1,2,3,3,3,3,3,3,3,3,2,3,3,3,3,3,1,3,1,3,3,1,1,2,1,1,1,3,2,3,3,1,3)
var3<-c(3,8,2,3,1,8,1,1,8,8,1,8,2,8,6,6,8,9,8,4,2,2,8,6,6,6,5,6,2,6,8,2,2,9,2,9,2,8,8,4,4,2,5,8,6,2,2,2,3,2,8,8,2,4,5,9,1,1,1,8,5,3,8,3,4,3,6,1,1,2,8,1,6,5,8,4,8,8,8,8,9,8,4,3,4,1,9,1,4,3,1,2,1,2,5,8,8,4,9,4,8,8,8,4,8,8,2,8,5,2,3,6,4,9,8,2,2,1,1,3,8,1,1,4,2,5,8,1,2,8,4,1,8,8,8,4,9,4,8,5,8,4,8,4,3,8,9,8,4,9,4,4,9,9,3,8,8,8,8,8,4,3,8,4,9,4,4,4,8,4,9,4,5,8,6,8,4,4,1,2,3,3,8,4,3,3,2,6,9,2,8,4,4,8,9,8,9,2,4,1,6)
var4<-c(1,2,1,1,1,1,1,2,1,1,2,2,1,2,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,2,1,4,1,1,2,2,1,1,2,1,3,2,1,1,1,1,1,1,1,4,1,1,4,1,1,1,1,1,2,1,1,3,1,1,1,2,1,1,1,1,1,1,1,3,2,1,1,2,2,1,1,1,1,1,1,1,1,1,2,2,1,3,1,1,1,1,1,1,2,1,1,2,1,2,1,1,2,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,1,2,1,1,3,1,2,1,1,1,1,1,1,2,2,1,1,1,4,1,1,2,1,2,1,1,1,1,3,1,2,1,1,1,1,1,1,1,1,1,1,1,1,4,2,1,1,1,1)

ex.data<-data.frame(var1, var2, var3, var4)

f <- cbind(var1, var2, var3, var4)~1

lc.ex<-poLCA(f, ex.data, nclass=3) #I run a 3-class model for the example

#First I tried Israel Souza's formula:

nume.E<- -sum(lc.ex$posterior * log(lc.ex$posterior), na.rm=T)
deno.E<-201*log(3)
ent.ex<-1-(nume.E/deno.E)
ent.ex
[1] 0.7379364
##
#Then, I tried Daniel Oberski's formula
(Originally from here: http://daob.nl/wp-content/uploads/2015/07/ESRA-course-slides.pdf)

entropy<-function (p) sum(-p*log(p))

error_prior <- entropy(lc.ex$P)
error_post <- mean(apply(lc.ex$posterior, 1, entropy), na.rm=T)
ent.ex2 <- (error_prior - error_post) / error_prior
ent.ex2
[1] 0.7254486

Of course, these values are very close, but with my full data (N > 6000), I get larger differences: frustratingly, I get entropy of .72 with the first formula and entropy of .68 with the second formula for the same model. Also, using the carcinoma data (as in Israel's example in the linked reply) I get identical values with the two formulas. Can anyone explain to me what is the difference, if any, between the two formulas? Or am I applying them wrong? I have removed all observations with missings on any of the variables, so that shouldn't be an issue.

Thanks in advance!

0

There are 0 best solutions below