Let's say that I have 2 variables: A
(as the input) and C
(as the output)
So it's A -> C
There's also another variable B
, and
corr(A, B) > corr(A, C)
corr(C, B) > corr(A, C)
Would A -> B -> C
get better performance with the existing model?
In other words, does this B
have any information gain?
Does this middle variable have any information gain?
108 Views Asked by Chas At
1
The information gained about C, given A is:
log(1/P(A,C))
. The information gained about C, given both A and B is:log(1/P(A,B,C))
. So as long asP(A,C) > P(A,B,C)
, there will be more information gained by including B.Now, whether or not that's the case depends on what you're using for the
corr
metric. But if A/C are dependent on B, there will be at least some values of B which are giving information gain. In general, I'd always include dependent variables in a model, unless the dependence is too strong, in which case some models may not work as well (e.g. neural networks).