Is weightcol of spark random forest classifier used directly in impurity calculation?

40 Views Asked by At

To my knowledge, in sklearn sample weights will be incorporated into the impurity formula. Take binary classification and gini impurity as an example:

enter image description here

With sample weights, p_0 will be calculated as:

enter image description here

However, looking into the source code of spark ml, I found the sample weights seem not to be used in calculating class probability. It's only used after the split to reweight the impurities of left and right node for the total impurity. As a result, a highly weighted positive example will not increase the postive probability, instead it only adds to total weight of a node. I'm not sure if my observation is right or wrong, so here to look for some expert clarify this.

0

There are 0 best solutions below