I'm trying to create a decision tree with C4.5 algorithm for a school project. The decision tree is for Haberman's Survival Data Set, attribute information is as follows.
Attribute Information:
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
1 = the patient survived 5 years or longer
2 = the patient died within 5 year
And we need to implement a decision tree where each leaf has to have one distinct result (meaning the entropy of that leaf should be 0), however there are six instances where there is the same attributes, but different results.
For example:
66,58,0,2
66,58,0,1
What does C4.5 algorithm do in these type of situations, I've searched everywhere but couldn't find any information.
Thanks.
Read Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. (It is a good to study C4.5, if you have college assignment)
From what I studied. it seems like on page 137, source code listing build.c
There is a line of
//* if all case are the same.... or there are not enough case to divide(like your question)it will
return NodeThis Node come from
Node = Leaf(ClassFreq, BestClass, Cases, Cases-NoBestClass);All of this information reference on Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
Anyone that has knowledge of this, please comment if I have made something wrong. Thankss