c4.5 algorithm missing values

3.9k Views Asked by At

How does the C4.5 algorithm deal with missing values and attribute value on continuous interval? Also, how is a decision tree pruned? Could someone please explain with the help of an example.

1

There are 1 best solutions below

6
On

Say we built a decision tree from the canonical example of whether one should play golf based on the weather conditions. We may have a training dataset like this:

OUTLOOK | TEMPERATURE | HUMIDITY | WINDY | PLAY
=====================================================
sunny   |      85     |    85    | false | Don't Play
sunny   |      80     |    90    | true  | Don't Play
overcast|      83     |    78    | false | Play
rain    |      70     |    96    | false | Play
rain    |      68     |    80    | false | Play
rain    |      65     |    70    | true  | Don't Play
overcast|      64     |    65    | true  | Play
sunny   |      72     |    95    | false | Don't Play
sunny   |      69     |    70    | false | Play
rain    |      75     |    80    | false | Play
sunny   |      75     |    70    | true  | Play
overcast|      72     |    90    | true  | Play
overcast|      81     |    75    | false | Play
rain    |      71     |    80    | true  | Don't Play

And use it to build a decision tree that may look something like this:

              Outlook
            /  |      \
  overcast /   |sunny  \rain
          /    |        \
     Play   Humidity   Windy
           /   |         |  \
          /    |         |   \
    <=75 /  >75|     true|    \false
        /      |         |     \
     Play  Don'tPlay Don'tPlay  Play
  1. The C4.5 Algorithm deals with missing values by returning the probability distribution of the labels under the attribute branch for which the value is missing. Suppose that we had an instance in our test data that showed the outlook to be Sunny but did not have a value for the attribute Humidity. Also, suppose that our training data had 2 instances for which the outlook was Sunny, Humidity was below 75, and a label of Play. Furthermore, suppose the training data had 3 instances where the outlook was Sunny, Humidity was above 75, and had a label of Don't Play. So for the test instance with the missing Humidity attribute, the C4.5 algorithm would return a probability distribution of [0.4, 0.6] corresponding to [Play, Don't Play].
  2. Assuming that you already understand how decision trees use information gain over a set of features to choose which features to branch at on each level, the C4.5 algorithm performs this same procedure on a continuous interval attribute by evaluating the information gain for every split of the attribute and choosing the best one. An example of this can be seen in the Humidity attribute above. The C4.5 algorithm tested the information gain provided by the humidity attribute by splitting it at 65, 70, 75, 78...90 and found that performing the split at 75 provided the most information gain.
  3. C4.5 performs pruning by replacing a subtree in the decision tree with a single decision node that either encompasses all the decisions of the subtree or provides the least error.

For more information, I would suggest this excellent resource I used to write my own Decision Tree and Random Forest algorithm: https://cis.temple.edu/~giorgio/cis587/readings/id3-c45.html