Confusion about the code for choosing "stumps" in Adaboost algorithm

47 Views Asked by At

This question refers to the following step in the classical procedure of Adaboost classification. enter image description here

Suppose that we assign an array W and generate training points x and y (with y only taking values -1, 1) as follows:

W = [0.05, 0.032732683535398856, 0.05, 0.05, 0.032732683535398856, 
0.05, 0.05, 0.05, 0.032732683535398856, 0.05, 
0.05, 0.05, 0.05, 0.05, 0.05, 
0.05, 0.05, 0.032732683535398856, 0.032732683535398856, 0.032732683535398856]

from sklearn.datasets import make_blobs
x,y = make_blobs(n_samples = 20, n_features = 5, centers = 2, cluster_std = 20.0, random_state = 100) 
y[y==0] = -1

Then my textbook uses the following code A to generate c_b.

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=1)
clf.fit(x, y, sample_weight = W)  # Here clf is the weak classifier c_b. 
training_pred = clf.predict(x)
print(training_pred)

However, the following code B based on the definition of c_b gives a different result:

 import numpy as np
 from sklearn.tree import DecisionTreeClassifier

 error_rate = 100000

 for k in range(5):

        clf = DecisionTreeClassifier(max_depth=1)
        clf.fit(x[:,[k]], y)

        local_training_pred = clf.predict(x[:,[k]])

        local_error_rate = 0

        for i in range(len(x)):
            if (local_training_pred[i] != y[i]):
                local_error_rate += (W[i])/np.sum(W)

        if local_error_rate < error_rate:
            error_rate = local_error_rate
            training_pred = local_training_pred

print(training_pred)

Here the code compares the error rate of each stump; selects the one with the lowest error rate and then computes the prediction of the training set x under that stump.

Nonetheless, Codes A and B do not return the same result for our choice of W. Does anyone know the reason behind this? Have I actually mistaken the definition of stumps?

0

There are 0 best solutions below