Can scipy.stats.bootstrap be used to compute confidence intervals for feature weighs in regression or classification tasks?

48 Views Asked by At

I am interested in computing confidence intervals for my feature weights using a bootstrap approach. Is scipy.stats.bootstrap able to do this? Consider this classification task as an example (but same idea for regression tasks). We can get coefficients which will return a vector of feature weights.

import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf = LinearDiscriminantAnalysis()
clf.fit(X, y)
coefficients = clf.coef_

The idea would be to draw samples (with replacement) n-times from X and y, fit the classifier on these batches, get the coefficients and finally compute confidence intervals using the coefficients from all resampling trials.

1

There are 1 best solutions below

0
Matt Haberland On

Yes, in the sense that bootstrap supports vector-valued statistics. For instance, this is valid code:

import numpy as np
from scipy import stats
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
X = np.array([[-1.1, -1.1], [-2.2, -1.2], [-3.3, -2.3], 
             [1.4, 1.4], [2.5, 1.5], [3.6, 2.6]])
y = np.array([1, 1, 1, 2, 2, 2])

def f(*samples):
    # each feature and the target is passed as a separate array,
    # so split them up again
    samples = np.asarray(samples)
    X = samples[:-1].T
    y = samples[-1]
    # confirm that observations stayed together/resamples make sense
    # print(X, y)
    clf = LinearDiscriminantAnalysis()
    clf.fit(X, y)
    return clf.coef_  # returning multiple values is OK
    
# pass the features and target as three separate samples
samples = (X[:, 0], X[:, 1], y)
res = stats.bootstrap(samples, statistic=f, paired=True)

LinearDiscriminantAnalysis seems to have trouble with some of the resamples, but you can see that the code is valid by replacing the clf lines with something like return X[:, 0].mean(), X[:, 1].var(); i.e. bootstrap confidence intervals of the mean of the first feature and variance of the second feature at the same time. Importantly, because paired=True, different features of the same observations stay paired, and of course the statistic can depend on both the first and second feature at the same time, as in your example.