What error it is when I put gridsearchCV to multiclass svm model it work time almost day now from 1 minute?

58 Views Asked by At

When I run my multiclass svm model without gridsearchCV it use 1 minute I only have 3 class and 24 data per class. When I use put gridsearchCV to get more accuracy it work until now for day. I think it doesn't supposed to be like this. It doesn't show any error it just working to long. Is it normal?

This is my code

import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm
import os

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve,roc_auc_score,auc
from sklearn.pipeline import Pipeline

data_list = []
df_list = []
dirname = r"C:\Users\punnut\Downloads\final_dataset"
# AD=0 MCI=1 CN=2

for sub_root, sub_dirs, sub_files in os.walk(dirname):
    for tfile in sub_files:
        if tfile.startswith('AD+MCI+CN_1'):
            data = os.path.join(sub_root, tfile)
            #print(data)
            df = pd.read_csv(data)
            df_list.append(df)
            print(df)
            #print(df.shape)
            # print(df.head())
            col_names = df.columns
            # print(col_names)
            info = df.info()
            # print(info)
            miss = df.isnull().sum()
            print('miss ',miss)
            print(df['Class'].value_counts())
            print(df['Class'].value_counts() / np.cfloat(len(df)))
            print(round(df.describe(), 2))
            # AD=0 MCI=1 CN=2

import warnings
warnings.filterwarnings('ignore')

X = df.drop(['Class'], axis=1)
y = df['Class'] # AD=0 MCI=1 CN=2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train.shape, X_test.shape

cols = X_train.columns
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])
X_train.describe()

print(X_train)
print(X_test)

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import RidgeClassifier

# FEATURE_SELECTION

selector = SelectFromModel(estimator=RidgeClassifier()).fit(X_train, y_train)

print(X_train.shape)

X_train_2 = selector.transform(X_train)
X_test_2 = selector.transform(X_test)


print("step_1111111")
params = {'C':[0.001,0.01,0.1, 1, 10, 100, 1000], 'kernel':['linear','rbf','poly','sigmoid'],'gamma': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
    ,'degree':[0,1,2,3,4,5,6,7,8,9],'decision_function_shape':['ovo','ovr'],'verbose':[1]}
grid_search = GridSearchCV(SVC(), params, cv=4,error_score="raise",verbose=1).fit(X_train, y_train)
print(grid_search.best_params_,"\n")
print("step_2222222")
clf = SVC(**grid_search.best_params_).fit(X_train, y_train)
print("step_3333333")

pipeline = Pipeline([('feature_sele',selector),('clf_cv',clf)]).fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_pred2 = pipeline.decision_function(X_test)
print("step_4444444")

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm,index = ['0','1','2'],columns = ['0','1','2'])

AD_TP = cm_df.iloc[0][0]
AD_TN = cm_df.iloc[0][1]+cm_df.iloc[0][2]
AD_FP = cm_df.iloc[1][1]+cm_df.iloc[2][2]
AD_FN = cm_df.iloc[1][0]+cm_df.iloc[1][2]+cm_df.iloc[2][0]+cm_df.iloc[2][1]

MCI_TP = cm_df.iloc[1][1]
MCI_TN = cm_df.iloc[1][0]+cm_df.iloc[1][2]
MCI_FP = cm_df.iloc[0][0]+cm_df.iloc[2][2]
MCI_FN = cm_df.iloc[0][1]+cm_df.iloc[0][2]+cm_df.iloc[2][0]+cm_df.iloc[2][1]

CN_TP = cm_df.iloc[2][2]
CN_TN = cm_df.iloc[2][0]+cm_df.iloc[2][1]
CN_FP = cm_df.iloc[0][0]+cm_df.iloc[1][1]
CN_FN = cm_df.iloc[0][1]+cm_df.iloc[0][2]+cm_df.iloc[1][0]+cm_df.iloc[1][2]

print('\nAD_True Positives(TP) = ', AD_TP)
print('AD_True Negatives(TN) = ', AD_TN)
print('AD_False Positives(FP) = ', AD_FP)
print('AD_False Negatives(FN) = ', AD_FN,'\n')

print('MCI_True Positives(TP) = ', MCI_TP)
print('MCI_True Negatives(TN) = ', MCI_TN)
print('MCI_False Positives(FP) = ', MCI_FP)
print('MCI_False Negatives(FN) = ', MCI_FN,'\n')

print('CN_True Positives(TP) = ', CN_TP)
print('CN_True Negatives(TN) = ', CN_TN)
print('CN_False Positives(FP) = ', CN_FP)
print('CN_False Negatives(FN) = ', CN_FN,'\n')
print(classification_report(y_test, y_pred))
print('Confusion matrix\n', cm_df)

AD_classification_accuracy = (AD_TP + AD_TN) / float(AD_TP + AD_TN + AD_FP + AD_FN)
print('AD Confusion Matrix Classification accuracy : {0:0.4f}'.format(AD_classification_accuracy))
MCI_classification_accuracy = (MCI_TP + MCI_TN) / float(MCI_TP + MCI_TN + MCI_FP + MCI_FN)
print('MCI Confusion Matrix Classification accuracy : {0:0.4f}'.format(MCI_classification_accuracy))
CN_classification_accuracy = (CN_TP + CN_TN) / float(CN_TP + CN_TN + CN_FP + CN_FN)
print('CN Confusion Matrix Classification accuracy : {0:0.4f}'.format(CN_classification_accuracy))
print('Model accuracy score with default hyperparameters: {0:0.4f}\n'. format(accuracy_score(y_test, y_pred)))

from itertools import cycle
fpr = dict()
tpr = dict()
roc_auc = dict()
y_test_dummies = pd.get_dummies(y_test, drop_first=False).values
for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(y_test_dummies[:, i], y_pred2[:, i], pos_label=0)
    roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green'])
for i, color in zip(range(3), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=2,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Roc Curve AD=0 MCI=1 CN=2')
plt.legend(loc="lower right")
plt.show()

When I use verbose=1 it show "Fitting 4 folds for each of 6160 candidates, totalling 24640 fits" on girdsearhCV step and when I get girdsearhCV out it work one minute. How do I fix girdsearhCV or do you have any better solution?

1

There are 1 best solutions below

0
Muhammed Yunus On

I think the main issue is that the grid search will go through a large number of combinations.

In general, use RandomizedSearchCV for tuning. It'll explore the same space of features as a grid search, but in a lot less time. Grid search is usually used when you are exploring a very limited set of important parameters.

Start with a smaller parameter space. When you start, if a search takes more than a minute or a couple of minutes, think about stopping it and finding a faster way to iterate on your model. Some tips based on your search space:

  • Get rid of decision_function_shape - it doesn't affect the SVC accuracy at all. It just means some results are formatted differently. When you keep it in there, you are using resources to test a combination that will have no impact on the SVC.
  • degree only affects the poly SVC and is ignored by every other kernel.
  • Start off without degree. That will make the poly SVC default to a degree of 2. If you find that the poly SVC does well, run a separate search that looks at different values of poly - but start off small like degree=[1, 2, 3].
  • Larger values of C can make the fit take longer. Start off with a smaller range like params={"C": scipy.stats.loguniform(0.1, 10), ...}.
  • Start off with the default value of gamma. You can expand the search later based on your findings.
  • Increasing tol= is useful in some cases - increasing it means you are willing to accept a more approximate SVC for a shorter optimisation time.
  • With a randomized search, it's better to use the appropriate sampling distribution rather than a list of numbers. Some examples: use scipy.stats.uniform(0, 1) to uniformly sample between 0.0 and 1.0. Use scipy.stats.loguniform(0.01, 100) to uniformly sample across orders of magnitude. Use scipy.stats.randint(-2, 4) to randomly sample a range of integers. Use scipy.stats.expon(0.4) if you mainly want to explore around 0.4.

It sounds like you have relatively few data points per call. This makes the SVC very prone to overfitting. It may help to search at smaller values of C, like {"C": scipy.stats.loguniform(1e-3, 1), ...}.