Isolation Forest getting accuracy score 0.0

446 Views Asked by At

Edit: please share comments as I'm learning to post good questions

I'm trying to train this dataset with IsolationForest(), I need to train this dataset, and use it in another dataset with altered qualities to predict the quality values and fetch all wines with quality 8 and 9.

However I'm having some problems with it. Because the accuracy score is 0.0 from the classification report:

print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

          -1       0.00      0.00      0.00       0.0
           1       0.00      0.00      0.00       0.0
           3       0.00      0.00      0.00     866.0
           4       0.00      0.00      0.00     829.0
           5       0.00      0.00      0.00     841.0
           6       0.00      0.00      0.00     861.0
           7       0.00      0.00      0.00     822.0
           8       0.00      0.00      0.00     886.0
           9       0.00      0.00      0.00     851.0

    accuracy                           0.00    5956.0
   macro avg       0.00      0.00      0.00    5956.0
weighted avg       0.00      0.00      0.00    5956.0

I don't know if it's a hyperparameter issue, or if I'm clearing the wrong data or putting wrong parameters, I already tried to use with SMOTE and without SMOTE, I wanted to reach an accuracy of 90% at least.

I'll leave the shared drive link public for dataset verification::

https://drive.google.com/drive/folders/18_sOSIZZw9DCW7ftEKuOG4aIzGXoasFe?usp=sharing

Here's my code:

from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import IsolationForest
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report,confusion_matrix

df = pd.read_csv('wines.csv')

df.head(5)

ordinalEncoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-99).fit(df[['color']])
df[['color']] = ordinalEncoder.transform(df[['color']])

df.info()

df['color'] = df['color'].astype(int)

df.head(3)

stm = SMOTE(k_neighbors=4)
x_smote = df.drop('quality',axis=1)
y_smote = df['quality']
x_smote,y_smote = stm.fit_resample(x_smote,y_smote)

print(x_smote.shape,y_smote.shape)

x_smote.columns

scaler = StandardScaler()
X = scaler.fit_transform(x_smote)
y = y_smote

X.shape, y.shape

x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3)

from sklearn.ensemble import IsolationForest
from sklearn.metrics import hamming_loss

iforest = IsolationForest(n_estimators=200, max_samples=0.1, contamination=0.10, max_features=1.0, bootstrap=False, n_jobs=-1, 
                            random_state=None, verbose=0, warm_start=False)

iforest_fit = iforest.fit(x_train,y_train)

prediction = iforest_fit.predict(x_test)

print (prediction.shape, y_test.shape)

y.value_counts()

prediction

print(confusion_matrix(y_test, prediction))
hamming_loss(y_test, prediction)

from sklearn.metrics import classification_report
print(classification_report(y_test, prediction))
1

There are 1 best solutions below

7
blackraven On

May I know why do you choose Isolation Forest as your model? This article says that Isolation Forest is an unsupervised learning algorithm for anomaly detection.

When I print some samples of the prediction (by Isolation Forest) and samples of actual truth, I get the following results, so you know why the accuracy score is 0.0:

print(list(prediction[0:15]))
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

print(list(y_test[0:15]))
[9, 4, 4, 7, 9, 3, 6, 7, 4, 8, 8, 7, 3, 8, 5]

The wines.csv dataset and your code are both pointing towards a multi-class classification problem. I have chosen RandomForestClassifier() to continue with the second part of your code:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import hamming_loss

model = RandomForestClassifier()
model.fit(x_train,y_train)
prediction = model.predict(x_test)

print(prediction[0:15])    #see 15 samples of prediction
[3, 9, 5, 5, 7, 9, 7, 6, 9, 8, 5, 9, 8, 3, 3]

print(list(y_test[0:15]))    #see 15 samples of actual truth
[3, 9, 5, 6, 6, 9, 7, 5, 9, 8, 5, 9, 8, 3, 3]

print(confusion_matrix(y_test, prediction))
[[842   0   0   0   0   0   0]
 [  2 815  17   8   1   1   0]
 [  8  50 690 130  26   2   0]
 [  2  28 152 531 128  16   0]
 [  4   1  15  66 716  32   3]
 [  0   1   0   4  12 833   0]
 [  0   0   0   0   0   0 820]]

print('hamming_loss =', hamming_loss(y_test, prediction))
hamming_loss = 0.11903962390866353

print(classification_report(y_test, prediction))
              precision    recall  f1-score   support

           3       0.98      1.00      0.99       842
           4       0.91      0.97      0.94       844
           5       0.79      0.76      0.78       906
           6       0.72      0.62      0.67       857
           7       0.81      0.86      0.83       837
           8       0.94      0.98      0.96       850
           9       1.00      1.00      1.00       820

    accuracy                           0.88      5956
   macro avg       0.88      0.88      0.88      5956
weighted avg       0.88      0.88      0.88      5956

The accuracy is already 0.88 even before tuning hyperparameters.