Why is my mutli-label text classification low accuracy?

51 Views Asked by At

So I'm using this dataset : https://www.kaggle.com/datasets/madhavmalhotra/journal-entries-with-labelled-emotions

This video as guidance: https://www.youtube.com/watch?v=YyOuDi-zSiI&t=1077s

So I have changed the value of True to 1 and False to 0. I also have removed classes with below 30 instances in it. Now, I only have text for these classes:

happy                        182
satisfied                    133
calm                          99
calm, happy, satisfied        77
happy, satisfied              73
proud                         62
happy, proud, satisfied       54
excited, happy, satisfied     46
calm, satisfied               42
calm, happy                   41
excited, happy, proud         37
proud, satisfied              33
frustrated                    32
excited, happy                31
excited                       31
Name: Emotions Felt, dtype: int64

I'm using this code for swapping between models and machine learning methods:

def build_model (model,mlb_estimator,xtrain,ytrain,xtest,ytest):
    clf = mlb_estimator(model)
    clf.fit(xtrain,ytrain)
    clf_predictions = clf.predict(xtest)
    acc = accuracy_score(ytest,clf_predictions)
    ham = hamming_loss(y_test,clf_predictions)
    result = {"accuracy":acc,"hamming_score":ham}
    return result

clf_chain_model = build_model(MultinomialNB(),ClassifierChain,X_train,y_train,X_test,y_test)

I got accuracy:

{'accuracy': 0.1815068493150685, 'hamming_score': 0.2054794520547945}

So my question is,

  1. why my accuracy so low?

  2. how to get higher accuracy?

So I tried swapping models with LogisticRegression, KNeighborsClassifier, DecisionTreeClassifier, GaussianNB, MultinomialNB and RandomForestClassifier. To add, I also swap machine learning methods with BinaryRelevance, ClassifierChain and LabelPowerset for each of the model. I have not tried using neural networks models or BERT yet.

1

There are 1 best solutions below

1
Marcello Zago On

Some of the methods you describe have hyperparameters, which can change the performance of the models significantly. For the KNeighborsClassifier you have the parameter k that is really important. Usually, one performs some kind of parameter optimisation, with methods like k-fold cross-validation. This is needed to find the optimal parameter set for your data.

You can use GridSearchCV for this. In the documentation, there is also an example for a Support-Vector Machine.