I am currently doing my final year project and I need your humble opinion. My dataset consists of 4 classes which contain :
Mild demented - 896 images
Moderate demented - 64 images
Non demented - 3200 images
Very Mild demented - 2240 images
As you can see, my moderate demented and mild demented are considered highly imbalance. Therefore, I am currently exploring the things that I should do when it comes to imbalance data. I am considering data augmentation or SMOTE to increase my imbalance data. However, I found that data augmentation should be done for training set only. In my case, I want to rebalance my data before splitting the data to ensure the data are balanced. What should I do? Can anyone help me?
I have tried data augmentation after data splitting on training set only. However, my supervisor advise maybe I should use SMOTE for oversampling the images.
The issue of data imbalance frequently arises in "long-tail learning," which focuses on addressing datasets with a long-tail distribution.
There are several methods available to handle the data imbalance problem. The simplest and most effective approach is to use cost-sensitive learning, which balances the class importance weight based on the number of data.
For instance, your dataset contains a total of 6,400 data, and the number of data for the "Moderate demented" class is 64, then the class importance weight for "Moderate demented" is calculated as 6,400/64 = 100. On the other hand, the number of data for the "Non-demented" class is 3,200, its class importance weight is calculated as 6,400/3,200 = 2.
Reference: https://samer-baslan.medium.com/an-introduction-to-deep-long-tailed-learning-414881a2519