Random Forest Classifier Removing Features using Top-N Features Method

73 Views Asked by At

I am trying to predict the winner of an NBA game using a random forest classifier. I have sought to remove and modify my list of features so that I can increase accuracy and decrease noise.

I implemented the solution found here: https://datascience.stackexchange.com/questions/57697/decision-trees-should-we-discard-low-importance-features, where I would loop through the top N most important features and plot out the resulting accuracy. After all my features have gone through that loop, I'm left with a plot that looks like this: enter image description here

As you can see, the resulting graph is kind of all over the place. Do I remove the features that have a negative slope? Or what's the threshold to removing features? Is there a better way to calculate noise? How would I get the most accurate model given that I have so many features with such a variable impact on my model accuracy on training data?

2

There are 2 best solutions below

1
Notepad On

In ML/DL, some features affect positive side but some features affect negative side in Model accuary, Model performance.
Each feature is related to each other with correlation or some one else.

sklearn's Random Forest provide lots of parameters such as max_depth, max_featuresor max_leaf_nodes etc.

So you can use GridSearch in sklearn, that class tunes hyperparameter in Randomforest. If you search best hyperparameter in Your model, Your model have better preformance before.

1
khubull On

As a starting point, you could try some feature selection techniques that are easier to understand. This is what I would try based on the small subset of techniques that I am familiar and comfortable with...

If you have continuous variables, plot a correlation matrix and remove highly correlated features to eliminate multicollinearity. If your features are categorical, you could try ANOVA. If you have a large number of features, a small sample size, and nonlinear relationships between features, you could investigate dimensionality reduction techniques like PCA.