Can I prevent the J48 classifier from splitting on the same field more than x times?

1.5k Views Asked by At

Using a dataset, Weka and the J48 classifier I've got the following tree: J48 tree

And it splits off a lot on 'NumTweets' on the right side. Can I prevent J48 from doing more than a specified amount of splits on one field? Because this is obviously overfitting my data on a specific field. Ideally I'd want it to only reuse the same field in a branch 3-4 times. Is there any way I can do this?

Thanks in advance!

2

There are 2 best solutions below

0
Percolator On BEST ANSWER

To answer your first question: No, the WEKA explorer does not offer split limits on a specific attribute. This can only be done manually in code.

With that said, there are several things you can try here to limit the tree size/reduce overfitting.

  1. You could try REPTree instead of J48. It uses the same splitting criteria as J48 but uses reduced error pruning. It has an option to limit the depth of the tree.

  2. Decreasing the J48 pruning confidence (-C parameter) will result in more pruning and thus smaller tree size.

  3. You can try to play around with the minNumObj (minimal number of instances reaching each leaf) parameter.

1
knb On

No. But you could set the J48 minNumObj config parameter higher. (The default value is 2.) This sets a constraint on the minimum number of data elements that each leaf node will have to contain.

This way (by trial and error) you can balance and/or simplify the decision tree to some extent.

Maybe you can drop or ignore the annoying attribute. Maybe discretizing the NumTweets into bins (e.g. <1 tweet/day, <10 tweets/day, more > 10 Tweets day) also helps? This could be done with a Discretizing Filter on the Preprocessing Tab.