I have a data set with multiple features. One of the features can take 10 possible discrete values. When generating a regression tree using sklearn, how can I get the tree to split a node on one of the discrete values rather than on a continuous range. For example, suppose the feature X can take values of 0.0, 0.1, 0.2, 0.3, 0.4, 0.5.0.6, 0.7 , 0.8 and 0,9. Currently when generating the regression tree , the current graph shows that a split is made when X < 0.25. Is it possible to modify my code so that a split can only be made using the above discrete values?
I thought turning the numerical data into categorical data would help the tree split discretely but apparently sklearn cannot use categorical data
Thank you for reading this question
This SO question has got some answers that look useful: sklearn tree treats categorical variable as float during splits, how should I solve this?
I think the basic idea is that you either one-hot encode the categorical variable (that post has some example code), or you use an algorithm that natively supports categorical features, such as
sklearn.ensemble.HistGradientBoostingRegressor.