Why DecisionTreeClassifier split wrongly the data with the specified criterion?

29 Views Asked by At

In the first use of DecisionTreeClassifier, we reach two subtrees with sample numbers of 192 and 346, but when we use the file Counter and set the same condition as separation in the Treeclassifier decision, we reach the numbers 171 and 367. What is the sign of this difference?

DecisionTreeClassifier code block:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv(r"PCOS.csv")
X = data.drop("PCOS (Y/N)", axis=1)
y = data["PCOS (Y/N)"]
model = DecisionTreeClassifier(max_depth=2, criterion="gini")
model.fit(X, y)

tree.plot_tree(model)
fn = data.columns

label = ["0", "1"]
fig, axes = plt.subplots()
tree.plot_tree(model, feature_names=fn, class_names=label, filled=True)
fig.savefig('imagenae.png')

counter code block:

import pandas as pd


def subtree(data, col):
    first_list = []
    sec_list = []
    for i in range(len(data)):
        if data[col][i] <= 7.5:
            first_list.append(data.loc[i, :].values)
        else:
            sec_list.append(data.loc[i, :].values)
    gini(first_list)
    gini(sec_list)


def gini(data):
    a, b= 0, 0
    for i in data:
        if i[-1] == 0:
            a += 1
        else:
            b += 1
    print("label 0 :", a)
    print("label 1 :", b)


col = ['Skin darkening (Y/N)', 'hair growth(Y/N)', 'Weight gain(Y/N)', 'Cycle(R/I)', 'Follicle No. (R)',
       'Fast food (Y/N)', 'Follicle No. (L)', 'PCOS (Y/N)']

data = pd.read_csv("PCOS.csv")[col]

X = data.drop("PCOS (Y/N)", axis=1)
y = data[["PCOS (Y/N)"]]

subtree(data, 'Follicle No. (L)')

result DecisionTreeClassifier: 192 and 346 result counter: 171 and 367

database: database Visualize Decision Tree: Visualize Decision Tree

0

There are 0 best solutions below