predict_proba() giving probabilities as 0s and 1s but few intermediate values

69 Views Asked by At

I am working on a breast cancer detection classification problem. I have downloaded the dataset from Kaggle: https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset

I want to predict:

a) Whether the tumour is Benign or Malignant and

b) What is the probability(0-1) that a tumour is malignant.

I am implementing a Random Forest Classifier.

The issue I am facing is that when I use the rf_classifier.predict_proba() method, the probabilities that I obtain contain a lot of 1s and 0s but few intermediate values. Ideally, I was expecting all values to be decimals between 0 and 1 in the Probability column.

Is this method the correct approach for achieving the objective? If yes, how to solve the issue?

The classifier performs very well.

Here's the relevant portion of my code:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)

y_pred = rf_classifier.predict(X_test)

y_pred_proba = rf_classifier.predict_proba(X_test)[:, 1]

results = np.column_stack((y_test[:200], y_pred[:200], y_pred_proba[:200]))
np.set_printoptions(precision=2, suppress=True)
print("Actual | Predicted | Probability")
print(results)

Output:

output

Classification report:

Classification report

1

There are 1 best solutions below

3
Reza On

The probability is between 0 and 1 only when the samples in a leaf contain 0 and 1. For example, when a leaf contains nine benign and one malignant sample, the probability of being malignant is 10 percent and vice. versa.

Now, when you traverse through the random forest and reach a leaf with impure samples (containing both benign and malignant samples), the output will be in decimals.

However, in your model, most leaves are pure that results in 0 and 1 predictins.