predict_proba() giving probabilities as 0s and 1s but few intermediate values

69 Views Asked by Avijit Jaiswal At 11 March 2024 at 09:13

I am working on a breast cancer detection classification problem. I have downloaded the dataset from Kaggle: https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset

I want to predict:

a) Whether the tumour is Benign or Malignant and

b) What is the probability(0-1) that a tumour is malignant.

I am implementing a Random Forest Classifier.

The issue I am facing is that when I use the rf_classifier.predict_proba() method, the probabilities that I obtain contain a lot of 1s and 0s but few intermediate values. Ideally, I was expecting all values to be decimals between 0 and 1 in the Probability column.

Is this method the correct approach for achieving the objective? If yes, how to solve the issue?

The classifier performs very well.

Here's the relevant portion of my code:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)

y_pred = rf_classifier.predict(X_test)

y_pred_proba = rf_classifier.predict_proba(X_test)[:, 1]

results = np.column_stack((y_test[:200], y_pred[:200], y_pred_proba[:200]))
np.set_printoptions(precision=2, suppress=True)
print("Actual | Predicted | Probability")
print(results)

Output:

output

Classification report:

Classification report

Original Q&A

There are 1 best solutions below

Reza On 11 March 2024 at 09:52

The probability is between 0 and 1 only when the samples in a leaf contain 0 and 1. For example, when a leaf contains nine benign and one malignant sample, the probability of being malignant is 10 percent and vice. versa.

Now, when you traverse through the random forest and reach a leaf with impure samples (containing both benign and malignant samples), the output will be in decimals.

However, in your model, most leaves are pure that results in 0 and 1 predictins.

predict_proba() giving probabilities as 0s and 1s but few intermediate values

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in RANDOM-FOREST

Trending Questions

Popular # Hahtags

Popular Questions