How to perform one hot encoding without converting a data frame into an array?

58 Views Asked by At

I have df data frame with categorical features columns 'temp_of_extremities', 'peripheral_pulse', 'mucous_membrane'. I want to encode categorical features like here:

from sklearn.preprocessing import OneHotEncoder
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), categorical_features)], remainder='passthrough')
df = ct.fit_transform(df)

But without converting the data frame to an array.

I've tried to apply that method:

categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
for feature in categorical_features:
    df = pd.concat([df, pd.get_dummies(df[feature], prefix=feature, dtype='int')], axis=1)
    df = df.drop([feature], axis=1)

However, that's not the correct solution because when applying this method to another data frame with the same features, the encoding is different

1

There are 1 best solutions below

0
Ian Thompson On

If you have scikit-learn version 1.2.0 or later you can use the set_output method to return a pandas.DataFrame instead of an array.


Example

from functools import partial

import numpy as np  # 1.26.2
import pandas as pd  # 2.1.4
import sklearn  # 1.3.2
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder


# For repeatability
np.random.seed(0)

# Setting some temporary defaults so I don't have to type out the parameters each time.
choice = partial(np.random.choice, size=(10,), replace=True)

# Make some fake data.
df = pd.DataFrame(
    data={
        "temp_of_extremities": choice(a=["high", "low", "neutral"]),
        "peripheral_pulse": choice(a=[True, False]),
        "mucous_membrane": choice(a=[True, False]),
    }
)

# Initial setup from your question.
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(
    # I set the `sparse_output` to False otherwise this will raise a ValueError.
    transformers=[('encoder', OneHotEncoder(sparse_output=False), categorical_features)],
    remainder='passthrough',
    # Use the `set_output` method here to return a `pd.DataFrame` instead of a `np.ndarray`.
).set_output(transform="pandas")
out = ct.fit_transform(df)

print(out)
   encoder__temp_of_extremities_high  ...  encoder__mucous_membrane_True
0                                1.0  ...                            0.0
1                                0.0  ...                            0.0
2                                1.0  ...                            1.0
3                                0.0  ...                            0.0
4                                0.0  ...                            1.0
5                                0.0  ...                            0.0
6                                1.0  ...                            1.0
7                                0.0  ...                            0.0
8                                1.0  ...                            0.0
9                                1.0  ...                            1.0

To clean up the column names a bit, we can set verbose_feature_names_out=False in the ColumnTransformer.

# Initial setup from your question.
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(
    # I set the `sparse_output` to False otherwise this will raise a ValueError.
    transformers=[('encoder', OneHotEncoder(sparse_output=False), categorical_features)],
    remainder='passthrough',
    # Set `verbose_feature_names_out=False` to keep original names + their encoded value.
    verbose_feature_names_out=False,
    # Use the `set_output` method here to return a `pd.DataFrame` instead of a `np.ndarray`.
).set_output(transform="pandas")
out = ct.fit_transform(df)

print(out)
   temp_of_extremities_high  ...  mucous_membrane_True
0                       1.0  ...                   0.0
1                       0.0  ...                   0.0
2                       1.0  ...                   1.0
3                       0.0  ...                   0.0
4                       0.0  ...                   1.0
5                       0.0  ...                   0.0
6                       1.0  ...                   1.0
7                       0.0  ...                   0.0
8                       1.0  ...                   0.0
9                       1.0  ...                   1.0

Resources