I have df data frame with categorical features columns 'temp_of_extremities', 'peripheral_pulse', 'mucous_membrane'.
I want to encode categorical features like here:
from sklearn.preprocessing import OneHotEncoder
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), categorical_features)], remainder='passthrough')
df = ct.fit_transform(df)
But without converting the data frame to an array.
I've tried to apply that method:
categorical_features = ['temp_of_extremities', 'peripheral_pulse', 'mucous_membrane']
for feature in categorical_features:
df = pd.concat([df, pd.get_dummies(df[feature], prefix=feature, dtype='int')], axis=1)
df = df.drop([feature], axis=1)
However, that's not the correct solution because when applying this method to another data frame with the same features, the encoding is different
If you have
scikit-learnversion1.2.0or later you can use theset_outputmethod to return apandas.DataFrameinstead of an array.Example
To clean up the column names a bit, we can set
verbose_feature_names_out=Falsein theColumnTransformer.Resources
set_outputAPI