Mistaken result after one apply One Hot Encoder to test and train datasets

13 Views Asked by At

I am currently working in Google Colab, and I want to apply a Naive Bayes algorithm to a dataset. This dataset includes a categorical column, and I have received advice on dividing it into training and testing sets before applying the One Hot Encoder. Consequently, I visualized all possible categories within the categorical column and separated my data into training and testing subsets. After that, I ensured that each subset contains all the values present in the complete dataset within the categorical column. Then, I applied the One Hot Encoder. However, I encountered an issue when visualizing NaN (Not-a-Number) values in the columns created by the One Hot Encoder within the testing dataset. I'm unsure about what I might have done wrong. Could you tell me what wrong is in the next Python code ?

# Extrae la columna categórica
categorica = 'Modalidad'
valores_c = ent1_df[categorica].values.reshape(-1, 1)

# Crea un objeto OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Ajusta y transforma los datos de entrenamiento
encoded_categorica = encoder.fit_transform(valores_c)

# Transforma los datos de evaluación
encoded_categorica_eval = encoder.transform(ev1_df[categorica].values.reshape(-1, 1))
print(encoded_categorica[:5])
print(encoded_categorica_eval[:50])



X_train_encoded = pd.concat([ent1_df.drop(['Modalidad', 'indice_creencia_norm'], axis=1),         pd.DataFrame(encoded_categorica)], axis=1)
X_train_encoded.reset_index(drop=True, inplace=True)  # Resetea el índice y elimina la columna del   índice anterior

X_eval_encoded = pd.concat([ev1_df.drop(['Modalidad', 'indice_creencia_norm'], axis=1),   pd.DataFrame(encoded_categorica_eval)], axis=1)
X_eval_encoded.reset_index(drop=True, inplace=True)  # Resetea el índice y elimina la columna del índice anterior

y_train = ent1_df['indice_creencia_norm']
y_eval = ev1_df['indice_creencia_norm']
print(X_train_encoded.head())
print(X_eval_encoded[:50])
print(y_train.head())
print(y_eval.head())
# END

I could send you the dataset and the Notebook if they would be useful in helping me.

Cheers

I want to learn the correct way to apply the One Hot Encoder in order to apply Naive Bayes. This is part of my practise in Data Science.

0

There are 0 best solutions below