I am trying to find the correct way, or to make sure that I have retained the target class during a PCA. I tried to do the scaling before and after splitting the data, but the issue is still the same.
I am sorry that I can't use the seaborn.load_dataset(name, cache=True, data_home=None, **kws) to load the dataset so here we go
Loading the data
# loading the dataframe
auto = pd.read_csv('auto.csv')
Make a target class by saying that any mileage lower than the median is 0 and higher is 1
med=np.median(auto["mpg"])
auto["mpg01"]=auto["mpg"].apply(lambda x: 1 if x>med else 0)
Splitting the data
X=auto[['cylinders','displacement','horsepower','weight','acceleration','year',"origin"]]
y=auto["mpg01"]
X_train, X_test, y_train, y_test = train_test_split(X,y , random_state=101, test_size=0.3, shuffle=True)
Start the PCA
pca2 = PCA(n_components=2)
X_train_reduced2 = pca2.fit_transform(scale(X_train))
Make a DF that joins the pcs and the target class
pca_df2 = pd.DataFrame(X_train_reduced2, columns =["PC1", "PC2"])
pca_df2["mpg01"]=y_train
pca_df2
I noticed that there are some NANs in this new dataframe. The length of the dataframe makes senses. The only thing I can think of is that the index no longer matches, but it should, and I have no way to verify it. enter image description here
The 2D plot of the PCA shows this. There is no separations btw the target class. I am just wondering if I got all the step right.
As you said, indexes are no longer matching. You need to modify the line:
pca_df2 = pd.DataFrame(X_train_reduced2, columns=["PC1", "PC2"], index=X_train.index)Note that PCA is not returning a
pd.Dataframe, but a simplenp.array. You need to fix indexed to match the label y_train.