I have dataset where I transformed categorical dataset into numerical by dummies and I ran simple linear regression model to predict dependent variable. I got adjusted R-square as 0.66.
Now I want to cross validate my model with leave one out method and want to see the LOOCV adjusted r-square, whether it is similar to my pre cross validation model.
cv = LeaveOneOut()
data = pd.read_excel(r'C:/Users/LENOVO/Documents/Diwali_Impact_coding/Modelling/Model_Data.xlsx', usecols=['PMlog', 'Temp', 'RH', 'WSlog', 'Type', 'Popu', 'FRPlog', 'Region'], sheet_name='City_cook2')
data.dropna(subset=['PMlog', 'Temp', 'RH', 'WSlog'], inplace=True)
data_log1 = pd.get_dummies(data, columns=['Type', 'Region', 'Popu']) # all NUMERICAL FEATURES
X = data_log1.loc[:, data_log1.columns != 'PMlog'] # Indepenedent/Predictor Variables
y = data_log1.loc[:, 'PMlog'] # Dependent Variable
model_LR = LinearRegression()
model_LR.fit(X,y)
def adj_Rsqr(model_LR, X, y):
xx = 1 - (1 - model_LR.score(X, y)) * (len(y) - 1) / (len(y) - X.shape[1] - 1)
return xx
adj_Rsqr(model_LR,X,y) # 0.66
scores = cross_val_score(model_LR, X, y, scoring=adj_Rsqr, cv=cv, n_jobs=-1)
mean(scores)
My scores values are coming nan
Can anybody help me why its is coming as nan. Also, if I uses scoring as R2 then also it is coming as nan but with not other scoring such as absolute error etc.
Thank you for every help.
Cross validation is the process of splitting your data into a training and a test split for the purpose of model validation on kind of different data sets. When you apply
LeaveOneOutcross validation, the test split is just one sample and the train split is all the other samples. An r-squared does not make a lot of sense just for a single sample in the test split.When I coded LOOCV for
sklearn's diabetes dataset to reproduce the behavior you got, I got the following warning:Consequently, a possible solution could be using
KFoldcross validation and choosing a highk, wherek < n/2, so that you have at least two samples in each test split.Your scoring function can be improved: You could use a
sklearn.metrics.make_scorer.make_scoreraccepts a scoring function. According to the documentation, the scoring function must have the signaturescore_func(y, y_pred, **kwargs).So, a scoring function and a scorer could look like this in your case:
But then, your code has to change a little bit:
I liked the question, I had fun thinking about it.