I use Lasso and a linear regression for feature selection. I want a table that includes the coefficients but also the significance of a variable. I first used SKlearn, but SKlearn does not create these kind of tables. Thus, I tried to create them myself.
Is the way how I calculated the different values for the lasso regression table correct?
import pandas as pd
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from scipy.stats import t
X = data[['actual_total_load_MW', 'DA_total_load_MW', 'DA_onshore_wind_MW', 'DA_offshore_wind_MW', 'DA_solar_MW', 'actual_offshore_wind_MW', 'actual_onshore_wind_MW', 'actual_solar_MW', 'DA_price']]
y = data['imbalance_MW']
# Standardize features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Lasso feature selection
lasso = Lasso(alpha=0.1) # Adjust the alpha value as per your preference
lasso.fit(X_std, y)
# Get coefficient information
coefficients = lasso.coef_
intercept = lasso.intercept_
# Unstandardized coefficients
B = coefficients / scaler.scale_ # Unstandardized coefficients
# Standard deviations (STD)
STD = np.std(X, 0) * coefficients / scaler.scale_
# Standardized coefficients (Beta)
Beta = coefficients
# Compute t-values
n_samples = X.shape[0]
dof = n_samples - np.count_nonzero(coefficients) - 1 # Degrees of freedom
t_values = coefficients / (np.std(X_std, 0) / np.sqrt(n_samples))
# Compute p-values
p_values = 2 * (1 - t.cdf(np.abs(t_values), df=dof)) # Assuming a two-tailed test
# Compute 95% Confidence Intervals (CI)
SE = np.std(X, 0) * np.std(X_std, 0) / np.sqrt(n_samples) # Standard Errors
ci_low = B - 1.96 * SE # 95% CI Lower Limit
ci_high = B + 1.96 * SE # 95% CI Upper Limit
# Create the coefficient table
table = pd.DataFrame({'Estimate': B, 'SE': SE, '95% CI (LL)': ci_low, '95% CI (UL)': ci_high, 'p-value': p_values},
index=X.columns)
# Print the coefficient table
print(table)
Thank you in advance!