create a DataFrame from a dictionary, ValueError: Per-column arrays must each be 1-dimensional

944 Views Asked by At

I'm trying to create a Panda dataframe from a dictionary to plot a performance curve. It was working in 2020, but now no.

model = ExtraTreesRegressor()      
feature_selector = RFECV(estimator=model, step=1, cv=5, scoring='r2') 
feature_selector.fit(X_train, np.ravel(y_train))
feature_names = X_train.columns
selected_features = feature_names[feature_selector.support_].tolist()
performance_curve = {"Number of Features": list(range(1, len(feature_names) + 1)),
                     "r2": (feature_selector.grid_scores_)}
performance_curve = pd.DataFrame(performance_curve)

error

performance_curve = pd.DataFrame(performance_curve)
Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Temp\ipykernel_3436\1638829063.py", line 1, in <module>
    performance_curve = pd.DataFrame(performance_curve)
  File "C:\Users\user\anaconda3\lib\site-packages\pandas\core\frame.py", line 636, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "C:\Users\user\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 502, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "C:\Users\user\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 120, in arrays_to_mgr
    index = _extract_index(arrays)
  File "C:\Users\user\anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 661, in _extract_index
    raise ValueError("Per-column arrays must each be 1-dimensional")
ValueError: Per-column arrays must each be 1-dimensional

how can i solve this problem ? thank you in advance for your help

the dictionary

{'Number of Features': [1, 2, 3, 4, 5, 6, 7, 8, 9],
 'r2': array([[0.897 , 0.8891, 0.9031, 0.8967, 0.8833],
        [0.889 , 0.8822, 0.8906, 0.8828, 0.8801],
        [0.9468, 0.9388, 0.9411, 0.9448, 0.9401],
        [0.9623, 0.9567, 0.9564, 0.9539, 0.9576],
        [0.9674, 0.962 , 0.9612, 0.9643, 0.9634],
        [0.9958, 0.9939, 0.9925, 0.9944, 0.9928],
        [0.9959, 0.9939, 0.9924, 0.9945, 0.993 ],
        [0.9961, 0.9941, 0.9926, 0.9949, 0.9929],
        [0.9963, 0.9943, 0.9926, 0.995 , 0.993 ]])}

Number of Features - list (9,) r2 - Array - (9, 5)

it work when i use list(feature_selector.grid_scores_), but it give a problem in plot:

sns.lineplot(x = "Number of Features", y = "r2", data = performance_curve,
             color = line_color, lw = 4, ax = ax)
sns.regplot(x = performance_curve["Number of Features"], y = performance_curve["r2"],
            color = marker_colors, fit_reg = False, scatter_kws = {"s": 200}, ax = ax)```
2

There are 2 best solutions below

1
100tifiko On BEST ANSWER

When you do list(feature_selector.grid_scores_), it will create a dataframe with 2 columns: Number of features and r2. The problem is that r2 is a list. For each row you will have a list of 5 values (one for each cv). And it will not work with sns.

You can get the average value of each cv and it will work.

performance_curve = {"Number of Features": list(range(1, len(feature_names) + 1)),
                     "r2": np.mean(feature_selector.grid_scores_, axis=1)}

performance_curve = pd.DataFrame(performance_curve)

This will create a dataframe:

enter image description here

Then, run your seaborn code and you will obtain:

enter image description here

0
ripalo On

You will need to change your dictionary into a simple df with the same dimensions. You can do it by flattening the data to one dimension.

Code as per below:

import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

performance_curve = {
    'Number of Features': [1, 2, 3, 4, 5, 6, 7, 8, 9],
    'r2': np.array([
        [0.897, 0.8891, 0.9031, 0.8967, 0.8833],
        [0.889, 0.8822, 0.8906, 0.8828, 0.8801],
        [0.9468, 0.9388, 0.9411, 0.9448, 0.9401],
        [0.9623, 0.9567, 0.9564, 0.9539, 0.9576],
        [0.9674, 0.962, 0.9612, 0.9643, 0.9634],
        [0.9958, 0.9939, 0.9925, 0.9944, 0.9928],
        [0.9959, 0.9939, 0.9924, 0.9945, 0.993],
        [0.9961, 0.9941, 0.9926, 0.9949, 0.9929],
        [0.9963, 0.9943, 0.9926, 0.995, 0.993]
    ])
}
# Flatten the r2 array into a 1-dimensional array
r2_1d = performance_curve['r2'].flatten()

# Create a DataFrame from the flattened data
df = pd.DataFrame({
    'Number of Features': np.repeat(performance_curve['Number of Features'], performance_curve['r2'].shape[1]),
    'r2': r2_1d
})

print(df)

sns.lineplot(x="Number of Features", y='r2', data=df, lw=4)
sns.regplot(x=df["Number of Features"], y=df['r2'], fit_reg=False, scatter_kws={"s": 200})

plt.show()