I want to create 2 subsets with columns whose names start with radius_, area_. Let me provide you fake data. Sorry that I modified below a bit
data = {'radius_mean':[18, 21, 20, 11, 20],
'radius_se':[1, 0.5, 0.7, 0.4, 0.8],
'area_mean': [1001, 1326, 1203, 386, 1200],
'area_se': [153, 75, 94, 27, 95]}
df=pd.DataFrame(data)
df1=pd.DataFrame().
df2=pd.DataFrame().
subsets=[df1, df2].
features=['radius', 'area'].
for subset, feature in zip(subsets, features):
subcol=[col for col in df.columns if col.startswith(feature+ '_')].
print(subcol).
subset=df[subcol].
print(subset.head()).
I expect df1.
['radius_mean', 'radius_se'].
radius_mean radius_se.
0 18 1.0.
1 21 0.5.
2 20 0.7.
3 11 0.4.
4 20 0.8.
I expect df2, as shown below. However, data1 and data2 are empty, but subset is created, as shown below:
['area_mean', 'area_se'].
area_mean area_se.
0 1001 153.
1 1326 75.
2 1203 94.
3 386 27.
4 1200 95.
You're running into an issue because of how references to dataframes are handled. Your logic makes sense, but I think what's happening is that copies of your tables are made instead of keeping references to the original tables, so when you try to update the originals you're really updating copies. You can side-step this issue by creating
data1anddata2AFTER your loop like I show later in the codeSOLUTION: create data1 and data2 for the first time in the loop instead