How to sequentially name multiple subsets with selected columns whose name start with

42 Views Asked by At

I want to create 2 subsets with columns whose names start with radius_, area_. Let me provide you fake data. Sorry that I modified below a bit

    data = {'radius_mean':[18, 21, 20, 11, 20],
            'radius_se':[1, 0.5, 0.7, 0.4, 0.8],
           'area_mean': [1001, 1326, 1203, 386, 1200],
           'area_se': [153, 75, 94, 27, 95]}
    df=pd.DataFrame(data)
    df1=pd.DataFrame(). 
    df2=pd.DataFrame(). 
    subsets=[df1, df2]. 
    features=['radius', 'area']. 
    for subset, feature in zip(subsets, features):  
        subcol=[col for col in df.columns if col.startswith(feature+ '_')]. 
        print(subcol). 
        subset=df[subcol]. 
        print(subset.head()). 

I expect df1.

    ['radius_mean', 'radius_se']. 
       radius_mean  radius_se. 
     0           18        1.0. 
     1           21        0.5. 
     2           20        0.7. 
     3           11        0.4. 
     4           20        0.8. 
    

I expect df2, as shown below. However, data1 and data2 are empty, but subset is created, as shown below:

   ['area_mean', 'area_se']. 
     area_mean  area_se. 
    0       1001      153. 
    1       1326       75. 
    2       1203       94. 
    3        386       27. 
    4       1200       95. 
1

There are 1 best solutions below

0
mitoRibo On

You're running into an issue because of how references to dataframes are handled. Your logic makes sense, but I think what's happening is that copies of your tables are made instead of keeping references to the original tables, so when you try to update the originals you're really updating copies. You can side-step this issue by creating data1 and data2 AFTER your loop like I show later in the code

import pandas as pd
import io #you don't need this, it's just for me to read in the cancer table

#again you don't need this, this just lets me get the cancer table
cancer = pd.read_csv(io.StringIO("""
radius_mean  radius_se  radius_worst    area_mean  area_se  area_worst
        17.99     1.0950         25.38      1001.0   153.40      2019.0
        20.57     0.5435         24.99     1326.0    74.08      1956.0
        19.69     0.7456         23.57     1203.0    94.03      1709.0
        11.42     0.4956         14.91     386.1    27.23       567.7
        20.29     0.7572         22.54     1297.0    94.44      1575.0
"""),delim_whitespace=True)

data1=pd.DataFrame()
data2=pd.DataFrame()
dsets=[data1, data2] #copies of data1 and data2 are made

#editing the entries in the dsets list doesn't update data1 or data2 since they are different copies
dsets[0] = pd.DataFrame({'a':[1,2,3]}) #trying to update 0-index, doesn't update data1
print(dsets[0]) #changed
print(data1) #not changed

#in your loop the same 'copy' issue is happening again so data1 and data2 don't get updated
features=['radius', 'area']
for dset, feature in zip(dsets,features): 
    subcol=[col for col in cancer.columns if col.startswith(feature+ '_')]
    dset=cancer[subcol]
    
print(data1) #still not updated
    

SOLUTION: create data1 and data2 for the first time in the loop instead

dsets = []
features=['radius', 'area']
for feature in features: 
    subcol=[col for col in cancer.columns if col.startswith(feature+ '_')]
    dsets.append(cancer[subcol])
    
data1,data2 = dsets

print(data1)
print(data2)