Select rows based on data variable in column and remove others

83 Views Asked by At

I have created a DataFrame (df):

    index   compound    e_above_hull         space
    0   CaFeO3  0.052160963499999546    {'symprec': 0.1, 'source': 'spglib', 'symbol': 'Pm-3m', 'number': 221, 'point_group': 'm-3m', 'crystal_system': 'cubic', 'hall': '-P 4 2 3'}
    1   BaSiO3  0.5612689803333337  {'symprec': 0.1, 'source': 'spglib', 'symbol': 'Pm-3m', 'number': 221, 'point_group': 'm-3m', 'crystal_system': 'cubic', 'hall': '-P 4 2 3'}
    2   BaGeO3  0.23235826850000052 {'symprec': 0.1, 'source': 'spglib', 'symbol': 'Pm-3m', 'number': 221, 'point_group': 'm-3m', 'crystal_system': 'cubic', 'hall': '-P 4 2 3'}
    3   CdIrO3  0.46319859100000027 {'symprec': 0.1, 'source': 'spglib', 'symbol': 'Pm-3m', 'number': 221, 'point_group': 'm-3m', 'crystal_system': 'cubic', 'hall': '-P 4 2 3'}
    4   BaZrO3  4.992200000053515e-05   {'symprec': 0.1, 'source': 'spglib', 'symbol': 'I4/mcm', 'number': 140, 'point_group': '4/mmm', 'crystal_system': 'tetragonal', 'hall': '-I 4 2c'}
    5   CeGaO3  0.12483252989999993 {'symprec': 0.1, 'source': 'spglib', 'symbol': 'Pnma', 'number': 62, 'point_group': 'mmm', 'crystal_system': 'orthorhombic', 'hall': '-P 2ac 2n'}

Now I want to select those rows only which contains ('symbol': 'Pm-3m') (or 'hall': '-P 4 2 3' or 'number': 221, ). I tried this way but this does not make any change in Dataframe.

df[df['space'].str.contains("'symbol': 'Pm-3m'")==True]

My list contains 2000 plus items, I have shown few of them only. Final output

index   compound    e_above_hull         space
0   CaFeO3  0.052160963499999546    {'symprec': 0.1, 'source': 'spglib', 'symbol': 'Pm-3m', 'number': 221, 'point_group': 'm-3m', 'crystal_system': 'cubic', 'hall': '-P 4 2 3'}
1   BaSiO3  0.5612689803333337  {'symprec': 0.1, 'source': 'spglib', 'symbol': 'Pm-3m', 'number': 221, 'point_group': 'm-3m', 'crystal_system': 'cubic', 'hall': '-P 4 2 3'}
2   BaGeO3  0.23235826850000052 {'symprec': 0.1, 'source': 'spglib', 'symbol': 'Pm-3m', 'number': 221, 'point_group': 'm-3m', 'crystal_system': 'cubic', 'hall': '-P 4 2 3'}
3   CdIrO3  0.46319859100000027 {'symprec': 0.1, 'source': 'spglib', 'symbol': 'Pm-3m', 'number': 221, 'point_group': 'm-3m', 'crystal_system': 'cubic', 'hall': '-P 4 2 3'}

Further, my desired Dataframe is like:

index   compound    e_above_hull            space
        0   CaFeO3  0.052160963499999546     Pm-3m
        1   BaSiO3  0.5612689803333337       Pm-3m
        2   BaGeO3  0.23235826850000052      Pm-3m
        3   CdIrO3  0.46319859100000027      Pm-3m
1

There are 1 best solutions below

0
itprorh66 On

Based on what I gather from your question you are asking how to:

  1. filter the contents of your dataframe such that it only contains entries satisfying the following criteria: 'symbol': 'Pm-3m') or ('hall': '-P 4 2 3' or ('number': 221).
  2. replace the contents of the column 'space' with 'Pm-3m' for the filtered rows.

While the following approach may not be the most efficient, it does get the job done. I am sure there are ways to use either df.map, df.filter, or df.query to do the job more efficiently but the following is the approach I would use:

Given the dataframe:

    compound e_above_hull   space
Index           
0   CaFeO3  0.052161    {'symprec': 0.1, 'source': 'spglib', 'symbol':...
1   BaSiO3  0.561269    {'symprec': 0.1, 'source': 'spglib', 'symbol':...
2   BaGeO3  0.232358    {'symprec': 0.1, 'source': 'spglib', 'symbol':...
3   CdIrO3  0.463199    {'symprec': 0.1, 'source': 'spglib', 'symbol':...
4   BaZrO3  0.000050    {'symprec': 0.1, 'source': 'spglib', 'symbol':...
5   CeGaO3  0.124833    {'symprec': 0.1, 'source': 'spglib', 'symbol':...   

Add the Check Column to this dataframe which evaluates to True if the criteria are met using:

df['Check'] = [x['symbol'] == 'Pm-3m' or x['hall'] == '-P 4 2 3' or x['number'] == 221 for x in df['space'].to_list()]  

Then filter and modify the dataframe as follows:

dfa = df[ df['Check'] == True][['compound', 'e_above_hull', 'space']]
dfa['space'] = ['Pm-3m' for x in dfa['space'].to_list()]  

This produces the following result:

    compound    e_above_hull    space
Index           
0   CaFeO3  0.052161    Pm-3m
1   BaSiO3  0.561269    Pm-3m
2   BaGeO3  0.232358    Pm-3m
3   CdIrO3  0.463199    Pm-3m