(Python/Pandas) Subset DataFrame based on non-missing values from a column

Question

(Python/Pandas) Subset DataFrame based on non-missing values from a column

42 Views Asked by CAA At 25 March 2024 at 20:05

I have a pd dataframe:

import pandas as pd
column1 = [None,None,None,4,8,9,None,None,None,2,3,5,None]
column2 = [None,None,None,None,5,1,None,None,6,3,3,None,None]
column3 = [None,None,None,3,None,7,None,None,7,None,None,1,None]
df = pd.DataFrame(np.column_stack([column1, column2,column3]),columns=['column1', 'column2', 'column3'])

print(df)
   column1 column2 column3
0     None    None    None
1     None    None    None
2     None    None    None
3        4    None       3
4        8       5    None
5        9       1       7
6     None    None    None
7     None    None    None
8     None       6       7
9        2       3    None
10       3       3    None
11       5    None       1
12    None    None    None

I want to subset the rows between the values in column 3, and get rid of all empty rows. My desired outcomes are:

print (df1)   
   column1 column2 column3
0        4    None       3
1        8       5    None
2        9       1       7

print(df2)
   column1 column2 column3
0     None       6       7
1        2       3    None
2        3       3    None
3        5    None       1

I don't care about the actual values column3. Column 3 values are used to indicate "start" and "stop".

Original Q&A

There are 3 best solutions below

Andrej Kesely On 25 March 2024 at 20:26

Try:

x = df.column3.dropna().index
for from_, to_ in zip(x[::2], x[1::2]):
    print(df.loc[from_:to_])

Prints:

  column1 column2 column3
3       4    None       3
4       8       5    None
5       9       1       7

   column1 column2 column3
8     None       6       7
9        2       3    None
10       3       3    None
11       5    None       1

Nima_Ebr On 28 March 2024 at 08:05

try this:

import pandas as pd
import numpy as np

column1 = [None,None,None,4,8,9,None,None,None,2,3,5,None]
column2 = [None,None,None,None,5,1,None,None,6,3,3,None,None]
column3 = [None,None,None,3,None,7,None,None,7,None,None,1,None]
df = pd.DataFrame(np.column_stack([column1, column2,column3]), columns=['column1', 'column2', 'column3'])

# Initialize an empty list to store subsets
subsets = []
subset = []

# Iterate over the DataFrame rows
for index, row in df.iterrows():
    if row['column3'] is not None:  # If the current row indicates start or stop
        if subset:  # If the subset is not empty, append it to subsets
            subsets.append(pd.DataFrame(subset))
            subset = []  # Reset subset for next subset
        if row['column3'] == 3:  # If the current row indicates start
            subset.append(row)  # Start collecting rows in subset
    else:
        if subset:  # If subset is collecting rows
            subset.append(row)  # Add the row to the subset

# Add the last subset to subsets if it's not empty
if subset:
    subsets.append(pd.DataFrame(subset))

for i, subset_df in enumerate(subsets, start=1):
    print(f"Subset {i}:")
    print(subset_df)
    print()

**James** · Accepted Answer · 2024-03-25T20:21:58.247000

You can find the non-na value, then perform a cumulative sum, then mod 2 to get the "groups" of start and one-less-than stop positions. Shifting this by 1, adding to the original, and clipping to (0, 1) gets clumps of the start and stop points.
To label the groups, you can take a diff of 1, then clip to (0, 1) again, and cum sum, then multiply those two together.

g_small = (~df.column3.isna()).cumsum().mod(2)
g = (g_small  + g_small .shift(1, fill_value=0)).clip(0,1)

groups = g.diff(1).fillna(0).clip(0,1).cumsum().astype(int) * g

You can then do a groupby operation on the data frame:

dfs = {i: g for i, g in df.groupby(groups) if i > 0}

dfs
# returns:
{1:
   column1 column2 column3
 3       4    None       3
 4       8       5    None
 5       9       1       7,

 2:
    column1 column2 column3
 8     None       6       7
 9        2       3    None
 10       3       3    None
 11       5    None       1}

(Python/Pandas) Subset DataFrame based on non-missing values from a column

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in SUBSET

Trending Questions

Popular # Hahtags

Popular Questions