(Python/Pandas) Subset DataFrame based on non-missing values from a column

42 Views Asked by At

I have a pd dataframe:

import pandas as pd
column1 = [None,None,None,4,8,9,None,None,None,2,3,5,None]
column2 = [None,None,None,None,5,1,None,None,6,3,3,None,None]
column3 = [None,None,None,3,None,7,None,None,7,None,None,1,None]
df = pd.DataFrame(np.column_stack([column1, column2,column3]),columns=['column1', 'column2', 'column3'])

print(df)
   column1 column2 column3
0     None    None    None
1     None    None    None
2     None    None    None
3        4    None       3
4        8       5    None
5        9       1       7
6     None    None    None
7     None    None    None
8     None       6       7
9        2       3    None
10       3       3    None
11       5    None       1
12    None    None    None

I want to subset the rows between the values in column 3, and get rid of all empty rows. My desired outcomes are:

print (df1)   
   column1 column2 column3
0        4    None       3
1        8       5    None
2        9       1       7

print(df2)
   column1 column2 column3
0     None       6       7
1        2       3    None
2        3       3    None
3        5    None       1

I don't care about the actual values column3. Column 3 values are used to indicate "start" and "stop".

3

There are 3 best solutions below

0
James On BEST ANSWER

You can find the non-na value, then perform a cumulative sum, then mod 2 to get the "groups" of start and one-less-than stop positions. Shifting this by 1, adding to the original, and clipping to (0, 1) gets clumps of the start and stop points.
To label the groups, you can take a diff of 1, then clip to (0, 1) again, and cum sum, then multiply those two together.

g_small = (~df.column3.isna()).cumsum().mod(2)
g = (g_small  + g_small .shift(1, fill_value=0)).clip(0,1)

groups = g.diff(1).fillna(0).clip(0,1).cumsum().astype(int) * g

You can then do a groupby operation on the data frame:

dfs = {i: g for i, g in df.groupby(groups) if i > 0}

dfs
# returns:
{1:
   column1 column2 column3
 3       4    None       3
 4       8       5    None
 5       9       1       7,

 2:
    column1 column2 column3
 8     None       6       7
 9        2       3    None
 10       3       3    None
 11       5    None       1}
0
Andrej Kesely On

Try:

x = df.column3.dropna().index
for from_, to_ in zip(x[::2], x[1::2]):
    print(df.loc[from_:to_])

Prints:

  column1 column2 column3
3       4    None       3
4       8       5    None
5       9       1       7

   column1 column2 column3
8     None       6       7
9        2       3    None
10       3       3    None
11       5    None       1
0
Nima_Ebr On

try this:

import pandas as pd
import numpy as np

column1 = [None,None,None,4,8,9,None,None,None,2,3,5,None]
column2 = [None,None,None,None,5,1,None,None,6,3,3,None,None]
column3 = [None,None,None,3,None,7,None,None,7,None,None,1,None]
df = pd.DataFrame(np.column_stack([column1, column2,column3]), columns=['column1', 'column2', 'column3'])

# Initialize an empty list to store subsets
subsets = []
subset = []

# Iterate over the DataFrame rows
for index, row in df.iterrows():
    if row['column3'] is not None:  # If the current row indicates start or stop
        if subset:  # If the subset is not empty, append it to subsets
            subsets.append(pd.DataFrame(subset))
            subset = []  # Reset subset for next subset
        if row['column3'] == 3:  # If the current row indicates start
            subset.append(row)  # Start collecting rows in subset
    else:
        if subset:  # If subset is collecting rows
            subset.append(row)  # Add the row to the subset

# Add the last subset to subsets if it's not empty
if subset:
    subsets.append(pd.DataFrame(subset))

for i, subset_df in enumerate(subsets, start=1):
    print(f"Subset {i}:")
    print(subset_df)
    print()