Comparing two data frames with condition and removing all that not qualified

32 Views Asked by At

I have two data frames. I have tried to generate a short data to explain what I am looking for, any suggestion or help is appreciated.

df = pd.DataFrame({'policy number':[11,22,33,44,55,66,77,88,99], ' policy status':['good', 'good', 'good', 'good', 'good','good', 'good', 'good', 'good']})

df_2 = pd.DataFrame({'policy number':[11,83,63,44,55,66,67,88,99,100], 'policy status':['bad','bad', 'good', 'good', 'bad', 'good','bad', 'good', 'average', 'good']})

I want to compare two data frames by policy number, if the column [policy status] is still good, I want to keep those policies. Else I want to remove them from my first data frame.

Is there any easier way for this? I have tried to iterate each rows of two data frames and compare them, but this takes a lot time, since I have bigger datasets.

Thanks in advance!

2

There are 2 best solutions below

0
Andrej Kesely On BEST ANSWER

If I understand you correctly, you can use pd.Series.isin for the task (for creating the boolean mask):

print(
    df[
        df["policy number"].isin(
            df_2.loc[df_2["policy status"] == "good", "policy number"]
        )
    ]
)

Prints:

   policy number  policy status
3             44           good
5             66           good
7             88           good
1
Ugochukwu Obinna On

Using pandas merge operation with the 'inner' join type.

import pandas as pd

# Sample data
df = pd.DataFrame({'policy number': [11, 22, 33, 44, 55, 66, 77, 88, 99],
                   'policy status': ['good', 'good', 'good', 'good', 'good', 'good', 'good', 'good', 'good']})

df_2 = pd.DataFrame({'policy number': [11, 83, 63, 44, 55, 66, 67, 88, 99, 100],
                     'policy status': ['bad', 'bad', 'good', 'good', 'bad', 'good', 'bad', 'good', 'average', 'good']})

# Merge dataframes based on policy number
merged_df = pd.merge(df, df_2, on='policy number', suffixes=('_original', '_updated'), how='inner')

# Filter rows where policy status is still good
filtered_df = merged_df[merged_df['policy status_original'] == 'good']

# Keep only the relevant columns
filtered_df = filtered_df[['policy number', 'policy status_original']]

print(filtered_df)