Filter df to remove rows where a column has a word/phrase

35 Views Asked by At

I have a list of words and phrases which I want to use to specify which rows to remove when creating a new dataframe.

include = ['word1', 'word2', 'word3'...]
exclude = ['word4', 'word5 word6' ...]

So far I have basically tried:

new_df = []

for word in include:
    valid = df['Message'].str.contains(word)
    count = 0
    for item in valid:
        if item:
            temp.append(df.iloc[count])
        count += 1

Then I remove the extras using temp = pd.DataFrame(temp) and temp = temp.drop_duplicates.

This gives me the included dataframe but then I want to further specify that if they contain any word from excluded that the row should be dropped. I am not very skilled with Pandas and have tried the same concept just with ~ for valid and .drop() instead of .append() but the rows still remain, how could I go about this?

1

There are 1 best solutions below

1
bpbutti On

I assume the words will be embedded in a bigger text, if not, you should consider the Pandas method df['Message'].isin(list_of_words)

For the first case you can either do it with regex patterns or without regex patterns and use reduce to combine all the masks (lists of boolean values)

Example without regex pattern

import pandas as pd
from operator import or_
from functools import reduce

df = pd.DataFrame(
    {
        'Message': [
            'word1 aaa',
            'word1 word4',
            'word4',
            'word3aaa'
        ]
    }
)

include_words_list = ['word1', 'word2', 'word3']
exclude_words_list = ['word4', 'word5', 'word6']

include_rows_mask = reduce(or_, [df['Message'].str.contains(word, regex=False) for word in include_words_list])
exclude_rows_mask = reduce(or_, [df['Message'].str.contains(word, regex=False) for word in exclude_words_list])

selected_rows_mask = (include_rows_mask) & ~(exclude_rows_mask)

df_selected = df.loc[selected_rows_mask, :]

Example with regex pattern

import pandas as pd

df = pd.DataFrame(
    {
        'Message': [
            'word1 aaa',
            'word1 word4',
            'word4',
            'word3aaa'
        ]
    }
)

include_words_pattern= '(word1|word2|word3)'
exclude_words_pattern = '(word4|word5|word6)'

include_rows_mask = df['Message'].str.contains(include_words_pattern, regex=True)
exclude_rows_mask = df['Message'].str.contains(exclude_words_pattern, regex=True)

selected_rows_mask = (include_rows_mask) & ~(exclude_rows_mask)

df_selected = df.loc[selected_rows_mask, :]