using thresholds to delete outliers

93 Views Asked by Noori Muhammed At 16 May 2025 at 17:26

I have a dataset, with a few columns, when I wanted to delete the outliers using z-score I initiated low and high thresholds for it as followed:

low = df[columns].quantile(0.003)
high = df[columns].quantile(0.997)

then used the following code to delete the outliers:

df = df[(df[columns]>low).any(axis=1)]
df = df[(df[columns]<high).any(axis=1)]

however it didn't delete anything, so I changed it to this:

df = df[~(df[columns]<low).any(axis=1)]
df = df[~(df[columns]>high).any(axis=1)]

This one worked, and deleted the outliers.

I expected both work the same, I just don't understand why the first one doesn't work. can someone please explain to me what makes the first one don't work and the second work? what is the difference anyway?

Original Q&A

There are 1 best solutions below

mozway On 25 October 2023 at 08:43

(df[columns]>low).any(axis=1) and ~(df[columns]<low).any(axis=1) are not the same operation.

If you reverse and negate the condition, you should swap ANY/ALL according to De Morgan's law (ANY is equivalent to the boolean OR, and ALL to AND).

In your case, since ~(df[columns]<low).any(axis=1) is giving the correct output you should have used ALL:

(df[columns]>low).all(axis=1)

Also, keep in mind the role of NaNs, if you have NaNs then both operations NaN>low and NaN<low will give the same False. Thus ~(NaN<low) is not the same as (NaN>low).

Finally, don't forget that the opposite of > is <=.

Example:

df = pd.DataFrame([[1, 2, 3, 4],
                   [1, 2, 1, 2],
                   [3, 4, 3, 4]
                  ])

np.array_equal((df>2).any(axis=1), ~(df<=2).all(axis=1))

# True

using thresholds to delete outliers

There are 1 best solutions below

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in PYTHON-3.11

Trending Questions

Popular # Hahtags

Popular Questions