using thresholds to delete outliers

93 Views Asked by At

I have a dataset, with a few columns, when I wanted to delete the outliers using z-score I initiated low and high thresholds for it as followed:

low = df[columns].quantile(0.003)
high = df[columns].quantile(0.997)

then used the following code to delete the outliers:

df = df[(df[columns]>low).any(axis=1)]
df = df[(df[columns]<high).any(axis=1)]

however it didn't delete anything, so I changed it to this:

df = df[~(df[columns]<low).any(axis=1)]
df = df[~(df[columns]>high).any(axis=1)]

This one worked, and deleted the outliers.

I expected both work the same, I just don't understand why the first one doesn't work. can someone please explain to me what makes the first one don't work and the second work? what is the difference anyway?

1

There are 1 best solutions below

1
On

(df[columns]>low).any(axis=1) and ~(df[columns]<low).any(axis=1) are not the same operation.

If you reverse and negate the condition, you should swap ANY/ALL according to De Morgan's law (ANY is equivalent to the boolean OR, and ALL to AND).

In your case, since ~(df[columns]<low).any(axis=1) is giving the correct output you should have used ALL:

(df[columns]>low).all(axis=1)

Also, keep in mind the role of NaNs, if you have NaNs then both operations NaN>low and NaN<low will give the same False. Thus ~(NaN<low) is not the same as (NaN>low).

Finally, don't forget that the opposite of > is <=.

Example:

df = pd.DataFrame([[1, 2, 3, 4],
                   [1, 2, 1, 2],
                   [3, 4, 3, 4]
                  ])

np.array_equal((df>2).any(axis=1), ~(df<=2).all(axis=1))

# True