I have a dataset, with a few columns, when I wanted to delete the outliers using z-score I initiated low and high thresholds for it as followed:
low = df[columns].quantile(0.003)
high = df[columns].quantile(0.997)
then used the following code to delete the outliers:
df = df[(df[columns]>low).any(axis=1)]
df = df[(df[columns]<high).any(axis=1)]
however it didn't delete anything, so I changed it to this:
df = df[~(df[columns]<low).any(axis=1)]
df = df[~(df[columns]>high).any(axis=1)]
This one worked, and deleted the outliers.
I expected both work the same, I just don't understand why the first one doesn't work. can someone please explain to me what makes the first one don't work and the second work? what is the difference anyway?
(df[columns]>low).any(axis=1)
and~(df[columns]<low).any(axis=1)
are not the same operation.If you reverse and negate the condition, you should swap ANY/ALL according to De Morgan's law (ANY is equivalent to the boolean OR, and ALL to AND).
In your case, since
~(df[columns]<low).any(axis=1)
is giving the correct output you should have used ALL:Also, keep in mind the role of NaNs, if you have NaNs then both operations
NaN>low
andNaN<low
will give the sameFalse
. Thus~(NaN<low)
is not the same as(NaN>low)
.Finally, don't forget that the opposite of
>
is<=
.Example: