Python dfply: unable to mask on multiple conditions

1.8k Views Asked by At

I am an R user learning how to use Python's dfply, the Python equivalent to R's dplyr. My problem: in dfply, I am unable to mask on multiple conditions in a pipe. I seek a solution involving dfply pipes rather than multiple lines of subsetting.

My code:

# Import
import pandas as pd
import numpy as np
from dfply import *

# Create data frame and mask it
df  = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
        mask((X.a.isnull()) | ~(X.b.isnull())))
print(df)
print(df2)

Here is the oringal data frame, df:

       a    b    c
    0  NaN  6.0  5
    1  2.0  7.0  4
    2  3.0  8.0  3
    3  4.0  9.0  2
    4  5.0  NaN  1

And here is the result of the piped mask, df2:

         a    b    c
      0  NaN  6.0  5
      4  5.0  NaN  1

However, I expect this instead:

         a    b    c
      0  NaN  6.0  5
      1  2.0  7.0  4
      2  3.0  8.0  3
      3  4.0  9.0  2

Why don't the "|" and "~" operators result in rows in which column "a" is either NaN or column "b" is not NaN?

By the way, I also tried np.logical_or():

df  = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
        mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
print(df)
print(df2)

But this resulted in error:

mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
ValueError: invalid __array_struct__
2

There are 2 best solutions below

2
CurlyW On

Edit: Tweak the second conditional to "df.col2.notnull()". No idea why the tilde is ignored after the pipe.

df  = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >> mask((X.a.isnull()) | (X.b.notnull())))

print(df2)

     a    b  c
0  NaN  6.0  5
1  2.0  7.0  4
2  3.0  8.0  3
3  4.0  9.0  2
2
loveactualry On

How about filter_by?

df >> filter_by((X.a.isnull()) | (X.b.isnull()))