PySpark filter works incorrectly with one True built using lambda funtion

46 Views Asked by At

I was debugging a function and I have encountered a misterious thing:

Given a PySpark dataframe with one column (name_id) I build another (is_number) one using a lambda function to see if name_id is a string whose characters are numbers.

The resulting dataframe (df) looks like this:

df.show(4, False)
name_id is_number
0001 true
0002 true
0003 true
0004 true

I need to count the number of True values, so I do the following:

df.where(F.col("is_number")==True).count()

3

Three?? Really? What is happening here?

It gets stranger:

df.groupBy("is_number").count().show(4, False)
is_number count
true 4

It looks like all True values are the same, BUT:

df.groupBy("is_number").count().where(F.col("is_number")=="True").collect()[0]["count"])

3

Again, it looks like applying the where function eliminates one True value. The filter function works the same.

Additionally I have detected which True value is the one excluded, and it is the first one.

df.where(F.col("is_number")==True).show(4, False)
name_id is_number
0002 true
0003 true
0004 true

Other things I have tried: Expressing True as not False doesn't work. "true" values shown are True representations, not a string "true". Using EqNullSafe() instead of == doesn't work.

Any ideas? This is a complete nonsense for me!

Thank you in advance!

0

There are 0 best solutions below