PySpark filter works incorrectly with one True built using lambda funtion

46 Views Asked by Python Puzzle At 02 October 2023 at 19:52

I was debugging a function and I have encountered a misterious thing:

Given a PySpark dataframe with one column (name_id) I build another (is_number) one using a lambda function to see if name_id is a string whose characters are numbers.

The resulting dataframe (df) looks like this:

df.show(4, False)

name_id	is_number
0001	true
0002	true
0003	true
0004	true

I need to count the number of True values, so I do the following:

df.where(F.col("is_number")==True).count()

Three?? Really? What is happening here?

It gets stranger:

df.groupBy("is_number").count().show(4, False)

is_number	count
true	4

It looks like all True values are the same, BUT:

df.groupBy("is_number").count().where(F.col("is_number")=="True").collect()[0]["count"])

Again, it looks like applying the where function eliminates one True value. The filter function works the same.

Additionally I have detected which True value is the one excluded, and it is the first one.

df.where(F.col("is_number")==True).show(4, False)

name_id	is_number
0002	true
0003	true
0004	true

Other things I have tried: Expressing True as not False doesn't work. "true" values shown are True representations, not a string "true". Using EqNullSafe() instead of == doesn't work.

Any ideas? This is a complete nonsense for me!

Thank you in advance!

Original Q&A

PySpark filter works incorrectly with one True built using lambda funtion

There are 0 best solutions below

Related Questions in PYSPARK

Related Questions in LAMBDA

Related Questions in WHERE-CLAUSE

Related Questions in BOOLEAN-LOGIC

Trending Questions

Popular # Hahtags

Popular Questions