How to use distinct with multiple parameter in a PySpark RDD?

20 Views Asked by At

Im trying to find a way to go through my RDD to produce a new one that wouldn't include re appearing longitude latitude pairing. However I can't seem to get distinct() to work on. I've tried .distinct(lambda station: (station[1], station[2])) but this doesn't seem to work. The RDD has station name, longitude, latitude below I have given example of sample input and desired output.

Input:

[["Station A",11.002,10.22],
["Station B",17.86,13.49],
["Station C",12.52,12.22],
["Station D",11.002,10.22]]

Output (station D removed since the position was same as station A):

[["Station A",11.002,10.22],
["Station B",17.86,13.49],
["Station C",12.52,12.22]]

As stated I have tried: .distinct(lambda station: (station[1], station[2]))

0

There are 0 best solutions below