Levenshtein on dataframe column and input list

157 Views Asked by At

New to pyspark and I need to do fuzzy match. Found that levenhenstein is a native function which can do that. I have a dataframe like this:

+----------------+----------------+ | col1| col2|
+----------------+----------------+ | apple| egg|
| mango| bread|
| banana| butter|

And an input list like ["apples","mango"]. I need to get all the (fuzzy) matches from col1 and input list.

I am using the following code:

input_array=["apples","mango"]
keyword_array=array([lit(x) for x in input_array])
keyword_string = array_join(keyword_array, "|")
result_df = df.filter(levenshtein("col1", keyword_string)<5)

levenshtein can only have parameters both as column or second one can be string. This code does not give any error, but it also does not give any result also(result_df.show() shows empty dataframe). Can anyone let me know what I am doing wrong in this or the best approach to achieve this?

1

There are 1 best solutions below

1
notNull On

You are not getting any row qualified for the filter condition.

Example:

df = spark.createDataFrame([('apple','egg'),('mango','bread'),('banana','butter')],['col1','col2'])
input_array=["apples","mango"]

keyword_array=array([lit(x) for x in input_array])
keyword_string = array_join(keyword_array, "|")
df.withColumn("keyword_tmp",keyword_string).\
withColumn('distance',levenshtein("col1", "tmp")).show(10,False)
#+------+------+------------+--------+
#|col1  |col2  |keyword_tmp |distance|
#+------+------+------------+--------+
#|apple |egg   |apples|mango|7       |
#|mango |bread |apples|mango|7       |
#|banana|butter|apples|mango|10      |
#+------+------+------------+--------+

As shown above the distance shows 7 but your filter condition has the distance of <5.

Change your filter condition to result_df = df.filter(levenshtein("col1", keyword_string)<8) to get your result.