New to pyspark and I need to do fuzzy match. Found that levenhenstein is a native function which can do that. I have a dataframe like this:
+----------------+----------------+
| col1| col2|
+----------------+----------------+
| apple| egg|
| mango| bread|
| banana| butter|
And an input list like ["apples","mango"]. I need to get all the (fuzzy) matches from col1 and input list.
I am using the following code:
input_array=["apples","mango"]
keyword_array=array([lit(x) for x in input_array])
keyword_string = array_join(keyword_array, "|")
result_df = df.filter(levenshtein("col1", keyword_string)<5)
levenshtein can only have parameters both as column or second one can be string. This code does not give any error, but it also does not give any result also(result_df.show() shows empty dataframe). Can anyone let me know what I am doing wrong in this or the best approach to achieve this?
You are not getting any row qualified for the
filter condition.Example:As shown above the distance shows
7but your filter condition has the distance of<5.Change your filter condition to
result_df = df.filter(levenshtein("col1", keyword_string)<8)to get your result.