levenshtein from pyspark.sql.functions is very slow

308 Views Asked by At

I need to do fuzzy match and am using levenshtein pyspark function, as it is inbuilt pyspark function I thought there will be speed advantage over udf. It is very slow, there are about 341 rows(max) in dataframe and am using 5 dataframes. The input to test the words against is only around 7 words. But this is taking around 24s for the whole processing. Can someone suggest a better way to do this, to improve the response time.

When I do fuzzy match after collect on dataframe, it is faster. This should not be the case as after collect, the processing is only happening on the master server whereas using levenshtein on dataframe, uses the data nodes for parallel processing. Not sure why it is slower.

0

There are 0 best solutions below