levenshtein from pyspark.sql.functions is very slow

308 Views Asked by curios At 15 May 2023 at 06:47

I need to do fuzzy match and am using levenshtein pyspark function, as it is inbuilt pyspark function I thought there will be speed advantage over udf. It is very slow, there are about 341 rows(max) in dataframe and am using 5 dataframes. The input to test the words against is only around 7 words. But this is taking around 24s for the whole processing. Can someone suggest a better way to do this, to improve the response time.

When I do fuzzy match after collect on dataframe, it is faster. This should not be the case as after collect, the processing is only happening on the master server whereas using levenshtein on dataframe, uses the data nodes for parallel processing. Not sure why it is slower.

Original Q&A

levenshtein from pyspark.sql.functions is very slow

There are 0 best solutions below

Related Questions in PYSPARK

Related Questions in LEVENSHTEIN-DISTANCE

Related Questions in FUZZY-COMPARISON

Trending Questions

Popular # Hahtags

Popular Questions