I run into problems when calling Spark's MinHashLSH's approxSimilarityJoin on a dataframe of (name_id, name) combinations.
A summary of the problem I try to solve:
I have a dataframe of around 30 million unique (name_id, name) combinations for company names. Some of those names refer to the same company, but are (i) either misspelled, and/or (ii) include additional names. Performing fuzzy string matching for every combination is not possible. To reduce the number of fuzzy string matching combinations, I use MinHashLSH in Spark. My intended approach is to use a approxSimilarityJoin (self-join) with a relatively large Jaccard threshold, such that I am able to run a fuzzy matching algorithm on the matched combinations to further improve the disambiguation.
A summary of the steps I took:
- Used CountVectorizer to create a vector of character counts for every name,
- Used MinHashLSH and its approxSimilarityJoin with the following settings:
- numHashTables=100
- threshold=0.3 (Jaccard threshold for approxSimilarityJoin)
- After the approxSimilarityJoin, I remove duplicate combinations (for which holds that there exists a matched combination (i,j) and (j,i), then I remove (j,i))
- After removing the duplicate combinations, I run a fuzzy string matching algorithm using the FuzzyWuzzy package to reduce the number of records and improve the disambiguation of the names.
- Eventually I run a connectedComponents algorithm on the remaining edges (i,j) to match which company names belong together.
Part of code used:
id_col = 'id'
name_col = 'name'
num_hastables = 100
max_jaccard = 0.3
fuzzy_threshold = 90
fuzzy_method = fuzz.token_set_ratio
# Calculate edges using minhash practices
edges = MinHashLSH(inputCol='vectorized_char_lst', outputCol='hashes', numHashTables=num_hastables).\
fit(data).\
approxSimilarityJoin(data, data, max_jaccard).\
select(col('datasetA.'+id_col).alias('src'),
col('datasetA.clean').alias('src_name'),
col('datasetB.'+id_col).alias('dst'),
col('datasetB.clean').alias('dst_name')).\
withColumn('comb', sort_array(array(*('src', 'dst')))).\
dropDuplicates(['comb']).\
rdd.\
filter(lambda x: fuzzy_method(x['src_name'], x['dst_name']) >= fuzzy_threshold if x['src'] != x['dst'] else False).\
toDF().\
drop(*('src_name', 'dst_name', 'comb'))
Explain plan of edges
== Physical Plan ==
*(5) HashAggregate(keys=[datasetA#232, datasetB#263], functions=[])
+- Exchange hashpartitioning(datasetA#232, datasetB#263, 200)
+- *(4) HashAggregate(keys=[datasetA#232, datasetB#263], functions=[])
+- *(4) Project [datasetA#232, datasetB#263]
+- *(4) BroadcastHashJoin [entry#233, hashValue#234], [entry#264, hashValue#265], Inner, BuildRight, (UDF(datasetA#232.vectorized_char_lst, datasetB#263.vectorized_char_lst) < 0.3)
:- *(4) Project [named_struct(id, id#10, name, name#11, clean, clean#90, char_lst, char_lst#95, vectorized_char_lst, vectorized_char_lst#107, hashes, hashes#225) AS datasetA#232, entry#233, hashValue#234]
: +- *(4) Filter isnotnull(hashValue#234)
: +- Generate posexplode(hashes#225), [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, hashes#225], false, [entry#233, hashValue#234]
: +- *(1) Project [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, UDF(vectorized_char_lst#107) AS hashes#225]
: +- InMemoryTableScan [char_lst#95, clean#90, id#10, name#11, vectorized_char_lst#107]
: +- InMemoryRelation [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *(4) Project [id#10, name#11, pythonUDF0#114 AS clean#90, pythonUDF2#116 AS char_lst#95, UDF(pythonUDF2#116) AS vectorized_char_lst#107]
: +- BatchEvalPython [<lambda>(name#11), <lambda>(<lambda>(name#11)), <lambda>(<lambda>(name#11))], [id#10, name#11, pythonUDF0#114, pythonUDF1#115, pythonUDF2#116]
: +- SortAggregate(key=[name#11], functions=[first(id#10, false)])
: +- *(3) Sort [name#11 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(name#11, 200)
: +- SortAggregate(key=[name#11], functions=[partial_first(id#10, false)])
: +- *(2) Sort [name#11 ASC NULLS FIRST], false, 0
: +- Exchange RoundRobinPartitioning(8)
: +- *(1) Filter AtLeastNNulls(n, id#10,name#11)
: +- *(1) FileScan csv [id#10,name#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:<path>, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, int, false], input[2, vector, true]))
+- *(3) Project [named_struct(id, id#10, name, name#11, clean, clean#90, char_lst, char_lst#95, vectorized_char_lst, vectorized_char_lst#107, hashes, hashes#256) AS datasetB#263, entry#264, hashValue#265]
+- *(3) Filter isnotnull(hashValue#265)
+- Generate posexplode(hashes#256), [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, hashes#256], false, [entry#264, hashValue#265]
+- *(2) Project [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107, UDF(vectorized_char_lst#107) AS hashes#256]
+- InMemoryTableScan [char_lst#95, clean#90, id#10, name#11, vectorized_char_lst#107]
+- InMemoryRelation [id#10, name#11, clean#90, char_lst#95, vectorized_char_lst#107], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(4) Project [id#10, name#11, pythonUDF0#114 AS clean#90, pythonUDF2#116 AS char_lst#95, UDF(pythonUDF2#116) AS vectorized_char_lst#107]
+- BatchEvalPython [<lambda>(name#11), <lambda>(<lambda>(name#11)), <lambda>(<lambda>(name#11))], [id#10, name#11, pythonUDF0#114, pythonUDF1#115, pythonUDF2#116]
+- SortAggregate(key=[name#11], functions=[first(id#10, false)])
+- *(3) Sort [name#11 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(name#11, 200)
+- SortAggregate(key=[name#11], functions=[partial_first(id#10, false)])
+- *(2) Sort [name#11 ASC NULLS FIRST], false, 0
+- Exchange RoundRobinPartitioning(8)
+- *(1) Filter AtLeastNNulls(n, id#10,name#11)
+- *(1) FileScan csv [id#10,name#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:<path>, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string>
How data looks:
+-------+--------------------+--------------------+--------------------+--------------------+
| id| name| clean| char_lst| vectorized_char_lst|
+-------+--------------------+--------------------+--------------------+--------------------+
|3633038|MURATA MACHINERY LTD| MURATA MACHINERY|[M, U, R, A, T, A...|(33,[0,1,2,3,4,5,...|
|3632811|SOCIETE ANONYME D...|SOCIETE ANONYME D...|[S, O, C, I, E, T...|(33,[0,1,2,3,4,5,...|
|3632655|FUJIFILM CORPORATION| FUJIFILM|[F, U, J, I, F, I...|(33,[3,10,12,13,2...|
|3633318|HEINE OPTOTECHNIK...|HEINE OPTOTECHNIK...|[H, E, I, N, E, ...|(33,[0,1,2,3,4,5,...|
|3633523|SUNBEAM PRODUCTS INC| SUNBEAM PRODUCTS|[S, U, N, B, E, A...|(33,[0,1,2,4,5,6,...|
|3633300| HIVAL LTD| HIVAL| [H, I, V, A, L]|(33,[2,3,10,11,21...|
|3632657| NSK LTD| NSK| [N, S, K]|(33,[5,6,16],[1.0...|
|3633240|REHABILITATION IN...|REHABILITATION IN...|[R, E, H, A, B, I...|(33,[0,1,2,3,4,5,...|
|3632732|STUDIENGESELLSCHA...|STUDIENGESELLSCHA...|[S, T, U, D, I, E...|(33,[0,1,2,3,4,5,...|
|3632866|ENERGY CONVERSION...|ENERGY CONVERSION...|[E, N, E, R, G, Y...|(33,[0,1,3,5,6,7,...|
|3632895|ERGENICS POWER SY...|ERGENICS POWER SY...|[E, R, G, E, N, I...|(33,[0,1,3,4,5,6,...|
|3632897| MOLI ENERGY LIMITED| MOLI ENERGY|[M, O, L, I, , E...|(33,[0,1,3,5,7,8,...|
|3633275| NORDSON CORPORATION| NORDSON|[N, O, R, D, S, O...|(33,[5,6,7,8,14],...|
|3633256| PEROXIDCHEMIE GMBH| PEROXIDCHEMIE|[P, E, R, O, X, I...|(33,[0,3,7,8,9,11...|
|3632695| POWER CELL INC| POWER CELL|[P, O, W, E, R, ...|(33,[0,1,7,8,9,10...|
|3633037| ERGENICS INC| ERGENICS|[E, R, G, E, N, I...|(33,[0,3,5,6,8,9,...|
|3632878| FORD MOTOR COMPANY| FORD MOTOR|[F, O, R, D, , M...|(33,[1,4,7,8,13,1...|
|3632573| SAFT AMERICA INC| SAFT AMERICA|[S, A, F, T, , A...|(33,[0,1,2,3,4,6,...|
|3632852|ALCAN INTERNATION...| ALCAN INTERNATIONAL|[A, L, C, A, N, ...|(33,[0,1,2,3,4,5,...|
|3632698| KRUPPKOPPERS GMBH| KRUPPKOPPERS|[K, R, U, P, P, K...|(33,[0,6,7,8,12,1...|
|3633150|ALCAN INTERNATION...| ALCAN INTERNATIONAL|[A, L, C, A, N, ...|(33,[0,1,2,3,4,5,...|
|3632761|AMERICAN TELEPHON...|AMERICAN TELEPHON...|[A, M, E, R, I, C...|(33,[0,1,2,3,4,5,...|
|3632757|HITACHI KOKI COMP...| HITACHI KOKI|[H, I, T, A, C, H...|(33,[1,2,3,4,7,9,...|
|3632836|HUGHES AIRCRAFT C...| HUGHES AIRCRAFT|[H, U, G, H, E, S...|(33,[0,1,2,3,4,6,...|
|3633152| SOSY INC| SOSY| [S, O, S, Y]|(33,[6,7,18],[2.0...|
|3633052|HAMAMATSU PHOTONI...|HAMAMATSU PHOTONI...|[H, A, M, A, M, A...|(33,[1,2,3,4,5,6,...|
|3633450| AKZO NOBEL NV| AKZO NOBEL|[A, K, Z, O, , N...|(33,[0,1,2,5,7,10...|
|3632713| ELTRON RESEARCH INC| ELTRON RESEARCH|[E, L, T, R, O, N...|(33,[0,1,2,4,5,6,...|
|3632533|NEC ELECTRONICS C...| NEC ELECTRONICS|[N, E, C, , E, L...|(33,[0,1,3,4,5,6,...|
|3632562| TARGETTI SANKEY SPA| TARGETTI SANKEY SPA|[T, A, R, G, E, T...|(33,[0,1,2,3,4,5,...|
+-------+--------------------+--------------------+--------------------+--------------------+
only showing top 30 rows
Hardware used:
- Master node: m5.2xlarge 8 vCore, 32 GiB memory, EBS only storage EBS Storage:128 GiB
- Slave nodes (10x): m5.4xlarge 16 vCore, 64 GiB memory, EBS only storage EBS Storage:500 GiB
Spark-submit settings used:
spark-submit --master yarn --conf "spark.executor.instances=40" --conf "spark.default.parallelism=640" --conf "spark.shuffle.partitions=2000" --conf "spark.executor.cores=4" --conf "spark.executor.memory=14g" --conf "spark.driver.memory=14g" --conf "spark.driver.maxResultSize=14g" --conf "spark.dynamicAllocation.enabled=false" --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 run_disambiguation.py
Task errors from Web UI
ExecutorLostFailure (executor 21 exited caused by one of the running tasks) Reason: Slave lost
ExecutorLostFailure (executor 31 exited unrelated to the running tasks) Reason: Container marked as failed: container_1590592506722_0001_02_000002 on host: ip-172-31-47-180.eu-central-1.compute.internal. Exit status: -100. Diagnostics: Container released on a *lost* node.
(Part of) executor logs:
20/05/27 16:29:09 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (25 times so far)
20/05/27 16:29:13 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (26 times so far)
20/05/27 16:29:15 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (28 times so far)
20/05/27 16:29:17 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (0 time so far)
20/05/27 16:29:28 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (27 times so far)
20/05/27 16:29:28 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (26 times so far)
20/05/27 16:29:33 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (29 times so far)
20/05/27 16:29:38 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (1 time so far)
20/05/27 16:29:42 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (27 times so far)
20/05/27 16:29:46 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (28 times so far)
20/05/27 16:29:53 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (30 times so far)
20/05/27 16:29:57 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (2 times so far)
20/05/27 16:30:00 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (28 times so far)
20/05/27 16:30:05 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (29 times so far)
20/05/27 16:30:10 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (31 times so far)
20/05/27 16:30:15 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (3 times so far)
20/05/27 16:30:19 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (29 times so far)
20/05/27 16:30:22 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (30 times so far)
20/05/27 16:30:29 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (32 times so far)
20/05/27 16:30:32 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (4 times so far)
20/05/27 16:30:39 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (31 times so far)
20/05/27 16:30:39 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (30 times so far)
20/05/27 16:30:46 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (33 times so far)
20/05/27 16:30:47 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (5 times so far)
20/05/27 16:30:55 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (32 times so far)
20/05/27 16:30:59 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (31 times so far)
20/05/27 16:31:03 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (34 times so far)
20/05/27 16:31:06 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (6 times so far)
20/05/27 16:31:13 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (33 times so far)
20/05/27 16:31:14 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (32 times so far)
20/05/27 16:31:22 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (35 times so far)
20/05/27 16:31:24 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (7 times so far)
20/05/27 16:31:30 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (34 times so far)
20/05/27 16:31:32 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (33 times so far)
20/05/27 16:31:41 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (36 times so far)
20/05/27 16:31:44 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (8 times so far)
20/05/27 16:31:47 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (35 times so far)
20/05/27 16:31:48 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (34 times so far)
20/05/27 16:32:02 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (37 times so far)
20/05/27 16:32:03 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (9 times so far)
20/05/27 16:32:04 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (36 times so far)
20/05/27 16:32:08 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (35 times so far)
20/05/27 16:32:19 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (38 times so far)
20/05/27 16:32:20 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (37 times so far)
20/05/27 16:32:21 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (10 times so far)
20/05/27 16:32:26 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (36 times so far)
20/05/27 16:32:37 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (39 times so far)
20/05/27 16:32:37 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (11 times so far)
20/05/27 16:32:38 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (38 times so far)
20/05/27 16:32:45 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (37 times so far)
20/05/27 16:32:51 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (40 times so far)
20/05/27 16:32:56 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (12 times so far)
20/05/27 16:32:58 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (39 times so far)
20/05/27 16:33:03 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (38 times so far)
20/05/27 16:33:08 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (41 times so far)
20/05/27 16:33:13 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (13 times so far)
20/05/27 16:33:15 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (40 times so far)
20/05/27 16:33:20 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (39 times so far)
20/05/27 16:33:26 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1988.0 MB to disk (42 times so far)
20/05/27 16:33:30 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (41 times so far)
20/05/27 16:33:31 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (14 times so far)
20/05/27 16:33:36 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (40 times so far)
20/05/27 16:33:46 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1992.0 MB to disk (43 times so far)
20/05/27 16:33:47 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1988.0 MB to disk (42 times so far)
20/05/27 16:33:51 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (15 times so far)
20/05/27 16:33:54 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (41 times so far)
20/05/27 16:34:03 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1992.0 MB to disk (43 times so far)
20/05/27 16:34:04 INFO ShuffleExternalSorter: Thread 146 spilling sort data of 1992.0 MB to disk (44 times so far)
20/05/27 16:34:08 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (16 times so far)
20/05/27 16:34:14 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1988.0 MB to disk (42 times so far)
20/05/27 16:34:16 INFO PythonUDFRunner: Times: total = 774701, boot = 3, init = 10, finish = 774688
20/05/27 16:34:21 INFO ShuffleExternalSorter: Thread 147 spilling sort data of 1992.0 MB to disk (44 times so far)
20/05/27 16:34:22 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (17 times so far)
20/05/27 16:34:30 INFO PythonUDFRunner: Times: total = 773372, boot = 2, init = 9, finish = 773361
20/05/27 16:34:32 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1992.0 MB to disk (43 times so far)
20/05/27 16:34:39 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (18 times so far)
20/05/27 16:34:46 INFO ShuffleExternalSorter: Thread 89 spilling sort data of 1992.0 MB to disk (44 times so far)
20/05/27 16:34:52 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (19 times so far)
20/05/27 16:35:01 INFO PythonUDFRunner: Times: total = 776905, boot = 3, init = 11, finish = 776891
20/05/27 16:35:05 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (20 times so far)
20/05/27 16:35:19 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (21 times so far)
20/05/27 16:35:35 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (22 times so far)
20/05/27 16:35:52 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (23 times so far)
20/05/27 16:36:10 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (24 times so far)
20/05/27 16:36:29 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (25 times so far)
20/05/27 16:36:47 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (26 times so far)
20/05/27 16:37:06 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (27 times so far)
20/05/27 16:37:25 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (28 times so far)
20/05/27 16:37:44 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (29 times so far)
20/05/27 16:38:03 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (30 times so far)
20/05/27 16:38:22 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (31 times so far)
20/05/27 16:38:41 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (32 times so far)
20/05/27 16:38:59 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (33 times so far)
20/05/27 16:39:19 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (34 times so far)
20/05/27 16:39:39 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (35 times so far)
20/05/27 16:39:58 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (36 times so far)
20/05/27 16:40:18 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (37 times so far)
20/05/27 16:40:38 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (38 times so far)
20/05/27 16:40:57 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (39 times so far)
20/05/27 16:41:16 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (40 times so far)
20/05/27 16:41:35 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (41 times so far)
20/05/27 16:41:55 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1988.0 MB to disk (42 times so far)
20/05/27 16:42:19 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1992.0 MB to disk (43 times so far)
20/05/27 16:42:41 INFO ShuffleExternalSorter: Thread 145 spilling sort data of 1992.0 MB to disk (44 times so far)
20/05/27 16:42:59 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
20/05/27 16:42:59 INFO DiskBlockManager: Shutdown hook called
20/05/27 16:42:59 INFO ShutdownHookManager: Shutdown hook called
20/05/27 16:42:59 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1590592506722_0001/spark-73af8e3b-f428-47d4-9e13-fed4e19cc2cd
2020-05-27T16:41:16.336+0000: [GC (Allocation Failure) 2020-05-27T16:41:16.336+0000: [ParNew: 272234K->242K(305984K), 0.0094375 secs] 9076907K->8804915K(13188748K), 0.0094895 secs] [Times: user=0.12 sys=0.00, real=0.01 secs]
2020-05-27T16:41:34.686+0000: [GC (Allocation Failure) 2020-05-27T16:41:34.686+0000: [ParNew: 272242K->257K(305984K), 0.0084179 secs] 9076915K->8804947K(13188748K), 0.0084840 secs] [Times: user=0.09 sys=0.01, real=0.01 secs]
2020-05-27T16:41:35.145+0000: [GC (Allocation Failure) 2020-05-27T16:41:35.145+0000: [ParNew: 272257K->1382K(305984K), 0.0095541 secs] 9076947K->8806073K(13188748K), 0.0096080 secs] [Times: user=0.12 sys=0.00, real=0.01 secs]
2020-05-27T16:41:55.077+0000: [GC (Allocation Failure) 2020-05-27T16:41:55.077+0000: [ParNew: 273382K->2683K(305984K), 0.0097177 secs] 9078073K->8807392K(13188748K), 0.0097754 secs] [Times: user=0.12 sys=0.00, real=0.01 secs]
2020-05-27T16:41:55.513+0000: [GC (Allocation Failure) 2020-05-27T16:41:55.513+0000: [ParNew: 274683K->3025K(305984K), 0.0093345 secs] 9079392K->8807734K(13188748K), 0.0093892 secs] [Times: user=0.12 sys=0.00, real=0.01 secs]
2020-05-27T16:42:05.481+0000: [GC (Allocation Failure) 2020-05-27T16:42:05.481+0000: [ParNew: 275025K->4102K(305984K), 0.0092950 secs] 9079734K->8808830K(13188748K), 0.0093464 secs] [Times: user=0.12 sys=0.00, real=0.01 secs]
2020-05-27T16:42:18.711+0000: [GC (Allocation Failure) 2020-05-27T16:42:18.711+0000: [ParNew: 276102K->2972K(305984K), 0.0098928 secs] 9080830K->8807700K(13188748K), 0.0099510 secs] [Times: user=0.13 sys=0.00, real=0.01 secs]
2020-05-27T16:42:36.493+0000: [GC (Allocation Failure) 2020-05-27T16:42:36.493+0000: [ParNew: 274972K->3852K(305984K), 0.0094324 secs] 9079700K->8808598K(13188748K), 0.0094897 secs] [Times: user=0.11 sys=0.00, real=0.01 secs]
2020-05-27T16:42:40.880+0000: [GC (Allocation Failure) 2020-05-27T16:42:40.880+0000: [ParNew: 275852K->2568K(305984K), 0.0111794 secs] 9080598K->8807882K(13188748K), 0.0112352 secs] [Times: user=0.13 sys=0.00, real=0.01 secs]
Heap
par new generation total 305984K, used 261139K [0x0000000440000000, 0x0000000454c00000, 0x0000000483990000)
eden space 272000K, 95% used [0x0000000440000000, 0x000000044fc82cf8, 0x00000004509a0000)
from space 33984K, 7% used [0x00000004509a0000, 0x0000000450c220a8, 0x0000000452ad0000)
to space 33984K, 0% used [0x0000000452ad0000, 0x0000000452ad0000, 0x0000000454c00000)
concurrent mark-sweep generation total 12882764K, used 8805314K [0x0000000483990000, 0x0000000795e63000, 0x00000007c0000000)
Metaspace used 77726K, capacity 79553K, committed 79604K, reserved 1118208K
class space used 10289K, capacity 10704K, committed 10740K, reserved 1048576K
What I tried:
- Changing
spark.sql.shuffle.partitions - Changing
spark.default.parallelism - Repartition the dataframe
How can I solve this issue?
Thanks in advance!
Thijs
approxSimilarityJoinwill only parallelize well across workers if the tokens being input into MinHash are sufficiently distinct. Since individual character tokens appear frequently across many records; include anNGramtransformation on your character list to make the appearance of each token less frequent; this will greatly reduce data skew and will resolve memory strain.MinHash simulates the process of creating a random permutation of your token population and selects the token in the sample set that appears first in the permutation. Since you are using individual characters as tokens, let's say you select a MinHash seed that makes the character
ethe first in your random permutation. In this case, every row with the letterein it will have a matching MinHash and will be shuffled to the same worker for set comparison. This will cause extreme data skew and out of memory errors.