distributing collect_list function across worker nodes

414 Views Asked by thentangler At 01 October 2020 at 00:07

I am performing an aggregated array collection using the following code in pyspark:

df1=df.groupBy('key').agg(collect_list('value'))

I know functions like collect forces data into a single node. Is it possible to achieve the same result while at the same time leveraging the power of distributed cloud computing?

Original Q&A

There are 1 best solutions below

Shubham Jain On 01 October 2020 at 05:08 BEST ANSWER

There seems to be a bit of misunderstanding here

collect forces the data to be collected over driver and is not distributed

whereas

collect_list and collect_set are distributed operations by default.

distributing collect_list function across worker nodes

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in COLLECT

Trending Questions

Popular # Hahtags

Popular Questions