I am performing an aggregated array collection using the following code in pyspark:
df1=df.groupBy('key').agg(collect_list('value'))
I know functions like collect forces data into a single node. Is it possible to achieve the same result while at the same time leveraging the power of distributed cloud computing?
There seems to be a bit of misunderstanding here
collectforces the data to be collected over driver and is not distributedwhereas
collect_listandcollect_setare distributed operations by default.