I have ~30GB uncompressed spatial data, it contains id, tags, and coordinates as three columns in parquet file with row group size 64MB.
I used dask read_parquet with block_size 32MiB got 118 partitions. I processed this data by using dask dataframe, and i called DataFrame.persist and distributed.wait or progress to wait for the process finishing on all process stages.
Everything works fine, until i tried to save it to output as parquet files. I always got same exception during compute schedule call.
dfs = features.to_delayed()
layer_path = os.path.join(self._path, layer)
filenames = [ os.path.join(layer_path, f"{layer}-{i}.parquet") for i, df in enumerate(dfs)]
writes = [delayed(save)(df, layer, fn) for df, fn in zip(dfs, filenames)]
dd.compute(*writes)
Exception: TerminatedWorkerError('A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.\n\nThe exit codes of the workers are {SIGKILL(-9)}')
This is partitions before call compute to save at last step
layer geometry properties
npartitions=118
object object object
... ... ...
... ... ... ...
... ... ...
... ... ...
Dask Name: apply, 2 graph layers
I tried the worker plugin, and still didn't see any useful logs. So, i am wondering if there is better way to check what happened?
P.S. 10 workers, and each worker has 8 cores, Memory 48GB available