How should we select the chunk size in disk frame?

82 Views Asked by Cauder At 12 September 2020 at 15:14

I'm working with disk frame and it's great so far.

One piece that confuses me is the chunk size. I sense that a small chunk might create too many tasks and disk frame might eat up time managing those tasks. On the other hand, a big chunk might be too expensive for the workers, decreasing the performance benefits from parallelism.

What pieces of information can we use to make a better guess for chunk size?

Original Q&A

There are 1 best solutions below

xiaodai On 17 September 2020 at 02:21 BEST ANSWER

This is a tough problem and I probably need better tools.

Currently, everything is on guess basis. But I have made a presentation on this and I will try to bring it into the docs soon.

Ideally, you want

RAM Used = number of workers * RAM usage per chunk

So, if you have 6 workers (ideal for 6 CPU cores), then you would want smaller chunk vs someone with 4 (workers) but same amount of total RAM.

The difficult is in estimating "RAM usage per chunk" which is different for different operations like merge, sort, and just vaniall filtering!

This is a hard problem to solve in general! So no good solution for now.

How should we select the chunk size in disk frame?

There are 1 best solutions below

Related Questions in R

Related Questions in DISK.FRAME

Trending Questions

Popular # Hahtags

Popular Questions