How should we select the chunk size in disk frame?

82 Views Asked by At

I'm working with disk frame and it's great so far.

One piece that confuses me is the chunk size. I sense that a small chunk might create too many tasks and disk frame might eat up time managing those tasks. On the other hand, a big chunk might be too expensive for the workers, decreasing the performance benefits from parallelism.

What pieces of information can we use to make a better guess for chunk size?

1

There are 1 best solutions below

0
xiaodai On BEST ANSWER

This is a tough problem and I probably need better tools.

Currently, everything is on guess basis. But I have made a presentation on this and I will try to bring it into the docs soon.

Ideally, you want

RAM Used = number of workers * RAM usage per chunk

So, if you have 6 workers (ideal for 6 CPU cores), then you would want smaller chunk vs someone with 4 (workers) but same amount of total RAM.

The difficult is in estimating "RAM usage per chunk" which is different for different operations like merge, sort, and just vaniall filtering!

This is a hard problem to solve in general! So no good solution for now.