How to load a huge model on Dask with limited RAM?

61 Views Asked by At

I want to load a model (ANNOY model) on Dask. The size of the model is 60GB and Dask RAM is 2GB only. Is there a way to load the model in distributed manner as well?

1

There are 1 best solutions below

0
On

If by "load" you mean: "store in memory", then obviously there is no way to do this. If you need access to the whole dataset in memory at once, you'll need a machine that can handle this. However, you very probably meant that you want to do some processing to the data and get a result (prediction, statistical score...) which does fit in memory.

Since I don't know what ANNOY is (array? dataframe? something else?), I can only give you general rules. For dask to work, it needs to be able to split a job into tasks. For data IO, this commonly means that the input is in multiple files, or that the files have some natural internal structure such that they can be loaded chunk-wise. For example, zarr (for arrays) stores each chunk of a logical dataset as a separate file, parquet (for dataframes) chunks up data into pages within columns within groups within files, and even CSV can be loaded chunkwise by looking for newline characters.

I suspect annoy ( https://github.com/spotify/annoy ?) has complex internal storage structure, and you may eed to raise an issue on their repo asking about dask support.