I am producing a very big datasets (>120 Gb), which are actually a list of named (100x100x3) matrices. A very large lists (millions of records). They are then fed to CNN and classified to one of 4 categories. Processing this amount of data at once is taedious and it often stuck my RAM, so I would like to split my dataset into chunks and process the chunks in parallel.
I found a few packages: bigmemory and disk.frame look most suitable. But do they accept lists? Or maybe there are better solutions for lists?
I had to adjust my data to data.table format, so I did something like this:
I need it to be named, so I extracted names to the vec:
I converted my list to data.table ("chunk" are my original data from the list1 used as the dummy; this is nested list of matrices; 3 matrices per name to be specific)
I tried to convert it into the disk.frame:
Then I encountered a following error:
So no way to use this for nested list of matrices.
After that I changed approach and decided to process the dummy data.table with column containing name and column containing matrix - I created this in a two-step fashion, based on this thread (used Jonathan Gellar's example):
data.frame with a column containing a matrix in R
Under this scenario, the disk.frame threw another type of error:
So, nope, unfortunately this is not the solution I could use with my datasets. I share this, so other ppl could spare their time.