Does disk.frame allow to work with large lists in R?

71 Views Asked by At

I am producing a very big datasets (>120 Gb), which are actually a list of named (100x100x3) matrices. A very large lists (millions of records). They are then fed to CNN and classified to one of 4 categories. Processing this amount of data at once is taedious and it often stuck my RAM, so I would like to split my dataset into chunks and process the chunks in parallel.

I found a few packages: bigmemory and disk.frame look most suitable. But do they accept lists? Or maybe there are better solutions for lists?

2

There are 2 best solutions below

0
ramen On

I had to adjust my data to data.table format, so I did something like this:

I need it to be named, so I extracted names to the vec:

  nameslist <- names(list1)

I converted my list to data.table ("chunk" are my original data from the list1 used as the dummy; this is nested list of matrices; 3 matrices per name to be specific)

dummy_dframe <- data.frame(name= nameslist, chunk = I(list1))

I tried to convert it into the disk.frame:

dummy_diskframe <- as.disk.frame(dummy_dframe)

Then I encountered a following error:

Error in `[.data.table`(df, , { :
The data frame contains these list-columns: 'chunk'. List-columns are not yet supported by disk.frame. Remove these columns to create a disk.frame

So no way to use this for nested list of matrices.

After that I changed approach and decided to process the dummy data.table with column containing name and column containing matrix - I created this in a two-step fashion, based on this thread (used Jonathan Gellar's example):

data.frame with a column containing a matrix in R

Under this scenario, the disk.frame threw another type of error:

Error in `[.data.table`(df, , { :
Column 2 ['mat'] is length 4 but column 1 is length 2; malformed data.table.

So, nope, unfortunately this is not the solution I could use with my datasets. I share this, so other ppl could spare their time.

0
xiaodai On

{disk.frame} only works with tabular data