opening csv file with 4 million columns using R data.table fread()

68 Views Asked by At

I am trying to load a .csv file with about 4 million columns, and a few hundred rows, using R data.table fread(). I've set it with verbose=TRUE, and here is the error I get:

  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            20
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          20
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 10 threads with throttle==1024. See ?setDTthreads.
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 10 threads (omp_get_max_threads()=20, nth=10)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 0
  0/1 column will be read as integer
[02] Opening the file
  Opening file /q/combined.u.NA.ntwistbd.csv
  File opened, size = 7.265GB (7800965634 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<gene,chr1.10469.10470.cpg_inte>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 100 lines of 3986159 fields using quote rule 0
  Detected 3986159 columns on line 1. This line is either column names or first data row. Line starts as: <<gene,chr1.10469.10470.cpg_inte>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 3986159
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 1 because (7800965633 bytes from row 1 to eof) / (2 * 3658680597 jump0size) == 1
  Type codes (jump 000)    : C7777777777777777777777777777777777775577755777777777777775772777755777777777777...2222222222  Quote rule 0
  Type codes (jump 001)    : C7777777777777777777777777777777777775577777777777777777777772777755777777777777...2222222222  Quote rule 0
  =====
  Sampled 153 rows (handled \n inside quoted fields) at 2 jump points
  Bytes from first data row on line 2 to the end of last row: 7456598287
  Line length: mean=33644907.61 sd=-nan min=18359868 max=54585593
  Estimated number of rows: 7456598287 / 33644907.61 = 222
  Initial alloc = 406 rows (222 + 82%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : C7777777777777777777777777777777777775577777777777777777777772777755777777777777...2222222222
[10] Allocate memory for the datatable
  Allocating 3986159 column slots (3986159 - 0 dropped) with 406 rows
[11] Read the data
  jumps=[0..1), chunk_size=33644907614, total_size=7456598287
  2390 out-of-sample type bumps: C7777777777777777777777777777777777775577777777777777777777772777755777777777777...2222222222

 *** caught segfault ***
address (nil), cause 'unknown'
Segmentation fault (core dumped)

Is this a memory issue? I am running it on a Linux machine with 188Gb RAM, and the file is about 7-8Gb in size. Any ideas?

0

There are 0 best solutions below