I have a very big database (more than 30 millions records) with the following structure:
test <- as.data.table(list(v1 = c("150,10001,11,Bien", "151,10002,11,Bien",
"152,10003,11,Bien", "153,10004,11,Mal",
"154,10005,11,Regular")))
I downloaded that db as one string per row since I miss a lot of information trying to read using fread straightforward. For instance, I get errors similar to this: "Stopped early on line 101. Expected 1099 fields but found 1100. Consider fill=TRUE and comment.char=. First discarded non-empty line" Because of that I can't download the whole database as expected.
What I want to achieve is to create four variables from those strings.This code works perfectly in my dummy database:
test <- test[, `:=`(id = lapply(strsplit(test$v1, split=","), "[", 1),
question_id = lapply(strsplit(test$v1, split=","), "[", 2),
answer_id = lapply(strsplit(test$v1, split=","), "[", 3),
answer = lapply(strsplit(test$v1, split=","), "[", 4))]
In my actual database, this code is taking a lot of time. I waited more than two hours and didn't get any result.I guess this approach might be wrong since I'm not creating variables themselves but lists so that makes the process slower than it should be:
Classes ‘data.table’ and 'data.frame': 5 obs. of 5 variables:
$ v1 : chr "150,10001,11,Bien" "151,10002,11,Bien" "152,10003,11,Bien" "153,10004,11,Mal" ...
$ id :List of 5
..$ : chr "150"
..$ : chr "151"
..$ : chr "152"
..$ : chr "153"
..$ : chr "154"
$ question_id:List of 5
..$ : chr "10001"
..$ : chr "10002"
..$ : chr "10003"
..$ : chr "10004"
..$ : chr "10005"
$ answer_id :List of 5
..$ : chr "11"
..$ : chr "11"
..$ : chr "11"
..$ : chr "11"
..$ : chr "11"
$ answer :List of 5
..$ : chr "Bien"
..$ : chr "Bien"
..$ : chr "Bien"
..$ : chr "Mal"
..$ : chr "Regular"
- attr(*, ".internal.selfref")=<externalptr>
I read that parLapply (a function from parallel library) tends to be slower than lapply when using Windows so I discarded that solution.
Any advice will be much appreciated.
First, calling
tstrsplitmultiple times on the same strings just to get to specific columns can be avoided:Just this change along is 4x faster with this small dataset, and with 1000x as many rows, it actually was more-efficient in limited testing.
(Another comment: using
test$v1while inside atestcalculation is inefficient and might be slowing you down a little. Perhaps 1% or so ... not much.)Since you say you have rows with too many commas, then this will still fail. For those, from your current (too-slow) method that you only want the first four elements of each string, so we can remove it first. I'll modify the sample data so that we can see the effect here.
Your currently (too-slow) method suggests that you only want the first four and are willing to discard everything after it, so I'll suggest a conditional
sub.However, I think all of that is unnecessary, since you can instruct
freadto fill when extract columns are discovered (as the warning/error message suggests).This is certainly a safer path, since it allows you to assess if the extra columns are okay to discard, or if perhaps one of the other columns should have had an embedded comma in a quoted string.