Given the size of data I'm working with I want to do the processing in parallel
I have set up the code as below keeping one core free so whole machine isn't being used
library(DoMC)
library(foreach)
library(itertools)
num_cores <- round(detectCores()*1-1) # num_cores is 7 in this case
registerDoMC(num_cores)
test_prediction <-
data.frame(
foreach(d=isplitRows(test, chunks = num_cores),
.combine=c,
.packages=c("stats")) %dopar% {
predict(cfModel, newdata=d)
}
)
The problem is that test_prediction returned has less rows than test, and I can't figure out why
The rows returned in a few attempts suggest to me that the .combine in the foreach isn't collecting data from some of the cores, although I'm not sure how to confirm this theory
Total number of rows 603,054
Attempt 1: rows returned > 516,903 - 6/7s of data returned
Attempt 2: rows returned > 344,602 - 4/7s of data returned
Attempt 3: rows returned > 430,753 - 5/7s of data returned
This only occurs when run in parallel, if I use the %do% option instead then the correct number of rows is returned - although I'm not sure how to further investigate this theory?
In general, any help on if there is a better approach to running this in parallel? would be appreciated
Session information:
> sessionInfo()
R version 3.3.0 beta (2016-03-30 r70404)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C LC_PAPER=C LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] miniCRAN_0.2.7 markdown_0.7.7 slackr_1.4.2 readr_0.2.2 readxl_0.1.1 testthat_1.0.2 R2HTML_2.3.2 itertools_0.1-3 XML_3.98-1.4
[10] rvest_0.3.2 xml2_1.0.0 devtools_1.12.0 doParallel_1.0.10 rjson_0.2.15 RCurl_1.95-4.8 bitops_1.0-6 bit64_0.9-5 bit_1.1-12
[19] qcc_2.6 optiRum_0.37.3 scales_0.4.0 doMC_1.3.4 iterators_1.0.8 foreach_1.4.3 pryr_0.1.2 party_1.0-25 strucchange_1.5-1
[28] sandwich_2.3-4 zoo_1.7-13 modeltools_0.2-21 mvtnorm_1.0-5 e1071_1.6-7 randomForest_4.6-12 caret_6.0-70 lattice_0.20-29 timeDate_3012.100
[37] Kmisc_0.5.0 reshape2_1.4.1 gridExtra_2.2.1 tidyr_0.5.1 dplyr_0.5.0 plyr_1.8.4 data.table_1.9.6 sendmailR_1.2-1 RPostgreSQL_0.4-1
[46] ggplot2_2.1.0 lubridate_1.5.6 stringr_1.0.0 sqldf_0.4-10 RSQLite_1.0.0 DBI_0.4-1 gsubfn_0.6-6 proto_0.3-10
loaded via a namespace (and not attached):
[1] nlme_3.1-128 pbkrtest_0.4-6 httr_1.2.1 tools_3.3.0 R6_2.1.2 lazyeval_0.2.0 mgcv_1.8-3 colorspace_1.2-6 nnet_7.3-8 withr_1.0.2
[11] compiler_3.3.0 chron_2.3-47 quantreg_5.26 SparseM_1.7 AUC_0.3.0 digest_0.6.9 minqa_1.2.4 base64enc_0.1-3 lme4_1.1-12 jsonlite_1.0
[21] car_2.1-2 magrittr_1.5 Matrix_1.2-6 Rcpp_0.12.6 munsell_0.4.3 stringi_1.1.1 multcomp_1.4-6 MASS_7.3-35 crayon_1.3.2 splines_3.3.0
[31] knitr_1.13 tcltk_3.3.0 codetools_0.2-9 nloptr_1.0.4 MatrixModels_0.4-1 gtable_0.2.0 assertthat_0.1 coin_1.1-2 class_7.3-11 survival_2.39-5
[41] tibble_1.1 memoise_1.0.0 TH.data_1.0-7
This works fine for me: