Issue with foreach R package combining data when run in parallel

130 Views Asked by At

Given the size of data I'm working with I want to do the processing in parallel

I have set up the code as below keeping one core free so whole machine isn't being used

library(DoMC)
library(foreach)
library(itertools)

num_cores <- round(detectCores()*1-1) # num_cores is 7 in this case

registerDoMC(num_cores)

test_prediction <-
    data.frame(
    foreach(d=isplitRows(test, chunks = num_cores),
            .combine=c, 
            .packages=c("stats")) %dopar% {
                predict(cfModel, newdata=d)
            }
    )

The problem is that test_prediction returned has less rows than test, and I can't figure out why

The rows returned in a few attempts suggest to me that the .combine in the foreach isn't collecting data from some of the cores, although I'm not sure how to confirm this theory

Total number of rows 603,054

Attempt 1: rows returned > 516,903 - 6/7s of data returned
Attempt 2: rows returned > 344,602 - 4/7s of data returned
Attempt 3: rows returned > 430,753 - 5/7s of data returned

This only occurs when run in parallel, if I use the %do% option instead then the correct number of rows is returned - although I'm not sure how to further investigate this theory?

In general, any help on if there is a better approach to running this in parallel? would be appreciated

Session information:

> sessionInfo()
R version 3.3.0 beta (2016-03-30 r70404)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C            LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C        LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] miniCRAN_0.2.7      markdown_0.7.7      slackr_1.4.2        readr_0.2.2         readxl_0.1.1        testthat_1.0.2      R2HTML_2.3.2        itertools_0.1-3     XML_3.98-1.4       
[10] rvest_0.3.2         xml2_1.0.0          devtools_1.12.0     doParallel_1.0.10   rjson_0.2.15        RCurl_1.95-4.8      bitops_1.0-6        bit64_0.9-5         bit_1.1-12         
[19] qcc_2.6             optiRum_0.37.3      scales_0.4.0        doMC_1.3.4          iterators_1.0.8     foreach_1.4.3       pryr_0.1.2          party_1.0-25        strucchange_1.5-1  
[28] sandwich_2.3-4      zoo_1.7-13          modeltools_0.2-21   mvtnorm_1.0-5       e1071_1.6-7         randomForest_4.6-12 caret_6.0-70        lattice_0.20-29     timeDate_3012.100  
[37] Kmisc_0.5.0         reshape2_1.4.1      gridExtra_2.2.1     tidyr_0.5.1         dplyr_0.5.0         plyr_1.8.4          data.table_1.9.6    sendmailR_1.2-1     RPostgreSQL_0.4-1  
[46] ggplot2_2.1.0       lubridate_1.5.6     stringr_1.0.0       sqldf_0.4-10        RSQLite_1.0.0       DBI_0.4-1           gsubfn_0.6-6        proto_0.3-10       

loaded via a namespace (and not attached):
 [1] nlme_3.1-128       pbkrtest_0.4-6     httr_1.2.1         tools_3.3.0        R6_2.1.2           lazyeval_0.2.0     mgcv_1.8-3         colorspace_1.2-6   nnet_7.3-8         withr_1.0.2       
[11] compiler_3.3.0     chron_2.3-47       quantreg_5.26      SparseM_1.7        AUC_0.3.0          digest_0.6.9       minqa_1.2.4        base64enc_0.1-3    lme4_1.1-12        jsonlite_1.0      
[21] car_2.1-2          magrittr_1.5       Matrix_1.2-6       Rcpp_0.12.6        munsell_0.4.3      stringi_1.1.1      multcomp_1.4-6     MASS_7.3-35        crayon_1.3.2       splines_3.3.0     
[31] knitr_1.13         tcltk_3.3.0        codetools_0.2-9    nloptr_1.0.4       MatrixModels_0.4-1 gtable_0.2.0       assertthat_0.1     coin_1.1-2         class_7.3-11       survival_2.39-5   
[41] tibble_1.1         memoise_1.0.0      TH.data_1.0-7   
1

There are 1 best solutions below

0
F. Privé On

This works fine for me:

library(doMC)
library(foreach)
library(itertools)

num_cores <- 2
registerDoMC(num_cores)

predict.fake <- function(object, newdata) {
  rowSums(newdata)
}

cfModel <- structure(NULL, class = "fake")
test <- matrix(rnorm(2000), ncol = 2)
pred <- predict(cfModel, test)

test_prediction <-
  foreach(d=isplitRows(test, chunks = num_cores),
          .combine=c, 
          .packages=c("stats")) %dopar% {
            predict(cfModel, newdata=d)
          }

all.equal(test_prediction, pred)