Issue with foreach R package combining data when run in parallel

Question

Issue with foreach R package combining data when run in parallel

130 Views Asked by Sam Gilbert At 25 August 2016 at 15:23

Given the size of data I'm working with I want to do the processing in parallel

I have set up the code as below keeping one core free so whole machine isn't being used

library(DoMC)
library(foreach)
library(itertools)

num_cores <- round(detectCores()*1-1) # num_cores is 7 in this case

registerDoMC(num_cores)

test_prediction <-
    data.frame(
    foreach(d=isplitRows(test, chunks = num_cores),
            .combine=c, 
            .packages=c("stats")) %dopar% {
                predict(cfModel, newdata=d)
            }
    )

The problem is that test_prediction returned has less rows than test, and I can't figure out why

The rows returned in a few attempts suggest to me that the .combine in the foreach isn't collecting data from some of the cores, although I'm not sure how to confirm this theory

Total number of rows 603,054

Attempt 1: rows returned > 516,903 - 6/7s of data returned
Attempt 2: rows returned > 344,602 - 4/7s of data returned
Attempt 3: rows returned > 430,753 - 5/7s of data returned

This only occurs when run in parallel, if I use the %do% option instead then the correct number of rows is returned - although I'm not sure how to further investigate this theory?

In general, any help on if there is a better approach to running this in parallel? would be appreciated

Session information:

> sessionInfo()
R version 3.3.0 beta (2016-03-30 r70404)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C            LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C        LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] miniCRAN_0.2.7      markdown_0.7.7      slackr_1.4.2        readr_0.2.2         readxl_0.1.1        testthat_1.0.2      R2HTML_2.3.2        itertools_0.1-3     XML_3.98-1.4       
[10] rvest_0.3.2         xml2_1.0.0          devtools_1.12.0     doParallel_1.0.10   rjson_0.2.15        RCurl_1.95-4.8      bitops_1.0-6        bit64_0.9-5         bit_1.1-12         
[19] qcc_2.6             optiRum_0.37.3      scales_0.4.0        doMC_1.3.4          iterators_1.0.8     foreach_1.4.3       pryr_0.1.2          party_1.0-25        strucchange_1.5-1  
[28] sandwich_2.3-4      zoo_1.7-13          modeltools_0.2-21   mvtnorm_1.0-5       e1071_1.6-7         randomForest_4.6-12 caret_6.0-70        lattice_0.20-29     timeDate_3012.100  
[37] Kmisc_0.5.0         reshape2_1.4.1      gridExtra_2.2.1     tidyr_0.5.1         dplyr_0.5.0         plyr_1.8.4          data.table_1.9.6    sendmailR_1.2-1     RPostgreSQL_0.4-1  
[46] ggplot2_2.1.0       lubridate_1.5.6     stringr_1.0.0       sqldf_0.4-10        RSQLite_1.0.0       DBI_0.4-1           gsubfn_0.6-6        proto_0.3-10       

loaded via a namespace (and not attached):
 [1] nlme_3.1-128       pbkrtest_0.4-6     httr_1.2.1         tools_3.3.0        R6_2.1.2           lazyeval_0.2.0     mgcv_1.8-3         colorspace_1.2-6   nnet_7.3-8         withr_1.0.2       
[11] compiler_3.3.0     chron_2.3-47       quantreg_5.26      SparseM_1.7        AUC_0.3.0          digest_0.6.9       minqa_1.2.4        base64enc_0.1-3    lme4_1.1-12        jsonlite_1.0      
[21] car_2.1-2          magrittr_1.5       Matrix_1.2-6       Rcpp_0.12.6        munsell_0.4.3      stringi_1.1.1      multcomp_1.4-6     MASS_7.3-35        crayon_1.3.2       splines_3.3.0     
[31] knitr_1.13         tcltk_3.3.0        codetools_0.2-9    nloptr_1.0.4       MatrixModels_0.4-1 gtable_0.2.0       assertthat_0.1     coin_1.1-2         class_7.3-11       survival_2.39-5   
[41] tibble_1.1         memoise_1.0.0      TH.data_1.0-7

Original Q&A

There are 1 best solutions below

**F. Privé** · Answer 1 · 2017-07-13T08:20:02.967000

This works fine for me:

library(doMC)
library(foreach)
library(itertools)

num_cores <- 2
registerDoMC(num_cores)

predict.fake <- function(object, newdata) {
  rowSums(newdata)
}

cfModel <- structure(NULL, class = "fake")
test <- matrix(rnorm(2000), ncol = 2)
pred <- predict(cfModel, test)

test_prediction <-
  foreach(d=isplitRows(test, chunks = num_cores),
          .combine=c, 
          .packages=c("stats")) %dopar% {
            predict(cfModel, newdata=d)
          }

all.equal(test_prediction, pred)

Issue with foreach R package combining data when run in parallel

There are 1 best solutions below

Related Questions in R

Related Questions in PARALLEL-FOREACH

Related Questions in DOMC

Trending Questions

Popular # Hahtags

Popular Questions