Why do pdf files that are printed using R's foreach() %dopar% construct turn out corrupted and unreadable?

Question

Why do pdf files that are printed using R's foreach() %dopar% construct turn out corrupted and unreadable?

156 Views Asked by stachyra At 21 September 2022 at 19:17

I have a minimum reproducible example script below that writes identical plots to two pdf files, first serially, using a standard for loop, and then in parallel, using R's foreach() %dopar% construct:

library(ggplot2)
library(parallel)
library(doParallel)
library(foreach)

# Print an arbitrary dummy plot (from the standard "cars" data set) to a
# specific integer graphical device number.
makeplot <- function(graph_dev) {
  dev.set(graph_dev)
  plt <- ggplot(cars) + geom_point(aes(x=speed, y=dist))
  # Print the same plot repeatedly 10 times, on 10 sequential pages, in
  # order to purposefully bloat up the file size a bit and convince
  # ourselves that actual plot content is really being saved to the file.
  for(ii in seq(10)) {print(plt)}
}

# A pair of pdf files that we will write serially, on a single processor
fser <- c('test_serial_one.pdf', 'test_serial_two.pdf')

# A pair of pdf files that we will write in parallel, on two processors
fpar <- c('test_parallel_one.pdf', 'test_parallel_two.pdf')

# Open all four pdf files, and generate a key-value pair assigning each
# file name to an integer graphical device number
fnmap <- list()
for(f in c(fser, fpar)) {
  pdf(f)
  fnmap[[f]] <- dev.cur()
}

# Loop over the first two pdf files using a basic serial "for" loop
for(f in fser) {makeplot(fnmap[[f]])}

# Do the same identical loop content as above, but this time using R's
# parallelization framework, and writing to the second pair of pdf files
registerDoParallel(cl=makeCluster(2, type='FORK'))
foreach(f=fpar) %dopar% {makeplot(fnmap[[f]])}

# Close all four of the pdf files
for(f in names(fnmap)) {
    dev.off(fnmap[[f]])
}

The first two output files, test_serial_one.pdf and test_serial_two.pdf, each have a final file size of 38660 bytes and can be opened and displayed correctly using a standard pdf reader such as Adobe Acrobat Reader or similar.

The second two output files, test_parallel_one.pdf and test_parallel_two.pdf, each have a final file size of 34745 bytes, but they return a file corruption error when attempting to read with standard tools: e.g., "There was an error opening this document. This file cannot be opened because it has no pages."

The fact that the file sizes of the serial vs. parallel versions are approximately equivalent suggests to me that the error message from the pdf reader is probably incorrect: the parallel loop is in fact dumping page content successfully to the files just as in the serial loop, and instead perhaps there is some kind of file footer information missing at the end of the page content of the parallelized output files, possibly because those two files aren't being closed successfully.

For various technical reasons, I would like to have the ability to open and close multiple pdf files outside of a foreach() %dopar% construct, while using dev.set() inside of the parallelized loop to choose which file gets written on each loop iteration.

What is the root cause of the file corruption that is occurring in the parallelized loop in this example? And how can I correct it: i.e., how can I modify my code to close the file properly and append the necessary pdf file footer information after the parallelized loop is finished?

Original Q&A

There are 1 best solutions below

**George Ostrouchov** · Answer 1 · 2022-09-24T05:37:54.723000

The forked processes are sharing some of the graphics device pipeline despite assigning different files. Using an MPI backend, or writing the code as SPMD for an HPC cluster, will give you as many R sessions (and graphics pipelines) as ranks. Below is your example code translated into SPMD and using the pbdMPI package:

library(ggplot2)
library(pbdMPI)

# Print an arbitrary dummy plot (from the standard "cars" data set) to a
# specific integer graphical device number.
makeplot <- function(graph_dev) {
  dev.set(graph_dev)
  plt <- ggplot(cars) + geom_point(aes(x=speed, y=dist))
  # Print the same plot repeatedly 10 times, on 10 sequential pages, in
  # order to purposefully bloat up the file size a bit and convince
  # ourselves that actual plot content is really being saved to the file.
  for(ii in seq(10)) {print(plt)}
}

# A pair of pdf files that we will write serially, on a single processor
fser <- c('test_serial_one.pdf', 'test_serial_two.pdf')

# A pair of pdf files that we will write in parallel, on two processors. 
# In general, this can be any number of files as commm.chunk assigns
# a different set to each MPI rank.
fpar <- c('test_parallel_one.pdf', 'test_parallel_two.pdf')
my_fpar_i <- comm.chunk(length(fpar), form = "vector")

# Open all pdf files, and generate a key-value pair assigning each
# file name to an integer graphical device number
my_files = fpar[my_fpar_i]
if(comm.rank() == 0) my_files = c(fser, my_files)
fnmap <- list()
for(f in my_files) {
  pdf(f)
  fnmap[[f]] <- dev.cur()
}

# Loop over the first two pdf files using a basic serial "for" loop
if(comm.rank() == 0) for(f in fser) {makeplot(fnmap[[f]])}

# Do the same identical loop content as above, but this time using R's
# parallelization framework, and writing to the second pair of pdf files
for(f in fpar[my_fpar_i]) {makeplot(fnmap[[f]])}

# Close all four of the pdf files
for(f in names(fnmap)) {
  dev.off(fnmap[[f]])
}

finalize()

You save this in your_file_name.R and run it with mpirun -np 2 Rscript your_file_name.R.

Note that SPMD is a generalization of a serial code into a form where several copies of it can collaborate. There is no manager code, just collaboration. In another sense, SPMD parallelization is the opposite of the manager-workers code you wrote, where the default is serial and you specify the parallel sections. In SPMD, the default is parallel and you specify the serial sections - the if(comm.rank() == 0) says that only rank 0 runs that part. The comm.chunk() returns different results to each parallel rank. Remove the serial sections to time the parallel speedup. See the pbdMPI package (the GitHub version in "RBigData/pbdMPI" is more up to date) for more information, especially on data communication and reduction with gather/allgather and reduce/allreduce collectives.

On HPC clusters, MPI is the overwhelming standard for distributed parallelization. It also works on multicore laptops, but here there can be a memory penalty compared to the unix fork.

Why do pdf files that are printed using R's foreach() %dopar% construct turn out corrupted and unreadable?

There are 1 best solutions below

Related Questions in R

Related Questions in PDF

Related Questions in DOPARALLEL

Related Questions in PARALLEL-FOREACH

Trending Questions

Popular # Hahtags

Popular Questions