What is the best practice for transferring objects across R projects?

1.1k Views Asked by At

I would like to use R objects (e.g., cleaned data) generated in one git-versioned R project in another git-versioned R project.

Specifically, I have multiple git-versioned R projects (that hold drake plans) that do various things for my thesis experiments (e.g., generate materials, import and clean data, generate reports/articles).

The experiment-specific projects should ideally be:

  1. Connectable - so that I can get objects (mainly data and materials) that I generated in these projects into another git-versioned R project that generates my thesis report.
  2. Self-contained - so that I can use them in other non-thesis projects (such as presentations, reports, and journal manuscripts). When sharing such projects, I'd ideally like not to need to share a monolithic thesis project.
  3. Versioned - so that their use in different projects can be independent (e.g., if I make changes to the data cleaning for a manuscript after submitting the thesis, I still want the thesis to be reproducible as it was originally compiled).

At the moment I can see three ways of doing this:

  1. Re-create the data cleaning process
    • But: this involves copy/paste, which I'd like to avoid, especially if things change upstream.
  2. Access the relevant scripts/functions by changing the working directory
    • But: even if I used here it seems that this would introduce poor reproducibility.
  3. Make the source projects into packages and make the objects I want to "export" into exported data (as per the data section of Hadley's R packages guide)

Is there any other way of doing this?

Edit: I tried @landau's suggestion of using a single drake plan, which worked well for a while, until (similar to @vrognas' case) I ended up with too many sub-projects (e.g., conference presentations and manuscripts) that relied on the same objects. Therefore, I added some clarifications above to my intentions with the question.

3

There are 3 best solutions below

5
landau On BEST ANSWER

My first recommendation is to use a single drake plan to unite the stages of the overall project that need to share data. drake is designed to handle a lot of moving parts this way, and it will be more seamless when it comes to drake's decisions about what to rerun downstream. But if you really do need different plans in different sub-projects that share data, you can track each shared dataset as a file_out() file in one plan and track it with file_in() in another plan.

upstream_plan <- drake_plan(
  export_file = write_csv(dataset, file_out("exported_data/dataset.csv"))
)
downstream_plan <- drake_plan(
  dataset = read_csv(file_in("../upstream_project/exported_data/dataset.csv"))
)
4
Konrad Rudolph On

You fundamentally misunderstood Miles McBain’s critique. He isn’t saying that you shouldn’t write reusable code nor that you shouldn’t use packages. He’s saying that you shouldn’t use packages for everything. But reusable code (i.e. code that you want to reuse) absolutely belongs in packages (or, better, modules), which can then be used in multiple projects.

That being said, first off, pay attention to Will Landau’s advice.

Secondly, you can make your RStudio projects configurable such that they can load data based on paths given in a configuration. Once that’s accomplished, nothing speaks against hard-coding paths to data in different projects inside that config file.

1
vrognas On

I am in a similar situation. I have many projects that are spawned from one raw dataset. Previously, when the project was young and small, I had it all in one version controlled project. This got out of hand as more sub-projects were spawned and my git history got cluttered from working on projects in parallel. This could be to my lack of skills with git. My folder structure looked something like this:

project/.git  
project/main/  
project/sub-project_1/  
project/sub-project_2/  
project/sub-project_n/

I contemplated having each project in its own git branch, but then I could not access them simultaneously. If I had to change something to the main dataset (eg I might have not cleaned some parts) then project 1 could become outdated and nonfunctional. Once I had finished project 1, I would have liked it to be isolated and contained for reproducibility. This is easier to achieve if the projects are separated. I don't think a drake/targets plan would solve this?

I also looked briefly into having the projects as git submodules but it seemed to add too much complexity. Again, my git ignorance might shine through here.

My current solution is to have the main data as an R-package, and each sub-project as a separate git-versioned folder (they are actually packages as well, but this is not necessary). This way I can load in a specific version of the data (using renv for package versions).

My folder structure now looks something like this:

main/.git  
sub-project_1/.git  
sub-project_2/.git  
sub-project_n/.git

And inside each sub-project, I call library(main) to load the cleaned data. Within each sub-project, a drake/targets plan could be used.