I would like to use R objects (e.g., cleaned data) generated in one git-versioned R project in another git-versioned R project.
Specifically, I have multiple git-versioned R projects (that hold drake plans) that do various things for my thesis experiments (e.g., generate materials, import and clean data, generate reports/articles).
The experiment-specific projects should ideally be:
- Connectable - so that I can get objects (mainly data and materials) that I generated in these projects into another git-versioned R project that generates my thesis report.
- Self-contained - so that I can use them in other non-thesis projects (such as presentations, reports, and journal manuscripts). When sharing such projects, I'd ideally like not to need to share a monolithic thesis project.
- Versioned - so that their use in different projects can be independent (e.g., if I make changes to the data cleaning for a manuscript after submitting the thesis, I still want the thesis to be reproducible as it was originally compiled).
At the moment I can see three ways of doing this:
- Re-create the data cleaning process
- But: this involves copy/paste, which I'd like to avoid, especially if things change upstream.
- Access the relevant scripts/functions by changing the working directory
- But: even if I used
hereit seems that this would introduce poor reproducibility.
- But: even if I used
- Make the source projects into packages and make the objects I want to "export" into exported data (as per the data section of Hadley's R packages guide)
- But: I'd like to avoid the unnecessary metadata, artefacts, and noise (e.g., see Miles McBain's "Project as an R package: An okay idea") if I can.
Is there any other way of doing this?
Edit: I tried @landau's suggestion of using a single drake plan, which worked well for a while, until (similar to @vrognas' case) I ended up with too many sub-projects (e.g., conference presentations and manuscripts) that relied on the same objects. Therefore, I added some clarifications above to my intentions with the question.
My first recommendation is to use a single
drakeplan to unite the stages of the overall project that need to share data.drakeis designed to handle a lot of moving parts this way, and it will be more seamless when it comes todrake's decisions about what to rerun downstream. But if you really do need different plans in different sub-projects that share data, you can track each shared dataset as afile_out()file in one plan and track it withfile_in()in another plan.