# Works regardless of where the script lives in the project
data <- read.csv(here::here("data", "penguins.csv"))33 Reproducibility and workflow
Your analysis is reproducible when someone else can run your code on their machine and get the same results. That someone is usually you, six months from now, with no memory of what you did.
This chapter is a guided tour, not a deep dive. Each tool (here, renv, Quarto, targets, Git) gets enough coverage to start using it today, with pointers to comprehensive references for when you need more. The goal is to show how the tools fit together into a reproducibility stack, not to replace their documentation.
33.1 What reproducibility means
Reproducibility: same code + same data = same results. This is referential transparency at the project level: your analysis script is an expression, and if it is reproducible, you can replace that expression with its value (the results) without changing meaning. A non-reproducible analysis is a function with hidden side effects; it depends on state you did not declare.
This is not the same as replicability (same question, new data, similar conclusions). Reproducibility is the lower bar, and most analyses fail it.
A 2016 Nature survey found that 70% of researchers had tried and failed to reproduce another scientist’s experiment. A 2019 study found that only 26% of published R-based analyses could be independently reproduced from the provided code. The problem is not malice or incompetence. It is that reproducing work requires infrastructure, and most analysts do not build that infrastructure until it is too late.
Why it breaks:
- Hardcoded file paths (
"C:/Users/me/Desktop/data.csv"). - Unrecorded package versions (dplyr 1.0 behaves differently from dplyr 1.1).
- Manual steps between scripts (“run this, then open Excel, then paste the table…”).
- Random seeds not set.
- Outputs separated from the code that produced them.
The fix is structural, not behavioral. Do not rely on discipline (“I will remember to run script 3 first”). Build a system that makes the wrong thing hard. Each tool in this chapter eliminates a specific failure mode: here eliminates path problems, renv eliminates version problems, Quarto eliminates the code/output sync problem, targets eliminates the run-order problem, Git eliminates the “which version was it?” problem.
If your analysis requires a README that says “run these scripts in this order, but skip step 4 on Tuesdays”, it is not reproducible. It is a ritual.
33.2 Projects and file paths
An RStudio project (.Rproj file) defines a working directory. Everything is relative to it. When you open the project, the working directory is set automatically.
here::here() builds paths relative to the project root:
here::here("data", "penguins.csv") produces something like "C:/Users/me/projects/analysis/data/penguins.csv", but you never type that path. It works on any OS (forward slashes on Windows too), in any subdirectory, in Quarto documents, in test files, and in interactive sessions. It even works when you source() a script from a different directory, because it finds the project root, not the script’s location.
Don’t use setwd(). It creates a hidden dependency on your computer’s directory structure. Avoid absolute paths too; they break on every other machine. And don’t use source("C:/Users/me/Desktop/functions.R"). It ties your code to one computer.
The here package solves this by detecting the project root (the directory containing .Rproj, .here, or .git) and building paths relative to it:
A clean project layout:
my_analysis/
├── my_analysis.Rproj
├── data/ # raw data (read-only)
├── R/ # functions
├── analysis/ # scripts or Quarto documents
├── output/ # results, figures, tables
└── renv.lock # dependency lockfile
Raw data is sacred: never modify it. Read it, transform it in code, write outputs elsewhere. If someone asks “where did this number come from?”, you can trace it from the output file back through the code to the raw data.
A common mistake is placing data-cleaning code in a different project from the analysis. Then reproducing the analysis requires finding and running a separate project first. Keep everything in one project, or use targets (Section 33.5) to formalize the dependency.
Another common mistake: using rm(list = ls()) at the top of a script to “clean up.” This clears your workspace but does not restart R. Hidden state (loaded packages, modified options, changed working directories) persists. Instead, restart R (Ctrl+Shift+F10 in RStudio) for a true clean slate.
If you would not email your project folder to a collaborator and expect it to work, it is not organized well enough.
Exercises
- Create an RStudio project with the layout above. Place a CSV file in
data/. Write a script inanalysis/that reads it usinghere::here(). Verify it works by opening the project fresh and running the script. - Try
here::here("data", "penguins.csv")from both the project root and a subdirectory. Does it return the same path? - Open an old script of yours that uses
setwd()or absolute paths. Rewrite it to usehere::here(). Does it still work when you move the project to a different folder?
33.3 renv: locking dependencies
install.packages("dplyr") today gives you a different version than six months from now. Your code might break. Your results might change. You will not know why, because nothing in your code changed.
renv isolates your project’s package library:
renv::init() # create a project-local library
renv::snapshot() # record exact versions in renv.lock
renv::restore() # install exactly the versions in the lockfilerenv::init() creates a private library for this project. Your package versions are isolated from every other project on your machine.
renv::snapshot() writes renv.lock, a JSON file listing every package and its exact version. This is your dependency manifest. Commit it to Git.
renv::restore() reads renv.lock and installs exactly those versions. A collaborator clones your repo, runs renv::restore(), and gets the same packages you used.
The workflow: init() once, snapshot() after installing or updating packages, commit renv.lock to Git. That is all.
renv does not version R itself. If the R version matters, document it. Docker solves this completely (see Section 33.8), but is beyond this book.
A common question: should you commit renv.lock to Git? Yes, always. It is small (a few KB of JSON), and it is the entire point of renv. Without the lockfile in version control, a collaborator cannot restore your environment. The renv/ directory itself (which contains the actual installed packages) is typically git-ignored.
renv.lock is boring. That is the point. Boring is reproducible.
One more practical detail: when you switch between projects that use renv, you do not need to do anything special. Each project has its own library. When you open a project with renv, it activates automatically. Packages installed in one project are invisible to another. This isolation prevents the classic problem of updating a package for one project and breaking another.
Exercises
- Run
renv::init()in a project. Install a package withinstall.packages(). Runrenv::snapshot(). Openrenv.lockand find the package and its version number. - Delete the package from your library (simulate a fresh machine). Run
renv::restore(). Verify the package is back. - Open
renv.lockin a text editor. What information does it store besides package names and versions?
33.4 Quarto: code and prose together
Quarto documents (.qmd) combine narrative text, code chunks, and their output in a single document. When you render the document, the code runs and the results appear inline. The document is the analysis.
---
title: "Penguin Analysis"
format: html
---
## Body mass by specieslibrary(palmerpenguins)
#>
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#>
#> penguins, penguins_raw
library(ggplot2)
penguins |>
ggplot(aes(species, body_mass_g)) +
geom_boxplot()
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).
There is no separate script that makes the figures and a Word document that describes them. They are one thing. If the data changes, re-render and everything updates.
Quarto is the successor to R Markdown. The core idea, literate programming, dates back to Donald Knuth (1984): write programs as documents where the prose explains the code, not the other way around. The lineage runs from WEB (Pascal/TeX) through Sweave (2002), knitr (2012), R Markdown, to Quarto (2022). Better tooling, multi-language support. If you know R Markdown, you know Quarto. The main differences: Quarto uses `#|` for chunk options (instead of the chunk header), supports Python, Julia, and Observable in addition to R, and has better defaults for academic output (cross-references, citations, callouts).
Output formats: HTML, PDF, Word, presentations, websites, books. This book is written in Quarto.
Chunk options control behavior. Place them at the top of each chunk with the `#|` prefix:
::: {.cell}
:::
`echo: false` hides the code (shows only the output). `eval: false` shows the code but does not run it. `fig-cap` adds a caption. `cache: true` caches results so unchanged chunks do not re-run. These options give you fine control over what the reader sees.
The key principle: the document is the analysis. Code, prose, and results live together. When they are separate, they drift apart. When they are together, they stay in sync.
For academic writing, Quarto supports citations natively. Place a `.bib` file in your project and reference entries with `@key`:
::: {.cell}
```{.r .cell-code}
---
title: "My Analysis"
bibliography: references.bib
---
As shown by @wickham2019, tidy data principles simplify analysis.
:::
The citation is rendered in the output and a reference list is appended automatically. No more manually typed reference sections that drift from the actual citations.
Exercises
- Create a
.qmdfile with a title, a code chunk that loads a dataset, and a code chunk that makes a plot. Render it to HTML. Change the data and re-render. - Add
echo: falseto a code chunk. What changes in the rendered output?
33.5 targets: pipeline automation
Problem: your analysis has multiple steps (clean data, fit model, make figures, render report). Running them in the right order is manual and error-prone. Changing the data means re-running everything, even steps whose inputs did not change.
targets solves this. You define a pipeline: each step (target) is a function call with declared inputs and outputs. targets tracks what changed and only re-runs what is necessary.
# _targets.R
library(targets)
tar_option_set(packages = c("dplyr", "ggplot2"))
list(
tar_target(raw_data, read.csv(here::here("data", "penguins.csv"))),
tar_target(clean_data, clean_penguins(raw_data)),
tar_target(model, lm(body_mass_g ~ species, data = clean_data)),
tar_target(fig, plot_results(model))
)tar_make() runs the pipeline. tar_visnetwork() visualizes the dependency graph. tar_read(model) retrieves a cached result.
Each target is a function call. Functions are the unit of computation (Chapter 7); targets are the unit of caching. If raw_data has not changed, clean_data does not re-run. If you change plot_results(), only fig re-runs.
Not every project needs targets. A single Quarto document is fine for simple analyses. Use targets when you have long-running steps, many interdependent outputs, or pipelines that change frequently.
The real power of targets shows when your pipeline is expensive. If a model takes an hour to fit, you do not want to re-fit it every time you tweak a plot. targets caches the model result and only re-fits when the model code or its inputs change. The plot target re-runs in seconds because it reads the cached model.
tar_visnetwork() produces a dependency graph showing which targets are up to date (green), outdated (blue), or errored (red). This makes complex pipelines much easier to debug.
targets is overkill for a homework assignment and essential for a thesis chapter. Know where your project falls.
A key design principle: each target should be defined by a function, not raw code. Instead of writing the transformation inline, write a function in R/ and call it from the target. This makes targets testable (you can test the function independently) and readable (the pipeline reads like a table of contents):
# R/functions.R
clean_penguins <- function(data) {
data[!is.na(data$body_mass_g), ]
}
plot_results <- function(model) {
plot(model, which = 1)
}# _targets.R
source("R/functions.R")
list(
tar_target(raw, read.csv(here::here("data", "penguins.csv"))),
tar_target(clean, clean_penguins(raw)),
tar_target(model, lm(body_mass_g ~ species, data = clean)),
tar_target(fig, plot_results(model))
)Exercises
- Define a small
targetspipeline with three steps: read data, compute a summary, make a plot. Run it withtar_make(). Change the summary function and runtar_make()again. Observe which targets re-run and which are skipped. - Run
tar_visnetwork()on your pipeline. What do the colors mean?
33.6 Version control with Git
Git tracks changes to files over time. Every change is recorded as a commit: a snapshot of your project at one moment, with a message describing what changed and why. You can go back to any previous snapshot, compare any two, and branch your work into parallel lines that merge back together.
Why Git for data analysis:
- Undo mistakes without losing work. Every commit is a checkpoint you can return to. Accidentally deleted a function?
git checkout -- file.Rbrings it back. Broke your analysis and cannot figure out what changed?git diffshows exactly which lines differ from the last working state. - Track the evolution of your analysis. Three months from now, a reviewer asks why you removed a covariate.
git logshows the commit where you removed it, with a message explaining the decision. Without Git, that reasoning is lost. - Collaborate without emailing files back and forth. Multiple people can work on the same project, each on their own branch, and Git merges their changes. Conflicts (two people editing the same line) are surfaced explicitly, not silently overwritten.
- Share and cite. GitHub and GitLab give your project a permanent URL. Zenodo can mint a DOI for a specific release, making your code citable in papers.
The daily workflow
Git has many commands. You need six:
git status # what has changed?
git add file.R # stage a file for the next commit
git commit -m "Add bootstrap analysis" # record the snapshot
git log --oneline # see recent history
git diff # see unstaged changes line by line
git push # send commits to GitHubThe mental model: you make changes to files, git add selects which changes to include in the next snapshot, git commit takes the snapshot with a message, and git push sends it to a remote server. git pull does the reverse: it fetches commits from the remote and incorporates them into your local copy.
RStudio has a built-in Git pane that shows modified files, lets you stage changes, write commit messages, and push to GitHub without touching the terminal. For setup, usethis::use_git() initializes a repository, and usethis::use_github() creates a remote on GitHub and pushes your code in one step.
Branching
Branches let you try something without risking the main line of work:
git branch try-new-model # create a branch
git checkout try-new-model # switch to it
# ... make changes, commit ...
git checkout main # switch back
git merge try-new-model # incorporate the branch's changesIf the experiment works, merge it. If it doesn’t, delete the branch. The main branch is untouched either way. This is the functional programming principle applied to project management: branches are like closures that capture a state and let you explore without side effects on the original.
What belongs in Git
Commit: code, documentation, renv.lock, _targets.R, .gitignore, small data files (under a few MB), Quarto source files.
Do not commit: large data files, generated output (figures, HTML reports, _targets/ cache), secrets and passwords, .Rhistory, .RData, .DS_Store.
.gitignore tells Git what to skip. A good starting point for R projects:
.Rhistory
.RData
.Rproj.user
_targets/
docs/
*.html
*.pdfusethis::use_git_ignore() helps set it up.
Commit messages
Write commit messages that describe why you changed something, not what you changed. Git already records what changed (every added and deleted line); your message adds the reasoning that the diff cannot show.
Good: “Remove species interaction term (AIC worse by 4.2).” Bad: “Update analysis.R.” Good: “Fix off-by-one in bootstrap loop causing n+1 resamples.” Bad: “Bug fix.”
Your future self will search through git log when something breaks. Those messages are the only documentation of your decision-making process.
Going further
This section covers enough Git to track your work, collaborate, and recover from mistakes. It does not cover rebasing, cherry-picking, bisecting, or other advanced operations. Jenny Bryan’s Happy Git and GitHub for the useR (happygitwithr.com) is the comprehensive reference for R users and covers installation, SSH keys, merge conflicts, and common workflows in detail.
33.7 set.seed() and session info
Any analysis involving randomness (simulation, resampling, train/test splits) needs set.seed():
set.seed(42)
sample(1:100, 5)
#> [1] 49 65 25 74 18Without set.seed(), every run produces different numbers. With it, the same seed always produces the same sequence. Place it at the top of your script or Quarto document. The specific number does not matter (42, 123, 2024, anything), but it must be fixed and documented.
Note: R 3.6.0 changed the default random number generator. Code using set.seed() before and after 3.6.0 may produce different sequences. If exact reproducibility across R versions matters, specify the RNG kind: set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion").
At the end of reports, record your environment:
sessionInfo()
# or
sessioninfo::session_info()This captures R version, OS, package versions. If results differ on another machine, the session info is the first place to look.
These are small things, but they make the difference between “I can reproduce this” and “I cannot.”
Exercises
- Run the same
sample()call twice withoutset.seed(). Do you get the same result? Now addset.seed(123)before each call. What changes? - Run
sessionInfo()and find: your R version, your operating system, and the version of a package you use frequently.
33.8 The reproducibility stack
A summary of tools and when to use each:
- RStudio projects + here: always. Zero cost, immediate benefit.
- Git: always, for anything beyond a one-off exploration.
- renv: for any project that needs to produce the same results later.
- Quarto: for any analysis that produces a report or document.
- targets: for multi-step pipelines with long-running computations.
- Docker: for complete environment reproducibility (R version, system libraries). Beyond this book, but worth knowing it exists.
You do not need all of these for every project. Start with projects + here + Git. Add renv when versions matter. Add Quarto when you write reports. Add targets when your pipeline grows.
Here is a decision guide:
| If your project… | You need |
|---|---|
| Has any R code at all | RStudio project + here |
| Will exist for more than a week | Git |
| Uses packages that update | renv |
| Produces a report or paper | Quarto |
| Has steps that take >30 seconds | targets |
| Must run on a different OS | Docker or Nix |
Each layer solves a specific failure mode. Together, they make “works on my machine” a non-issue.
The common thread is automation. here automates path construction. renv automates dependency tracking. Quarto automates the link between code and output. targets automates pipeline execution. Git automates change tracking. None of these tools require heroic discipline. They require a one-time setup cost, and then they work silently in the background, preventing the mistakes you would otherwise make by hand.
33.9 Beyond renv: full-stack reproducibility
renv captures R package versions. It does not capture R itself, system libraries (libcurl, GDAL, libxml2), or the C compiler. When your code depends on any of these, renv alone is not enough.
Docker solves this by bundling everything into a container: OS, R, system libraries, packages, your code. The Rocker project (rocker-project.org) provides pre-built R images:
# Dockerfile
FROM rocker/r-ver:4.4.0
RUN install2.r dplyr ggplot2
COPY . /analysis
CMD ["Rscript", "analysis/main.R"]A Dockerfile is a recipe that always produces the same environment. Build once, run anywhere. The trade-off is complexity: Docker adds a layer of tooling (images, containers, registries) that takes time to learn.
Nix takes a different approach. Where Docker isolates by containerization, Nix isolates by functional purity. The Nix package manager treats every package as a pure function from its inputs (source code, dependencies, compiler flags) to its output (the built artifact). Fix all inputs, get a deterministic output. This is referential transparency applied to software builds, the same principle you saw in ?sec-referential-transparency applied to R expressions.
The rix R package makes Nix accessible from R:
library(rix)
rix(
r_ver = "4.4.0",
r_pkgs = c("dplyr", "ggplot2", "palmerpenguins"),
system_pkgs = NULL,
ide = "rstudio",
project_path = "."
)This generates a default.nix file that pins everything: R version, package versions, system libraries, even the C compiler. Running nix-build on any machine with Nix installed produces an identical environment. Where renv.lock captures one layer (R packages), default.nix captures all layers.
The practical trade-off: Nix has a steep learning curve and limited Windows support (it runs natively on Linux and macOS, or via WSL on Windows). Docker is more widely adopted and better supported on all platforms. Both achieve full-stack reproducibility; they differ in mechanism and philosophy. Docker says “ship the whole machine.” Nix says “describe the machine as a function and let anyone rebuild it.”
For most R users, renv is sufficient. Add Docker or Nix when your results depend on system-level components, when you need to guarantee reproducibility across operating systems, or when a collaborator reports “it doesn’t work on my machine” and the problem is not an R package version.