33  Reproducibility and workflow

You finished an analysis six months ago, and now a reviewer wants you to re-run it with one variable removed. You open the project, hit “Source,” and nothing works: a package updated, a file path points to a folder that no longer exists, and you cannot remember whether you were supposed to run the cleaning script before or after the imputation script. The analysis produced correct results once. It will never produce them again.

This chapter walks through five tools (here, renv, Quarto, targets, Git) that prevent exactly this situation, giving each one enough coverage to start using it today while pointing to deeper references when you need them. Each layer eliminates a specific way that analyses break.

33.1 What reproducibility means

Same code plus same data should equal the same results. That is reproducibility, and if you squint, it looks like referential transparency at the project level: your analysis script is an expression, and if that expression is reproducible, you can replace it with its value (the results) without changing meaning. A non-reproducible analysis is a function with hidden side effects; it depends on state you never declared.

Every tool in this chapter addresses one category of hidden state. here makes file paths explicit, removing the dependency on which directory you happened to open. renv pins package versions to the project, so a colleague installing your code six months from now gets the same library you tested against. Quarto binds code to output in a single document, making it impossible for the prose and the numbers to drift apart. targets declares dependencies between pipeline steps and only reruns what has changed, turning a sequential script into a dependency graph that skips unnecessary work. Git records the reasoning behind each change, not just snapshots. Together these tools enforce the functional principle of referential transparency at the scale of a project: swap one execution for another and get identical results.

This is not replicability, which asks a bigger question: same research question, new data, similar conclusions. Reproducibility is the lower bar. Most analyses fail it anyway.

A 2016 Nature survey found that 70% of researchers had tried and failed to reproduce another scientist’s experiment, and a 2019 study found that only 26% of published R-based analyses could be independently reproduced from the provided code. Reproducing work requires infrastructure, and most analysts do not build infrastructure until something has already gone wrong.

Why it breaks:

  • Hardcoded file paths ("C:/Users/me/Desktop/data.csv").
  • Unrecorded package versions (dplyr 1.0 behaves differently from dplyr 1.1).
  • Manual steps between scripts (“run this, then open Excel, then paste the table…”).
  • Random seeds not set.
  • Outputs separated from the code that produced them.

The fix is structural, not behavioral. Do not rely on discipline (“I will remember to run script 3 first”); build a system that makes the wrong thing hard. Each tool in this chapter eliminates one failure mode: here eliminates path problems, renv eliminates version problems, Quarto eliminates the code/output sync problem, targets eliminates the run-order problem, and Git eliminates the “which version was it?” problem. The first one costs nothing to adopt.

TipOpinion

If your analysis requires a README that says “run these scripts in this order, but skip step 4 on Tuesdays,” it is not reproducible. It is a ritual.

33.2 Projects and file paths

Every path you hardcode is a bet that your directory structure will never change. You will lose that bet. An RStudio project (.Rproj file) defines a working directory so that everything becomes relative to one root, and when you open the project, the working directory is set automatically.

here::here() builds paths relative to that root:

# Works regardless of where the script lives in the project
data <- read.csv(here::here("data", "penguins.csv"))

The call here::here("data", "penguins.csv") returns an absolute path on the fly — something like "C:/Users/me/analysis/data/penguins.csv" — but you never type that path yourself. It works on any OS (forward slashes on Windows too), in any subdirectory, in Quarto documents, in test files, and in interactive sessions. It even works when you source() a script from a different directory, because it finds the project root, not the script’s location.

Avoid setwd(), which creates a hidden dependency on your computer’s directory structure. Avoid absolute paths, which break on every other machine. And never write source("C:/Users/me/Desktop/functions.R"), because that ties your code to one computer and one human’s folder layout.

The here package detects the project root (the directory containing .Rproj, .here, or .git) and builds every path relative to it. A clean project layout looks like this:

my_analysis/
├── my_analysis.Rproj
├── data/               # raw data (read-only)
├── R/                  # functions
├── analysis/           # scripts or Quarto documents
├── output/             # results, figures, tables
└── renv.lock           # dependency lockfile

Raw data is sacred: never modify it. Read it, transform it in code, write outputs elsewhere. If someone asks “where did this number come from?”, you can trace it from the output file back through the code to the raw data, and that traceability breaks the moment code and data live in different places.

A common mistake is placing data-cleaning code in a different project from the analysis, so that reproducing the analysis requires finding and running a separate project first. Keep everything in one project, or use targets (Section 33.5) to formalize the dependency.

Another common mistake: using rm(list = ls()) at the top of a script to “clean up.” This clears your workspace but does not restart R, so hidden state (loaded packages, modified options, changed working directories) persists. Instead, restart R (Ctrl+Shift+F10 in RStudio) for a true clean slate. Clean paths solve the “where is my file?” problem; but what about the “which version of dplyr was I using?” problem?

TipOpinion

If you would not email your project folder to a collaborator and expect it to work, it is not organized well enough.

Exercises

  1. Create an RStudio project with the layout above. Place a CSV file in data/. Write a script in analysis/ that reads it using here::here(). Verify it works by opening the project fresh and running the script.
  2. Try here::here("data", "penguins.csv") from both the project root and a subdirectory. Does it return the same path?
  3. Open an old script of yours that uses setwd() or absolute paths. Rewrite it to use here::here(). Does it still work when you move the project to a different folder?

33.3 renv: locking dependencies

install.packages("dplyr") today gives you a different version than six months from now, and when your code breaks or your results silently change, nothing in your source files will explain why.

renv isolates your project’s package library:

renv::init()       # create a project-local library
renv::snapshot()   # record exact versions in renv.lock
renv::restore()    # install exactly the versions in the lockfile

renv::init() creates a private library for this project, isolating your package versions from every other project on your machine. renv::snapshot() writes renv.lock, a JSON file listing every package and its exact version: your dependency manifest, which you should commit to Git. renv::restore() reads that lockfile and installs exactly those versions, so a collaborator who clones your repo and runs renv::restore() gets the same packages you used.

The workflow: init() once, snapshot() after installing or updating packages, commit renv.lock to Git. That is all.

renv does not version R itself. If the R version matters, document it; Docker solves this completely (see Section 33.8), but is beyond this book. Should you commit renv.lock to Git? Always. It is small (a few KB of JSON), and it is the entire point of renv. Without the lockfile in version control, a collaborator cannot restore your environment. The renv/ directory itself, which contains the actual installed packages, is typically git-ignored.

TipOpinion

renv.lock is boring. That is the point. Boring is reproducible.

One practical detail worth knowing: when you switch between projects that use renv, you do not need to do anything special, because each project has its own library and renv activates automatically when you open the project. Packages installed in one project are invisible to another, which prevents the classic problem of updating a package for one analysis and breaking a different one. Code and dependencies are now pinned; but what about the prose, the figures, the narrative that wraps around your analysis?

Exercises

  1. Run renv::init() in a project. Install a package with install.packages(). Run renv::snapshot(). Open renv.lock and find the package and its version number.
  2. Delete the package from your library (simulate a fresh machine). Run renv::restore(). Verify the package is back.
  3. Open renv.lock in a text editor. What information does it store besides package names and versions?

33.4 Quarto: code and prose together

You have a script that produces three figures. You have a Word document that describes those figures. You update the script, re-run it, forget to update the Word document, and now the text describes figures that no longer exist. The architecture guarantees drift: code and prose live in different files, so they drift apart.

Quarto documents (.qmd) eliminate the gap by combining narrative text, code chunks, and their output in a single file. When you render the document, the code runs and the results appear inline:

---
title: "Penguin Analysis"
format: html
---

## Body mass by species

```{r}
library(palmerpenguins)
library(ggplot2)
penguins |>
  ggplot(aes(species, body_mass_g)) +
  geom_boxplot()
```

No separate script that makes the figures, no separate Word document that describes them. If the data changes, re-render and everything updates. To render the document, run from the terminal:

quarto render analysis.qmd           # defaults to the format in the YAML header
quarto render analysis.qmd --to pdf  # override the output format
quarto render analysis.qmd --to docx

The format field in the YAML header sets the default (html, pdf, docx, revealjs for slides), and the --to flag overrides it. HTML needs no extra software. PDF requires a LaTeX installation; quarto install tinytex handles that. Word output produces a .docx that collaborators who do not use R can read and comment on. In RStudio, the “Render” button (Ctrl+Shift+K) does the same thing without the terminal.

Quarto is the successor to R Markdown, and the core idea (literate programming) dates back to Donald Knuth in 1984: write programs as documents where the prose explains the code, not the other way around. The lineage runs from WEB (Pascal/TeX) through Sweave (2002), knitr (2012), R Markdown, and finally Quarto (2022), each generation bringing better tooling and broader language support. If you know R Markdown, you already know Quarto. The main differences: Quarto uses #| for chunk options instead of the chunk header, supports Python, Julia, and Observable in addition to R, and has better defaults for academic output (cross-references, citations, callouts).

Output formats include HTML, PDF, Word, presentations, websites, and books (this book is written in Quarto). The format is flexible, but the real control comes from chunk options, which you place at the top of each chunk with the #| prefix:

echo: false hides the code and shows only the output. eval: false shows the code but does not run it. fig-cap adds a caption. cache: true caches results so unchanged chunks do not re-run. These options give you fine control over what the reader sees versus what runs behind the scenes.

For academic writing, Quarto supports citations natively. Place a .bib file in your project and reference entries with @key:

---
title: "My Analysis"
bibliography: references.bib
---

As shown by @wickham2019, tidy data principles simplify analysis.

The citation is rendered in the output and a reference list is appended automatically, so there are no more manually typed reference sections drifting from the actual citations. Quarto keeps code and prose in sync within a single document; but when your analysis spans multiple scripts with complex dependencies, a different tool keeps the steps themselves in sync.

Exercises

  1. Create a .qmd file with a title, a code chunk that loads a dataset, and a code chunk that makes a plot. Render it to HTML. Change the data and re-render.
  2. Add echo: false to a code chunk. What changes in the rendered output?

33.5 targets: pipeline automation

Your analysis has five steps: clean data, fit model, run diagnostics, make figures, render report. You run them by hand, in order, every time. One day you change the model but forget to re-run the diagnostics. The figures now describe a model that no longer exists. How long before you notice?

targets prevents this by making you declare the pipeline: each step (target) is a function call with explicit inputs, and targets tracks what changed so it only re-runs what needs re-running.

# _targets.R
library(targets)

tar_option_set(packages = c("dplyr", "ggplot2"))

list(
  tar_target(raw_data, read.csv(here::here("data", "penguins.csv"))),
  tar_target(clean_data, clean_penguins(raw_data)),
  tar_target(model, lm(body_mass_g ~ species, data = clean_data)),
  tar_target(fig, plot_results(model))
)

tar_make() runs the pipeline. tar_visnetwork() visualizes the dependency graph. tar_read(model) retrieves a cached result.

Each target is a function call. Functions are the unit of computation (Chapter 7); targets are the unit of caching. If raw_data has not changed, clean_data does not re-run. If you change plot_results(), only fig re-runs.

The caching works precisely because the functions are pure: given the same inputs, clean_penguins() returns the same output, so targets can skip it when the inputs have not changed. A function with side effects (one that reads from a database, modifies a global variable, or depends on the current time) would break the caching, because the same inputs would no longer guarantee the same output. targets makes the functional discipline from Chapter 18 pay off at the project level: pure functions compose into pipelines that targets can reason about, cache, and selectively re-execute.

Not every project needs targets. A single Quarto document handles simple analyses perfectly well. Reach for targets when you have long-running steps, many interdependent outputs, or pipelines that change frequently.

The real power shows when your pipeline is expensive: if a model takes an hour to fit, you do not want to re-fit it every time you tweak a color on a plot. targets caches the model result and only re-fits when the model code or its inputs change, so the plot target re-runs in seconds because it reads the cached model. tar_visnetwork() produces a dependency graph showing which targets are up to date (green), outdated (blue), or errored (red), which makes debugging complex pipelines far easier than staring at a collection of numbered scripts.

TipOpinion

targets is overkill for a homework assignment and necessary for a thesis chapter. Know where your project falls.

One design principle worth internalizing: each target should be defined by a function, not raw code. Instead of writing the transformation inline, write a function in R/ and call it from the target. This makes targets testable (you can test the function independently) and readable (the pipeline reads like a table of contents):

# R/functions.R
clean_penguins <- function(data) {
  data[!is.na(data$body_mass_g), ]
}

plot_results <- function(model) {
  plot(model, which = 1)
}
# _targets.R
source("R/functions.R")
list(
  tar_target(raw, read.csv(here::here("data", "penguins.csv"))),
  tar_target(clean, clean_penguins(raw)),
  tar_target(model, lm(body_mass_g ~ species, data = clean)),
  tar_target(fig, plot_results(model))
)

Exercises

  1. Define a small targets pipeline with three steps: read data, compute a summary, make a plot. Run it with tar_make(). Change the summary function and run tar_make() again. Observe which targets re-run and which are skipped.
  2. Run tar_visnetwork() on your pipeline. What do the colors mean?

33.6 Version control with Git

You have a working analysis. You try a new approach, and it breaks everything. You hit Ctrl+Z forty times, but the file is not quite back to where it was, and you are not sure which of the six files you changed. If only you had saved a snapshot before experimenting.

Git is that snapshot system. Every change is recorded as a commit: a frozen image of your project at one moment, with a message describing what changed and why. You can go back to any previous snapshot, compare any two, and branch your work into parallel lines that merge back together.

Why Git for data analysis:

  • Undo mistakes without losing work. Every commit is a checkpoint you can return to. Accidentally deleted a function? git checkout -- file.R brings it back. Broke your analysis and cannot figure out what changed? git diff shows exactly which lines differ from the last working state.
  • Track the evolution of your analysis. Three months from now, a reviewer asks why you removed a covariate. git log shows the commit where you removed it, with a message explaining the decision. Without Git, that reasoning is lost.
  • Collaborate without emailing files back and forth. Multiple people can work on the same project, each on their own branch, and Git merges their changes. Conflicts (two people editing the same line) are surfaced explicitly, not silently overwritten.
  • Share and cite. GitHub and GitLab give your project a permanent URL. Zenodo can mint a DOI for a specific release, making your code citable in papers.

The daily workflow

Git has many commands. You need six:

git status                              # what has changed?
git add file.R                          # stage a file for the next commit
git commit -m "Add bootstrap analysis"  # record the snapshot
git log --oneline                       # see recent history
git diff                                # see unstaged changes line by line
git push                                # send commits to GitHub

The mental model: you make changes to files, git add selects which changes to include in the next snapshot, git commit takes the snapshot with a message, and git push sends it to a remote server. git pull does the reverse, fetching commits from the remote and incorporating them into your local copy.

RStudio has a built-in Git pane that shows modified files, lets you stage changes, write commit messages, and push to GitHub without touching the terminal. For initial setup, usethis::use_git() initializes a repository, and usethis::use_github() creates a remote on GitHub and pushes your code in one step.

Branching

Branches let you try something without risking the main line of work:

git branch try-new-model     # create a branch
git checkout try-new-model   # switch to it
# ... make changes, commit ...
git checkout main            # switch back
git merge try-new-model      # incorporate the branch's changes

If the experiment works, merge it. If it does not, delete the branch. The main branch is untouched either way, your code stays the same regardless of which experiment succeeded.

What belongs in Git

Commit: code, documentation, renv.lock, _targets.R, .gitignore, small data files (under a few MB), Quarto source files.

Do not commit: large data files, generated output (figures, HTML reports, _targets/ cache), secrets and passwords, .Rhistory, .RData, .DS_Store.

.gitignore tells Git what to skip. A good starting point for R projects:

.Rhistory
.RData
.Rproj.user
_targets/
docs/
*.html
*.pdf

usethis::use_git_ignore() helps set it up.

Commit messages

Write commit messages that describe why you changed something, not what you changed. Git already records every added and deleted line; your message adds the reasoning that the diff cannot show.

Good: “Remove species interaction term (AIC worse by 4.2).” Bad: “Update analysis.R.” Good: “Fix off-by-one in bootstrap loop causing n+1 resamples.” Bad: “Bug fix.”

Your future self will search through git log when something breaks, and those messages are the only documentation of your decision-making process.

Going further

This section covers enough Git to track your work, collaborate, and recover from mistakes. It does not cover rebasing, cherry-picking, bisecting, or other advanced operations. Jenny Bryan’s Happy Git and GitHub for the useR (happygitwithr.com) is the best reference for R users and covers installation, SSH keys, merge conflicts, and common workflows in detail.

33.7 set.seed() and session info

Any analysis involving randomness (simulation, resampling, train/test splits) needs set.seed():

set.seed(42)
sample(1:100, 5)
#> [1] 49 65 25 74 18

Without set.seed(), every run produces different numbers. With it, the same seed always produces the same sequence. Place it at the top of your script or Quarto document; the specific number does not matter (42, 123, 2024, anything), but it must be fixed and documented.

One subtlety: R 3.6.0 changed the default random number generator, so code using set.seed() before and after that version may produce different sequences even with the same seed. If exact reproducibility across R versions matters, specify the RNG kind: set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion").

At the end of reports, record your environment:

sessionInfo()
# or
sessioninfo::session_info()

This captures R version, OS, and package versions. If results differ on another machine, the session info is the first place to look. Small practices, both of them, but they close gaps that the larger tools leave open.

Exercises

  1. Run the same sample() call twice without set.seed(). Do you get the same result? Now add set.seed(123) before each call. What changes?
  2. Run sessionInfo() and find: your R version, your operating system, and the version of a package you use frequently.

33.8 The reproducibility stack

Here is when to reach for each tool:

  1. RStudio projects + here: always. Zero cost, immediate benefit.
  2. Git: always, for anything beyond a one-off exploration.
  3. renv: for any project that needs to produce the same results later.
  4. Quarto: for any analysis that produces a report or document.
  5. targets: for multi-step pipelines with long-running computations.
  6. Docker: for complete environment reproducibility (R version, system libraries). Beyond this book, but worth knowing it exists.

You do not need all of these for every project. Start with projects + here + Git, add renv when versions matter, add Quarto when you write reports, add targets when your pipeline grows.

A decision guide:

If your project… You need
Has any R code at all RStudio project + here
Will exist for more than a week Git
Uses packages that update renv
Produces a report or paper Quarto
Has steps that take >30 seconds targets
Must run on a different OS Docker or Nix

Each layer solves one failure mode. Together, they make “works on my machine” a non-issue.

The common thread across all of them is automation: here automates path construction, renv automates dependency tracking, Quarto automates the link between code and output, targets automates pipeline execution, and Git automates change tracking. Each requires a one-time setup cost and then runs automatically.

33.9 Beyond renv: full-stack reproducibility

renv captures R package versions. It does not capture R itself, system libraries (libcurl, GDAL, libxml2), or the C compiler that built them. When your code depends on any of these (and spatial code, for instance, almost always depends on GDAL), renv alone leaves a gap.

Docker fills it by bundling everything into a container: OS, R, system libraries, packages, your code. The Rocker project (rocker-project.org) provides pre-built R images:

# Dockerfile
FROM rocker/r-ver:4.4.0
RUN install2.r dplyr ggplot2
COPY . /analysis
CMD ["Rscript", "analysis/main.R"]

A Dockerfile is a recipe that always produces the same environment. Build once, run anywhere. The trade-off is complexity: Docker adds a layer of tooling (images, containers, registries) that takes time to learn.

Nix takes a different approach. Where Docker isolates by containerization, Nix isolates by functional purity: the Nix package manager treats every package as a pure function from its inputs (source code, dependencies, compiler flags) to its output (the built artifact). Fix all inputs, get a deterministic output. This is referential transparency applied to software builds, the same principle that makes pure functions predictable in R: same inputs, same output, no hidden state.

The rix R package makes Nix accessible from R:

library(rix)

rix(
  r_ver = "4.4.0",
  r_pkgs = c("dplyr", "ggplot2", "palmerpenguins"),
  system_pkgs = NULL,
  ide = "rstudio",
  project_path = "."
)

This generates a default.nix file that pins everything: R version, package versions, system libraries, even the C compiler. Running nix-build on any machine with Nix installed produces an identical environment. Where renv.lock captures one layer (R packages), default.nix captures all of them.

The practical trade-off: Nix has a steep learning curve and limited Windows support (it runs natively on Linux and macOS, or via WSL on Windows), while Docker is more widely adopted and better supported across platforms. Both achieve full-stack reproducibility; they differ in mechanism and philosophy. Docker says “ship the whole machine.” Nix says “describe the machine as a function and let anyone rebuild it.”

For most R users, renv is sufficient. Add Docker or Nix when your results depend on system-level components, when you need to guarantee reproducibility across operating systems, or when a collaborator reports “it doesn’t work on my machine” and the problem turns out to be something no R package can fix. Which layer you pin depends on how far the reproducibility guarantee needs to reach.