26  Metaprogramming

You have been writing lm(y ~ x) since Chapter 17, and filter(penguins, species == "Adelie") since Chapter 14. Both do something that might seem impossible: y ~ x is not evaluated before lm() sees it, and species is found inside a data frame, not in your environment. Normally, 1 + 1 evaluates to 2 and the expression disappears. R can keep it. Code in R is a data structure you can hold, examine, rearrange, and run whenever you choose.

That difference has a name: metaprogramming. Code that operates on code. You have been using it from your first formula, your first call to aes(), your first dplyr pipeline, all of it built on the machinery this chapter describes. This chapter shows you the mechanism.

26.1 Code is data

quote() captures an expression without evaluating it:

quote(x + 1)
#> x + 1

No result. What you got back is the expression itself: x + 1 was not computed but returned as a data structure (a call object) that you can store, inspect, and eventually evaluate. If that sounds strange, think of it through lambda calculus: (λx. x + 1)(3) reduces to 4, but if you “quote” the term you get the syntactic object (λx. x + 1)(3), something you can pull apart and study without triggering the reduction. quote(1 + 1) in R gives you the expression tree, not the value 2.

e <- quote(x + 1)
typeof(e)
#> [1] "language"
class(e)
#> [1] "call"

The rlang package provides expr(), which does the same thing:

rlang::expr(x + 1)
#> x + 1

To run a captured expression, use eval():

x <- 10
eval(e)
#> [1] 11

eval() takes the frozen expression and evaluates it in the current environment, where x is 10, so the result is 11. The cycle is always the same: capture code, possibly modify it, then evaluate it somewhere. But why would you want to?

In Section 1.2, Church’s lambda calculus treated functions as data that could be passed around and applied. R extends that principle to all code, not only functions. An expression is a value, exactly like a number or a string; you can assign it, put it in a list, pass it to a function, transform it, then run the transformed version. In Section 7.5, you saw that + is a function and 2 + 3 is a function call. That function call is also a data structure you can capture and manipulate before it ever executes.

R inherited this idea from Lisp, where (quote (+ 1 2)) returns the list (+ 1 2) instead of computing 3. A language whose code has the same structure as its data is called homoiconic: Lisp code is lists, Lisp data is lists; R code is call objects (trees), and R data includes call objects. Python, Java, and C++ are not homoiconic — their code is text, and metaprogramming means parsing strings, which is fragile. R’s homoiconicity means it gets this for free.

Python’s inspect.getsource() returns a string you then have to parse; Java’s reflection is a separate API with its own rules and sharp edges. R skips both steps. quote(x + 1) gives you the syntax tree directly — a call object you can subset with [[, modify with replacement, and evaluate in any environment you choose. That tree is an ordinary R object. It lives in memory, responds to R’s standard operations, and you can pass it to functions like any vector or list. Tidy evaluation, formula interfaces, and ggplot2’s aes() all rest on that foundation.

26.2 Abstract syntax trees

Every R expression has a tree structure called an abstract syntax tree (AST). The lobstr package makes them visible:

lobstr::ast(x + y * 2)
#> █─`+` 
#> ├─x 
#> └─█─`*` 
#>   ├─y 
#>   └─2

What looks like a flat sequence of tokens on the page is actually a tree where + sits at the root, x hangs off one branch, and * (y, 2) hangs off the other. The tree encodes precedence: * binds tighter than +, so y * 2 forms a subtree nested inside the + node, and R knows to evaluate it first without needing parentheses from you.

Three kinds of nodes make up every AST:

  • Constants: 1, "hello", TRUE. These are leaves. They have no children.
  • Symbols (names): x, y, mean. Also leaves. They represent name lookups: when evaluated, R searches for the value bound to that name.
  • Calls: function applications. These are branches. The first child is the function, the rest are arguments.

That is the entire vocabulary. x + 1 is a call to `+` with arguments x and 1. if (x > 0) "yes" else "no" is a call to `if` with three arguments. There is no special syntax that escapes the tree; even control flow is just function calls wearing different clothes.

lobstr::ast(if (x > 0) "yes" else "no")
#> █─`if` 
#> ├─█─`>` 
#> │ ├─x 
#> │ └─0 
#> ├─"yes" 
#> └─"no"

You can take a call object apart with standard list operations, where the first element is the function and the rest are the arguments:

e <- quote(mean(x, na.rm = TRUE))
e[[1]]
#> mean
e[[2]]
#> x
e[[3]]
#> [1] TRUE

as.list() converts the whole call into a list, which makes the structure easy to see at a glance:

as.list(e)
#> [[1]]
#> mean
#> 
#> [[2]]
#> x
#> 
#> $na.rm
#> [1] TRUE

You can convert between text and expressions with parse() and deparse():

deparse(quote(x + y * 2))
#> [1] "x + y * 2"
parse(text = "x + y * 2")[[1]]
#> x + y * 2

The conversion is not perfectly symmetric; comments and whitespace vanish in the round trip. But the tree structure survives, and the tree is what matters. So what happens when you need to capture not your own expression but someone else’s?

Exercises

  1. Draw the AST for f(a, g(b, c)) on paper. Then check your answer with lobstr::ast().
  2. What does lobstr::ast(1 + 2 + 3) look like? Is + left-associative or right-associative?
  3. Use lobstr::ast() to visualize x[1]. What function is at the root?

26.3 Capturing and evaluating expressions

Capturing your own expression and capturing the caller’s expression require different tools, and confusing the two is a reliable source of bugs.

quote() and expr() capture what you type directly:

quote(a + b)
#> a + b
rlang::expr(a + b)
#> a + b

substitute() captures what the caller passed:

f <- function(x) substitute(x)
f(a + b)
#> a + b

You called f(a + b), and substitute(x) reached back across the function boundary to the call site, grabbing a + b, the expression the caller actually wrote. The same mechanism powers dplyr::filter(): it sees species == "Adelie" as an expression rather than immediately evaluating it and getting an error about a missing variable.

The rlang equivalents are enexpr() (captures an expression) and enquo() (captures an expression plus its environment, forming a quosure). The decision rule is simple: use substitute() in base R code, use enexpr()/enquo() in tidyverse/rlang code. If your function will be called by dplyr or other tidy-evaluation-aware code, reach for the rlang tools because they integrate with !!, { }, and eval_tidy(). If you are writing standalone base R, substitute() and eval() are sufficient and carry zero dependencies.

g <- function(x) rlang::enexpr(x)
g(a + b)
#> a + b

Once you have an expression, you evaluate it with eval(). The second argument controls where:

e <- quote(x + 1)
eval(e, list(x = 10))
#> [1] 11
eval(e, list(x = 100))
#> [1] 101

The same frozen expression, evaluated in different environments, gives different results. Data masking is exactly that trick: eval_tidy() from rlang evaluates an expression against a data frame, so filter(penguins, species == "Adelie") looks up species in the data rather than in the global environment.

library(rlang)
df <- data.frame(x = c(1, 2, 3), y = c(10, 20, 30))
eval_tidy(expr(x + y), data = df)
#> [1] 11 22 33

Tidy evaluation calls this pattern defuse-and-inject: capture, optionally modify, then evaluate in the right context.

One more base R tool worth knowing: match.call(). Inside a function, it returns the entire call as the user typed it, with arguments matched by name:

my_lm <- function(formula, data, subset = NULL) {
  match.call()
}
my_lm(y ~ x, data = mtcars)
#> my_lm(formula = y ~ x, data = mtcars)

Many modeling functions use match.call() to record the call for reproducibility. When you print a fitted model and see Call: lm(formula = y ~ x, data = mtcars), that string came from match.call() stashing the original invocation.

Exercises

  1. Write a function show_code that takes an argument and prints the expression the caller passed (use substitute() and deparse()). Test: show_code(mean(x, na.rm = TRUE)) should print "mean(x, na.rm = TRUE)".
  2. Evaluate quote(x * 2) in an environment where x = 7. Then evaluate it where x = -3.
  3. Write a function that uses match.call() to return its own call. Call it with several arguments and observe the output.

26.4 Building expressions programmatically

What if the function name or the variable is not known until runtime, when your code has to decide at the last moment what expression to build? You need to construct the expression from parts.

rlang::call2() constructs a call object:

rlang::call2("+", 1, 2)
#> 1 + 2
eval(rlang::call2("+", 1, 2))
#> [1] 3

The next step is quasiquotation: building an expression template with holes that get filled in at construction time. The !! (bang-bang) operator injects a value into an expression:

my_var <- rlang::expr(body_mass_g)
rlang::expr(mean(!!my_var))
#> mean(body_mass_g)

!! replaced my_var with its value (body_mass_g), producing the expression mean(body_mass_g). Without !!, you would get mean(my_var), which is an entirely different expression and not the one you wanted.

!!! (triple bang, or splice) injects a list of expressions as separate arguments:

vars <- rlang::exprs(species, island)
rlang::expr(group_by(penguins, !!!vars))
#> group_by(penguins, species, island)

This is what { } (embrace) from tidy evaluation does under the hood. When you write a function like:

my_summary <- function(data, var) {
  data |> dplyr::summarise(mean = mean({{ var }}, na.rm = TRUE))
}

my_summary(palmerpenguins::penguins, body_mass_g)
#> # A tibble: 1 × 1
#>    mean
#>   <dbl>
#> 1 4202.
my_summary(palmerpenguins::penguins, flipper_length_mm)
#> # A tibble: 1 × 1
#>    mean
#>   <dbl>
#> 1  201.

the embrace operator defuses var with enquo() and injects it with !!, which means it is syntactic sugar for the defuse-and-inject pattern. The caller writes bare column names, exactly as they would with dplyr directly, and your function forwards them transparently, no quoted strings, no special syntax at the call site.

Base R has bquote() for quasiquotation, using .() instead of !!:

my_var <- quote(body_mass_g)
bquote(mean(.(my_var)))
#> mean(body_mass_g)

It works, but bquote() is less common in practice and does not support splicing. All of these tools, !!, bquote(), expr(), solve the same problem that Lisp macros have addressed since the 1960s: how to write code that writes code, safely, without collapsing into string manipulation. When Wickham designed tidy evaluation, he was adapting that lineage for R, and the result is that you can generate expressions with the same structural guarantees that quote() gives you by hand.

Exercises

  1. Use rlang::call2() to build the expression sqrt(16), then evaluate it.
  2. Create a variable col <- rlang::expr(bill_length_mm). Use !! to build the expression mean(bill_length_mm, na.rm = TRUE).
  3. Given fns <- rlang::exprs(mean, sd, median), use lapply() and call2() to build three expressions: mean(x), sd(x), median(x).

26.5 Formulas as expressions

Before anyone used the word “metaprogramming” in an R context, there were formulas. When you write:

lm(body_mass_g ~ bill_length_mm, data = palmerpenguins::penguins)
#> 
#> Call:
#> lm(formula = body_mass_g ~ bill_length_mm, data = palmerpenguins::penguins)
#> 
#> Coefficients:
#>    (Intercept)  bill_length_mm  
#>         362.31           87.42

the expression body_mass_g ~ bill_length_mm is not evaluated in the ordinary sense. R captures it as a formula object: two expressions (the left-hand side and the right-hand side) bundled together with the environment where the formula was created.

f <- y ~ x + z
typeof(f)
#> [1] "language"
length(f)
#> [1] 3
f[[2]]
#> y
f[[3]]
#> x + z

A formula stores its terms as call objects. f[[2]] is the left-hand side (y), f[[3]] is the right-hand side (x + z). The formula also carries an environment attribute:

environment(f)
#> <environment: R_GlobalEnv>

Formulas work across function boundaries because the formula remembers where its variables should be looked up. Quosures in tidy evaluation solve the same problem: an expression bundled with its environment. Wickham borrowed the concept directly from formulas, and their 30-year track record in R’s modeling ecosystem is what made the design credible.

The formula language (+, *, :, -, I()) is a domain-specific language for specifying models. y ~ x1 * x2 does not mean “multiply x1 by x2.” It means “include x1, x2, and their interaction.” model.matrix() interprets these operators to build the design matrix:

model.matrix(~ species + island, data = palmerpenguins::penguins) |> head()
#>   (Intercept) speciesChinstrap speciesGentoo islandDream islandTorgersen
#> 1           1                0             0           0               1
#> 2           1                0             0           0               1
#> 3           1                0             0           0               1
#> 4           1                0             0           0               1
#> 5           1                0             0           0               1
#> 6           1                0             0           0               1

You can also build formulas dynamically, which becomes useful when the set of predictors is not known in advance:

predictors <- c("bill_length_mm", "flipper_length_mm")
f <- as.formula(paste("body_mass_g ~", paste(predictors, collapse = " + ")))
f
#> body_mass_g ~ bill_length_mm + flipper_length_mm

Constructing code from data, then running it: that is metaprogramming at its most practical. The formula is a piece of code assembled from strings, about to be interpreted by lm() as a model specification, and neither the user nor the modeling function needs to know it was built programmatically.

The connection to Chapter 23 is direct. Formulas are non-standard evaluation: the variable names are unquoted, body_mass_g is looked up in penguins rather than in the calling environment. Dplyr uses the same trick, just with newer machinery.

Exercises

  1. Create a formula y ~ x1 + x2 and extract its right-hand side.
  2. Build a formula programmatically: given response <- "mpg" and predictors <- c("wt", "hp"), construct the formula mpg ~ wt + hp using as.formula() and paste(). Pass it to lm() with the mtcars dataset.

26.6 When to use metaprogramming

Metaprogramming lets you generate code, build DSLs, and eliminate boilerplate. That power comes with a cost.

Good uses:

  • Domain-specific languages. Model formulas, ggplot2 aesthetics, dplyr pipelines. These are all DSLs built on metaprogramming. If you are building a package with an interactive interface that benefits from concise, expressive syntax, metaprogramming is the right tool.
  • Code generation. Building model specifications programmatically, creating batches of test cases, generating reports from templates.
  • Debugging and inspection. substitute() to see what was passed, match.call() to record the exact call for reproducibility.

Bad uses:

  • Things that work with regular functions. If f(x) solves the problem, do not complicate it with eval(substitute(...)).
  • Performance. Metaprogramming is not faster. It is more flexible, which is a different axis entirely.
  • Showing off. Code that manipulates code is hard to read and hard to debug. Use it when the benefit (a concise user interface) outweighs the cost (a complex implementation), not before.
TipOpinion

Most R users consume metaprogramming (by using dplyr, ggplot2, formulas) and very few need to produce it. Package authors building interactive interfaces may need these tools; analysts writing data pipelines almost certainly do not. The test: does your user-facing API become meaningfully better with non-standard evaluation? If yes, the complexity is worth absorbing. Saving yourself a quoted string is not enough reason.

26.7 The metaprogramming toolkit

For further study:

  • Tidy evaluation (rlang): expr(), enquo(), eval_tidy(), !!, !!!, { }. The system behind dplyr, tidyr, and ggplot2. Covered in depth in Advanced R, chapters 17 through 20.
  • Base R tools: quote(), substitute(), eval(), match.call(), sys.call(), bquote(). These predate rlang and still power much of R’s infrastructure. Every R programmer should know substitute() and eval().
  • Inspection packages: lobstr for AST visualization, pryr for probing environments and function internals.
  • Further reading: Wickham’s Advanced R (2nd edition) devotes four chapters to metaprogramming. Mailund’s Metaprogramming in R (Springer, 2017) is a book-length treatment covering domain-specific languages and code generation.

Exercises

  1. Look at the source of dplyr::filter (type dplyr::filter.data.frame at the console). Can you spot where it captures the user’s expressions?
  2. Compare quote(), rlang::expr(), substitute(), and rlang::enexpr(). Write one sentence describing when you would use each.