27  Building a DSL

R is a language that lets you build languages. You have seen the pieces: quote() captures code as data (Section 26.1), substitute() captures the caller’s code (Section 26.3), S3 dispatch selects behavior by type (Section 24.1), and operators are just functions (Section 7.5). This chapter puts those pieces together. You will dissect DSLs you already use, then build one from scratch.

A domain-specific language (DSL) is a small language designed for one task. SQL is a DSL for querying data. Regular expressions are a DSL for matching text. HTML is a DSL for structuring documents. These are external DSLs: they have their own parsers, their own syntax, their own tooling.

R excels at internal DSLs: mini-languages embedded inside R itself, using R’s own syntax but giving it new meaning. The formula interface, ggplot2’s grammar of graphics, dplyr’s verb chains, data.table’s [i, j, by]: each is a DSL that lives inside R and plays by R’s rules (mostly). What makes R unusually good at this is NSE (non-standard evaluation) and operator overloading. You can capture user expressions without evaluating them, reinterpret what operators mean, and dispatch behavior through S3. Most languages cannot do this, or not without painful workarounds.

Peter Landin described this idea in 1966: instead of forcing programmers to think in the language’s terms, let them build “a language appropriate to the problem.” Guy Steele’s 1998 keynote “Growing a Language” made the same case from the implementor’s side: a good language is one that users can extend with new vocabulary, so that the language grows to fit the domain rather than the other way around. R’s formula interface dates to the S language in the 1980s, long before either paper was widely cited, but the principle is the same.

27.1 R’s existing DSLs

Before building a DSL, study the ones you already use. Each takes a different approach.

Formulas. y ~ x1 + x2 is R’s oldest DSL. The ~ operator does not compute anything; it creates a formula object that stores two expressions (the left- and right-hand sides) plus the environment where it was created. Inside the formula, + does not mean addition; it means “include this term.” * means “include main effects and their interaction.” : means “interaction only.” This is code that means something completely different from its usual meaning, interpreted not by R’s evaluator but by model.matrix(). You explored this in Section 26.5.

ggplot2. The + operator for ggplot objects does not add numbers. It layers graphical components:

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm")

aes() captures column names as expressions without evaluating them. +.gg (the S3 method for + on ggplot objects) takes the existing plot and attaches the new layer. The plot is built incrementally, one + at a time, accumulating a list of layers, scales, and coordinates. This is an expression builder pattern: each + returns a new (or modified) plot object, and the final object is rendered only when printed.

dplyr. filter(penguins, species == "Adelie") captures species == "Adelie" as an expression and evaluates it against the data frame using data masking (Section 26.3). The pipe operator chains verbs together, and each verb does one thing: filter rows, select columns, create new columns, summarize groups. The verbs are S3 generics, so they dispatch on the data source. filter.data.frame works on data frames; filter.tbl_lazy generates SQL. Same syntax, different backend.

data.table. DT[i, j, by] overloads the [ operator. All three arguments are captured by substitute() and evaluated in the data.table’s scope. There is no pipe, no chain of function calls. Everything happens inside [, which is both concise and dense.

Each DSL makes a different trade-off. Formulas redefine arithmetic operators inside a special object. ggplot2 overloads + on a class. dplyr uses data masking and generic dispatch. data.table overloads [. The technique varies; the principle is the same: capture code, give it new meaning.

Exercises

  1. Type ggplot2::aes at the console (no parentheses) and read the source. Where does it capture the user’s expressions? What function does it use?
  2. Look at the source of ggplot2:::"+.gg". What does the + operator actually do to a ggplot object?
  3. Compare filter(df, x > 5) to df[df$x > 5, ]. What does dplyr’s version gain from NSE? What does it cost?

27.2 How aes() works

aes() is worth dissecting in detail because it demonstrates the full pattern: capture expressions, store them, evaluate them later in the right context.

mapping <- ggplot2::aes(x = wt, y = mpg, color = cyl)
mapping
#> Aesthetic mapping: 
#> * `x`      -> `wt`
#> * `y`      -> `mpg`
#> * `colour` -> `cyl`

mapping is a list of quosures. Each quosure pairs an expression (wt, mpg, cyl) with the environment where it was written. The column names were never evaluated; they were captured and stored for later.

mapping$x
#> <quosure>
#> expr: ^wt
#> env:  global
class(mapping$x)
#> [1] "quosure" "formula"
rlang::get_expr(mapping$x)
#> wt

When ggplot2 eventually builds the plot, it evaluates these quosures against the data frame. wt is looked up in mtcars, not in the global environment. The separation of capture and evaluation is the mechanism that makes the whole system work.

The +.gg method is simpler than you might expect. It takes the existing plot object, attaches the new component (a layer, a scale, a theme), and returns the modified plot. The plot is a list that grows with each +. Rendering happens only when you print it.

This two-phase design (build, then execute) appears in SQL query builders, in compiler intermediate representations, and in ggplot2. The user describes what they want; the system decides how to produce it.

27.3 Building a unit-aware DSL

Now build something. The goal: a small DSL for arithmetic with physical units. meters(5) + meters(3) should return 8 meters. meters(5) + seconds(3) should error. meters(100) |> to("km") should return 0.1 km. The DSL is deliberately small, but it exercises every technique from the last three chapters: S3 classes, operator overloading, constructors, validation, and dispatch.

Start with the class. A unit value is a number paired with a unit string:

new_unit <- function(value, unit) {
  structure(list(value = value, unit = unit), class = "unit_val")
}

print.unit_val <- function(x, ...) {
  cat(x$value, x$unit, "\n")
  invisible(x)
}

Add constructor functions for common units:

meters  <- function(x) new_unit(x, "m")
seconds <- function(x) new_unit(x, "s")
kg      <- function(x) new_unit(x, "kg")
meters(5)
#> 5 m
seconds(3)
#> 3 s

These constructors are the DSL’s vocabulary. Each one creates a unit_val object. The user never calls new_unit() directly.

27.3.1 Operator overloading

Overload + so that it works on unit_val objects, but only when the units match:

"+.unit_val" <- function(a, b) {
  if (a$unit != b$unit) {
    stop(sprintf("cannot add %s and %s", a$unit, b$unit))
  }
  new_unit(a$value + b$value, a$unit)
}
meters(5) + meters(3)
#> 8 m
meters(5) + seconds(3)
#> Error in `+.unit_val`:
#> ! cannot add m and s

The error is immediate and readable. No silent coercion, no mysterious NA. The same pattern works for -, *, and /, though multiplication and division need to combine units (meters times seconds gives meter-seconds, for instance). For brevity, handle just addition and subtraction here:

"-.unit_val" <- function(a, b) {
  if (missing(b)) return(new_unit(-a$value, a$unit))  # unary negation
  if (a$unit != b$unit) {
    stop(sprintf("cannot subtract %s from %s", b$unit, a$unit))
  }
  new_unit(a$value - b$value, a$unit)
}

Note the missing(b) check: - is used both as a binary operator (a - b) and as a unary operator (-a). When called with one argument, R dispatches to the same method but leaves b missing. Without this check, -meters(5) would error because there is no b$unit to compare.

meters(10) - meters(3)
#> 7 m
-meters(5)
#> -5 m

You now have a language for unit-safe arithmetic. meters(5) + meters(3) reads like a sentence and does the right thing. meters(5) + seconds(3) fails clearly. This is what a DSL does: it makes correct code easy to write and incorrect code hard to ignore.

27.3.2 Unit conversion

Add a conversion function. A lookup table maps unit pairs to conversion factors:

conversions <- list(
  "m_to_km"  = 0.001,
  "km_to_m"  = 1000,
  "s_to_min" = 1 / 60,
  "min_to_s" = 60,
  "kg_to_g"  = 1000,
  "g_to_kg"  = 0.001
)

to <- function(x, target) {
  if (!inherits(x, "unit_val")) {
    stop(sprintf("x must be a unit_val, got %s", class(x)[1]))
  }
  if (!is.character(target) || length(target) != 1) {
    stop("target must be a single character string")
  }
  # The DSL's operators maintain the unit_val invariant internally,
  # but to() is a boundary function: it accepts a string from the
  # user, so we validate here to keep invalid states out of the system.
  key <- paste0(x$unit, "_to_", target)
  factor <- conversions[[key]]
  if (is.null(factor)) {
    stop(sprintf("no conversion from %s to %s", x$unit, target))
  }
  new_unit(x$value * factor, target)
}
meters(1500) |> to("km")
#> 1.5 km
meters(5) |> to("min")
#> Error in `to()`:
#> ! no conversion from m to min

The pipe (|>) makes the conversion read naturally: “take 1500 meters, convert to km.” The DSL is composable: you can chain arithmetic and conversion.

(meters(500) + meters(1000)) |> to("km")
#> 1.5 km

27.3.3 Comparison operators

Overload comparison so that units are checked there too:

">.unit_val" <- function(a, b) {
  if (a$unit != b$unit) stop(sprintf("cannot compare %s and %s", a$unit, b$unit))
  a$value > b$value
}

"==.unit_val" <- function(a, b) {
  if (a$unit != b$unit) stop(sprintf("cannot compare %s and %s", a$unit, b$unit))
  a$value == b$value
}
meters(10) > meters(5)
#> [1] TRUE
meters(5) == meters(5)
#> [1] TRUE

Every operator enforces the invariant: you cannot mix units. The type system (S3 classes) and the operator overloading work together to make invalid states unrepresentable in the DSL.

Exercises

  1. Add a *.unit_val method that combines units. meters(5) * seconds(2) should return 10 m*s. (Hint: paste the unit strings together with * as a separator.)
  2. Add a format.unit_val method so that paste("Distance:", meters(42)) produces "Distance: 42 m".
  3. Add celsius() and fahrenheit() constructors plus a to() conversion between them. This one is not a simple multiplication; you need an offset. How does that change the design of the conversions table?

27.4 Techniques for DSL construction

The unit DSL used three techniques: S3 classes, operator overloading, and constructor functions. Here is the full toolkit.

Operator overloading. R lets you define methods for +, -, *, /, [, [[, <, >, ==, |, &, and more. You can also create custom infix operators with the %op% syntax:

"%to%" <- function(x, target) to(x, target)
meters(1500) %to% "km"
#> 1.5 km

Custom infixes are useful when the built-in operators do not express the right meaning. The %>% pipe from magrittr is the most famous example; %in% is a base R one.

NSE for expression capture. If your DSL needs to capture column names, formulas, or unevaluated expressions, use substitute() (base R) or enquo() (rlang). The unit DSL did not need NSE because its inputs are plain values, but a query-builder DSL would:

where <- function(.data, expr) {
  e <- substitute(expr)
  rows <- eval(e, .data, parent.frame())
  .data[rows, ]
}
where(mtcars, mpg > 30)

This is a miniature version of what dplyr::filter() does: capture the expression, evaluate it against the data frame.

Formula interfaces. Use formulas when your DSL needs two-sided specifications. The ~ already prevents evaluation, and the formula carries its environment:

specify <- function(formula, data) {
  lhs <- all.vars(formula[[2]])
  rhs <- all.vars(formula[[3]])
  list(response = lhs, predictors = rhs)
}
specify(mpg ~ wt + hp, mtcars)

S3 dispatch as polymorphism. If your DSL should work on multiple data backends (data frames, databases, remote APIs), define your verbs as S3 generics. Each backend gets its own method. This is exactly how dplyr supports data frames and SQL databases with the same syntax.

Expression builders. ggplot2’s approach: each operation returns an object, and the objects compose. The user builds up a description; execution is deferred until the end. This separates specification from computation and makes the DSL composable.

27.5 eval(parse(text = ...)): the anti-pattern

You will encounter code like this:

col_name <- "mpg"
eval(parse(text = paste0("mtcars$", col_name)))

It constructs R code as a string, parses it into an expression, and evaluates it. It works. It is also almost always wrong.

The problems are concrete. First, injection attacks: if col_name comes from user input, a malicious string like "mpg; system('rm -rf /')" gets parsed and executed. Second, debugging: when the generated code errors, the error message refers to the parsed text, not to your source file. The traceback is useless. Third, tooling: static analysis, linting, and IDE autocompletion cannot see inside a string.

The alternative is to work with expressions as data structures, not strings. Everything you learned in Chapter 26 avoids eval(parse(text = ...)):

col_name <- "mpg"
mtcars[[col_name]]
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
#> [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
#> [29] 15.8 19.7 15.0 21.4

For more complex cases, rlang::sym() converts a string to a symbol, and !! injects it:

col <- rlang::sym("mpg")
dplyr::summarise(mtcars, mean_val = mean(!!col))
TipOpinion

If you find yourself writing eval(parse(text = ...)), stop and ask whether the problem can be solved with [[, rlang::sym(), call2(), or do.call(). In nearly every case, it can. The string-based approach is a last resort for code generation tasks where you are producing R scripts to be run later, not for interactive computation.

Exercises

  1. Rewrite eval(parse(text = paste0("mean(mtcars$", col, ")"))) without eval(parse(...)). Use [[ or rlang::sym().
  2. Write a function safe_select(data, col_name) that takes a string column name and returns that column. Do it without eval(parse(text = ...)).

27.6 Design principles

A DSL is an interface. The same principles that make functions usable (Chapter 25) apply here, scaled up.

Keep the surface small. A DSL with fifty special operators is not a DSL; it is a burden. The formula interface has about six operators (~, +, *, :, -, I()). ggplot2 has a handful of geom types and a +. Resist adding syntax; each new operator is a concept the user must learn.

Make errors readable. When units do not match, the error says “cannot add m and s.” When a conversion does not exist, it says “no conversion from m to min.” Good error messages name the things the user wrote, not the implementation details. This is Chapter 25 applied to DSL design.

Make it composable. Every expression in the DSL should return something that can be used as input to the next expression. meters(5) + meters(3) returns a unit_val; you can pass it to to() or add more values. ggplot2’s + returns a plot object; you can add more layers. Composability comes from consistent types: operations on unit_val return unit_val.

Do not surprise. If your DSL overloads +, it should still be associative: (a + b) + c should equal a + (b + c). If it redefines *, users will expect it to distribute over +. Violating mathematical expectations makes the DSL confusing, even if the behavior is internally consistent.

Document the grammar. A DSL has a grammar: what expressions are valid, what they mean, how they compose. Write it down, even if informally. “A unit_val can be combined with + or - if units match. Use to() to convert between units. Constructors: meters(), seconds(), kg().” That is the grammar.

27.7 Putting it all together

The chapter’s DSL used five ingredients:

  1. S3 class (unit_val) to represent domain objects.
  2. Constructor functions (meters(), seconds(), kg()) to provide vocabulary.
  3. Operator overloading (+.unit_val, -.unit_val) to give arithmetic domain-specific meaning.
  4. Validation (unit mismatch errors) to enforce domain rules.
  5. Composability (every operation returns a unit_val) to make expressions chainable.

A more ambitious DSL would add NSE to capture column names or expressions, use S3 generics to dispatch across backends, and build expression objects that defer evaluation. ggplot2 uses all of these. The progression from “simple class with overloaded operators” to “full expression-builder DSL with lazy evaluation and multiple backends” is a continuum. You do not need the full machinery for every problem. Start with the simplest technique that makes your interface read well, and add complexity only when the domain demands it.

The broader point is this: R gives you the tools to reshape the language around your problem. Formulas reshape arithmetic into model specification. ggplot2 reshapes + into layer composition. dplyr reshapes function calls into data queries. When you find yourself writing repetitive, mechanical code, consider whether a small DSL, even just a class with a few overloaded operators, would let you express the same ideas more directly. That is what Landin meant by “a language appropriate to the problem,” and it is what R, more than most languages, makes practical.

Exercises

  1. Extend the unit DSL with a summary.unit_val method that, given a list of unit_val objects (all with the same unit), returns the min, max, and mean. Validate that all units match.
  2. Design (on paper, no code required) a DSL for describing file-processing pipelines: read a CSV, filter rows, rename columns, write output. What would the verbs be? What class would they operate on? How would they compose?
  3. Read the source of htmltools::tag(). How does it build HTML from R function calls? Which of the techniques from this chapter does it use?
  4. Pick a repetitive task from your own work. Sketch a five-function DSL that would make it more concise. What class would you define? What operators would you overload?