15 Pipes and composition

Consider a line of code that selects columns, filters rows, and sorts the result. Written with nested function calls, you read it inside-out: the first operation you want is buried deepest, and by the third level of parentheses you are counting closing parens by hand. The order you write the code runs backwards from the order you think about the problem.

15.1 The problem with nesting

Real analysis chains multiple operations. Take the penguins data, keep only Adelie penguins, select two columns, sort by body mass. Without pipes, you have three choices.

Nesting: read inside-out, right-to-left.

arrange(filter(select(penguins, species, body_mass_g), species == "Adelie"), body_mass_g)

This works for two functions, but at three it strains, and at five you are matching parentheses like a Lisp programmer on a bad day. The first operation you want to happen (select) is buried deepest inside the expression.

Intermediate variables: name every step.

a <- select(penguins, species, body_mass_g)
b <- filter(a, species == "Adelie")
c <- arrange(b, body_mass_g)

Readable, but the environment fills with throwaway names that exist only to shuttle a value from one line to the next: a, b, c, temp, df2.

Overwriting: reuse the same name.

df <- select(penguins, species, body_mass_g)
df <- filter(df, species == "Adelie")
df <- arrange(df, body_mass_g)

Compact, but fragile. You cannot re-run a single line without re-running everything above it, because each line depends on whatever df was a moment ago.

All three approaches work, yet none compose well. What would it look like if each step simply fed into the next?

15.2 The pipe: `|>`

library(palmerpenguins)
#> Warning: package 'palmerpenguins' was built under R version 4.6.1
#> 
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#> 
#>     penguins, penguins_raw
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

penguins |>
  select(species, body_mass_g) |>
  filter(species == "Adelie") |>
  arrange(body_mass_g)
#> # A tibble: 152 × 2
#>    species body_mass_g
#>    <fct>         <int>
#>  1 Adelie         2850
#>  2 Adelie         2850
#>  3 Adelie         2900
#>  4 Adelie         2900
#>  5 Adelie         2900
#>  6 Adelie         2925
#>  7 Adelie         2975
#>  8 Adelie         3000
#>  9 Adelie         3000
#> 10 Adelie         3050
#> # ℹ 142 more rows

Read it top to bottom: take penguins, select two columns, keep Adelie rows, sort by mass. Each line transforms the result of the previous one, and the reading order matches the execution order.

x |> f() is syntactic sugar for f(x). R literally rewrites it before the code runs. Chain two and x |> f() |> g() becomes g(f(x)), which is (g ∘ f)(x) written left-to-right instead of inside-out. That is function composition from Section 1.2, just in friendlier notation.

The pipe is a monoid

This is another monoid (Chapter 14): ∘ is the operation, the identity function \(x) x is the identity element, and chaining any number of verbs always gives you back a data frame. The same structure shows up again with Reduce() in Chapter 21.

The native pipe |> was introduced in R 4.1 (2021) and is built into the language, requiring no packages. The idea, however, is older than R itself: Doug McIlroy proposed the pipe concept for Unix in 1964, writing “we should have some ways of coupling programs like garden hose.” The | operator in the Unix shell chains programs, each reading from standard input and writing to standard output; R’s |> is the same idea applied to data frames, where each dplyr verb takes a data frame and returns one.

Opinion

One verb per line. Put a space before |> and a newline after it. A pipeline crammed into a single long line throws away all the readability the pipe was designed to give you.

Exercises

Rewrite the following nested call as a pipeline: head(sort(sqrt(1:20)), 5).
Take penguins, filter for penguins heavier than 4000g, and select only species and island. Write it as a pipeline.
What does 10 |> sqrt() return? What about 10 |> log(base = 2)? Explain why log receives two arguments.

15.3 Building a pipeline

The best way to build a pipeline is the way you would give directions: one step at a time, checking the view at each turn.

penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species) |>
  summarise(mean_mass = mean(body_mass_g)) |>
  arrange(desc(mean_mass))
#> # A tibble: 3 × 2
#>   species   mean_mass
#>   <fct>         <dbl>
#> 1 Gentoo        5076.
#> 2 Chinstrap     3733.
#> 3 Adelie        3701.

“Take penguins, remove NAs, group by species, compute mean mass, sort descending.” The code reads like the sentence.

Build incrementally: run the first line, check the output, add the next step, check again.

penguins |>
  filter(!is.na(flipper_length_mm), !is.na(body_mass_g)) |>
  group_by(species, island) |>
  summarise(
    n = n(),
    mean_flipper = mean(flipper_length_mm),
    mean_mass = mean(body_mass_g),
    .groups = "drop"
  )
#> # A tibble: 5 × 5
#>   species   island        n mean_flipper mean_mass
#>   <fct>     <fct>     <int>        <dbl>     <dbl>
#> 1 Adelie    Biscoe       44         189.     3710.
#> 2 Adelie    Dream        56         190.     3688.
#> 3 Adelie    Torgersen    51         191.     3706.
#> 4 Chinstrap Dream        68         196.     3733.
#> 5 Gentoo    Biscoe      123         217.     5076.

You compose verbs into a pipeline the same way you compose sentences into a paragraph, one idea at a time, in order. But what happens when you want to reuse a pipeline across different data sets, or embed one inside a larger computation?

Exercises

Build a pipeline that counts how many penguins of each species live on each island. Start with penguins, group by species and island, then summarise with n = n().
Extend your pipeline from Exercise 1: arrange the result by n in descending order.
Write a pipeline that computes the median bill_length_mm per species (removing NAs), then filters to keep only species where the median exceeds 40mm.

15.4 Function composition

Lambda calculus (Section 1.2) is built on applying functions to arguments. Composition is the next step: apply one function, then apply another to the result.

This is why dplyr verbs all follow the same contract: take a data frame as the first argument, return a data frame. Functions that follow this contract compose via the pipe; functions that don’t (first argument is a formula, a filename, a model object) require workarounds. The convention exists precisely so that verbs snap together like sections of garden hose.

Functions are values you can pass around (Chapter 7), and they are also things you compose. Passing and composing are the two fundamental operations on functions, and together they account for most of what makes R code concise.

clean_and_summarise <- function(df) {
  df |>
    filter(!is.na(body_mass_g)) |>
    group_by(species) |>
    summarise(mean_mass = mean(body_mass_g), .groups = "drop")
}

penguins |> clean_and_summarise()
#> # A tibble: 3 × 2
#>   species   mean_mass
#>   <fct>         <dbl>
#> 1 Adelie        3701.
#> 2 Chinstrap     3733.
#> 3 Gentoo        5076.

Wrapping a pipeline in a function gives the composition a name. clean_and_summarise is a single function built from four composed steps; you can pass it around, store it in a list, or use it inside another pipeline. The pipeline describes the transformation without naming intermediates:

penguins |> filter(species == "Adelie") |> nrow()

Point-free style

In Haskell, this style is called “point-free” or “tacit” programming, tracing back to combinatory logic (Schönfinkel, 1924): functions are composed without mentioning the data they act on. R’s pipe enables a mild version of this. It isn’t fully point-free (dplyr verbs still reference column names), but the same impulse is at work.

15.5 The placeholder `_`

The pipe inserts the left side as the first argument. But what happens when the data belongs somewhere else?

penguins |>
  filter(!is.na(body_mass_g), !is.na(bill_length_mm)) |>
  lm(body_mass_g ~ bill_length_mm, data = _)
#> 
#> Call:
#> lm(formula = body_mass_g ~ bill_length_mm, data = filter(penguins, 
#>     !is.na(body_mass_g), !is.na(bill_length_mm)))
#> 
#> Coefficients:
#>    (Intercept)  bill_length_mm  
#>         362.31           87.42

The _ placeholder says “put the piped value here.” It must appear as a named argument (data = _), and it can appear only once. This syntax requires R 4.2+.

For anything more complex, use an anonymous function:

c(1, 4, 9, 16) |>
  (\(x) x[x > 3])()
#> [1]  4  9 16

The \(x) syntax from Section 7.2 creates a function inline, and the trailing () calls it immediately with the piped value. This works for any situation where _ is too restrictive.

Exercises

Pipe mtcars into lm() to fit mpg ~ wt, using the _ placeholder for the data argument.
Rewrite the same call using an anonymous function instead of _.
Why does penguins |> lm(body_mass_g ~ species, _) fail? (Hint: the placeholder must be a named argument.)

15.6 `|>` vs `%>%`

The magrittr pipe %>% came first (2014). The native pipe |> arrived in R 4.1 (2021). Both do the same basic thing: insert the left side as the first argument of the right side. The differences are practical.

The placeholder is the difference that matters most: |> uses _ as a named argument, once, while %>% uses . anywhere and any number of times. The rest are smaller. |> requires parentheses on the right side (x |> sqrt(), not x |> sqrt) and needs no package, while %>% needs magrittr (loaded automatically by the tidyverse). The native pipe is also faster: the parser rewrites x |> f() to f(x) before evaluation, whereas %>% is a function call carrying its own overhead.

The theoretical distinction runs deeper. The base pipe |> is pure composition: (g . f)(x), with no implicit variable and no branching. The magrittr pipe %>% is reminiscent of what Haskell calls Kleisli composition, where you chain functions of the form a -> m b and m is some context (a list, a maybe, an I/O action). The analogy is loose — %>% carries no monadic structure, no return, no bind that satisfies the monad laws — but the shape rhymes: the . pronoun threads the result through, allowing you to refer to it multiple times, branch on it, or pass it to non-first arguments. This is why the following works with %>% but has no direct equivalent with |>:

x %>% { if (nrow(.) > 0) filter(., cond) else . }

If this sounds abstract, think of it concretely: |> is “apply the next function,” while %>% is “apply the next function, and here is a name for what you are working with.” That extra naming power makes %>% more flexible but also more complex, which is precisely why the base pipe chose to omit it.

Either way, pipes make substitution visible. When you write x |> f() |> g() |> h(), you are watching data flow through a chain of function applications, each one receiving the previous result. The same computation written as h(g(f(x))) hides the substitution order inside nested parentheses — you have to read from the innermost call outward, which reverses the sequence. Lambda calculus is agnostic about direction; (h ∘ g ∘ f)(x) and x |> f |> g |> h denote the same reduction. But human readers are not agnostic. Left-to-right pipelines follow the order in which things actually happen, and that alignment between notation and execution is why pipes took over R code within a few years of magrittr’s release. No new computational power was involved; the same reductions were always available. Pipes made them legible.

# These are equivalent
penguins |> nrow()
#> [1] 344

library(magrittr)
penguins %>% nrow()
#> [1] 344

Opinion

Use |>. It lives in base R, it is faster, and it is the future of the language. The only reason to reach for %>% is legacy code or the rare case where you need . in multiple positions within a single call.

15.7 When not to pipe

Pipes are for linear sequences: A then B then C. Not every problem is linear.

Multiple inputs. If two data frames need to interact, pipe one and pass the other as an argument. A join fits naturally in a pipeline; a complex merge of three data frames, each requiring its own preparation steps, does not.

Multiple outputs. If you need intermediate results for different purposes later, name them. A pipeline produces one result, so if you need both the filtered data and the summary, use two separate pipelines from a shared intermediate.

adelie <- penguins |> filter(species == "Adelie")

# Two different analyses from the same intermediate
adelie |> summarise(mean_mass = mean(body_mass_g, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   mean_mass
#>       <dbl>
#> 1     3701.
adelie |> count(island)
#> # A tibble: 3 × 2
#>   island        n
#>   <fct>     <int>
#> 1 Biscoe       44
#> 2 Dream        56
#> 3 Torgersen    52

Long chains. Past eight or ten steps, break the pipeline and name the intermediate result at a meaningful boundary. A pipeline that scrolls off the screen tells a run-on story, and run-on stories lose their audience.

Side effects in the middle. print(), write.csv(), and plot() produce side effects; they belong at the end of a pipeline, not wedged between two transformations where they interrupt the data flow and make the pipeline harder to reason about.

Debugging. When a pipeline produces unexpected output, break it apart. Assign intermediates, inspect each one, find the step that goes wrong, fix it, and reassemble the pipeline.

Opinion

A pipe should tell a story. If it reads like a run-on sentence, break it into paragraphs and name the paragraph boundaries with meaningful variable names, not temp1 and temp2.

The pipe gives you composition in a readable notation, but composition is only the beginning. Once you can chain transformations, the next question is how to apply them to many things at once: to every column, every group, every element of a list. That is the territory of Section 19.3.

Exercises

The following pipeline tries to do too much. Break it into two or three named steps with meaningful names:

penguins |>
  filter(!is.na(body_mass_g)) |>
  mutate(mass_kg = body_mass_g / 1000) |>
  group_by(species) |>
  summarise(mean_kg = mean(mass_kg)) |>
  arrange(desc(mean_kg)) |>
  mutate(label = paste(species, round(mean_kg, 1), "kg"))

Why is putting write.csv() in the middle of a pipeline (between filter and summarise) a bad idea? What could go wrong?
You need both a summary table (mean mass per species) and a filtered dataset (Adelie penguins only) from penguins. Write code that avoids repeating the filter(!is.na(body_mass_g)) step.

15.1 The problem with nesting

15.2 The pipe: |>

Exercises

15.3 Building a pipeline

Exercises

15.4 Function composition

15.5 The placeholder _

Exercises

15.6 |> vs %>%

15.7 When not to pipe

Exercises

15.2 The pipe: `|>`

15.5 The placeholder `_`

15.6 `|>` vs `%>%`