15  Pipes and composition

Mathematicians write g(f(x)). R programmers write x |> f() |> g(). Same idea, better reading order. In Section 1.2, function composition was one of the fundamental operations: apply one function, then apply another to the result. Pipes make that composition readable. This chapter explains how they work, why they work, and where they break down.

15.1 The problem with nesting

Real analysis chains multiple operations. Suppose you want to take the penguins data, keep only Adelie penguins, select two columns, and sort by body mass. Without pipes, you have three choices.

Nesting: read inside-out, right-to-left.

arrange(filter(select(penguins, species, body_mass_g), species == "Adelie"), body_mass_g)

This works for two functions. At three, it’s hard to read. At five, it’s unreadable. The first operation you want to happen (select) is buried deepest inside the expression.

Intermediate variables: name every step.

a <- select(penguins, species, body_mass_g)
b <- filter(a, species == "Adelie")
c <- arrange(b, body_mass_g)

Readable, but the environment fills with throwaway names. a, b, c, temp, df2: names that exist only to carry a value from one line to the next.

Overwriting: reuse the same name.

df <- select(penguins, species, body_mass_g)
df <- filter(df, species == "Adelie")
df <- arrange(df, body_mass_g)

Compact, but fragile. You cannot re-run a single line without re-running everything above it, because each line depends on the previous value of df.

All three approaches work, but none compose well. Pipes solve this problem.

15.2 The pipe: |>

library(palmerpenguins)
#> 
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#> 
#>     penguins, penguins_raw
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

penguins |>
  select(species, body_mass_g) |>
  filter(species == "Adelie") |>
  arrange(body_mass_g)
#> # A tibble: 152 × 2
#>    species body_mass_g
#>    <fct>         <int>
#>  1 Adelie         2850
#>  2 Adelie         2850
#>  3 Adelie         2900
#>  4 Adelie         2900
#>  5 Adelie         2900
#>  6 Adelie         2925
#>  7 Adelie         2975
#>  8 Adelie         3000
#>  9 Adelie         3000
#> 10 Adelie         3050
#> # ℹ 142 more rows

Read it top to bottom: take penguins, select two columns, keep Adelie rows, sort by mass. Each line transforms the result of the previous one.

The rule is simple. x |> f() is syntactic sugar for f(x). The pipe takes the value on its left and inserts it as the first argument of the function on its right. That is the entire mechanism. There is no magic, no special evaluation: just rewriting x |> f() as f(x) before the code runs. And x |> f() |> g() is g(f(x)), which is (g ∘ f)(x) in mathematical notation. The pipe is function composition from lambda calculus, written left-to-right instead of inside-out.

Function composition is itself a monoid. The operation is (compose two functions), and the identity element is the identity function \(x) x, which returns its argument unchanged. f ∘ id = f and id ∘ f = f, just as x + 0 = x for addition. You saw the same structure with c() and NULL (Section 4.1), with string concatenation and "", and you will see it again with ggplot’s + (Section 17.8) and with Reduce() (Chapter 21). The monoid keeps showing up because composition is the fundamental way to build complex behavior from simple parts.

The native pipe |> was introduced in R 4.1 (2021) and is built into the language, requiring no packages. The idea is older than R: Doug McIlroy proposed the pipe concept for Unix in 1964, writing “we should have some ways of coupling programs like garden hose.” The | operator in the Unix shell chains programs, each reading from standard input and writing to standard output. R’s |> is the same idea applied to data frames: each dplyr verb takes a data frame and returns a data frame, and the pipe connects them.

TipOpinion

One verb per line. Put a space before |> and a newline after it. A pipeline formatted as a single long line loses all the readability the pipe was designed to provide.

Exercises

  1. Rewrite the following nested call as a pipeline: head(sort(sqrt(1:20)), 5).
  2. Take penguins, filter for penguins heavier than 4000g, and select only species and island. Write it as a pipeline.
  3. What does 10 |> sqrt() return? What about 10 |> log(base = 2)? Explain why log receives two arguments.

15.3 Building a pipeline

Start with the data. Add one step at a time. Read the result like a recipe:

penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species) |>
  summarise(mean_mass = mean(body_mass_g)) |>
  arrange(desc(mean_mass))
#> # A tibble: 3 × 2
#>   species   mean_mass
#>   <fct>         <dbl>
#> 1 Gentoo        5076.
#> 2 Chinstrap     3733.
#> 3 Adelie        3701.

“Take penguins, remove NAs, group by species, compute mean mass, sort descending.” The code reads like the sentence.

Build incrementally. Run the first line, check the output. Add the next step, check again. Pipes invite exploration: you grow a pipeline one verb at a time, inspecting the intermediate data frame at each stage. When the result looks right, the pipeline is done.

penguins |>
  filter(!is.na(flipper_length_mm), !is.na(body_mass_g)) |>
  group_by(species, island) |>
  summarise(
    n = n(),
    mean_flipper = mean(flipper_length_mm),
    mean_mass = mean(body_mass_g),
    .groups = "drop"
  )
#> # A tibble: 5 × 5
#>   species   island        n mean_flipper mean_mass
#>   <fct>     <fct>     <int>        <dbl>     <dbl>
#> 1 Adelie    Biscoe       44         189.     3710.
#> 2 Adelie    Dream        56         190.     3688.
#> 3 Adelie    Torgersen    51         191.     3706.
#> 4 Chinstrap Dream        68         196.     3733.
#> 5 Gentoo    Biscoe      123         217.     5076.

Each verb does one thing. filter removes rows. group_by sets the grouping. summarise collapses groups to summaries. arrange sorts. You compose them into a pipeline the same way you compose sentences into a paragraph: one idea at a time, in order.

Exercises

  1. Build a pipeline that counts how many penguins of each species live on each island. Start with penguins, group by species and island, then summarise with n = n().
  2. Extend your pipeline from Exercise 1: arrange the result by n in descending order.
  3. Write a pipeline that computes the median bill_length_mm per species (removing NAs), then filters to keep only species where the median exceeds 40mm.

15.4 Function composition

In Section 1.2, you saw that lambda calculus is built on applying functions to arguments. Composition is the next step: apply one function, then apply another to the result.

x |> f() |> g() is g(f(x)). In mathematical notation, that is (g . f)(x): the composition of g and f, applied to x. The pipe rewrites composition from inside-out to left-to-right. Nothing else changes.

This is why dplyr verbs all follow the same contract: take a data frame as the first argument, return a data frame. Functions that follow this contract compose via the pipe. Functions that don’t (first argument is a formula, a filename, a model object) require workarounds. The convention exists so that verbs compose via the pipe.

In Chapter 7, you saw that functions are values you can pass around. Here, you see the other side: functions are things you compose. Passing and composing are the two fundamental operations on functions, and together they account for most of what makes R code concise.

clean_and_summarise <- function(df) {
  df |>
    filter(!is.na(body_mass_g)) |>
    group_by(species) |>
    summarise(mean_mass = mean(body_mass_g), .groups = "drop")
}

penguins |> clean_and_summarise()
#> # A tibble: 3 × 2
#>   species   mean_mass
#>   <fct>         <dbl>
#> 1 Adelie        3701.
#> 2 Chinstrap     3733.
#> 3 Gentoo        5076.

Wrapping a pipeline in a function gives the composition a name. clean_and_summarise is a single function built from four composed steps. You can pass it around, store it in a list, or use it inside another pipeline, exactly as Chapter 7 described. Notice that the pipeline describes the transformation without naming intermediates:

penguins |> filter(species == "Adelie") |> nrow()

In Haskell, this style is called “point-free” or “tacit” programming, tracing back to combinatory logic (Schönfinkel, 1924): functions are composed without mentioning the data they act on. R’s pipe enables a mild version of this. It isn’t fully point-free (dplyr verbs still reference column names), but the same impulse is at work.

15.5 The placeholder _

The pipe inserts the left side as the first argument. Sometimes you need it somewhere else.

penguins |>
  filter(!is.na(body_mass_g), !is.na(bill_length_mm)) |>
  lm(body_mass_g ~ bill_length_mm, data = _)
#> 
#> Call:
#> lm(formula = body_mass_g ~ bill_length_mm, data = filter(penguins, 
#>     !is.na(body_mass_g), !is.na(bill_length_mm)))
#> 
#> Coefficients:
#>    (Intercept)  bill_length_mm  
#>         362.31           87.42

The _ placeholder says “put the piped value here.” It must appear as a named argument (data = _), and it can appear only once. This syntax requires R 4.2+.

For anything more complex, use an anonymous function:

c(1, 4, 9, 16) |>
  (\(x) x[x > 3])()
#> [1]  4  9 16

The \(x) syntax from Section 7.2 creates a function inline. The trailing () calls it immediately with the piped value. This works for any situation where _ is too restrictive.

Exercises

  1. Pipe mtcars into lm() to fit mpg ~ wt, using the _ placeholder for the data argument.
  2. Rewrite the same call using an anonymous function instead of _.
  3. Why does penguins |> lm(body_mass_g ~ species, _) fail? (Hint: the placeholder must be a named argument.)

15.6 |> vs %>%

The magrittr pipe %>% came first (2014). The native pipe |> arrived in R 4.1 (2021). Both do the same basic thing: insert the left side as the first argument of the right side. The differences are practical.

First, |> requires parentheses on the right side: x |> sqrt(), not x |> sqrt. Second, |> uses _ as its placeholder (named argument only, once), while %>% uses . (anywhere, multiple times). Third, |> is faster because it is a syntax transformation: the parser rewrites x |> f() to f(x) before evaluation, whereas %>% is a function call with its own overhead. Fourth, |> needs no package; %>% needs magrittr (loaded automatically by the tidyverse).

The theoretical distinction runs deeper. The base pipe |> is pure composition: (g . f)(x), with no implicit variable and no branching. The magrittr pipe %>% is closer to what Haskell calls Kleisli composition. In ordinary composition, you chain functions of the form a -> b. In Kleisli composition, you chain functions of the form a -> m b, where m is some context (a list, a maybe, an I/O action). The . pronoun in %>% acts like that context: it threads the result through, allowing you to refer to it multiple times, branch on it, or pass it to non-first arguments. This is why the following works with %>% but has no direct equivalent with |>:

x %>% { if (nrow(.) > 0) filter(., cond) else . }

The magrittr pipe is not just “composition with a dot”; it is closer to the monadic bind operator >>=, which chains computations that carry context along with their result. If this sounds abstract, think of it concretely: |> is “apply the next function,” while %>% is “apply the next function, and here is a name for what you are working with.” That extra naming power makes %>% more flexible but also more complex, which is precisely why the base pipe chose to omit it.

# These are equivalent
penguins |> nrow()
#> [1] 344

library(magrittr)
penguins %>% nrow()
#> [1] 344
TipOpinion

Use |>. It is in base R, it is faster, and it is the future of the language. The only reason to reach for %>% is legacy code or the rare case where you need . in multiple positions within a single call. For new code, |> is the right default.

15.7 When not to pipe

Pipes are for linear sequences: A then B then C. They are not for everything.

Multiple inputs. If two data frames need to interact, pipe one and pass the other as an argument. A join is fine in a pipeline; a complex merge of three data frames with different preparation steps is not.

Multiple outputs. If you need intermediate results for different purposes later, name them. A pipeline produces one result. If you need the filtered data and the summary, use two separate pipelines from a shared intermediate.

adelie <- penguins |> filter(species == "Adelie")

# Two different analyses from the same intermediate
adelie |> summarise(mean_mass = mean(body_mass_g, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   mean_mass
#>       <dbl>
#> 1     3701.
adelie |> count(island)
#> # A tibble: 3 × 2
#>   island        n
#>   <fct>     <int>
#> 1 Biscoe       44
#> 2 Dream        56
#> 3 Torgersen    52

Long chains. Past eight or ten steps, break the pipeline. Name the intermediate result at a meaningful boundary. A pipeline that scrolls off the screen tells a run-on story.

Side effects in the middle. print(), write.csv(), and plot() produce side effects. They belong at the end of a pipeline, not in the middle. Putting a side effect between two transformations interrupts the data flow and makes the pipeline harder to reason about.

Debugging. When a pipeline produces unexpected output, break it apart. Assign intermediates, inspect each one, find the step that goes wrong. Once you’ve fixed it, reassemble the pipeline.

TipOpinion

A pipe should tell a story. If it reads like a run-on sentence, break it into paragraphs. Name the paragraph boundaries with meaningful variable names, not temp1 and temp2.

Exercises

  1. The following pipeline tries to do too much. Break it into two or three named steps with meaningful names:
penguins |>
  filter(!is.na(body_mass_g)) |>
  mutate(mass_kg = body_mass_g / 1000) |>
  group_by(species) |>
  summarise(mean_kg = mean(mass_kg)) |>
  arrange(desc(mean_kg)) |>
  mutate(label = paste(species, round(mean_kg, 1), "kg"))
  1. Why is putting write.csv() in the middle of a pipeline (between filter and summarise) a bad idea? What could go wrong?
  2. You need both a summary table (mean mass per species) and a filtered dataset (Adelie penguins only) from penguins. Write code that avoids repeating the filter(!is.na(body_mass_g)) step.