arrange(filter(select(penguins, species, body_mass_g), species == "Adelie"), body_mass_g)15 Pipes and composition
Pipes give you composition. Section 19.3 asks the next question: how do you apply the same transformation to many things at once?
Consider a line of code that selects columns, filters rows, and sorts the result. Written with nested function calls, you read it inside-out: the first operation you want is buried deepest, and by the third level of parentheses you are counting closing parens by hand. The order you write the code runs backwards from the order you think about the problem.
15.1 The problem with nesting
Real analysis chains multiple operations. Take the penguins data, keep only Adelie penguins, select two columns, sort by body mass. Without pipes, you have three choices.
Nesting: read inside-out, right-to-left.
This works for two functions, but at three it strains, and at five you are matching parentheses like a Lisp programmer on a bad day. The first operation you want to happen (select) is buried deepest inside the expression.
Intermediate variables: name every step.
a <- select(penguins, species, body_mass_g)
b <- filter(a, species == "Adelie")
c <- arrange(b, body_mass_g)Readable, but the environment fills with throwaway names that exist only to shuttle a value from one line to the next: a, b, c, temp, df2.
Overwriting: reuse the same name.
df <- select(penguins, species, body_mass_g)
df <- filter(df, species == "Adelie")
df <- arrange(df, body_mass_g)Compact, but fragile. You cannot re-run a single line without re-running everything above it, because each line depends on whatever df was a moment ago.
All three approaches work, yet none compose well. What would it look like if each step simply fed into the next?
15.2 The pipe: |>
library(palmerpenguins)
#>
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#>
#> penguins, penguins_raw
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
penguins |>
select(species, body_mass_g) |>
filter(species == "Adelie") |>
arrange(body_mass_g)
#> # A tibble: 152 × 2
#> species body_mass_g
#> <fct> <int>
#> 1 Adelie 2850
#> 2 Adelie 2850
#> 3 Adelie 2900
#> 4 Adelie 2900
#> 5 Adelie 2900
#> 6 Adelie 2925
#> 7 Adelie 2975
#> 8 Adelie 3000
#> 9 Adelie 3000
#> 10 Adelie 3050
#> # ℹ 142 more rowsRead it top to bottom: take penguins, select two columns, keep Adelie rows, sort by mass. Each line transforms the result of the previous one, and the reading order matches the execution order.
x |> f() is syntactic sugar for f(x). R literally rewrites it before the code runs. Chain two and x |> f() |> g() becomes g(f(x)), which is (g ∘ f)(x) written left-to-right instead of inside-out. That is function composition from Section 1.2, just in friendlier notation.
This is another monoid (Chapter 14): ∘ is the operation, the identity function \(x) x is the identity element, and chaining any number of verbs always gives you back a data frame. The same structure shows up again with Reduce() in Chapter 21.
The native pipe |> was introduced in R 4.1 (2021) and is built into the language, requiring no packages. The idea, however, is older than R itself: Doug McIlroy proposed the pipe concept for Unix in 1964, writing “we should have some ways of coupling programs like garden hose.” The | operator in the Unix shell chains programs, each reading from standard input and writing to standard output; R’s |> is the same idea applied to data frames, where each dplyr verb takes a data frame and returns one.
One verb per line. Put a space before |> and a newline after it. A pipeline crammed into a single long line throws away all the readability the pipe was designed to give you.
Exercises
- Rewrite the following nested call as a pipeline:
head(sort(sqrt(1:20)), 5). - Take
penguins, filter for penguins heavier than 4000g, and select onlyspeciesandisland. Write it as a pipeline. - What does
10 |> sqrt()return? What about10 |> log(base = 2)? Explain whylogreceives two arguments.
15.3 Building a pipeline
The best way to build a pipeline is the way you would give directions: one step at a time, checking the view at each turn.
penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species) |>
summarise(mean_mass = mean(body_mass_g)) |>
arrange(desc(mean_mass))
#> # A tibble: 3 × 2
#> species mean_mass
#> <fct> <dbl>
#> 1 Gentoo 5076.
#> 2 Chinstrap 3733.
#> 3 Adelie 3701.“Take penguins, remove NAs, group by species, compute mean mass, sort descending.” The code reads like the sentence.
Build incrementally: run the first line, check the output, add the next step, check again. You grow a pipeline one verb at a time, inspecting the intermediate data frame at each stage, and when the result looks right the pipeline is done.
penguins |>
filter(!is.na(flipper_length_mm), !is.na(body_mass_g)) |>
group_by(species, island) |>
summarise(
n = n(),
mean_flipper = mean(flipper_length_mm),
mean_mass = mean(body_mass_g),
.groups = "drop"
)
#> # A tibble: 5 × 5
#> species island n mean_flipper mean_mass
#> <fct> <fct> <int> <dbl> <dbl>
#> 1 Adelie Biscoe 44 189. 3710.
#> 2 Adelie Dream 56 190. 3688.
#> 3 Adelie Torgersen 51 191. 3706.
#> 4 Chinstrap Dream 68 196. 3733.
#> 5 Gentoo Biscoe 123 217. 5076.Each verb does one thing: filter removes rows, group_by sets the grouping, summarise collapses groups to summaries, arrange sorts. You compose them into a pipeline the same way you compose sentences into a paragraph, one idea at a time, in order. But what happens when you want to reuse a pipeline across different data sets, or embed one inside a larger computation?
Exercises
- Build a pipeline that counts how many penguins of each species live on each island. Start with
penguins, group byspeciesandisland, then summarise withn = n(). - Extend your pipeline from Exercise 1: arrange the result by
nin descending order. - Write a pipeline that computes the median
bill_length_mmper species (removing NAs), then filters to keep only species where the median exceeds 40mm.
15.4 Function composition
Lambda calculus (Section 1.2) is built on applying functions to arguments. Composition is the next step: apply one function, then apply another to the result.
x |> f() |> g() is g(f(x)). In mathematical notation, that is (g . f)(x): the composition of g and f, applied to x. The pipe rewrites composition from inside-out to left-to-right. Nothing else changes.
This is why dplyr verbs all follow the same contract: take a data frame as the first argument, return a data frame. Functions that follow this contract compose via the pipe; functions that don’t (first argument is a formula, a filename, a model object) require workarounds. The convention exists precisely so that verbs snap together like sections of garden hose.
Functions are values you can pass around (Chapter 7), and they are also things you compose. Passing and composing are the two fundamental operations on functions, and together they account for most of what makes R code concise.
clean_and_summarise <- function(df) {
df |>
filter(!is.na(body_mass_g)) |>
group_by(species) |>
summarise(mean_mass = mean(body_mass_g), .groups = "drop")
}
penguins |> clean_and_summarise()
#> # A tibble: 3 × 2
#> species mean_mass
#> <fct> <dbl>
#> 1 Adelie 3701.
#> 2 Chinstrap 3733.
#> 3 Gentoo 5076.Wrapping a pipeline in a function gives the composition a name. clean_and_summarise is a single function built from four composed steps; you can pass it around, store it in a list, or use it inside another pipeline. The pipeline describes the transformation without naming intermediates:
penguins |> filter(species == "Adelie") |> nrow()In Haskell, this style is called “point-free” or “tacit” programming, tracing back to combinatory logic (Schönfinkel, 1924): functions are composed without mentioning the data they act on. R’s pipe enables a mild version of this. It isn’t fully point-free (dplyr verbs still reference column names), but the same impulse is at work.
15.5 The placeholder _
The pipe inserts the left side as the first argument. But what happens when the data belongs somewhere else?
penguins |>
filter(!is.na(body_mass_g), !is.na(bill_length_mm)) |>
lm(body_mass_g ~ bill_length_mm, data = _)
#>
#> Call:
#> lm(formula = body_mass_g ~ bill_length_mm, data = filter(penguins,
#> !is.na(body_mass_g), !is.na(bill_length_mm)))
#>
#> Coefficients:
#> (Intercept) bill_length_mm
#> 362.31 87.42The _ placeholder says “put the piped value here.” It must appear as a named argument (data = _), and it can appear only once. This syntax requires R 4.2+.
For anything more complex, use an anonymous function:
c(1, 4, 9, 16) |>
(\(x) x[x > 3])()
#> [1] 4 9 16The \(x) syntax from Section 7.2 creates a function inline, and the trailing () calls it immediately with the piped value. This works for any situation where _ is too restrictive.
Exercises
- Pipe
mtcarsintolm()to fitmpg ~ wt, using the_placeholder for thedataargument. - Rewrite the same call using an anonymous function instead of
_. - Why does
penguins |> lm(body_mass_g ~ species, _)fail? (Hint: the placeholder must be a named argument.)
15.6 |> vs %>%
The magrittr pipe %>% came first (2014). The native pipe |> arrived in R 4.1 (2021). Both do the same basic thing: insert the left side as the first argument of the right side. The differences are practical.
|> requires parentheses on the right side: x |> sqrt(), not x |> sqrt. It uses _ as its placeholder (named argument only, once), while %>% uses . (anywhere, multiple times). The native pipe is faster because it is a syntax transformation: the parser rewrites x |> f() to f(x) before evaluation, whereas %>% is a function call carrying its own overhead. And |> needs no package, while %>% needs magrittr (loaded automatically by the tidyverse).
The theoretical distinction runs deeper. The base pipe |> is pure composition: (g . f)(x), with no implicit variable and no branching. The magrittr pipe %>% is reminiscent of what Haskell calls Kleisli composition, where you chain functions of the form a -> m b and m is some context (a list, a maybe, an I/O action). The analogy is loose — %>% carries no monadic structure, no return, no bind that satisfies the monad laws — but the shape rhymes: the . pronoun threads the result through, allowing you to refer to it multiple times, branch on it, or pass it to non-first arguments. This is why the following works with %>% but has no direct equivalent with |>:
x %>% { if (nrow(.) > 0) filter(., cond) else . }The magrittr pipe echoes the shape of the monadic bind operator >>= more than it does simple composition — both chain computations while giving a name to the intermediate result — but the resemblance is structural, not formal. If this sounds abstract, think of it concretely: |> is “apply the next function,” while %>% is “apply the next function, and here is a name for what you are working with.” That extra naming power makes %>% more flexible but also more complex, which is precisely why the base pipe chose to omit it.
Either way, pipes make substitution visible. When you write x |> f() |> g() |> h(), you are watching data flow through a chain of function applications, each one receiving the previous result. The same computation written as h(g(f(x))) hides the substitution order inside nested parentheses — you have to read from the innermost call outward, which reverses the sequence. Lambda calculus is agnostic about direction; (h ∘ g ∘ f)(x) and x |> f |> g |> h denote the same reduction. But human readers are not agnostic. Left-to-right pipelines follow the order in which things actually happen, and that alignment between notation and execution is why pipes took over R code within a few years of magrittr’s release. No new computational power was involved; the same reductions were always available. Pipes made them legible.
# These are equivalent
penguins |> nrow()
#> [1] 344
library(magrittr)
penguins %>% nrow()
#> [1] 344Use |>. It lives in base R, it is faster, and it is the future of the language. The only reason to reach for %>% is legacy code or the rare case where you need . in multiple positions within a single call.
15.7 When not to pipe
Pipes are for linear sequences: A then B then C. Not every problem is linear.
Multiple inputs. If two data frames need to interact, pipe one and pass the other as an argument. A join fits naturally in a pipeline; a complex merge of three data frames, each requiring its own preparation steps, does not.
Multiple outputs. If you need intermediate results for different purposes later, name them. A pipeline produces one result, so if you need both the filtered data and the summary, use two separate pipelines from a shared intermediate.
adelie <- penguins |> filter(species == "Adelie")
# Two different analyses from the same intermediate
adelie |> summarise(mean_mass = mean(body_mass_g, na.rm = TRUE))
#> # A tibble: 1 × 1
#> mean_mass
#> <dbl>
#> 1 3701.
adelie |> count(island)
#> # A tibble: 3 × 2
#> island n
#> <fct> <int>
#> 1 Biscoe 44
#> 2 Dream 56
#> 3 Torgersen 52Long chains. Past eight or ten steps, break the pipeline and name the intermediate result at a meaningful boundary. A pipeline that scrolls off the screen tells a run-on story, and run-on stories lose their audience.
Side effects in the middle. print(), write.csv(), and plot() produce side effects; they belong at the end of a pipeline, not wedged between two transformations where they interrupt the data flow and make the pipeline harder to reason about.
Debugging. When a pipeline produces unexpected output, break it apart. Assign intermediates, inspect each one, find the step that goes wrong, fix it, and reassemble the pipeline.
A pipe should tell a story. If it reads like a run-on sentence, break it into paragraphs and name the paragraph boundaries with meaningful variable names, not temp1 and temp2.
The pipe gives you composition in a readable notation, but composition is only the beginning. Once you can chain transformations, the next question is how to apply them to many things at once: to every column, every group, every element of a list. That is the territory of Section 19.3.
Exercises
- The following pipeline tries to do too much. Break it into two or three named steps with meaningful names:
penguins |>
filter(!is.na(body_mass_g)) |>
mutate(mass_kg = body_mass_g / 1000) |>
group_by(species) |>
summarise(mean_kg = mean(mass_kg)) |>
arrange(desc(mean_kg)) |>
mutate(label = paste(species, round(mean_kg, 1), "kg"))- Why is putting
write.csv()in the middle of a pipeline (betweenfilterandsummarise) a bad idea? What could go wrong? - You need both a summary table (mean mass per species) and a filtered dataset (Adelie penguins only) from
penguins. Write code that avoids repeating thefilter(!is.na(body_mass_g))step.