21 Reduce and accumulate

You have a list of data frames from an API and you need to merge them all into one. map() will not help: it treats every element in isolation and never looks at what came before. What you need is a way to take two things, combine them, feed the result into the next combination, and keep going until one thing remains. That operation is Reduce().

map() (Chapter 19) replaces loops that apply a function element by element. Reduce() replaces the other kind: loops that carry state from one pass to the next, like a running total or a chain of merges that folds data frames together one by one.

21.1 Left fold

Reduce() takes a binary function and a list, then combines the elements pairwise from left to right:

Reduce(`+`, 1:5)
#> [1] 15

That computes ((((1 + 2) + 3) + 4) + 5). The first call is f(1, 2), which gives 3; then f(3, 3), giving 6; then f(6, 4), giving 10; then f(10, 5), giving 15. Each intermediate result becomes the left argument of the next call, which is why this is called a left fold: the parentheses nest to the left.

The name comes from Haskell’s foldl. APL had the same operation in 1962, spelled +/1 2 3 4 5.

With a named function the folding structure becomes visible:

Reduce(paste, c("the", "cat", "sat", "on", "the", "mat"))
#> [1] "the cat sat on the mat"

Each step takes the accumulated string and pastes the next word onto it. The accumulator grows; the list shrinks. When nothing remains, the accumulator is the answer.

21.2 The init argument

What does Reduce(f, x) do when x has one element? It returns that element without ever calling f. Reasonable enough. What about zero elements?

Reduce(`+`, list())
#> NULL

An error, because there is nothing to return and no identity to fall back on. The init argument fixes this:

Reduce(`+`, list(), init = 0)
#> [1] 0

With init, the fold starts by calling f(init, x[[1]]) instead of f(x[[1]], x[[2]]), which means the fold always has something to work with, even when the list is empty. The value you pass as init should be the identity element for the operation: 0 for addition, 1 for multiplication, "" for paste0, NULL for c(). For intersect you would want the universal set, though R has no such thing, so you typically peel off the first element yourself.

This connection between an operation, its identity element, and associativity has a name in algebra: a monoid. Integers under addition with identity 0 form a monoid. Strings under concatenation with identity "" form a monoid. Lists under c() with identity NULL form a monoid. The pattern shows up everywhere, and recognizing it tells you something immediately practical: if your operation is associative, the fold can be split across chunks, parallelized, and recombined without changing the result. Google’s MapReduce (Dean and Ghemawat, 2004) is literally map() followed by a monoid fold, scaled to warehouses full of machines.

Reduce() is the operation that collapses a monoid. Given a list and an associative binary function, it produces a single value. That description sounds specific to lists, but the same structure appears whenever you compose things sequentially: piping functions is folding over a list of functions with composition as the binary operation and the identity function as init; building a ggplot layer by layer is folding over a list of layers with + as the operation and ggplot() as init. Tallying, concatenating, composing, intersecting: all of them are folds over a monoid.

Opinion

Always pass init to Reduce(). Without it, your code breaks on empty input and behaves subtly differently on length-1 input. The identity element makes the fold total: it works for every list, including the empty one, and it still works six months from now when some upstream function returns zero results.

21.3 Right fold

Reduce() folds from the left by default. Setting right = TRUE reverses the nesting:

Reduce(`-`, 1:4)
#> [1] -8
Reduce(`-`, 1:4, right = TRUE)
#> [1] -2

The left fold computes (((1 - 2) - 3) - 4) = -8. The right fold computes (1 - (2 - (3 - 4))) = -2. For associative operations (addition, multiplication, paste0, c(), intersect, union) the direction does not matter; you get the same answer either way, and left fold is the natural default. The difference only surfaces for non-associative operations: subtraction, division, exponentiation. These give different results depending on which way you nest.

In Haskell, foldl and foldr are separate functions, and choosing between them is a design decision with performance implications (lazy evaluation makes right folds cheaper in some cases). In R, right = TRUE is rarely needed. If you find yourself reaching for a right fold, ask first whether your binary function simply has its arguments in the wrong order.

Exercises

Use Reduce() to compute the product of the numbers 1 through 6 (i.e., 6 factorial). Check your answer against factorial(6).
What is Reduce(intersect, list(c(1,2,3,4), c(2,3,4,5), c(3,4,5,6)))? Work through the steps by hand, then verify.
Write a Reduce() call that merges three data frames by a shared "id" column. (Hint: merge is a binary function.)
Explain why Reduce(/, c(120, 2, 3, 4, 5)) gives 1 with a left fold and something different with a right fold.

21.4 accumulate: keeping the intermediates

Sometimes the final result is not enough. You want every intermediate step: the running total after each deposit, the growing sentence after each word, the high-water mark after each measurement. Base R handles this by adding a single argument:

Reduce(`+`, 1:5, accumulate = TRUE)
#> [1]  1  3  6 10 15

Running totals: 1, 3, 6, 10, 15. This is the scan operation (Haskell’s scanl, APL’s \). Where fold collapses a list to a single value, scan collapses it to a list of all the values the fold passed through on its way to the answer. Running maxima, cumulative sums, progressive string construction: they all fall out of the same pattern.

Reduce(max, c(3, 1, 4, 1, 5, 9, 2, 6), accumulate = TRUE)
#> [1] 3 3 4 4 5 9 9 9

A running maximum, where each value is the largest seen so far. The 1s and the 2 cannot drag the maximum back down; it can only rise or hold steady.

Reduce(paste, c("one", "two", "three", "four"), accumulate = TRUE)
#> [1] "one"                "one two"            "one two three"     
#> [4] "one two three four"

A sentence assembling itself one word at a time.

21.5 purrr::reduce() and purrr::accumulate()

The purrr package provides reduce() and accumulate() as tidyverse-flavored alternatives:

reduce(1:5, `+`)
#> [1] 15
accumulate(1:5, `+`)
#> [1]  1  3  6 10 15

Same results. The differences from base Reduce() are syntactic: .x and .f instead of positional arguments, .init instead of init, and .dir = "backward" instead of right = TRUE. The purrr versions also accept lambda-style formulas, though anonymous functions with \() are clearer.

Where purrr adds genuine power is reduce2(), which folds over two lists in parallel, handing an extra argument from the second list at each step:

reduce2(
  c("a", "b", "c"),
  c("-", "+"),
  \(acc, val, sep) paste0(acc, sep, val)
)
#> [1] "a-b+c"

The binary function receives three arguments: the accumulator, the current element from the first list, and the corresponding element from the second list. The second list must be one element shorter than the first (or the same length if .init is provided), because the first element of the first list seeds the accumulator.

21.6 Practical patterns

Combining data frames. When split() or repeated API calls hand you a list of data frames, Reduce(rbind, dfs) stacks them. For large lists, do.call(rbind, dfs) is faster because it avoids the O(n²) copying that comes from binding two at a time, growing the result frame with every step.

dfs <- list(
  data.frame(x = 1:2, y = c("a", "b")),
  data.frame(x = 3:4, y = c("c", "d")),
  data.frame(x = 5:6, y = c("e", "f"))
)
Reduce(rbind, dfs)
#>   x y
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f

Nested list extraction. Walking into a deeply nested structure one key at a time:

nested <- list(a = list(b = list(c = 42)))
Reduce(\(x, i) x[[i]], c("a", "b", "c"), init = nested)
#> [1] 42

Each step indexes one level deeper. The fold replaces a chain of [[ calls. (purrr’s pluck() does the same thing with better error messages.)

Set operations across many sets:

sets <- list(c(1,2,3,4), c(2,3,4,5), c(3,4,5,6))
Reduce(intersect, sets)
#> [1] 3 4
Reduce(union, sets)
#> [1] 1 2 3 4 5 6

State machines with accumulate. An accumulate() call can model any sequential process where each step depends on the previous state:

# Simple bank account: deposits and withdrawals
transactions <- c(100, -20, -30, 50, -10)
accumulate(transactions, `+`)
#> [1] 100  80  50 100  90

The running balance after each transaction. For more complex state, where the accumulator needs to carry multiple fields rather than a single number, use a list as the accumulator and a function that unpacks, updates, and repacks it.

Exercises

Given words <- c("R", "is", "a", "functional", "language"), use accumulate() to produce the vector c("R", "R is", "R is a", "R is a functional", "R is a functional language").
Write a Reduce() call that finds the intersection of the columns present in a list of data frames. (Hint: use names() and intersect.)
Use accumulate() to compute a running product of c(2, 3, 5, 7). Verify that the last element equals prod(c(2, 3, 5, 7)).

21.7 Row-wise folds with c_across()

Reduce() folds across the elements of a list. In Section 19.8, you saw across() apply a function across the columns of a data frame. But what about folding across columns within a single row? That is a row-wise reduction, and dplyr::c_across() handles it.

The setup is rowwise(), which tells dplyr to treat each row as a group of one:

penguins <- palmerpenguins::penguins

penguins |>
  rowwise() |>
  mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
  ungroup() |>
  head(3)
#> # A tibble: 3 × 9
#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>             <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgers…           39.1          18.7               181        3750
#> 2 Adelie  Torgers…           39.5          17.4               186        3800
#> 3 Adelie  Torgers…           40.3          18                 195        3250
#> # ℹ 3 more variables: sex <fct>, year <int>, bill_ratio <dbl>

c_across() gathers the specified columns into a vector for each row, so you can pass that vector to any summary function:

df <- tibble(a = 1:3, b = 4:6, c = 7:9)
df |>
  rowwise() |>
  mutate(total = sum(c_across(a:c))) |>
  ungroup()
#> # A tibble: 3 × 4
#>       a     b     c total
#>   <int> <int> <int> <int>
#> 1     1     4     7    12
#> 2     2     5     8    15
#> 3     3     6     9    18

Each row’s a, b, and c values are collected into a vector, summed, and stored in total. Same fold as Reduce(), same monoid structure (Section 21.2), just across columns instead of down a list.

rowwise() + c_across() processes one row at a time in R, so it scales poorly. On 100,000 rows, rowSums() finishes in milliseconds because it is vectorized in C; the rowwise() version takes seconds. Use c_across() when the operation is genuinely complex or when you need tidyselect helpers to pick columns. Otherwise use rowSums() or pmax().

Exercises

Using rowwise() and c_across(), compute the maximum of columns a, b, and c for each row of df <- tibble(a = c(3,1,4), b = c(1,5,9), c = c(2,6,5)).
Rewrite the c_across() sum example using rowSums() instead. Which is clearer? Which is faster? (Try bench::mark() on a larger data frame.)
Use rowwise() and c_across(where(is.numeric)) to compute the range (max minus min) of every numeric column for each row of penguins. Why would this be hard to do without c_across()?

21.8 When to fold and when not to

Reduce() is the right tool when you have a binary operation and a list of things to combine: merging data frames, chaining set operations, collapsing nested structures. It is the wrong tool when a vectorized function already exists. sum(x) is faster and clearer than Reduce(+, x). paste(x, collapse = " ") beats Reduce(paste, x). The base R cumulative functions (cumsum, cumprod, cummax, cummin) are compiled C and will always outperform accumulate(+, x).

The rule is simple: if a specialized function exists, use it. Reach for Reduce() when your binary operation has no built-in vectorized cousin.

Reduce() also becomes awkward when the accumulator needs complex state. If your fold carries a list with multiple fields and each step updates several of them conditionally, you are no longer folding; you are writing a state machine with the fold syntax as a straitjacket. That is better expressed as an explicit loop or a recursive function (Chapter 22), where the state updates are visible and the control flow is honest.

Graham Hutton proved in 1999 that fold can express any function on finite lists. This is a theoretical universality result, not practical advice. Just because you can write something as a fold does not mean it will be readable, or maintainable, or that the person who inherits your code (including future you) will thank you for it.

Opinion

Fold is the most underused functional in R. Most programmers who discover map() never learn Reduce(), and they end up writing loops for exactly the problems fold solves in a single line. If you take one thing from this chapter: next time you write a loop that carries an accumulator variable, growing it with each pass, stop and try Reduce() first. You might not need that loop at all.