19  Iteration without loops

You have a list of three data frames and you want to compute the mean of a column in each one. Six lines of for loop, or one line of map(). The map() version is shorter, but the real difference is structural.

Functions are values (Chapter 7) and you can pass them to other functions (Section 7.1). A functional builds on that idea: it is a higher-order function that handles the iteration for you. You give it a function and a collection; it applies the function to each element, manages the indexing and pre-allocation, and returns the results.

This chapter uses three packages. Load them now if you are following along:

library(palmerpenguins)
library(purrr)
library(dplyr)

We will work with the penguins data split by species throughout:

penguins_split <- split(penguins, penguins$species)
names(penguins_split)
#> [1] "Adelie"    "Chinstrap" "Gentoo"

Three data frames, one per species. Almost everything in this chapter amounts to doing something to each of them, and the question is always the same: how much of the work should you spell out yourself, and how much should the language handle?

19.1 The problem with for loops

Suppose you want the mean body mass for each species. Here is the loop version:

results <- vector("list", length(penguins_split))
for (i in seq_along(penguins_split)) {
  results[[i]] <- mean(penguins_split[[i]]$body_mass_g, na.rm = TRUE)
}
names(results) <- names(penguins_split)
results
#> $Adelie
#> [1] 3700.662
#> 
#> $Chinstrap
#> [1] 3733.088
#> 
#> $Gentoo
#> [1] 5076.016

Six lines to say “compute the mean mass for each group.” Look at how much of that is bookkeeping: pre-allocate a container, set up an index, store results at the right position, copy names over at the end. The actual computation, mean(df$body_mass_g, na.rm = TRUE), sits buried in the middle, surrounded by scaffolding that has nothing to do with the question you asked.

A functional says the same thing in one line:

map_dbl(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE))
#>    Adelie Chinstrap    Gentoo 
#>  3700.662  3733.088  5076.016

Same answer. No index variable, no pre-allocation, no name copying. You said what to compute; map_dbl handled the iteration.

The problem with for loops is not performance; in modern R, a well-written loop is fast. The problem is that loops mix the mechanism of iteration (allocate, index, store) with the content of the computation (compute a mean). Functionals separate the two. The mechanism disappears inside the functional; the content lives in the function you pass to it. And once those two concerns are separated, each becomes easier to read, test, and reuse.

TipOpinion

For loops are not bad R. They are verbose R. If you find yourself writing the same loop scaffolding three or four times in a project, a functional is waiting to collapse all that boilerplate into a single expressive line.

Exercises

  1. Write a for loop that computes the number of rows in each element of penguins_split. Then rewrite it using map_int() and nrow.
  2. What happens if you forget to pre-allocate results and instead grow it with results <- c(results, value) inside a loop? Why is this slower?

19.2 lapply() and sapply(): base R functionals

The idea of handing a function to something that applies it to every element in a collection is older than R, older than S, older than most programming languages people still use. John McCarthy’s 1960 paper described mapcar, a Lisp function that takes a function and a list and returns a new list with the function applied to each element. The name map became standard in later languages (Scheme, ML, Haskell), and R inherited the concept twice: once through lapply() in base R, and again through purrr::map() in the tidyverse. The mechanism is always the same. You separate what to do from how to iterate, and let the language handle the second part. lapply() is R’s oldest version of that idea:

lapply(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE))
#> $Adelie
#> [1] 3700.662
#> 
#> $Chinstrap
#> [1] 3733.088
#> 
#> $Gentoo
#> [1] 5076.016

lapply(x, f) applies f to each element of x and returns a list. Always a list. That predictability is its strength.

sapply() does the same thing, but tries to “simplify” the result:

sapply(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE))
#>    Adelie Chinstrap    Gentoo 
#>  3700.662  3733.088  5076.016

When every call to f returns a single value, sapply() hands you a named vector. But if f returns different lengths for different elements, sapply() falls back to a list, silently. You cannot predict the return type without knowing the output of f for every element of x. Fine at the console; unreliable inside a function that other code depends on.

vapply() solves this by making you declare the expected type:

vapply(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE), numeric(1))
#>    Adelie Chinstrap    Gentoo 
#>  3700.662  3733.088  5076.016

The third argument, numeric(1), says “I expect each call to return a single numeric value.” If any call returns something else, you get an error instead of a silent type change.

Two other base R functionals are worth knowing. tapply(x, group, f) applies f to subsets of x defined by group:

tapply(penguins$body_mass_g, penguins$species, mean, na.rm = TRUE)
#>    Adelie Chinstrap    Gentoo 
#>  3700.662  3733.088  5076.016

And mapply(f, x, y) iterates over multiple inputs in parallel:

mapply(paste, c("one", "two", "three"), c("fish", "cat", "bird"))
#>          one          two        three 
#>   "one fish"    "two cat" "three bird"
TipOpinion

Use lapply() when you want a list. Use vapply() when you want a vector and want safety. Never use sapply() in non-interactive code: its return type depends on the data, and that is a bug hiding in your pipeline, waiting for the one input that changes everything.

Exercises

  1. Use lapply() and nrow to get the number of rows in each element of penguins_split.
  2. Use vapply() to compute the median flipper length for each species. Declare the expected output type.
  3. What does sapply(list(), mean) return? What about vapply(list(), mean, numeric(1))? Which is safer?

19.3 purrr::map(): the tidyverse functional

The purrr package provides map(), which does the same thing as lapply():

map(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE))
#> $Adelie
#> [1] 3700.662
#> 
#> $Chinstrap
#> [1] 3733.088
#> 
#> $Gentoo
#> [1] 5076.016

Apply a function to each element, return a list. So why bother with a new package?

Three reasons. First, typed variants: map_dbl() returns a double vector, map_chr() returns a character vector, and if the function returns the wrong type, you get an error. No more guessing, no more silent simplification.

map_dbl(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE))
#>    Adelie Chinstrap    Gentoo 
#>  3700.662  3733.088  5076.016

Second, consistency with the pipe. map() and its variants are designed to sit naturally in a pipeline:

penguins_split |>
  map_dbl(\(df) mean(df$body_mass_g, na.rm = TRUE))
#>    Adelie Chinstrap    Gentoo 
#>  3700.662  3733.088  5076.016

Third, you may encounter the formula shorthand in older code: ~ mean(.x$body_mass_g, na.rm = TRUE). This was purrr’s anonymous function syntax before R 4.1 introduced \(x). You will see it in existing codebases. Prefer \(x) in new code.

map() returns a list. Always. map_dbl() returns a double vector. Always. This type stability is the core advantage over sapply(), and it matters more than it sounds: in a long pipeline, knowing the exact shape of every intermediate result is the difference between debugging for five seconds and debugging for an hour.

NoteFunctors and the shape of iteration

map() embodies a pattern from category theory: the functor. map(xs, f) applies f to each element of xs, preserving the container structure (a list of three elements in, a list of three elements out). A functor lifts a function between values (f :: A -> B) to a function between containers (map(f) :: List(A) -> List(B)). You do not need the theory to use map(), but recognizing the pattern explains why map() feels the same whether you apply it to a list, a vector, or a set of data frame columns with across(): it is the same structural idea in different containers.

A for loop iterates by maintaining a counter variable: i goes from 1 to n, and the body runs once per increment. You declare the counter, increment it, read it. With map(), the counter disappears. The list has three elements, so the function runs three times; the structure itself determines how many applications occur. You stop managing the iteration and let the data’s shape manage it for you.

The typed variants enforce something even stricter.

Exercises

  1. Use map_dbl() to compute the standard deviation of body_mass_g for each species in penguins_split.
  2. Use map_chr() to extract the first value of species from each element of penguins_split. (Hint: \(df) as.character(df$species[1]).)
  3. Rewrite the for loop from Section 19.1 using map_dbl(). Compare the two versions for readability.

19.4 Typed variants

The typed variants of map() enforce both the type and the length of each result:

  • map_dbl(.x, .f): each call to .f must return a single double. Result is a numeric vector.
  • map_chr(.x, .f): each call must return a single string. Result is a character vector.
  • map_lgl(.x, .f): each call must return a single logical. Result is a logical vector.
  • map_int(.x, .f): each call must return a single integer. Result is an integer vector.
map_int(penguins_split, nrow)
#>    Adelie Chinstrap    Gentoo 
#>       152        68       124
map_lgl(penguins_split, \(df) any(is.na(df$body_mass_g)))
#>    Adelie Chinstrap    Gentoo 
#>      TRUE     FALSE      TRUE

If .f returns the wrong type or a vector of length other than one, you get an error:

map_dbl(penguins_split, \(df) df$body_mass_g)
#> Error in `map_dbl()`:
#> ℹ In index: 1.
#> ℹ With name: Adelie.
#> Caused by error:
#> ! Result must be length 1, not 152.

When the output does not match your expectation, you find out immediately, not three functions downstream when something produces NULL instead of a number.

For results that are data frames, the older map_dfr() is deprecated. The current pattern is map() followed by list_rbind():

penguins_split |>
  map(\(df) tibble(
    species = df$species[1],
    mean_mass = mean(df$body_mass_g, na.rm = TRUE),
    n = nrow(df)
  )) |>
  list_rbind()
#> # A tibble: 3 × 3
#>   species   mean_mass     n
#>   <fct>         <dbl> <int>
#> 1 Adelie        3701.   152
#> 2 Chinstrap     3733.    68
#> 3 Gentoo        5076.   124

map() returns a list of data frames, and list_rbind() stacks them into one. But what about iterating over more than one thing at a time?

Exercises

  1. Use map_int() and nrow to count the rows in each element of penguins_split.
  2. Use map_lgl() to check which elements of penguins_split have more than 100 rows.
  3. Use map() and list_rbind() to build a summary data frame with one row per species, containing the species name, median bill length, and median bill depth.

19.5 Iterating over multiple inputs

Sometimes a single list is not enough. You need to walk over two inputs in lockstep, pulling one element from each on every step. map2() handles this:

species_names <- names(penguins_split)
species_counts <- map_int(penguins_split, nrow)

map2_chr(species_names, species_counts, \(name, n) paste(name, ":", n, "penguins"))
#> [1] "Adelie : 152 penguins"   "Chinstrap : 68 penguins"
#> [3] "Gentoo : 124 penguins"

map2(.x, .y, .f) calls .f(x[[1]], y[[1]]), then .f(x[[2]], y[[2]]), and so on. The two inputs are consumed in parallel, not in a nested grid.

For three or more inputs, use pmap(). You supply a list of vectors (or a data frame, since a data frame is a list of columns):

params <- list(
  mean = c(0, 5, -3),
  sd = c(1, 2, 0.5),
  n = c(3, 3, 3)
)

set.seed(42)
pmap(params, \(mean, sd, n) rnorm(n, mean, sd))
#> [[1]]
#> [1]  1.3709584 -0.5646982  0.3631284
#> 
#> [[2]]
#> [1] 6.265725 5.808537 4.787751
#> 
#> [[3]]
#> [1] -2.244239 -3.047330 -1.990788

Each row of parameters becomes one call. pmap() is the general case; map2() is the special case for exactly two inputs.

The base R equivalents are Map(f, x, y) (which returns a list, like map2()) and mapply(f, x, y) (which simplifies, like sapply()). The same warning applies: mapply is unpredictable in return type.

Where pmap() really shines is in configuration-driven workflows, places where you want to fit different models to different subsets with different formulas, all controlled by a single table of parameters. Suppose you want to regress body mass on flipper length for each species, but with a different formula for each:

configs <- tibble(
  data = penguins_split,
  formula = list(
    body_mass_g ~ flipper_length_mm,
    body_mass_g ~ flipper_length_mm + bill_length_mm,
    body_mass_g ~ flipper_length_mm * bill_length_mm
  )
)

models <- pmap(configs, \(data, formula) lm(formula, data = data))
map_dbl(models, \(m) summary(m)$r.squared)
#>    Adelie Chinstrap    Gentoo 
#> 0.2192128 0.4688941 0.5923485

pmap() iterates over the rows of the configuration table, fitting one model per species. The entire experiment lives in one tibble, and adding a fourth species or a fourth formula means adding a row, not rewriting the loop.

Exercises

  1. Use map2_chr() to paste together the vectors c("one", "two", "three") and c("fish", "cat", "bird") with a space in between.
  2. Create a list of three means and three standard deviations. Use map2_dbl() to generate one random normal value for each pair. (Hint: \(m, s) rnorm(1, m, s).)
  3. A data frame is a list of columns. What does pmap_dbl(data.frame(x = 1:3, y = 4:6), \(x, y) x + y) return?

19.6 Side effects: walk()

Everything so far has been about functions that compute something: they take input and produce output. But some functions exist for what they do, not what they return: printing a message, writing a file, drawing a plot. For these, use walk().

walk(names(penguins_split), \(name) cat(name, "\n"))
#> Adelie 
#> Chinstrap 
#> Gentoo

walk(.x, .f) applies .f to each element of .x, but instead of collecting results into a list, it returns .x invisibly. The point is the side effect, not the return value.

The two-input version, walk2(), is especially useful for writing files:

file_paths <- paste0("data/", names(penguins_split), ".csv")
walk2(penguins_split, file_paths, \(df, path) write.csv(df, path, row.names = FALSE))

Each data frame gets written to its own file. walk2() iterates over the data frames and paths in parallel, calling write.csv() for each pair.

Use walk() whenever .f does something rather than computes something. If you use map() for side effects, you end up with a list of NULL values cluttering your console, which is the language telling you that you reached for the wrong tool.

Because walk() returns its input invisibly, it slots cleanly into the middle of a pipeline:

penguins_split |>
  walk(\(df) cat("Species:", as.character(df$species[1]), "- n =", nrow(df), "\n")) |>
  map_dbl(\(df) mean(df$body_mass_g, na.rm = TRUE))
#> Species: Adelie - n = 152 
#> Species: Chinstrap - n = 68 
#> Species: Gentoo - n = 124
#>    Adelie Chinstrap    Gentoo 
#>  3700.662  3733.088  5076.016

The walk() prints a line for each species, then passes penguins_split through unchanged to map_dbl(). Side effects and computation in one pipeline, each handled by the right tool. But there is a category of iteration that no functional – not map(), not walk() – can handle cleanly.

Exercises

  1. Use walk() to print the column names of each element of penguins_split. (They should all be the same.)
  2. Use walk2() and cat() to print lines like "Adelie has 152 rows" for each species.

19.7 When loops are fine

Not everything should be a functional. Loops earn their place in three situations.

When iterations depend on each other. If the result of iteration i feeds into iteration i + 1, a functional cannot help you, because functionals assume independence between elements. Simulation chains, iterative algorithms, and accumulating state all need explicit loops. (purrr’s reduce() and accumulate() handle some of these sequential cases, and you will meet them in Chapter 21. In the language of category theory, a fold that tears down a list one element at a time is a catamorphism, the most general recursion pattern: any loop that starts with an accumulator, walks a list, and returns the accumulator is a catamorphism in disguise.)

# Random walk: each step depends on the previous
x <- numeric(20)
x[1] <- 0
for (i in 2:20) {
  x[i] <- x[i - 1] + rnorm(1)
}
x
#>  [1]  0.0000000 -0.0627141  1.2421556  3.5288009  2.1399402  1.8611515
#>  [7]  1.7278301  2.3637805  2.0795276 -0.5769278 -3.0173947 -1.6972814
#> [13] -2.0039200 -3.7852284 -3.9571458 -2.7424711 -0.8472776 -1.2777467
#> [19] -1.5350161 -3.2981792

When the loop body is complex. If you need next to skip elements, break to stop early, or multiple conditional branches that depend on runtime state, a loop is often clearer than contorting the logic into a function passed to map().

When you are still learning. A loop you understand is better than a functional you do not. Use map() when the pattern clicks; until then, write the loop, get the right answer, and refactor later when the shape of the problem becomes obvious.

TipOpinion

Replace a loop with a functional when the pattern is “do this to each thing independently.” Keep a loop when iterations talk to each other.

19.8 across() revisited

In Section 21.7, you saw across() for applying a function to multiple columns inside a dplyr verb. So what is across(), exactly? It is map() thinking applied inside a data frame pipeline.

penguins |>
  summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
#> # A tibble: 1 × 5
#>   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
#>            <dbl>         <dbl>             <dbl>       <dbl> <dbl>
#> 1           43.9          17.2              201.       4202. 2008.

across(.cols, .fns) takes a column selection and a function (or list of functions) and applies the function to each selected column. It is iteration over columns, the same structural idea as map() iterating over list elements, but specialized for the column-wise case inside dplyr verbs.

Multiple functions at once:

penguins |>
  summarise(across(
    where(is.numeric),
    list(mean = \(x) mean(x, na.rm = TRUE), sd = \(x) sd(x, na.rm = TRUE))
  ))
#> # A tibble: 1 × 10
#>   bill_length_mm_mean bill_length_mm_sd bill_depth_mm_mean bill_depth_mm_sd
#>                 <dbl>             <dbl>              <dbl>            <dbl>
#> 1                43.9              5.46               17.2             1.97
#> # ℹ 6 more variables: flipper_length_mm_mean <dbl>,
#> #   flipper_length_mm_sd <dbl>, body_mass_g_mean <dbl>,
#> #   body_mass_g_sd <dbl>, year_mean <dbl>, year_sd <dbl>

Combined with group_by():

penguins |>
  group_by(species) |>
  summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
#> # A tibble: 3 × 6
#>   species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
#>   <fct>              <dbl>         <dbl>             <dbl>       <dbl> <dbl>
#> 1 Adelie              38.8          18.3              190.       3701. 2008.
#> 2 Chinstrap           48.8          18.4              196.       3733. 2008.
#> 3 Gentoo              47.5          15.0              217.       5076. 2008.

across() replaces the older summarise_at(), summarise_if(), and summarise_all(), which are now superseded. If you encounter them in older code, they do the same thing with less flexible syntax.

Under the hood, across() iterates over the selected columns, applies the function to each one, and assembles the results. That is the same thing map() does over list entries and sapply() does over vector values. The for loop at the start of this chapter, lapply(), map_dbl(), across(): each separates what to do from how to iterate, and each walks a different container with the same logic. The for loop, lapply(), map_dbl(), and across() all apply a function to each element of a collection; they differ only in which container they walk and what they return.

Exercises

  1. Use across() inside summarise() to compute the median() of every numeric column in penguins, removing NA values.
  2. Use across() with a list of two functions to compute both the min() and max() of every numeric column, grouped by species.
  3. What does across(where(is.character), n_distinct) compute when used inside summarise()? Try it on penguins.