library(palmerpenguins)
library(purrr)
library(dplyr)19 Iteration without loops
You have a list of three data frames and you want to compute the mean of a column in each one. Six lines of for loop, or one line of map(). The map() version is shorter, but the real difference is structural.
Functions are values (Chapter 7) and you can pass them to other functions (Section 7.1). A functional builds on that idea: it is a higher-order function that handles the iteration for you. You give it a function and a collection; it applies the function to each element, manages the indexing and pre-allocation, and returns the results.
This chapter uses three packages. Load them now if you are following along:
We will work with the penguins data split by species throughout:
penguins_split <- split(penguins, penguins$species)
names(penguins_split)
#> [1] "Adelie" "Chinstrap" "Gentoo"Three data frames, one per species. Almost everything in this chapter amounts to doing something to each of them, and the question is always the same: how much of the work should you spell out yourself, and how much should the language handle?
19.1 The problem with for loops
Suppose you want the mean body mass for each species. Here is the loop version:
results <- vector("list", length(penguins_split))
for (i in seq_along(penguins_split)) {
results[[i]] <- mean(penguins_split[[i]]$body_mass_g, na.rm = TRUE)
}
names(results) <- names(penguins_split)
results
#> $Adelie
#> [1] 3700.662
#>
#> $Chinstrap
#> [1] 3733.088
#>
#> $Gentoo
#> [1] 5076.016Six lines to say “compute the mean mass for each group.” Look at how much of that is bookkeeping: pre-allocate a container, set up an index, store results at the right position, copy names over at the end. The actual computation, mean(df$body_mass_g, na.rm = TRUE), sits buried in the middle, surrounded by scaffolding that has nothing to do with the question you asked.
A functional says the same thing in one line:
map_dbl(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE))
#> Adelie Chinstrap Gentoo
#> 3700.662 3733.088 5076.016Same answer. No index variable, no pre-allocation, no name copying. You said what to compute; map_dbl handled the iteration.
The problem with for loops is not performance; in modern R, a well-written loop is fast. The problem is that loops mix the mechanism of iteration (allocate, index, store) with the content of the computation (compute a mean). Functionals separate the two. The mechanism disappears inside the functional; the content lives in the function you pass to it. And once those two concerns are separated, each becomes easier to read, test, and reuse.
For loops are not bad R. They are verbose R. If you find yourself writing the same loop scaffolding three or four times in a project, a functional is waiting to collapse all that boilerplate into a single expressive line.
Exercises
- Write a
forloop that computes the number of rows in each element ofpenguins_split. Then rewrite it usingmap_int()andnrow. - What happens if you forget to pre-allocate
resultsand instead grow it withresults <- c(results, value)inside a loop? Why is this slower?
19.2 lapply() and sapply(): base R functionals
The idea of handing a function to something that applies it to every element in a collection is older than R, older than S, older than most programming languages people still use. John McCarthy’s 1960 paper described mapcar, a Lisp function that takes a function and a list and returns a new list with the function applied to each element. The name map became standard in later languages (Scheme, ML, Haskell), and R inherited the concept twice: once through lapply() in base R, and again through purrr::map() in the tidyverse. The mechanism is always the same. You separate what to do from how to iterate, and let the language handle the second part. lapply() is R’s oldest version of that idea:
lapply(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE))
#> $Adelie
#> [1] 3700.662
#>
#> $Chinstrap
#> [1] 3733.088
#>
#> $Gentoo
#> [1] 5076.016lapply(x, f) applies f to each element of x and returns a list. Always a list. That predictability is its strength.
sapply() does the same thing, but tries to “simplify” the result:
sapply(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE))
#> Adelie Chinstrap Gentoo
#> 3700.662 3733.088 5076.016When every call to f returns a single value, sapply() hands you a named vector. But if f returns different lengths for different elements, sapply() falls back to a list, silently. You cannot predict the return type without knowing the output of f for every element of x. Fine at the console; unreliable inside a function that other code depends on.
vapply() solves this by making you declare the expected type:
vapply(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE), numeric(1))
#> Adelie Chinstrap Gentoo
#> 3700.662 3733.088 5076.016The third argument, numeric(1), says “I expect each call to return a single numeric value.” If any call returns something else, you get an error instead of a silent type change.
Two other base R functionals are worth knowing. tapply(x, group, f) applies f to subsets of x defined by group:
tapply(penguins$body_mass_g, penguins$species, mean, na.rm = TRUE)
#> Adelie Chinstrap Gentoo
#> 3700.662 3733.088 5076.016And mapply(f, x, y) iterates over multiple inputs in parallel:
mapply(paste, c("one", "two", "three"), c("fish", "cat", "bird"))
#> one two three
#> "one fish" "two cat" "three bird"Use lapply() when you want a list. Use vapply() when you want a vector and want safety. Never use sapply() in non-interactive code: its return type depends on the data, and that is a bug hiding in your pipeline, waiting for the one input that changes everything.
Exercises
- Use
lapply()andnrowto get the number of rows in each element ofpenguins_split. - Use
vapply()to compute the median flipper length for each species. Declare the expected output type. - What does
sapply(list(), mean)return? What aboutvapply(list(), mean, numeric(1))? Which is safer?
19.3 purrr::map(): the tidyverse functional
The purrr package provides map(), which does the same thing as lapply():
map(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE))
#> $Adelie
#> [1] 3700.662
#>
#> $Chinstrap
#> [1] 3733.088
#>
#> $Gentoo
#> [1] 5076.016Apply a function to each element, return a list. So why bother with a new package?
Three reasons. First, typed variants: map_dbl() returns a double vector, map_chr() returns a character vector, and if the function returns the wrong type, you get an error. No more guessing, no more silent simplification.
map_dbl(penguins_split, \(df) mean(df$body_mass_g, na.rm = TRUE))
#> Adelie Chinstrap Gentoo
#> 3700.662 3733.088 5076.016Second, consistency with the pipe. map() and its variants are designed to sit naturally in a pipeline:
penguins_split |>
map_dbl(\(df) mean(df$body_mass_g, na.rm = TRUE))
#> Adelie Chinstrap Gentoo
#> 3700.662 3733.088 5076.016Third, you may encounter the formula shorthand in older code: ~ mean(.x$body_mass_g, na.rm = TRUE). This was purrr’s anonymous function syntax before R 4.1 introduced \(x). You will see it in existing codebases. Prefer \(x) in new code.
map() returns a list. Always. map_dbl() returns a double vector. Always. This type stability is the core advantage over sapply(), and it matters more than it sounds: in a long pipeline, knowing the exact shape of every intermediate result is the difference between debugging for five seconds and debugging for an hour.
map() embodies a pattern from category theory: the functor. map(xs, f) applies f to each element of xs, preserving the container structure (a list of three elements in, a list of three elements out). A functor lifts a function between values (f :: A -> B) to a function between containers (map(f) :: List(A) -> List(B)). You do not need the theory to use map(), but recognizing the pattern explains why map() feels the same whether you apply it to a list, a vector, or a set of data frame columns with across(): it is the same structural idea in different containers.
A for loop iterates by maintaining a counter variable: i goes from 1 to n, and the body runs once per increment. You declare the counter, increment it, read it. With map(), the counter disappears. The list has three elements, so the function runs three times; the structure itself determines how many applications occur. You stop managing the iteration and let the data’s shape manage it for you.
The typed variants enforce something even stricter.
Exercises
- Use
map_dbl()to compute the standard deviation ofbody_mass_gfor each species inpenguins_split. - Use
map_chr()to extract the first value ofspeciesfrom each element ofpenguins_split. (Hint:\(df) as.character(df$species[1]).) - Rewrite the
forloop from Section 19.1 usingmap_dbl(). Compare the two versions for readability.
19.4 Typed variants
The typed variants of map() enforce both the type and the length of each result:
map_dbl(.x, .f): each call to.fmust return a single double. Result is a numeric vector.map_chr(.x, .f): each call must return a single string. Result is a character vector.map_lgl(.x, .f): each call must return a single logical. Result is a logical vector.map_int(.x, .f): each call must return a single integer. Result is an integer vector.
map_int(penguins_split, nrow)
#> Adelie Chinstrap Gentoo
#> 152 68 124map_lgl(penguins_split, \(df) any(is.na(df$body_mass_g)))
#> Adelie Chinstrap Gentoo
#> TRUE FALSE TRUEIf .f returns the wrong type or a vector of length other than one, you get an error:
map_dbl(penguins_split, \(df) df$body_mass_g)
#> Error in `map_dbl()`:
#> ℹ In index: 1.
#> ℹ With name: Adelie.
#> Caused by error:
#> ! Result must be length 1, not 152.When the output does not match your expectation, you find out immediately, not three functions downstream when something produces NULL instead of a number.
For results that are data frames, the older map_dfr() is deprecated. The current pattern is map() followed by list_rbind():
penguins_split |>
map(\(df) tibble(
species = df$species[1],
mean_mass = mean(df$body_mass_g, na.rm = TRUE),
n = nrow(df)
)) |>
list_rbind()
#> # A tibble: 3 × 3
#> species mean_mass n
#> <fct> <dbl> <int>
#> 1 Adelie 3701. 152
#> 2 Chinstrap 3733. 68
#> 3 Gentoo 5076. 124map() returns a list of data frames, and list_rbind() stacks them into one. But what about iterating over more than one thing at a time?
Exercises
- Use
map_int()andnrowto count the rows in each element ofpenguins_split. - Use
map_lgl()to check which elements ofpenguins_splithave more than 100 rows. - Use
map()andlist_rbind()to build a summary data frame with one row per species, containing the species name, median bill length, and median bill depth.
19.5 Iterating over multiple inputs
Sometimes a single list is not enough. You need to walk over two inputs in lockstep, pulling one element from each on every step. map2() handles this:
species_names <- names(penguins_split)
species_counts <- map_int(penguins_split, nrow)
map2_chr(species_names, species_counts, \(name, n) paste(name, ":", n, "penguins"))
#> [1] "Adelie : 152 penguins" "Chinstrap : 68 penguins"
#> [3] "Gentoo : 124 penguins"map2(.x, .y, .f) calls .f(x[[1]], y[[1]]), then .f(x[[2]], y[[2]]), and so on. The two inputs are consumed in parallel, not in a nested grid.
For three or more inputs, use pmap(). You supply a list of vectors (or a data frame, since a data frame is a list of columns):
params <- list(
mean = c(0, 5, -3),
sd = c(1, 2, 0.5),
n = c(3, 3, 3)
)
set.seed(42)
pmap(params, \(mean, sd, n) rnorm(n, mean, sd))
#> [[1]]
#> [1] 1.3709584 -0.5646982 0.3631284
#>
#> [[2]]
#> [1] 6.265725 5.808537 4.787751
#>
#> [[3]]
#> [1] -2.244239 -3.047330 -1.990788Each row of parameters becomes one call. pmap() is the general case; map2() is the special case for exactly two inputs.
The base R equivalents are Map(f, x, y) (which returns a list, like map2()) and mapply(f, x, y) (which simplifies, like sapply()). The same warning applies: mapply is unpredictable in return type.
Where pmap() really shines is in configuration-driven workflows, places where you want to fit different models to different subsets with different formulas, all controlled by a single table of parameters. Suppose you want to regress body mass on flipper length for each species, but with a different formula for each:
configs <- tibble(
data = penguins_split,
formula = list(
body_mass_g ~ flipper_length_mm,
body_mass_g ~ flipper_length_mm + bill_length_mm,
body_mass_g ~ flipper_length_mm * bill_length_mm
)
)
models <- pmap(configs, \(data, formula) lm(formula, data = data))
map_dbl(models, \(m) summary(m)$r.squared)
#> Adelie Chinstrap Gentoo
#> 0.2192128 0.4688941 0.5923485pmap() iterates over the rows of the configuration table, fitting one model per species. The entire experiment lives in one tibble, and adding a fourth species or a fourth formula means adding a row, not rewriting the loop.
Exercises
- Use
map2_chr()to paste together the vectorsc("one", "two", "three")andc("fish", "cat", "bird")with a space in between. - Create a list of three means and three standard deviations. Use
map2_dbl()to generate one random normal value for each pair. (Hint:\(m, s) rnorm(1, m, s).) - A data frame is a list of columns. What does
pmap_dbl(data.frame(x = 1:3, y = 4:6), \(x, y) x + y)return?
19.6 Side effects: walk()
Everything so far has been about functions that compute something: they take input and produce output. But some functions exist for what they do, not what they return: printing a message, writing a file, drawing a plot. For these, use walk().
walk(names(penguins_split), \(name) cat(name, "\n"))
#> Adelie
#> Chinstrap
#> Gentoowalk(.x, .f) applies .f to each element of .x, but instead of collecting results into a list, it returns .x invisibly. The point is the side effect, not the return value.
The two-input version, walk2(), is especially useful for writing files:
file_paths <- paste0("data/", names(penguins_split), ".csv")
walk2(penguins_split, file_paths, \(df, path) write.csv(df, path, row.names = FALSE))Each data frame gets written to its own file. walk2() iterates over the data frames and paths in parallel, calling write.csv() for each pair.
Use walk() whenever .f does something rather than computes something. If you use map() for side effects, you end up with a list of NULL values cluttering your console, which is the language telling you that you reached for the wrong tool.
Because walk() returns its input invisibly, it slots cleanly into the middle of a pipeline:
penguins_split |>
walk(\(df) cat("Species:", as.character(df$species[1]), "- n =", nrow(df), "\n")) |>
map_dbl(\(df) mean(df$body_mass_g, na.rm = TRUE))
#> Species: Adelie - n = 152
#> Species: Chinstrap - n = 68
#> Species: Gentoo - n = 124
#> Adelie Chinstrap Gentoo
#> 3700.662 3733.088 5076.016The walk() prints a line for each species, then passes penguins_split through unchanged to map_dbl(). Side effects and computation in one pipeline, each handled by the right tool. But there is a category of iteration that no functional – not map(), not walk() – can handle cleanly.
Exercises
- Use
walk()to print the column names of each element ofpenguins_split. (They should all be the same.) - Use
walk2()andcat()to print lines like"Adelie has 152 rows"for each species.
19.7 When loops are fine
Not everything should be a functional. Loops earn their place in three situations.
When iterations depend on each other. If the result of iteration i feeds into iteration i + 1, a functional cannot help you, because functionals assume independence between elements. Simulation chains, iterative algorithms, and accumulating state all need explicit loops. (purrr’s reduce() and accumulate() handle some of these sequential cases, and you will meet them in Chapter 21. In the language of category theory, a fold that tears down a list one element at a time is a catamorphism, the most general recursion pattern: any loop that starts with an accumulator, walks a list, and returns the accumulator is a catamorphism in disguise.)
# Random walk: each step depends on the previous
x <- numeric(20)
x[1] <- 0
for (i in 2:20) {
x[i] <- x[i - 1] + rnorm(1)
}
x
#> [1] 0.0000000 -0.0627141 1.2421556 3.5288009 2.1399402 1.8611515
#> [7] 1.7278301 2.3637805 2.0795276 -0.5769278 -3.0173947 -1.6972814
#> [13] -2.0039200 -3.7852284 -3.9571458 -2.7424711 -0.8472776 -1.2777467
#> [19] -1.5350161 -3.2981792When the loop body is complex. If you need next to skip elements, break to stop early, or multiple conditional branches that depend on runtime state, a loop is often clearer than contorting the logic into a function passed to map().
When you are still learning. A loop you understand is better than a functional you do not. Use map() when the pattern clicks; until then, write the loop, get the right answer, and refactor later when the shape of the problem becomes obvious.
Replace a loop with a functional when the pattern is “do this to each thing independently.” Keep a loop when iterations talk to each other.
19.8 across() revisited
In Section 21.7, you saw across() for applying a function to multiple columns inside a dplyr verb. So what is across(), exactly? It is map() thinking applied inside a data frame pipeline.
penguins |>
summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
#> # A tibble: 1 × 5
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 43.9 17.2 201. 4202. 2008.across(.cols, .fns) takes a column selection and a function (or list of functions) and applies the function to each selected column. It is iteration over columns, the same structural idea as map() iterating over list elements, but specialized for the column-wise case inside dplyr verbs.
Multiple functions at once:
penguins |>
summarise(across(
where(is.numeric),
list(mean = \(x) mean(x, na.rm = TRUE), sd = \(x) sd(x, na.rm = TRUE))
))
#> # A tibble: 1 × 10
#> bill_length_mm_mean bill_length_mm_sd bill_depth_mm_mean bill_depth_mm_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 43.9 5.46 17.2 1.97
#> # ℹ 6 more variables: flipper_length_mm_mean <dbl>,
#> # flipper_length_mm_sd <dbl>, body_mass_g_mean <dbl>,
#> # body_mass_g_sd <dbl>, year_mean <dbl>, year_sd <dbl>Combined with group_by():
penguins |>
group_by(species) |>
summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
#> # A tibble: 3 × 6
#> species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Adelie 38.8 18.3 190. 3701. 2008.
#> 2 Chinstrap 48.8 18.4 196. 3733. 2008.
#> 3 Gentoo 47.5 15.0 217. 5076. 2008.across() replaces the older summarise_at(), summarise_if(), and summarise_all(), which are now superseded. If you encounter them in older code, they do the same thing with less flexible syntax.
Under the hood, across() iterates over the selected columns, applies the function to each one, and assembles the results. That is the same thing map() does over list entries and sapply() does over vector values. The for loop at the start of this chapter, lapply(), map_dbl(), across(): each separates what to do from how to iterate, and each walks a different container with the same logic. The for loop, lapply(), map_dbl(), and across() all apply a function to each element of a collection; they differ only in which container they walk and what they return.
Exercises
- Use
across()insidesummarise()to compute themedian()of every numeric column inpenguins, removingNAvalues. - Use
across()with a list of two functions to compute both themin()andmax()of every numeric column, grouped by species. - What does
across(where(is.character), n_distinct)compute when used insidesummarise()? Try it onpenguins.