14  Data transformation

Every data analysis is the same four moves: keep the rows you want, pick the columns you need, compute new values, summarize. dplyr gives each move a name.

14.1 The idea: verbs for data

A dplyr verb takes a data frame and returns a data frame. The data frame goes in first, the result comes out the same shape (a data frame), and nothing else changes. Every verb does one thing:

  • filter() keeps rows that match a condition.
  • select() picks columns by name.
  • mutate() adds new columns computed from existing ones.
  • summarise() collapses rows into a summary.
  • arrange() reorders rows.

These five verbs cover most of what you need. You compose them for complex transformations (via pipes, Chapter 15), but each one does exactly one thing. A dplyr pipeline is function composition: each verb transforms a data frame and passes it to the next, the same composition from lambda calculus, (g ∘ f)(x) = g(f(x)), applied to tables instead of numbers.

These verbs did not come from nowhere. In 1970, Edgar Codd, a mathematician at IBM’s San Jose Research Laboratory, published “A Relational Model of Data for Large Shared Data Banks.” At the time, databases were navigational: to find a record, you followed pointers from one record to the next, like walking a linked list. Codd proposed something radical. Forget the pointers. Think of data as tables (he called them relations), and define a small set of operations on those tables: restriction (keep rows where a predicate holds), projection (keep certain columns), extension (add computed columns), and aggregation (collapse groups to summaries). The key insight was that every operation takes a table as input and returns a table as output, so you can chain them: filter a table, then select columns from the result, then summarize that. Codd called this set of composable operations a relational algebra, “algebra” in the mathematical sense of a closed system where combining any two things always gives you back the same kind of thing. IBM initially resisted the idea. Codd’s own employer spent years trying to protect their existing database product (IMS) from the relational model. But the composability was too useful to ignore, and by the 1980s relational databases had won.

dplyr is Codd’s relational algebra with different names. filter() is restriction. select() is projection. mutate() is extension. summarise() is aggregation. arrange() handles ordering, which has no relational equivalent but is essential for display. The reason these five verbs compose so cleanly, and the reason the same ideas appear in SQL, pandas, data.table, and every other language that works with tabular data, is that they are all implementations of the same fifty-year-old algebra.

library(palmerpenguins)
#> 
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#> 
#>     penguins, penguins_raw
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

The penguins data frame (loaded by palmerpenguins) has 344 rows and 8 columns: species, island, bill measurements, flipper length, body mass, sex, and year. Every example in this chapter uses it.

penguins
#> # A tibble: 344 × 8
#>    species island  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>            <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torger…           39.1          18.7               181        3750
#>  2 Adelie  Torger…           39.5          17.4               186        3800
#>  3 Adelie  Torger…           40.3          18                 195        3250
#>  4 Adelie  Torger…           NA            NA                  NA          NA
#>  5 Adelie  Torger…           36.7          19.3               193        3450
#>  6 Adelie  Torger…           39.3          20.6               190        3650
#>  7 Adelie  Torger…           38.9          17.8               181        3625
#>  8 Adelie  Torger…           39.2          19.6               195        4675
#>  9 Adelie  Torger…           34.1          18.1               193        3475
#> 10 Adelie  Torger…           42            20.2               190        4250
#> # ℹ 334 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>

14.2 filter(): keep rows

filter() takes a data frame and one or more conditions. It returns only the rows where every condition is TRUE.

filter(penguins, species == "Adelie")
#> # A tibble: 152 × 8
#>    species island  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>            <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torger…           39.1          18.7               181        3750
#>  2 Adelie  Torger…           39.5          17.4               186        3800
#>  3 Adelie  Torger…           40.3          18                 195        3250
#>  4 Adelie  Torger…           NA            NA                  NA          NA
#>  5 Adelie  Torger…           36.7          19.3               193        3450
#>  6 Adelie  Torger…           39.3          20.6               190        3650
#>  7 Adelie  Torger…           38.9          17.8               181        3625
#>  8 Adelie  Torger…           39.2          19.6               195        4675
#>  9 Adelie  Torger…           34.1          18.1               193        3475
#> 10 Adelie  Torger…           42            20.2               190        4250
#> # ℹ 142 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>

The first argument is the data frame. The second is a logical expression, the same kind you wrote in Section 8.2. Inside filter(), column names work like variables: species refers to the species column, not an object in your global environment.

Multiple conditions, separated by commas, are combined with AND:

filter(penguins, species == "Adelie", body_mass_g > 4000)

This keeps only Adelie penguins heavier than 4000 grams. Comma means “and.” You can also write & explicitly, but the comma is cleaner.

For OR, use |:

filter(penguins, species == "Adelie" | species == "Chinstrap")

When testing membership in a set, %in% is clearer than chaining |:

filter(penguins, species %in% c("Adelie", "Chinstrap"))

One detail worth knowing: filter() drops rows where the condition evaluates to NA. If body_mass_g is NA for a row, then body_mass_g > 4000 is NA, and filter() excludes it. This is usually what you want. If you need to keep NA rows, test for them explicitly: filter(penguins, is.na(body_mass_g)).

Under the hood, filter() is doing the same logical subsetting you saw in Section 8.2: penguins[penguins$species == "Adelie", ]. The verb just makes the intent explicit.

Exercises

  1. Filter penguins to keep only female penguins from Biscoe island.
  2. Filter penguins to keep rows where flipper_length_mm is greater than 200 or body_mass_g is less than 3000.
  3. How many rows have missing values in sex? Use filter() and nrow().

14.3 select(): pick columns

select() keeps the columns you name and drops everything else.

select(penguins, species, bill_length_mm, body_mass_g)
#> # A tibble: 344 × 3
#>    species bill_length_mm body_mass_g
#>    <fct>            <dbl>       <int>
#>  1 Adelie            39.1        3750
#>  2 Adelie            39.5        3800
#>  3 Adelie            40.3        3250
#>  4 Adelie            NA            NA
#>  5 Adelie            36.7        3450
#>  6 Adelie            39.3        3650
#>  7 Adelie            38.9        3625
#>  8 Adelie            39.2        4675
#>  9 Adelie            34.1        3475
#> 10 Adelie            42          4250
#> # ℹ 334 more rows

The result keeps only those three columns; the rest are gone.

Drop columns by prefixing with -:

select(penguins, -year)

For more flexible selection, dplyr provides helper functions. starts_with(), ends_with(), and contains() match column names by pattern:

select(penguins, starts_with("bill"))
select(penguins, ends_with("mm"))

where() selects columns based on their content, not their name. where(is.numeric) keeps every numeric column:

select(penguins, species, where(is.numeric))

You can rename columns inside select():

select(penguins, species, mass = body_mass_g)
#> # A tibble: 344 × 2
#>    species  mass
#>    <fct>   <int>
#>  1 Adelie   3750
#>  2 Adelie   3800
#>  3 Adelie   3250
#>  4 Adelie     NA
#>  5 Adelie   3450
#>  6 Adelie   3650
#>  7 Adelie   3625
#>  8 Adelie   4675
#>  9 Adelie   3475
#> 10 Adelie   4250
#> # ℹ 334 more rows

This selects species and body_mass_g, renaming the latter to mass. If you want to rename without dropping columns, use rename():

rename(penguins, mass = body_mass_g)

To reorder columns (move species to the front, keep everything else), use everything():

select(penguins, island, species, everything())

Exercises

  1. Select all columns from penguins that contain the string "length".
  2. Select species, island, and all numeric columns. (Hint: combine names and where().)
  3. What is the difference between select(penguins, mass = body_mass_g) and rename(penguins, mass = body_mass_g)? Try both.

14.4 mutate(): compute new columns

mutate() adds new columns. The original columns stay.

mutate() is the functor pattern applied to data frames. It maps a function over a column (the container), producing a new column of the same length. The container’s shape (344 rows) is preserved; only the contents change. You saw the same pattern with vectorized sqrt() on vectors (Section 4.4), with sapply() on lists (Section 7.1), and with map() on split data frames (Section 19.3). Each time, the structure is ‘apply a function inside a container, keep the container intact.’ The word for this, across all of these instances, is functor.

mutate(penguins, mass_kg = body_mass_g / 1000)
#> # A tibble: 344 × 9
#>    species island  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>            <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torger…           39.1          18.7               181        3750
#>  2 Adelie  Torger…           39.5          17.4               186        3800
#>  3 Adelie  Torger…           40.3          18                 195        3250
#>  4 Adelie  Torger…           NA            NA                  NA          NA
#>  5 Adelie  Torger…           36.7          19.3               193        3450
#>  6 Adelie  Torger…           39.3          20.6               190        3650
#>  7 Adelie  Torger…           38.9          17.8               181        3625
#>  8 Adelie  Torger…           39.2          19.6               195        4675
#>  9 Adelie  Torger…           34.1          18.1               193        3475
#> 10 Adelie  Torger…           42            20.2               190        4250
#> # ℹ 334 more rows
#> # ℹ 3 more variables: sex <fct>, year <int>, mass_kg <dbl>

You can reference a column you just created in the same mutate() call:

mutate(
  penguins,
  mass_kg = body_mass_g / 1000,
  mass_lb = mass_kg * 2.205
)

mass_lb uses mass_kg, which was defined one line above. dplyr evaluates them in order, top to bottom.

Any function that works on vectors works inside mutate(): log(), sqrt(), cumsum(), lag(), lead(), percent_rank(). The conditional functions from Section 8.7 work here too:

mutate(
  penguins,
  size = if_else(body_mass_g > 4000, "large", "small")
)

For more than two categories, case_when() is the tool:

mutate(
  penguins,
  size = case_when(
    body_mass_g > 5000 ~ "large",
    body_mass_g > 3500 ~ "medium",
    .default = "small"
  )
)

If you want to keep only the new columns and drop everything else, use .keep = "none":

mutate(
  penguins,
  mass_kg = body_mass_g / 1000,
  .keep = "none"
)
#> # A tibble: 344 × 1
#>    mass_kg
#>      <dbl>
#>  1    3.75
#>  2    3.8 
#>  3    3.25
#>  4   NA   
#>  5    3.45
#>  6    3.65
#>  7    3.62
#>  8    4.68
#>  9    3.48
#> 10    4.25
#> # ℹ 334 more rows

Exercises

  1. Add a column bill_ratio that divides bill_length_mm by bill_depth_mm.
  2. Use mutate() and case_when() to create a column island_code that maps "Biscoe" to "B", "Dream" to "D", and "Torgersen" to "T".
  3. What happens if you use a column name that already exists in mutate()? Try mutate(penguins, species = toupper(species)).

14.5 summarise() and group_by(): aggregate

summarise() collapses an entire data frame into a single row:

summarise(penguins, mean_mass = mean(body_mass_g, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   mean_mass
#>       <dbl>
#> 1     4202.

On its own, collapsing everything into a single row is limited. Paired with group_by(), it does the real work.

group_by() marks a data frame as grouped. It doesn’t change the data; it attaches metadata that tells subsequent verbs to operate per group:

penguins_grouped <- group_by(penguins, species)
summarise(penguins_grouped, mean_mass = mean(body_mass_g, na.rm = TRUE))
#> # A tibble: 3 × 2
#>   species   mean_mass
#>   <fct>         <dbl>
#> 1 Adelie        3701.
#> 2 Chinstrap     3733.
#> 3 Gentoo        5076.

summarise() applied mean() within each group and returned one row per species.

You can group by multiple variables:

summarise(
  group_by(penguins, species, island),
  mean_mass = mean(body_mass_g, na.rm = TRUE),
  count = n()
)
#> `summarise()` has regrouped the output.
#> ℹ Summaries were computed grouped by species and island.
#> ℹ Output is grouped by species.
#> ℹ Use `summarise(.groups = "drop_last")` to silence this message.
#> ℹ Use `summarise(.by = c(species, island))` for per-operation grouping
#>   (`?dplyr::dplyr_by`) instead.
#> # A tibble: 5 × 4
#> # Groups:   species [3]
#>   species   island    mean_mass count
#>   <fct>     <fct>         <dbl> <int>
#> 1 Adelie    Biscoe        3710.    44
#> 2 Adelie    Dream         3688.    56
#> 3 Adelie    Torgersen     3706.    52
#> 4 Chinstrap Dream         3733.    68
#> 5 Gentoo    Biscoe        5076.   124

n() counts the number of rows in each group. n_distinct() counts unique values:

summarise(
  group_by(penguins, species),
  n_islands = n_distinct(island)
)
#> # A tibble: 3 × 2
#>   species   n_islands
#>   <fct>         <int>
#> 1 Adelie            3
#> 2 Chinstrap         1
#> 3 Gentoo            1

The shortcut count() does group_by() plus summarise(n = n()) in one step:

count(penguins, species)
#> # A tibble: 3 × 2
#>   species       n
#>   <fct>     <int>
#> 1 Adelie      152
#> 2 Chinstrap    68
#> 3 Gentoo      124
TipOpinion

After summarise(), the last grouping variable is silently peeled off. The result may still be grouped. This causes subtle bugs downstream: a mutate() you expect to operate on the full data frame operates per group instead. Always add .groups = "drop" to summarise(), or call ungroup() afterward. Be explicit about grouping state.

Exercises

  1. Compute the median flipper_length_mm for each species.
  2. Use group_by() and summarise() to find the maximum bill_length_mm per species and island combination.
  3. Use count() to find how many penguins were observed in each year.

14.6 arrange(): order rows

arrange() sorts rows. The default is ascending:

arrange(penguins, body_mass_g)
#> # A tibble: 344 × 8
#>    species  island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>    <fct>           <dbl>         <dbl>             <int>       <int>
#>  1 Chinstr… Dream            46.9          16.6               192        2700
#>  2 Adelie   Biscoe           36.5          16.6               181        2850
#>  3 Adelie   Biscoe           36.4          17.1               184        2850
#>  4 Adelie   Biscoe           34.5          18.1               187        2900
#>  5 Adelie   Dream            33.1          16.1               178        2900
#>  6 Adelie   Torge…           38.6          17                 188        2900
#>  7 Chinstr… Dream            43.2          16.6               187        2900
#>  8 Adelie   Biscoe           37.9          18.6               193        2925
#>  9 Adelie   Dream            37.5          18.9               179        2975
#> 10 Adelie   Dream            37            16.9               185        3000
#> # ℹ 334 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>

Wrap a column in desc() for descending order:

arrange(penguins, desc(body_mass_g))

Multiple columns break ties. Sort by species alphabetically, then by mass descending within each species:

arrange(penguins, species, desc(body_mass_g))

Unlike the other verbs, arrange() ignores grouping by default. If you want to sort within groups, set .by_group = TRUE.

arrange() is pure display. It does not change the data, only the order of rows. No information is added or removed. But a well-sorted table can reveal patterns that a jumbled one hides.

14.7 More verbs

The five core verbs handle most transformations. A few more are worth knowing.

distinct() returns unique rows:

distinct(penguins, species, island)
#> # A tibble: 5 × 2
#>   species   island   
#>   <fct>     <fct>    
#> 1 Adelie    Torgersen
#> 2 Adelie    Biscoe   
#> 3 Adelie    Dream    
#> 4 Gentoo    Biscoe   
#> 5 Chinstrap Dream

Seven combinations of species and island exist in the data.

slice_head() and slice_tail() keep the first or last n rows:

slice_head(penguins, n = 3)
#> # A tibble: 3 × 8
#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>             <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgers…           39.1          18.7               181        3750
#> 2 Adelie  Torgers…           39.5          17.4               186        3800
#> 3 Adelie  Torgers…           40.3          18                 195        3250
#> # ℹ 2 more variables: sex <fct>, year <int>

slice_max() and slice_min() keep the rows with the largest or smallest values:

slice_max(penguins, body_mass_g, n = 3)
#> # A tibble: 4 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
#> 1 Gentoo  Biscoe           49.2          15.2               221        6300
#> 2 Gentoo  Biscoe           59.6          17                 230        6050
#> 3 Gentoo  Biscoe           51.1          16.3               220        6000
#> 4 Gentoo  Biscoe           48.8          16.2               222        6000
#> # ℹ 2 more variables: sex <fct>, year <int>

pull() extracts a single column as a vector. It is the pipe-friendly equivalent of $:

pull(penguins, species) |> head()
#> [1] Adelie Adelie Adelie Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo

relocate() moves columns without dropping any. Move sex and year to the front:

relocate(penguins, sex, year)

Finally, across() applies a function to multiple columns at once. Summarise all numeric columns by their mean:

summarise(penguins, across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
#> # A tibble: 1 × 5
#>   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
#>            <dbl>         <dbl>             <dbl>       <dbl> <dbl>
#> 1           43.9          17.2              201.       4202. 2008.

across() applies a function to multiple columns at once. It pairs naturally with where(), starts_with(), and the other selection helpers from Section 14.3. It becomes easier to use after Chapter 19, where you will see the same pattern of “apply a function to many things” in a broader context.

Exercises

  1. Use distinct() to find all unique values of island.
  2. Use slice_max() to find the 5 penguins with the longest flippers.
  3. Use across() inside summarise() to compute the standard deviation (sd()) of every numeric column, grouped by species.

14.8 The base R equivalents

Everything dplyr does, base R can do too. The verbs are convenient wrappers, not new capabilities. Here are the translations.

filter() is logical subsetting:

# dplyr
filter(penguins, species == "Adelie", body_mass_g > 4000)

# base R
penguins[penguins$species == "Adelie" & penguins$body_mass_g > 4000, ]

select() is column indexing:

# dplyr
select(penguins, species, body_mass_g)

# base R
penguins[, c("species", "body_mass_g")]

mutate() is direct assignment:

# dplyr
mutate(penguins, mass_kg = body_mass_g / 1000)

# base R
penguins$mass_kg <- penguins$body_mass_g / 1000

Note the difference: the base R version modifies the data frame in place (or rather, modifies the binding). The dplyr version returns a new data frame and leaves the original untouched.

summarise() plus group_by() maps to aggregate():

# dplyr
summarise(group_by(penguins, species), mean_mass = mean(body_mass_g, na.rm = TRUE))

# base R
aggregate(body_mass_g ~ species, data = penguins, FUN = mean, na.rm = TRUE)

arrange() is order():

# dplyr
arrange(penguins, desc(body_mass_g))

# base R
penguins[order(-penguins$body_mass_g), ]

Base R also has subset(), which combines filtering and column selection:

subset(penguins, species == "Adelie", select = c(species, body_mass_g))

subset() is clean and readable. For simple operations, it works well. The advantage of dplyr is consistency: every verb takes a data frame first and returns a data frame, so they compose. In base R, aggregate() returns different column names, order() returns indices rather than a data frame, and transform() (the base equivalent of mutate()) cannot reference columns it just created. The dplyr verbs snap together; the base R equivalents each have their own interface.

In fact, dplyr and SQL are the same language wearing different syntax. Both implement relational algebra, which is why they feel similar:

dplyr SQL
filter(penguins, species == "Adelie") SELECT * FROM penguins WHERE species = 'Adelie'
group_by(species) |> summarise(n = n()) SELECT species, COUNT(*) AS n ... GROUP BY species

Call show_query() on a database-backed tibble to see the translation directly.

In category theory, a natural transformation maps one functor to another while preserving structure. filter() transforms a data frame (a container of rows) into a smaller container where columns keep their types and relationships are maintained. The clean composability of dplyr verbs is not an accident; it reflects this mathematical structure, where the transformation commutes with the container.

TipOpinion

Learn the base R equivalents. You will read them in other people’s code, in Stack Overflow answers, in older packages. But write dplyr in your own work. The verbs compose, the names communicate intent, and the argument order is consistent. When you read filter(penguins, species == "Adelie"), the operation is the function name. When you read penguins[penguins$species == "Adelie", ], you have to parse the indexing to understand what’s happening.

Exercises

  1. Rewrite select(penguins, species, island, body_mass_g) using base R bracket notation.
  2. Rewrite arrange(penguins, bill_length_mm) using order().
  3. Use aggregate() to compute the mean flipper_length_mm per island. Compare with the group_by() plus summarise() version.