14  Data transformation

So now you can keep rows and pick columns. What you cannot yet do is create something that was not there before. mutate() adds new columns while the originals stay untouched — the same functor pattern from Section 4.4: apply a function inside a container, get back the same container with different contents. There, the container was a vector; here it is a data frame with 344 rows. You can now keep rows, pick columns, and compute new values from old ones. One operation remains: collapsing many rows into few.

Are Gentoo penguins heavier than Adelie penguins? The penguins data has the answer, but getting there means filtering rows, picking columns, computing a mean, and grouping by species. Every data analysis you will ever write is some arrangement of those same moves, and the question that shapes this chapter is: what is the smallest set of operations that covers all of them?

14.1 The idea: verbs for data

A dplyr verb takes a data frame and returns a data frame. The data frame goes in first, the result comes out the same shape, and nothing else changes. Five verbs cover most of what you need:

  • filter() keeps rows that match a condition.
  • select() picks columns by name.
  • mutate() adds new columns computed from existing ones.
  • summarise() collapses rows into a summary.
  • arrange() reorders rows.

You compose them for complex transformations (via pipes, Chapter 15), but each one does exactly one thing. A dplyr pipeline is function composition: each verb transforms a data frame and passes it to the next, the same composition from lambda calculus, (g ∘ f)(x) = g(f(x)), applied to tables instead of numbers.

These verbs did not come from nowhere. Edgar Codd’s 1970 paper defined a relational algebra: a small set of operations on tables where every operation takes a table in and returns a table out. Restriction (keep rows), projection (keep columns), extension (add computed columns), aggregation (collapse groups). IBM resisted — they had a competing database product (IMS) — but the composability was too useful, and by the 1980s relational databases had won.

dplyr is Codd’s algebra with friendlier names. The same ideas appear in SQL, pandas, and data.table because the operations are the same; only the syntax differs. arrange() is the one exception — ordering has no relational equivalent, but you need it for display.

library(palmerpenguins)
#> 
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#> 
#>     penguins, penguins_raw
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

The penguins data frame (loaded by palmerpenguins) has 344 rows and 8 columns: species, island, bill measurements, flipper length, body mass, sex, and year. Every example in this chapter uses it.

penguins
#> # A tibble: 344 × 8
#>    species island  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>            <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torger…           39.1          18.7               181        3750
#>  2 Adelie  Torger…           39.5          17.4               186        3800
#>  3 Adelie  Torger…           40.3          18                 195        3250
#>  4 Adelie  Torger…           NA            NA                  NA          NA
#>  5 Adelie  Torger…           36.7          19.3               193        3450
#>  6 Adelie  Torger…           39.3          20.6               190        3650
#>  7 Adelie  Torger…           38.9          17.8               181        3625
#>  8 Adelie  Torger…           39.2          19.6               195        4675
#>  9 Adelie  Torger…           34.1          18.1               193        3475
#> 10 Adelie  Torger…           42            20.2               190        4250
#> # ℹ 334 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>

14.2 filter(): keep rows

You have 344 penguins. You only care about the Adelie ones. How do you throw the rest away?

filter(penguins, species == "Adelie")
#> # A tibble: 152 × 8
#>    species island  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>            <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torger…           39.1          18.7               181        3750
#>  2 Adelie  Torger…           39.5          17.4               186        3800
#>  3 Adelie  Torger…           40.3          18                 195        3250
#>  4 Adelie  Torger…           NA            NA                  NA          NA
#>  5 Adelie  Torger…           36.7          19.3               193        3450
#>  6 Adelie  Torger…           39.3          20.6               190        3650
#>  7 Adelie  Torger…           38.9          17.8               181        3625
#>  8 Adelie  Torger…           39.2          19.6               195        4675
#>  9 Adelie  Torger…           34.1          18.1               193        3475
#> 10 Adelie  Torger…           42            20.2               190        4250
#> # ℹ 142 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>

The logical expression works the same way as in Section 8.2, but notice that species here refers to the column, not to anything in your global environment. dplyr evaluates the expression inside the data frame.

Multiple conditions, separated by commas, are combined with AND:

filter(penguins, species == "Adelie", body_mass_g > 4000)

This keeps only Adelie penguins heavier than 4000 grams. Comma means “and.” You can also write & explicitly, but the comma is cleaner.

For OR, use |:

filter(penguins, species == "Adelie" | species == "Chinstrap")

When testing membership in a set, %in% is clearer than chaining |:

filter(penguins, species %in% c("Adelie", "Chinstrap"))

One detail worth knowing: filter() drops rows where the condition evaluates to NA. If body_mass_g is NA for a row, then body_mass_g > 4000 is NA, and filter() excludes it. This is usually what you want, but if you need to keep NA rows, test for them explicitly: filter(penguins, is.na(body_mass_g)).

Under the hood, filter() is doing the same logical subsetting from Section 8.2: penguins[penguins$species == "Adelie", ]. The verb makes the intent explicit and, more importantly, composable with other verbs. But keeping rows is only half the story; what about columns you don’t need?

Exercises

  1. Filter penguins to keep only female penguins from Biscoe island.
  2. Filter penguins to keep rows where flipper_length_mm is greater than 200 or body_mass_g is less than 3000.
  3. How many rows have missing values in sex? Use filter() and nrow().

14.3 select(): pick columns

Eight columns is manageable. Eighty is not. When you only need species, bill length, and body mass, carrying around five extra columns clutters every printout, slows every join, and obscures what matters. select() strips a data frame down to the columns you name.

select(penguins, species, bill_length_mm, body_mass_g)
#> # A tibble: 344 × 3
#>    species bill_length_mm body_mass_g
#>    <fct>            <dbl>       <int>
#>  1 Adelie            39.1        3750
#>  2 Adelie            39.5        3800
#>  3 Adelie            40.3        3250
#>  4 Adelie            NA            NA
#>  5 Adelie            36.7        3450
#>  6 Adelie            39.3        3650
#>  7 Adelie            38.9        3625
#>  8 Adelie            39.2        4675
#>  9 Adelie            34.1        3475
#> 10 Adelie            42          4250
#> # ℹ 334 more rows

The rest are gone.

Drop columns by prefixing with -:

select(penguins, -year)

For more flexible selection, dplyr provides helper functions. starts_with(), ends_with(), and contains() match column names by pattern:

select(penguins, starts_with("bill"))
select(penguins, ends_with("mm"))

where() selects columns based on their content, not their name. where(is.numeric) keeps every numeric column:

select(penguins, species, where(is.numeric))

You can rename columns inside select():

select(penguins, species, mass = body_mass_g)
#> # A tibble: 344 × 2
#>    species  mass
#>    <fct>   <int>
#>  1 Adelie   3750
#>  2 Adelie   3800
#>  3 Adelie   3250
#>  4 Adelie     NA
#>  5 Adelie   3450
#>  6 Adelie   3650
#>  7 Adelie   3625
#>  8 Adelie   4675
#>  9 Adelie   3475
#> 10 Adelie   4250
#> # ℹ 334 more rows

This selects species and body_mass_g, renaming the latter to mass. But notice what happened: every other column is gone. Renaming inside select() is really selection-with-renaming, not renaming-with-selection. If you want a new name but need the other six columns to survive, you have to decide: do you list every column you want to keep, or do you use a different verb entirely? That trade-off between concision and preservation comes up constantly.

If you want to rename without dropping columns, use rename():

rename(penguins, mass = body_mass_g)

To reorder columns (move species to the front, keep everything else), use everything():

select(penguins, island, species, everything())

Exercises

  1. Select all columns from penguins that contain the string "length".
  2. Select species, island, and all numeric columns. (Hint: combine names and where().)
  3. What is the difference between select(penguins, mass = body_mass_g) and rename(penguins, mass = body_mass_g)? Try both.

14.4 mutate(): compute new columns

Body mass is recorded in grams, but your collaborator’s model expects kilograms.

mutate(penguins, mass_kg = body_mass_g / 1000)
#> # A tibble: 344 × 9
#>    species island  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>            <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torger…           39.1          18.7               181        3750
#>  2 Adelie  Torger…           39.5          17.4               186        3800
#>  3 Adelie  Torger…           40.3          18                 195        3250
#>  4 Adelie  Torger…           NA            NA                  NA          NA
#>  5 Adelie  Torger…           36.7          19.3               193        3450
#>  6 Adelie  Torger…           39.3          20.6               190        3650
#>  7 Adelie  Torger…           38.9          17.8               181        3625
#>  8 Adelie  Torger…           39.2          19.6               195        4675
#>  9 Adelie  Torger…           34.1          18.1               193        3475
#> 10 Adelie  Torger…           42            20.2               190        4250
#> # ℹ 334 more rows
#> # ℹ 3 more variables: sex <fct>, year <int>, mass_kg <dbl>

You can reference a column you just created in the same mutate() call:

mutate(
  penguins,
  mass_kg = body_mass_g / 1000,
  mass_lb = mass_kg * 2.205
)

mass_lb uses mass_kg, which was defined one line above; dplyr evaluates them in order, top to bottom. This is actually surprising. R normally evaluates function arguments independently of each other. If you called a plain function with two arguments, the second could not refer to the first. mutate() breaks that expectation: it processes columns sequentially, so each expression can see everything defined above it. Reverse the order and mass_lb would fail, because mass_kg would not exist yet.

Any function that works on vectors works inside mutate(): log(), sqrt(), cumsum(), lag(), lead(), percent_rank(). The conditional functions from Section 8.6 work here too:

mutate(
  penguins,
  size = if_else(body_mass_g > 4000, "large", "small")
)

For more than two categories, case_when() is the tool:

mutate(
  penguins,
  size = case_when(
    body_mass_g > 5000 ~ "large",
    body_mass_g > 3500 ~ "medium",
    .default = "small"
  )
)

If you want to keep only the new columns and drop everything else, use .keep = "none":

mutate(
  penguins,
  mass_kg = body_mass_g / 1000,
  .keep = "none"
)
#> # A tibble: 344 × 1
#>    mass_kg
#>      <dbl>
#>  1    3.75
#>  2    3.8 
#>  3    3.25
#>  4   NA   
#>  5    3.45
#>  6    3.65
#>  7    3.62
#>  8    4.68
#>  9    3.48
#> 10    4.25
#> # ℹ 334 more rows

Exercises

  1. Add a column bill_ratio that divides bill_length_mm by bill_depth_mm.
  2. Use mutate() and case_when() to create a column island_code that maps "Biscoe" to "B", "Dream" to "D", and "Torgersen" to "T".
  3. What happens if you use a column name that already exists in mutate()? Try mutate(penguins, species = toupper(species)).

14.5 summarise() and group_by(): aggregate

What is the average body mass of a penguin in this dataset?

summarise(penguins, mean_mass = mean(body_mass_g, na.rm = TRUE))
#> # A tibble: 1 × 1
#>   mean_mass
#>       <dbl>
#> 1     4202.

One row. One number. On its own, collapsing everything into a single summary is limited. The interesting question is not “what is the mean?” but “how does the mean differ across species?” That is where group_by() enters.

group_by() marks a data frame as grouped. It doesn’t change the data; it attaches metadata that tells subsequent verbs to operate per group:

penguins_grouped <- group_by(penguins, species)
summarise(penguins_grouped, mean_mass = mean(body_mass_g, na.rm = TRUE))
#> # A tibble: 3 × 2
#>   species   mean_mass
#>   <fct>         <dbl>
#> 1 Adelie        3701.
#> 2 Chinstrap     3733.
#> 3 Gentoo        5076.

summarise() applied mean() within each group and returned one row per species.

Compare the two calls. Without group_by(), summarise() collapsed everything into a single row — one mean, one answer. With group_by(species), the same verb produced three rows, one per species, without any change to the summarise() call itself. The column you group by acts as a partitioning function: it splits the data into subsets, and summarise() runs independently inside each one. SQL spells this GROUP BY; Spark uses explicit partitioning; in functional programming the same idea appears as foldByKey, where a reduce operation runs per partition rather than globally. The underlying pattern — an aggregation function paired with a partitioning structure — is the same across all of them. What makes dplyr’s version distinctive is that the partition is invisible metadata attached to the data frame — group_by() changes nothing you can see in a printout, but it changes everything summarise() does. That silent metadata is powerful, and it is also the source of the most common grouping bugs.

You can group by multiple variables:

summarise(
  group_by(penguins, species, island),
  mean_mass = mean(body_mass_g, na.rm = TRUE),
  count = n()
)
#> `summarise()` has regrouped the output.
#> ℹ Summaries were computed grouped by species and island.
#> ℹ Output is grouped by species.
#> ℹ Use `summarise(.groups = "drop_last")` to silence this message.
#> ℹ Use `summarise(.by = c(species, island))` for per-operation grouping
#>   (`?dplyr::dplyr_by`) instead.
#> # A tibble: 5 × 4
#> # Groups:   species [3]
#>   species   island    mean_mass count
#>   <fct>     <fct>         <dbl> <int>
#> 1 Adelie    Biscoe        3710.    44
#> 2 Adelie    Dream         3688.    56
#> 3 Adelie    Torgersen     3706.    52
#> 4 Chinstrap Dream         3733.    68
#> 5 Gentoo    Biscoe        5076.   124

n() counts the number of rows in each group; n_distinct() counts unique values:

summarise(
  group_by(penguins, species),
  n_islands = n_distinct(island)
)
#> # A tibble: 3 × 2
#>   species   n_islands
#>   <fct>         <int>
#> 1 Adelie            3
#> 2 Chinstrap         1
#> 3 Gentoo            1

The shortcut count() does group_by() plus summarise(n = n()) in one step:

count(penguins, species)
#> # A tibble: 3 × 2
#>   species       n
#>   <fct>     <int>
#> 1 Adelie      152
#> 2 Chinstrap    68
#> 3 Gentoo      124
TipOpinion

After summarise(), the last grouping variable is silently peeled off, which means the result may still be grouped. A mutate() you expect to operate on the full data frame then operates per group instead. Add .groups = "drop" to summarise(), or call ungroup() afterward.

Exercises

  1. Compute the median flipper_length_mm for each species.
  2. Use group_by() and summarise() to find the maximum bill_length_mm per species and island combination.
  3. Use count() to find how many penguins were observed in each year.

14.6 arrange(): order rows

arrange() sorts rows. The default is ascending:

arrange(penguins, body_mass_g)
#> # A tibble: 344 × 8
#>    species  island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>    <fct>           <dbl>         <dbl>             <int>       <int>
#>  1 Chinstr… Dream            46.9          16.6               192        2700
#>  2 Adelie   Biscoe           36.5          16.6               181        2850
#>  3 Adelie   Biscoe           36.4          17.1               184        2850
#>  4 Adelie   Biscoe           34.5          18.1               187        2900
#>  5 Adelie   Dream            33.1          16.1               178        2900
#>  6 Adelie   Torge…           38.6          17                 188        2900
#>  7 Chinstr… Dream            43.2          16.6               187        2900
#>  8 Adelie   Biscoe           37.9          18.6               193        2925
#>  9 Adelie   Dream            37.5          18.9               179        2975
#> 10 Adelie   Dream            37            16.9               185        3000
#> # ℹ 334 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>

Wrap a column in desc() for descending order:

arrange(penguins, desc(body_mass_g))

Multiple columns break ties; sort by species alphabetically, then by mass descending within each species:

arrange(penguins, species, desc(body_mass_g))

Unlike the other verbs, arrange() ignores grouping by default. If you want to sort within groups, set .by_group = TRUE.

arrange() is pure display. No information is added or removed, only the order of rows changes. But a well-sorted table can reveal patterns that a jumbled one hides, and sometimes the five core verbs are not quite enough to expose them.

14.7 More verbs

distinct() returns unique rows:

distinct(penguins, species, island)
#> # A tibble: 5 × 2
#>   species   island   
#>   <fct>     <fct>    
#> 1 Adelie    Torgersen
#> 2 Adelie    Biscoe   
#> 3 Adelie    Dream    
#> 4 Gentoo    Biscoe   
#> 5 Chinstrap Dream

Seven combinations of species and island exist in the data.

slice_head() and slice_tail() keep the first or last n rows:

slice_head(penguins, n = 3)
#> # A tibble: 3 × 8
#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>             <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgers…           39.1          18.7               181        3750
#> 2 Adelie  Torgers…           39.5          17.4               186        3800
#> 3 Adelie  Torgers…           40.3          18                 195        3250
#> # ℹ 2 more variables: sex <fct>, year <int>

slice_max() and slice_min() keep the rows with the largest or smallest values:

slice_max(penguins, body_mass_g, n = 3)
#> # A tibble: 4 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
#> 1 Gentoo  Biscoe           49.2          15.2               221        6300
#> 2 Gentoo  Biscoe           59.6          17                 230        6050
#> 3 Gentoo  Biscoe           51.1          16.3               220        6000
#> 4 Gentoo  Biscoe           48.8          16.2               222        6000
#> # ℹ 2 more variables: sex <fct>, year <int>

pull() extracts a single column as a vector, the pipe-friendly equivalent of $:

pull(penguins, species) |> head()
#> [1] Adelie Adelie Adelie Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo

relocate() moves columns without dropping any. Move sex and year to the front:

relocate(penguins, sex, year)

Finally, across() applies a function to multiple columns at once. Summarise all numeric columns by their mean:

summarise(penguins, across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
#> # A tibble: 1 × 5
#>   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
#>            <dbl>         <dbl>             <dbl>       <dbl> <dbl>
#> 1           43.9          17.2              201.       4202. 2008.

across() works inside mutate() too. Round every numeric column to one decimal place:

mutate(penguins, across(where(is.numeric), \(x) round(x, 1)))

across() pairs naturally with where(), starts_with(), and the other selection helpers from Section 14.3. Chapter 19 generalizes the same idea (apply a function to each element of a collection) beyond data frames. But before you build elaborate pipelines out of these verbs, it is worth seeing what they look like without dplyr at all.

Exercises

  1. Use distinct() to find all unique values of island.
  2. Use slice_max() to find the 5 penguins with the longest flippers.
  3. Use across() inside summarise() to compute the standard deviation (sd()) of every numeric column, grouped by species.

14.8 The base R equivalents

Everything dplyr does, base R can do too. The verbs are convenient wrappers, not new capabilities. So what exactly do you gain?

filter() is logical subsetting:

# dplyr
filter(penguins, species == "Adelie", body_mass_g > 4000)

# base R
penguins[penguins$species == "Adelie" & penguins$body_mass_g > 4000, ]

select() is column indexing:

# dplyr
select(penguins, species, body_mass_g)

# base R
penguins[, c("species", "body_mass_g")]

mutate() is direct assignment:

# dplyr
mutate(penguins, mass_kg = body_mass_g / 1000)

# base R
penguins$mass_kg <- penguins$body_mass_g / 1000

The base R version modifies the binding; the dplyr version returns a new data frame and leaves the original untouched. This distinction matters more as pipelines grow longer.

summarise() plus group_by() maps to aggregate():

# dplyr
summarise(group_by(penguins, species), mean_mass = mean(body_mass_g, na.rm = TRUE))

# base R
aggregate(body_mass_g ~ species, data = penguins, FUN = mean, na.rm = TRUE)

arrange() is order():

# dplyr
arrange(penguins, desc(body_mass_g))

# base R
penguins[order(-penguins$body_mass_g), ]

Base R also has subset(), which combines filtering and column selection:

subset(penguins, species == "Adelie", select = c(species, body_mass_g))

subset() is clean and readable; for simple operations, it works well. Where dplyr pulls ahead is composability: every verb takes a data frame and returns one, so you can chain them. Base R’s equivalents each return something slightly different: aggregate() renames your columns, order() gives you indices instead of a data frame. So you spend time wiring them together instead of thinking about your data.

dplyr and SQL are the same language wearing different syntax. Both implement relational algebra, which is why they feel similar:

dplyr SQL
filter(penguins, species == "Adelie") SELECT * FROM penguins WHERE species = 'Adelie'
group_by(species) |> summarise(n = n()) SELECT species, COUNT(*) AS n ... GROUP BY species

Call show_query() on a database-backed tibble to see the translation directly.

The reason dplyr pipelines compose so cleanly is that every verb has the same type signature: data frame in, data frame out, column structure preserved. This is the same monoid pattern from Section 4.1: an associative operation (verb composition) with an identity element (a verb that returns the data frame unchanged). Compose any sequence of dplyr verbs and the result is still a data frame.

TipOpinion

Learn the base R equivalents. You will encounter them in other people’s code, in Stack Overflow answers, in older packages. But write dplyr in your own work, because the verbs compose, the names communicate intent, and the argument order is consistent. When you read filter(penguins, species == "Adelie"), the operation is the function name. When you read penguins[penguins$species == "Adelie", ], you have to parse the indexing to understand what’s happening. That parsing cost, multiplied across every line of a script, is exactly the kind of friction Codd’s algebra was designed to remove.

Now you have every verb you need to answer the question from the start of the chapter:

penguins |>
  filter(!is.na(body_mass_g)) |>
  group_by(species) |>
  summarise(mean_mass = mean(body_mass_g))
#> # A tibble: 3 × 2
#>   species   mean_mass
#>   <fct>         <dbl>
#> 1 Adelie        3701.
#> 2 Chinstrap     3733.
#> 3 Gentoo        5076.

Gentoo penguins average roughly 5,076 grams; Adelie average 3,706. The answer was always in the data. Getting to it required four verbs and five lines.

Exercises

  1. Rewrite select(penguins, species, island, body_mass_g) using base R bracket notation.
  2. Rewrite arrange(penguins, bill_length_mm) using order().
  3. Use aggregate() to compute the mean flipper_length_mm per island. Compare with the group_by() plus summarise() version.