library(palmerpenguins)
#>
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#>
#> penguins, penguins_raw
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union14 Data transformation
So now you can keep rows and pick columns. What you cannot yet do is create something that was not there before. mutate() adds new columns while the originals stay untouched — the same functor pattern from Section 4.4: apply a function inside a container, get back the same container with different contents. There, the container was a vector; here it is a data frame with 344 rows. You can now keep rows, pick columns, and compute new values from old ones. One operation remains: collapsing many rows into few.
Are Gentoo penguins heavier than Adelie penguins? The penguins data has the answer, but getting there means filtering rows, picking columns, computing a mean, and grouping by species. Every data analysis you will ever write is some arrangement of those same moves, and the question that shapes this chapter is: what is the smallest set of operations that covers all of them?
14.1 The idea: verbs for data
A dplyr verb takes a data frame and returns a data frame. The data frame goes in first, the result comes out the same shape, and nothing else changes. Five verbs cover most of what you need:
filter()keeps rows that match a condition.select()picks columns by name.mutate()adds new columns computed from existing ones.summarise()collapses rows into a summary.arrange()reorders rows.
You compose them for complex transformations (via pipes, Chapter 15), but each one does exactly one thing. A dplyr pipeline is function composition: each verb transforms a data frame and passes it to the next, the same composition from lambda calculus, (g ∘ f)(x) = g(f(x)), applied to tables instead of numbers.
These verbs did not come from nowhere. Edgar Codd’s 1970 paper defined a relational algebra: a small set of operations on tables where every operation takes a table in and returns a table out. Restriction (keep rows), projection (keep columns), extension (add computed columns), aggregation (collapse groups). IBM resisted — they had a competing database product (IMS) — but the composability was too useful, and by the 1980s relational databases had won.
dplyr is Codd’s algebra with friendlier names. The same ideas appear in SQL, pandas, and data.table because the operations are the same; only the syntax differs. arrange() is the one exception — ordering has no relational equivalent, but you need it for display.
The penguins data frame (loaded by palmerpenguins) has 344 rows and 8 columns: species, island, bill measurements, flipper length, body mass, sex, and year. Every example in this chapter uses it.
penguins
#> # A tibble: 344 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torger… 39.1 18.7 181 3750
#> 2 Adelie Torger… 39.5 17.4 186 3800
#> 3 Adelie Torger… 40.3 18 195 3250
#> 4 Adelie Torger… NA NA NA NA
#> 5 Adelie Torger… 36.7 19.3 193 3450
#> 6 Adelie Torger… 39.3 20.6 190 3650
#> 7 Adelie Torger… 38.9 17.8 181 3625
#> 8 Adelie Torger… 39.2 19.6 195 4675
#> 9 Adelie Torger… 34.1 18.1 193 3475
#> 10 Adelie Torger… 42 20.2 190 4250
#> # ℹ 334 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>14.2 filter(): keep rows
You have 344 penguins. You only care about the Adelie ones. How do you throw the rest away?
filter(penguins, species == "Adelie")
#> # A tibble: 152 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torger… 39.1 18.7 181 3750
#> 2 Adelie Torger… 39.5 17.4 186 3800
#> 3 Adelie Torger… 40.3 18 195 3250
#> 4 Adelie Torger… NA NA NA NA
#> 5 Adelie Torger… 36.7 19.3 193 3450
#> 6 Adelie Torger… 39.3 20.6 190 3650
#> 7 Adelie Torger… 38.9 17.8 181 3625
#> 8 Adelie Torger… 39.2 19.6 195 4675
#> 9 Adelie Torger… 34.1 18.1 193 3475
#> 10 Adelie Torger… 42 20.2 190 4250
#> # ℹ 142 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>The logical expression works the same way as in Section 8.2, but notice that species here refers to the column, not to anything in your global environment. dplyr evaluates the expression inside the data frame.
Multiple conditions, separated by commas, are combined with AND:
filter(penguins, species == "Adelie", body_mass_g > 4000)This keeps only Adelie penguins heavier than 4000 grams. Comma means “and.” You can also write & explicitly, but the comma is cleaner.
For OR, use |:
filter(penguins, species == "Adelie" | species == "Chinstrap")When testing membership in a set, %in% is clearer than chaining |:
filter(penguins, species %in% c("Adelie", "Chinstrap"))One detail worth knowing: filter() drops rows where the condition evaluates to NA. If body_mass_g is NA for a row, then body_mass_g > 4000 is NA, and filter() excludes it. This is usually what you want, but if you need to keep NA rows, test for them explicitly: filter(penguins, is.na(body_mass_g)).
Under the hood, filter() is doing the same logical subsetting from Section 8.2: penguins[penguins$species == "Adelie", ]. The verb makes the intent explicit and, more importantly, composable with other verbs. But keeping rows is only half the story; what about columns you don’t need?
Exercises
- Filter
penguinsto keep only female penguins from Biscoe island. - Filter
penguinsto keep rows whereflipper_length_mmis greater than 200 orbody_mass_gis less than 3000. - How many rows have missing values in
sex? Usefilter()andnrow().
14.3 select(): pick columns
Eight columns is manageable. Eighty is not. When you only need species, bill length, and body mass, carrying around five extra columns clutters every printout, slows every join, and obscures what matters. select() strips a data frame down to the columns you name.
select(penguins, species, bill_length_mm, body_mass_g)
#> # A tibble: 344 × 3
#> species bill_length_mm body_mass_g
#> <fct> <dbl> <int>
#> 1 Adelie 39.1 3750
#> 2 Adelie 39.5 3800
#> 3 Adelie 40.3 3250
#> 4 Adelie NA NA
#> 5 Adelie 36.7 3450
#> 6 Adelie 39.3 3650
#> 7 Adelie 38.9 3625
#> 8 Adelie 39.2 4675
#> 9 Adelie 34.1 3475
#> 10 Adelie 42 4250
#> # ℹ 334 more rowsThe rest are gone.
Drop columns by prefixing with -:
select(penguins, -year)For more flexible selection, dplyr provides helper functions. starts_with(), ends_with(), and contains() match column names by pattern:
select(penguins, starts_with("bill"))select(penguins, ends_with("mm"))where() selects columns based on their content, not their name. where(is.numeric) keeps every numeric column:
select(penguins, species, where(is.numeric))You can rename columns inside select():
select(penguins, species, mass = body_mass_g)
#> # A tibble: 344 × 2
#> species mass
#> <fct> <int>
#> 1 Adelie 3750
#> 2 Adelie 3800
#> 3 Adelie 3250
#> 4 Adelie NA
#> 5 Adelie 3450
#> 6 Adelie 3650
#> 7 Adelie 3625
#> 8 Adelie 4675
#> 9 Adelie 3475
#> 10 Adelie 4250
#> # ℹ 334 more rowsThis selects species and body_mass_g, renaming the latter to mass. But notice what happened: every other column is gone. Renaming inside select() is really selection-with-renaming, not renaming-with-selection. If you want a new name but need the other six columns to survive, you have to decide: do you list every column you want to keep, or do you use a different verb entirely? That trade-off between concision and preservation comes up constantly.
If you want to rename without dropping columns, use rename():
rename(penguins, mass = body_mass_g)To reorder columns (move species to the front, keep everything else), use everything():
select(penguins, island, species, everything())Exercises
- Select all columns from
penguinsthat contain the string"length". - Select
species,island, and all numeric columns. (Hint: combine names andwhere().) - What is the difference between
select(penguins, mass = body_mass_g)andrename(penguins, mass = body_mass_g)? Try both.
14.4 mutate(): compute new columns
Body mass is recorded in grams, but your collaborator’s model expects kilograms.
mutate(penguins, mass_kg = body_mass_g / 1000)
#> # A tibble: 344 × 9
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torger… 39.1 18.7 181 3750
#> 2 Adelie Torger… 39.5 17.4 186 3800
#> 3 Adelie Torger… 40.3 18 195 3250
#> 4 Adelie Torger… NA NA NA NA
#> 5 Adelie Torger… 36.7 19.3 193 3450
#> 6 Adelie Torger… 39.3 20.6 190 3650
#> 7 Adelie Torger… 38.9 17.8 181 3625
#> 8 Adelie Torger… 39.2 19.6 195 4675
#> 9 Adelie Torger… 34.1 18.1 193 3475
#> 10 Adelie Torger… 42 20.2 190 4250
#> # ℹ 334 more rows
#> # ℹ 3 more variables: sex <fct>, year <int>, mass_kg <dbl>You can reference a column you just created in the same mutate() call:
mutate(
penguins,
mass_kg = body_mass_g / 1000,
mass_lb = mass_kg * 2.205
)mass_lb uses mass_kg, which was defined one line above; dplyr evaluates them in order, top to bottom. This is actually surprising. R normally evaluates function arguments independently of each other. If you called a plain function with two arguments, the second could not refer to the first. mutate() breaks that expectation: it processes columns sequentially, so each expression can see everything defined above it. Reverse the order and mass_lb would fail, because mass_kg would not exist yet.
Any function that works on vectors works inside mutate(): log(), sqrt(), cumsum(), lag(), lead(), percent_rank(). The conditional functions from Section 8.6 work here too:
mutate(
penguins,
size = if_else(body_mass_g > 4000, "large", "small")
)For more than two categories, case_when() is the tool:
mutate(
penguins,
size = case_when(
body_mass_g > 5000 ~ "large",
body_mass_g > 3500 ~ "medium",
.default = "small"
)
)If you want to keep only the new columns and drop everything else, use .keep = "none":
mutate(
penguins,
mass_kg = body_mass_g / 1000,
.keep = "none"
)
#> # A tibble: 344 × 1
#> mass_kg
#> <dbl>
#> 1 3.75
#> 2 3.8
#> 3 3.25
#> 4 NA
#> 5 3.45
#> 6 3.65
#> 7 3.62
#> 8 4.68
#> 9 3.48
#> 10 4.25
#> # ℹ 334 more rowsExercises
- Add a column
bill_ratiothat dividesbill_length_mmbybill_depth_mm. - Use
mutate()andcase_when()to create a columnisland_codethat maps"Biscoe"to"B","Dream"to"D", and"Torgersen"to"T". - What happens if you use a column name that already exists in
mutate()? Trymutate(penguins, species = toupper(species)).
14.5 summarise() and group_by(): aggregate
What is the average body mass of a penguin in this dataset?
summarise(penguins, mean_mass = mean(body_mass_g, na.rm = TRUE))
#> # A tibble: 1 × 1
#> mean_mass
#> <dbl>
#> 1 4202.One row. One number. On its own, collapsing everything into a single summary is limited. The interesting question is not “what is the mean?” but “how does the mean differ across species?” That is where group_by() enters.
group_by() marks a data frame as grouped. It doesn’t change the data; it attaches metadata that tells subsequent verbs to operate per group:
penguins_grouped <- group_by(penguins, species)
summarise(penguins_grouped, mean_mass = mean(body_mass_g, na.rm = TRUE))
#> # A tibble: 3 × 2
#> species mean_mass
#> <fct> <dbl>
#> 1 Adelie 3701.
#> 2 Chinstrap 3733.
#> 3 Gentoo 5076.summarise() applied mean() within each group and returned one row per species.
Compare the two calls. Without group_by(), summarise() collapsed everything into a single row — one mean, one answer. With group_by(species), the same verb produced three rows, one per species, without any change to the summarise() call itself. The column you group by acts as a partitioning function: it splits the data into subsets, and summarise() runs independently inside each one. SQL spells this GROUP BY; Spark uses explicit partitioning; in functional programming the same idea appears as foldByKey, where a reduce operation runs per partition rather than globally. The underlying pattern — an aggregation function paired with a partitioning structure — is the same across all of them. What makes dplyr’s version distinctive is that the partition is invisible metadata attached to the data frame — group_by() changes nothing you can see in a printout, but it changes everything summarise() does. That silent metadata is powerful, and it is also the source of the most common grouping bugs.
You can group by multiple variables:
summarise(
group_by(penguins, species, island),
mean_mass = mean(body_mass_g, na.rm = TRUE),
count = n()
)
#> `summarise()` has regrouped the output.
#> ℹ Summaries were computed grouped by species and island.
#> ℹ Output is grouped by species.
#> ℹ Use `summarise(.groups = "drop_last")` to silence this message.
#> ℹ Use `summarise(.by = c(species, island))` for per-operation grouping
#> (`?dplyr::dplyr_by`) instead.
#> # A tibble: 5 × 4
#> # Groups: species [3]
#> species island mean_mass count
#> <fct> <fct> <dbl> <int>
#> 1 Adelie Biscoe 3710. 44
#> 2 Adelie Dream 3688. 56
#> 3 Adelie Torgersen 3706. 52
#> 4 Chinstrap Dream 3733. 68
#> 5 Gentoo Biscoe 5076. 124n() counts the number of rows in each group; n_distinct() counts unique values:
summarise(
group_by(penguins, species),
n_islands = n_distinct(island)
)
#> # A tibble: 3 × 2
#> species n_islands
#> <fct> <int>
#> 1 Adelie 3
#> 2 Chinstrap 1
#> 3 Gentoo 1The shortcut count() does group_by() plus summarise(n = n()) in one step:
count(penguins, species)
#> # A tibble: 3 × 2
#> species n
#> <fct> <int>
#> 1 Adelie 152
#> 2 Chinstrap 68
#> 3 Gentoo 124After summarise(), the last grouping variable is silently peeled off, which means the result may still be grouped. A mutate() you expect to operate on the full data frame then operates per group instead. Add .groups = "drop" to summarise(), or call ungroup() afterward.
Exercises
- Compute the median
flipper_length_mmfor each species. - Use
group_by()andsummarise()to find the maximumbill_length_mmper species and island combination. - Use
count()to find how many penguins were observed in each year.
14.6 arrange(): order rows
arrange() sorts rows. The default is ascending:
arrange(penguins, body_mass_g)
#> # A tibble: 344 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Chinstr… Dream 46.9 16.6 192 2700
#> 2 Adelie Biscoe 36.5 16.6 181 2850
#> 3 Adelie Biscoe 36.4 17.1 184 2850
#> 4 Adelie Biscoe 34.5 18.1 187 2900
#> 5 Adelie Dream 33.1 16.1 178 2900
#> 6 Adelie Torge… 38.6 17 188 2900
#> 7 Chinstr… Dream 43.2 16.6 187 2900
#> 8 Adelie Biscoe 37.9 18.6 193 2925
#> 9 Adelie Dream 37.5 18.9 179 2975
#> 10 Adelie Dream 37 16.9 185 3000
#> # ℹ 334 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>Wrap a column in desc() for descending order:
arrange(penguins, desc(body_mass_g))Multiple columns break ties; sort by species alphabetically, then by mass descending within each species:
arrange(penguins, species, desc(body_mass_g))Unlike the other verbs, arrange() ignores grouping by default. If you want to sort within groups, set .by_group = TRUE.
arrange() is pure display. No information is added or removed, only the order of rows changes. But a well-sorted table can reveal patterns that a jumbled one hides, and sometimes the five core verbs are not quite enough to expose them.
14.7 More verbs
distinct() returns unique rows:
distinct(penguins, species, island)
#> # A tibble: 5 × 2
#> species island
#> <fct> <fct>
#> 1 Adelie Torgersen
#> 2 Adelie Biscoe
#> 3 Adelie Dream
#> 4 Gentoo Biscoe
#> 5 Chinstrap DreamSeven combinations of species and island exist in the data.
slice_head() and slice_tail() keep the first or last n rows:
slice_head(penguins, n = 3)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgers… 39.1 18.7 181 3750
#> 2 Adelie Torgers… 39.5 17.4 186 3800
#> 3 Adelie Torgers… 40.3 18 195 3250
#> # ℹ 2 more variables: sex <fct>, year <int>slice_max() and slice_min() keep the rows with the largest or smallest values:
slice_max(penguins, body_mass_g, n = 3)
#> # A tibble: 4 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Gentoo Biscoe 49.2 15.2 221 6300
#> 2 Gentoo Biscoe 59.6 17 230 6050
#> 3 Gentoo Biscoe 51.1 16.3 220 6000
#> 4 Gentoo Biscoe 48.8 16.2 222 6000
#> # ℹ 2 more variables: sex <fct>, year <int>pull() extracts a single column as a vector, the pipe-friendly equivalent of $:
pull(penguins, species) |> head()
#> [1] Adelie Adelie Adelie Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoorelocate() moves columns without dropping any. Move sex and year to the front:
relocate(penguins, sex, year)Finally, across() applies a function to multiple columns at once. Summarise all numeric columns by their mean:
summarise(penguins, across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
#> # A tibble: 1 × 5
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 43.9 17.2 201. 4202. 2008.across() works inside mutate() too. Round every numeric column to one decimal place:
mutate(penguins, across(where(is.numeric), \(x) round(x, 1)))across() pairs naturally with where(), starts_with(), and the other selection helpers from Section 14.3. Chapter 19 generalizes the same idea (apply a function to each element of a collection) beyond data frames. But before you build elaborate pipelines out of these verbs, it is worth seeing what they look like without dplyr at all.
Exercises
- Use
distinct()to find all unique values ofisland. - Use
slice_max()to find the 5 penguins with the longest flippers. - Use
across()insidesummarise()to compute the standard deviation (sd()) of every numeric column, grouped by species.
14.8 The base R equivalents
Everything dplyr does, base R can do too. The verbs are convenient wrappers, not new capabilities. So what exactly do you gain?
filter() is logical subsetting:
# dplyr
filter(penguins, species == "Adelie", body_mass_g > 4000)
# base R
penguins[penguins$species == "Adelie" & penguins$body_mass_g > 4000, ]select() is column indexing:
# dplyr
select(penguins, species, body_mass_g)
# base R
penguins[, c("species", "body_mass_g")]mutate() is direct assignment:
# dplyr
mutate(penguins, mass_kg = body_mass_g / 1000)
# base R
penguins$mass_kg <- penguins$body_mass_g / 1000The base R version modifies the binding; the dplyr version returns a new data frame and leaves the original untouched. This distinction matters more as pipelines grow longer.
summarise() plus group_by() maps to aggregate():
# dplyr
summarise(group_by(penguins, species), mean_mass = mean(body_mass_g, na.rm = TRUE))
# base R
aggregate(body_mass_g ~ species, data = penguins, FUN = mean, na.rm = TRUE)arrange() is order():
# dplyr
arrange(penguins, desc(body_mass_g))
# base R
penguins[order(-penguins$body_mass_g), ]Base R also has subset(), which combines filtering and column selection:
subset(penguins, species == "Adelie", select = c(species, body_mass_g))subset() is clean and readable; for simple operations, it works well. Where dplyr pulls ahead is composability: every verb takes a data frame and returns one, so you can chain them. Base R’s equivalents each return something slightly different: aggregate() renames your columns, order() gives you indices instead of a data frame. So you spend time wiring them together instead of thinking about your data.
dplyr and SQL are the same language wearing different syntax. Both implement relational algebra, which is why they feel similar:
| dplyr | SQL |
|---|---|
filter(penguins, species == "Adelie") |
SELECT * FROM penguins WHERE species = 'Adelie' |
group_by(species) |> summarise(n = n()) |
SELECT species, COUNT(*) AS n ... GROUP BY species |
Call show_query() on a database-backed tibble to see the translation directly.
The reason dplyr pipelines compose so cleanly is that every verb has the same type signature: data frame in, data frame out, column structure preserved. This is the same monoid pattern from Section 4.1: an associative operation (verb composition) with an identity element (a verb that returns the data frame unchanged). Compose any sequence of dplyr verbs and the result is still a data frame.
Learn the base R equivalents. You will encounter them in other people’s code, in Stack Overflow answers, in older packages. But write dplyr in your own work, because the verbs compose, the names communicate intent, and the argument order is consistent. When you read filter(penguins, species == "Adelie"), the operation is the function name. When you read penguins[penguins$species == "Adelie", ], you have to parse the indexing to understand what’s happening. That parsing cost, multiplied across every line of a script, is exactly the kind of friction Codd’s algebra was designed to remove.
Now you have every verb you need to answer the question from the start of the chapter:
penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species) |>
summarise(mean_mass = mean(body_mass_g))
#> # A tibble: 3 × 2
#> species mean_mass
#> <fct> <dbl>
#> 1 Adelie 3701.
#> 2 Chinstrap 3733.
#> 3 Gentoo 5076.Gentoo penguins average roughly 5,076 grams; Adelie average 3,706. The answer was always in the data. Getting to it required four verbs and five lines.
Exercises
- Rewrite
select(penguins, species, island, body_mass_g)using base R bracket notation. - Rewrite
arrange(penguins, bill_length_mm)usingorder(). - Use
aggregate()to compute the meanflipper_length_mmper island. Compare with thegroup_by()plussummarise()version.