library(palmerpenguins)
#>
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#>
#> penguins, penguins_raw
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union14 Data transformation
Every data analysis is the same four moves: keep the rows you want, pick the columns you need, compute new values, summarize. dplyr gives each move a name.
14.1 The idea: verbs for data
A dplyr verb takes a data frame and returns a data frame. The data frame goes in first, the result comes out the same shape (a data frame), and nothing else changes. Every verb does one thing:
filter()keeps rows that match a condition.select()picks columns by name.mutate()adds new columns computed from existing ones.summarise()collapses rows into a summary.arrange()reorders rows.
These five verbs cover most of what you need. You compose them for complex transformations (via pipes, Chapter 15), but each one does exactly one thing. A dplyr pipeline is function composition: each verb transforms a data frame and passes it to the next, the same composition from lambda calculus, (g ∘ f)(x) = g(f(x)), applied to tables instead of numbers.
These verbs did not come from nowhere. In 1970, Edgar Codd, a mathematician at IBM’s San Jose Research Laboratory, published “A Relational Model of Data for Large Shared Data Banks.” At the time, databases were navigational: to find a record, you followed pointers from one record to the next, like walking a linked list. Codd proposed something radical. Forget the pointers. Think of data as tables (he called them relations), and define a small set of operations on those tables: restriction (keep rows where a predicate holds), projection (keep certain columns), extension (add computed columns), and aggregation (collapse groups to summaries). The key insight was that every operation takes a table as input and returns a table as output, so you can chain them: filter a table, then select columns from the result, then summarize that. Codd called this set of composable operations a relational algebra, “algebra” in the mathematical sense of a closed system where combining any two things always gives you back the same kind of thing. IBM initially resisted the idea. Codd’s own employer spent years trying to protect their existing database product (IMS) from the relational model. But the composability was too useful to ignore, and by the 1980s relational databases had won.
dplyr is Codd’s relational algebra with different names. filter() is restriction. select() is projection. mutate() is extension. summarise() is aggregation. arrange() handles ordering, which has no relational equivalent but is essential for display. The reason these five verbs compose so cleanly, and the reason the same ideas appear in SQL, pandas, data.table, and every other language that works with tabular data, is that they are all implementations of the same fifty-year-old algebra.
The penguins data frame (loaded by palmerpenguins) has 344 rows and 8 columns: species, island, bill measurements, flipper length, body mass, sex, and year. Every example in this chapter uses it.
penguins
#> # A tibble: 344 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torger… 39.1 18.7 181 3750
#> 2 Adelie Torger… 39.5 17.4 186 3800
#> 3 Adelie Torger… 40.3 18 195 3250
#> 4 Adelie Torger… NA NA NA NA
#> 5 Adelie Torger… 36.7 19.3 193 3450
#> 6 Adelie Torger… 39.3 20.6 190 3650
#> 7 Adelie Torger… 38.9 17.8 181 3625
#> 8 Adelie Torger… 39.2 19.6 195 4675
#> 9 Adelie Torger… 34.1 18.1 193 3475
#> 10 Adelie Torger… 42 20.2 190 4250
#> # ℹ 334 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>14.2 filter(): keep rows
filter() takes a data frame and one or more conditions. It returns only the rows where every condition is TRUE.
filter(penguins, species == "Adelie")
#> # A tibble: 152 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torger… 39.1 18.7 181 3750
#> 2 Adelie Torger… 39.5 17.4 186 3800
#> 3 Adelie Torger… 40.3 18 195 3250
#> 4 Adelie Torger… NA NA NA NA
#> 5 Adelie Torger… 36.7 19.3 193 3450
#> 6 Adelie Torger… 39.3 20.6 190 3650
#> 7 Adelie Torger… 38.9 17.8 181 3625
#> 8 Adelie Torger… 39.2 19.6 195 4675
#> 9 Adelie Torger… 34.1 18.1 193 3475
#> 10 Adelie Torger… 42 20.2 190 4250
#> # ℹ 142 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>The first argument is the data frame. The second is a logical expression, the same kind you wrote in Section 8.2. Inside filter(), column names work like variables: species refers to the species column, not an object in your global environment.
Multiple conditions, separated by commas, are combined with AND:
filter(penguins, species == "Adelie", body_mass_g > 4000)This keeps only Adelie penguins heavier than 4000 grams. Comma means “and.” You can also write & explicitly, but the comma is cleaner.
For OR, use |:
filter(penguins, species == "Adelie" | species == "Chinstrap")When testing membership in a set, %in% is clearer than chaining |:
filter(penguins, species %in% c("Adelie", "Chinstrap"))One detail worth knowing: filter() drops rows where the condition evaluates to NA. If body_mass_g is NA for a row, then body_mass_g > 4000 is NA, and filter() excludes it. This is usually what you want. If you need to keep NA rows, test for them explicitly: filter(penguins, is.na(body_mass_g)).
Under the hood, filter() is doing the same logical subsetting you saw in Section 8.2: penguins[penguins$species == "Adelie", ]. The verb just makes the intent explicit.
Exercises
- Filter
penguinsto keep only female penguins from Biscoe island. - Filter
penguinsto keep rows whereflipper_length_mmis greater than 200 orbody_mass_gis less than 3000. - How many rows have missing values in
sex? Usefilter()andnrow().
14.3 select(): pick columns
select() keeps the columns you name and drops everything else.
select(penguins, species, bill_length_mm, body_mass_g)
#> # A tibble: 344 × 3
#> species bill_length_mm body_mass_g
#> <fct> <dbl> <int>
#> 1 Adelie 39.1 3750
#> 2 Adelie 39.5 3800
#> 3 Adelie 40.3 3250
#> 4 Adelie NA NA
#> 5 Adelie 36.7 3450
#> 6 Adelie 39.3 3650
#> 7 Adelie 38.9 3625
#> 8 Adelie 39.2 4675
#> 9 Adelie 34.1 3475
#> 10 Adelie 42 4250
#> # ℹ 334 more rowsThe result keeps only those three columns; the rest are gone.
Drop columns by prefixing with -:
select(penguins, -year)For more flexible selection, dplyr provides helper functions. starts_with(), ends_with(), and contains() match column names by pattern:
select(penguins, starts_with("bill"))select(penguins, ends_with("mm"))where() selects columns based on their content, not their name. where(is.numeric) keeps every numeric column:
select(penguins, species, where(is.numeric))You can rename columns inside select():
select(penguins, species, mass = body_mass_g)
#> # A tibble: 344 × 2
#> species mass
#> <fct> <int>
#> 1 Adelie 3750
#> 2 Adelie 3800
#> 3 Adelie 3250
#> 4 Adelie NA
#> 5 Adelie 3450
#> 6 Adelie 3650
#> 7 Adelie 3625
#> 8 Adelie 4675
#> 9 Adelie 3475
#> 10 Adelie 4250
#> # ℹ 334 more rowsThis selects species and body_mass_g, renaming the latter to mass. If you want to rename without dropping columns, use rename():
rename(penguins, mass = body_mass_g)To reorder columns (move species to the front, keep everything else), use everything():
select(penguins, island, species, everything())Exercises
- Select all columns from
penguinsthat contain the string"length". - Select
species,island, and all numeric columns. (Hint: combine names andwhere().) - What is the difference between
select(penguins, mass = body_mass_g)andrename(penguins, mass = body_mass_g)? Try both.
14.4 mutate(): compute new columns
mutate() adds new columns. The original columns stay.
mutate() is the functor pattern applied to data frames. It maps a function over a column (the container), producing a new column of the same length. The container’s shape (344 rows) is preserved; only the contents change. You saw the same pattern with vectorized sqrt() on vectors (Section 4.4), with sapply() on lists (Section 7.1), and with map() on split data frames (Section 19.3). Each time, the structure is ‘apply a function inside a container, keep the container intact.’ The word for this, across all of these instances, is functor.
mutate(penguins, mass_kg = body_mass_g / 1000)
#> # A tibble: 344 × 9
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torger… 39.1 18.7 181 3750
#> 2 Adelie Torger… 39.5 17.4 186 3800
#> 3 Adelie Torger… 40.3 18 195 3250
#> 4 Adelie Torger… NA NA NA NA
#> 5 Adelie Torger… 36.7 19.3 193 3450
#> 6 Adelie Torger… 39.3 20.6 190 3650
#> 7 Adelie Torger… 38.9 17.8 181 3625
#> 8 Adelie Torger… 39.2 19.6 195 4675
#> 9 Adelie Torger… 34.1 18.1 193 3475
#> 10 Adelie Torger… 42 20.2 190 4250
#> # ℹ 334 more rows
#> # ℹ 3 more variables: sex <fct>, year <int>, mass_kg <dbl>You can reference a column you just created in the same mutate() call:
mutate(
penguins,
mass_kg = body_mass_g / 1000,
mass_lb = mass_kg * 2.205
)mass_lb uses mass_kg, which was defined one line above. dplyr evaluates them in order, top to bottom.
Any function that works on vectors works inside mutate(): log(), sqrt(), cumsum(), lag(), lead(), percent_rank(). The conditional functions from Section 8.7 work here too:
mutate(
penguins,
size = if_else(body_mass_g > 4000, "large", "small")
)For more than two categories, case_when() is the tool:
mutate(
penguins,
size = case_when(
body_mass_g > 5000 ~ "large",
body_mass_g > 3500 ~ "medium",
.default = "small"
)
)If you want to keep only the new columns and drop everything else, use .keep = "none":
mutate(
penguins,
mass_kg = body_mass_g / 1000,
.keep = "none"
)
#> # A tibble: 344 × 1
#> mass_kg
#> <dbl>
#> 1 3.75
#> 2 3.8
#> 3 3.25
#> 4 NA
#> 5 3.45
#> 6 3.65
#> 7 3.62
#> 8 4.68
#> 9 3.48
#> 10 4.25
#> # ℹ 334 more rowsExercises
- Add a column
bill_ratiothat dividesbill_length_mmbybill_depth_mm. - Use
mutate()andcase_when()to create a columnisland_codethat maps"Biscoe"to"B","Dream"to"D", and"Torgersen"to"T". - What happens if you use a column name that already exists in
mutate()? Trymutate(penguins, species = toupper(species)).
14.5 summarise() and group_by(): aggregate
summarise() collapses an entire data frame into a single row:
summarise(penguins, mean_mass = mean(body_mass_g, na.rm = TRUE))
#> # A tibble: 1 × 1
#> mean_mass
#> <dbl>
#> 1 4202.On its own, collapsing everything into a single row is limited. Paired with group_by(), it does the real work.
group_by() marks a data frame as grouped. It doesn’t change the data; it attaches metadata that tells subsequent verbs to operate per group:
penguins_grouped <- group_by(penguins, species)
summarise(penguins_grouped, mean_mass = mean(body_mass_g, na.rm = TRUE))
#> # A tibble: 3 × 2
#> species mean_mass
#> <fct> <dbl>
#> 1 Adelie 3701.
#> 2 Chinstrap 3733.
#> 3 Gentoo 5076.summarise() applied mean() within each group and returned one row per species.
You can group by multiple variables:
summarise(
group_by(penguins, species, island),
mean_mass = mean(body_mass_g, na.rm = TRUE),
count = n()
)
#> `summarise()` has regrouped the output.
#> ℹ Summaries were computed grouped by species and island.
#> ℹ Output is grouped by species.
#> ℹ Use `summarise(.groups = "drop_last")` to silence this message.
#> ℹ Use `summarise(.by = c(species, island))` for per-operation grouping
#> (`?dplyr::dplyr_by`) instead.
#> # A tibble: 5 × 4
#> # Groups: species [3]
#> species island mean_mass count
#> <fct> <fct> <dbl> <int>
#> 1 Adelie Biscoe 3710. 44
#> 2 Adelie Dream 3688. 56
#> 3 Adelie Torgersen 3706. 52
#> 4 Chinstrap Dream 3733. 68
#> 5 Gentoo Biscoe 5076. 124n() counts the number of rows in each group. n_distinct() counts unique values:
summarise(
group_by(penguins, species),
n_islands = n_distinct(island)
)
#> # A tibble: 3 × 2
#> species n_islands
#> <fct> <int>
#> 1 Adelie 3
#> 2 Chinstrap 1
#> 3 Gentoo 1The shortcut count() does group_by() plus summarise(n = n()) in one step:
count(penguins, species)
#> # A tibble: 3 × 2
#> species n
#> <fct> <int>
#> 1 Adelie 152
#> 2 Chinstrap 68
#> 3 Gentoo 124After summarise(), the last grouping variable is silently peeled off. The result may still be grouped. This causes subtle bugs downstream: a mutate() you expect to operate on the full data frame operates per group instead. Always add .groups = "drop" to summarise(), or call ungroup() afterward. Be explicit about grouping state.
Exercises
- Compute the median
flipper_length_mmfor each species. - Use
group_by()andsummarise()to find the maximumbill_length_mmper species and island combination. - Use
count()to find how many penguins were observed in each year.
14.6 arrange(): order rows
arrange() sorts rows. The default is ascending:
arrange(penguins, body_mass_g)
#> # A tibble: 344 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Chinstr… Dream 46.9 16.6 192 2700
#> 2 Adelie Biscoe 36.5 16.6 181 2850
#> 3 Adelie Biscoe 36.4 17.1 184 2850
#> 4 Adelie Biscoe 34.5 18.1 187 2900
#> 5 Adelie Dream 33.1 16.1 178 2900
#> 6 Adelie Torge… 38.6 17 188 2900
#> 7 Chinstr… Dream 43.2 16.6 187 2900
#> 8 Adelie Biscoe 37.9 18.6 193 2925
#> 9 Adelie Dream 37.5 18.9 179 2975
#> 10 Adelie Dream 37 16.9 185 3000
#> # ℹ 334 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>Wrap a column in desc() for descending order:
arrange(penguins, desc(body_mass_g))Multiple columns break ties. Sort by species alphabetically, then by mass descending within each species:
arrange(penguins, species, desc(body_mass_g))Unlike the other verbs, arrange() ignores grouping by default. If you want to sort within groups, set .by_group = TRUE.
arrange() is pure display. It does not change the data, only the order of rows. No information is added or removed. But a well-sorted table can reveal patterns that a jumbled one hides.
14.7 More verbs
The five core verbs handle most transformations. A few more are worth knowing.
distinct() returns unique rows:
distinct(penguins, species, island)
#> # A tibble: 5 × 2
#> species island
#> <fct> <fct>
#> 1 Adelie Torgersen
#> 2 Adelie Biscoe
#> 3 Adelie Dream
#> 4 Gentoo Biscoe
#> 5 Chinstrap DreamSeven combinations of species and island exist in the data.
slice_head() and slice_tail() keep the first or last n rows:
slice_head(penguins, n = 3)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgers… 39.1 18.7 181 3750
#> 2 Adelie Torgers… 39.5 17.4 186 3800
#> 3 Adelie Torgers… 40.3 18 195 3250
#> # ℹ 2 more variables: sex <fct>, year <int>slice_max() and slice_min() keep the rows with the largest or smallest values:
slice_max(penguins, body_mass_g, n = 3)
#> # A tibble: 4 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Gentoo Biscoe 49.2 15.2 221 6300
#> 2 Gentoo Biscoe 59.6 17 230 6050
#> 3 Gentoo Biscoe 51.1 16.3 220 6000
#> 4 Gentoo Biscoe 48.8 16.2 222 6000
#> # ℹ 2 more variables: sex <fct>, year <int>pull() extracts a single column as a vector. It is the pipe-friendly equivalent of $:
pull(penguins, species) |> head()
#> [1] Adelie Adelie Adelie Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoorelocate() moves columns without dropping any. Move sex and year to the front:
relocate(penguins, sex, year)Finally, across() applies a function to multiple columns at once. Summarise all numeric columns by their mean:
summarise(penguins, across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))
#> # A tibble: 1 × 5
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 43.9 17.2 201. 4202. 2008.across() applies a function to multiple columns at once. It pairs naturally with where(), starts_with(), and the other selection helpers from Section 14.3. It becomes easier to use after Chapter 19, where you will see the same pattern of “apply a function to many things” in a broader context.
Exercises
- Use
distinct()to find all unique values ofisland. - Use
slice_max()to find the 5 penguins with the longest flippers. - Use
across()insidesummarise()to compute the standard deviation (sd()) of every numeric column, grouped by species.
14.8 The base R equivalents
Everything dplyr does, base R can do too. The verbs are convenient wrappers, not new capabilities. Here are the translations.
filter() is logical subsetting:
# dplyr
filter(penguins, species == "Adelie", body_mass_g > 4000)
# base R
penguins[penguins$species == "Adelie" & penguins$body_mass_g > 4000, ]select() is column indexing:
# dplyr
select(penguins, species, body_mass_g)
# base R
penguins[, c("species", "body_mass_g")]mutate() is direct assignment:
# dplyr
mutate(penguins, mass_kg = body_mass_g / 1000)
# base R
penguins$mass_kg <- penguins$body_mass_g / 1000Note the difference: the base R version modifies the data frame in place (or rather, modifies the binding). The dplyr version returns a new data frame and leaves the original untouched.
summarise() plus group_by() maps to aggregate():
# dplyr
summarise(group_by(penguins, species), mean_mass = mean(body_mass_g, na.rm = TRUE))
# base R
aggregate(body_mass_g ~ species, data = penguins, FUN = mean, na.rm = TRUE)arrange() is order():
# dplyr
arrange(penguins, desc(body_mass_g))
# base R
penguins[order(-penguins$body_mass_g), ]Base R also has subset(), which combines filtering and column selection:
subset(penguins, species == "Adelie", select = c(species, body_mass_g))subset() is clean and readable. For simple operations, it works well. The advantage of dplyr is consistency: every verb takes a data frame first and returns a data frame, so they compose. In base R, aggregate() returns different column names, order() returns indices rather than a data frame, and transform() (the base equivalent of mutate()) cannot reference columns it just created. The dplyr verbs snap together; the base R equivalents each have their own interface.
In fact, dplyr and SQL are the same language wearing different syntax. Both implement relational algebra, which is why they feel similar:
| dplyr | SQL |
|---|---|
filter(penguins, species == "Adelie") |
SELECT * FROM penguins WHERE species = 'Adelie' |
group_by(species) |> summarise(n = n()) |
SELECT species, COUNT(*) AS n ... GROUP BY species |
Call show_query() on a database-backed tibble to see the translation directly.
In category theory, a natural transformation maps one functor to another while preserving structure. filter() transforms a data frame (a container of rows) into a smaller container where columns keep their types and relationships are maintained. The clean composability of dplyr verbs is not an accident; it reflects this mathematical structure, where the transformation commutes with the container.
Learn the base R equivalents. You will read them in other people’s code, in Stack Overflow answers, in older packages. But write dplyr in your own work. The verbs compose, the names communicate intent, and the argument order is consistent. When you read filter(penguins, species == "Adelie"), the operation is the function name. When you read penguins[penguins$species == "Adelie", ], you have to parse the indexing to understand what’s happening.
Exercises
- Rewrite
select(penguins, species, island, body_mass_g)using base R bracket notation. - Rewrite
arrange(penguins, bill_length_mm)usingorder(). - Use
aggregate()to compute the meanflipper_length_mmper island. Compare with thegroup_by()plussummarise()version.