library(palmerpenguins)
#>
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#>
#> penguins, penguins_raw11 Data frames
You have a hundred measurements of penguin body mass, another hundred of bill length, and a character vector of species names. Three vectors, all the same length, all describing the same 100 birds. You could keep them as separate objects, indexing carefully, hoping that nobody reorders one without reordering the others. Or you could store them together in a structure that enforces the constraint for you: every column the same length, every row one observation, accessible with the list operations from Chapter 10.
That structure is the data frame: a list where every element is a vector of the same length. Lists hold anything; a data frame constrains that freedom to equal-length columns, and once you see this, subsetting, column operations, and the entire tidyverse make sense. The idea predates R by decades: Edgar Codd’s 1970 paper at IBM defined relations as sets of tuples, each row a typed record, and R’s data frames, SQL tables, pandas DataFrames, and Spark datasets all descend from that same relational algebra.
So what does this constraint actually buy you?
11.1 A list of equal-length vectors
Ask R what a data frame actually is:
typeof(penguins)
#> [1] "list"
is.list(penguins)
#> [1] TRUEType "list", because it is a list. The difference from an ordinary list is that every element (every column) must be a vector of the same length, so each row lines up across all columns; one observation, one position in every vector. Build one yourself:
df <- data.frame(
name = c("Adelie", "Gentoo"),
mass = c(3750, 5200)
)
df
#> name mass
#> 1 Adelie 3750
#> 2 Gentoo 5200You are building a list with two elements (name and mass), both of length 2. R adds a class attribute ("data.frame") and row numbers, but the underlying structure is just a named list with bookkeeping on top. You can verify this with str():
str(df)
#> 'data.frame': 2 obs. of 2 variables:
#> $ name: chr "Adelie" "Gentoo"
#> $ mass: num 3750 5200Two columns, two rows, and nothing more than a list wearing a label.
Type theory calls this a product type. A data frame with columns name (character) and mass (double) has type character × double: each row pairs one character value with one double value. Add a third column and the type grows: character × double × integer.
The equal-length constraint is doing more work than it appears. A plain list lets you pair a five-element vector with a three-element vector and says nothing about the mismatch. A data frame refuses. Why? Because the constraint is the guarantee: row i across all columns describes the same observation. Break the rectangle — one column longer than another — and that guarantee disappears, and every function that touches the data must re-verify alignment on its own. Codd’s 1970 relational model imposed the same rule: every tuple in a relation has the same attributes. SQL inherited it, pandas and Arrow enforce it today, and R’s data frame sits in that same lineage. The trade is flexibility for a promise: if the data is rectangular, row alignment is automatic and every subsetting operation preserves it. Logical indexing, merge(), and rbind() all depend on that promise holding.
So what goes into those columns when you build one from scratch?
Exercises
- Create a data frame with three columns:
species(character),island(character), andbill_length(numeric), with at least three rows. Check itstypeof(). - What does
length(penguins)return? Why? (Hint: a data frame is a list.) - What does
names(penguins)return?
11.2 Creating data frames
data.frame() is the base R workhorse. Column names come from the argument names:
measurements <- data.frame(
species = c("Adelie", "Chinstrap", "Gentoo"),
mass_g = c(3750, 3800, 5200),
island = c("Torgersen", "Dream", "Biscoe")
)
measurements
#> species mass_g island
#> 1 Adelie 3750 Torgersen
#> 2 Chinstrap 3800 Dream
#> 3 Gentoo 5200 BiscoeBefore R 4.0, data.frame() converted character vectors to factors by default (stringsAsFactors = TRUE), which meant that strings that looked like text silently became categorical variables with integer codes underneath. This caused confusion for over two decades.
The stringsAsFactors default was wrong for 23 years and finally fixed in R 4.0 (2020). If someone’s tutorial still warns about it, the tutorial is outdated.
11.2.1 Row names
Some built-in datasets carry a feature you might not expect:
head(mtcars, 3)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1The car names (“Mazda RX4”, etc.) are row names, not a column. They look like data, they smell like data, but they live in a separate metadata slot that can’t contain duplicates, gets silently dropped by many functions, and behaves nothing like a real column. You can’t filter by row name without extra syntax, and you can’t have two rows with the same label.
Pretend row names don’t exist. If something is data, put it in a column. tibble::rownames_to_column() converts row names to a proper column when you inherit a dataset that uses them.
Exercises
- Create a data frame where one column is logical (e.g.,
passed = c(TRUE, FALSE, TRUE)). What doesstr()show? - Run
row.names(mtcars)androw.names(penguins). What’s the difference?
11.3 Subsetting data frames
Because a data frame is a list, the same accessors work:
penguins$species[1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo$ extracts a column and returns a vector, the same way [[ extracts an element from a list:
penguins[["species"]][1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo[[ does the same thing but takes a string, which matters when the column name lives in a variable rather than in your source code:
col <- "species"
penguins[[col]][1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap GentooSingle brackets, by contrast, return a one-column data frame (a sub-list), not a vector:
penguins["species"][1:3, ]
#> # A tibble: 3 × 1
#> species
#> <fct>
#> 1 Adelie
#> 2 Adelie
#> 3 AdelieSame rule as lists: [ keeps the container, [[ extracts the contents. But data frames are rectangular, which opens a second dimension.
11.3.1 Two-dimensional indexing
Because a data frame has rows and columns, it also accepts two indices: [row, column].
penguins[1, ]
#> # A tibble: 1 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgers… 39.1 18.7 181 3750
#> # ℹ 2 more variables: sex <fct>, year <int>First row, all columns, returned as a data frame. Now try the reverse:
penguins[, 1]
#> # A tibble: 344 × 1
#> species
#> <fct>
#> 1 Adelie
#> 2 Adelie
#> 3 Adelie
#> 4 Adelie
#> 5 Adelie
#> 6 Adelie
#> 7 Adelie
#> 8 Adelie
#> 9 Adelie
#> 10 Adelie
#> # ℹ 334 more rowsAll rows, first column. Here is the gotcha: by default, this returns a vector, not a data frame. R “drops” the data frame structure when the result has a single column.
class(penguins[, 1])
#> [1] "tbl_df" "tbl" "data.frame"To keep the data frame structure, use drop = FALSE:
class(penguins[, 1, drop = FALSE])
#> [1] "tbl_df" "tbl" "data.frame"This drop behavior is a common source of bugs. You write a function that subsets a data frame, test it with two columns, and it works; someone passes one column, and your function breaks because it suddenly receives a vector instead of a data frame.
penguins[1, 2]
#> # A tibble: 1 × 1
#> island
#> <fct>
#> 1 TorgersenRow 1, column 2: a single value (a vector of length 1). [ with two indices reaches any cell, row, or column, but the drop gotcha still applies whenever your selection narrows to one column.
11.3.2 Logical subsetting
Logical indexing from Section 4.5 works on data frames too. Put the logical vector in the row position:
heavy <- penguins[penguins$body_mass_g > 5000 & !is.na(penguins$body_mass_g), ]
nrow(heavy)
#> [1] 61
head(heavy, 3)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Gentoo Biscoe 50 16.3 230 5700
#> 2 Gentoo Biscoe 50 15.2 218 5700
#> 3 Gentoo Biscoe 47.6 14.5 215 5400
#> # ℹ 2 more variables: sex <fct>, year <int>penguins$body_mass_g > 5000 produces a logical vector, TRUE for heavy penguins and FALSE (or NA) for the rest, and placing it before the comma selects the matching rows. dplyr::filter() (Chapter 14) does the same filtering without repeating the data frame name; base R makes you write it twice, once for the condition and once for the subset.
Exercises
Predict the class of each result before running:
penguins$bill_length_mm penguins["bill_length_mm"] penguins[["bill_length_mm"]] penguins[, "bill_length_mm"] penguins[, "bill_length_mm", drop = FALSE]Extract all penguins from Biscoe island using logical subsetting. How many rows?
What does
penguins[1:5, c("species", "island")]return?
11.4 Adding and modifying columns
Assign to a new name with $:
penguins$mass_kg <- penguins$body_mass_g / 1000
head(penguins$mass_kg)
#> [1] 3.75 3.80 3.25 NA 3.45 3.65This adds (or overwrites) a column. The right-hand side must be a vector with the same number of rows, or a single value that gets recycled (Section 4.4.1).
For modifying multiple columns at once, base R provides within():
df2 <- within(penguins, {
mass_kg <- body_mass_g / 1000
bill_ratio <- bill_length_mm / bill_depth_mm
})
head(df2[, c("mass_kg", "bill_ratio")])
#> # A tibble: 6 × 2
#> mass_kg bill_ratio
#> <dbl> <dbl>
#> 1 3.75 2.09
#> 2 3.8 2.27
#> 3 3.25 2.24
#> 4 NA NA
#> 5 3.45 1.90
#> 6 3.65 1.91within() evaluates the expressions inside the data frame’s environment, so you can refer to columns by name without the df$ prefix, and the result is a new data frame. dplyr::mutate() (Chapter 14) has almost entirely replaced within() in modern R. You will still see it in older scripts.
Exercises
- Add a column
bill_areatopenguinsthat multipliesbill_length_mmbybill_depth_mm. What ismean(penguins$bill_area, na.rm = TRUE)? - What happens if you assign a vector of the wrong length to a new column? Try
penguins$test <- 1:10.
11.5 Combining data frames
rbind() stacks rows; the column names and types must match:
batch1 <- data.frame(species = "Adelie", mass = 3750)
batch2 <- data.frame(species = "Gentoo", mass = 5200)
rbind(batch1, batch2)
#> species mass
#> 1 Adelie 3750
#> 2 Gentoo 5200cbind() glues columns side by side; the row counts must match:
ids <- data.frame(id = c("A1", "G1"))
masses <- data.frame(mass = c(3750, 5200))
cbind(ids, masses)
#> id mass
#> 1 A1 3750
#> 2 G1 5200When they fail, the error messages mention dimensions rather than the real problem, which is almost always mismatched column names in rbind() or mismatched row counts in cbind(). Check names() and nrow() before combining, and you will save yourself a frustrating debugging session.
Exercises
- Create two data frames with the same columns but different data. Stack them with
rbind(). - What happens if the column names don’t match? Try it.
- Use
cbind()to attach a newidcolumn topenguins. What class is the result?
11.6 Tibbles
Print a large data frame in R and thousands of rows flood the console. Subset a single column: you silently get a vector instead of a data frame. Type a column name wrong and partial matching returns the wrong column.
A tibble is the tidyverse’s answer to all three problems:
library(tibble)
tbl <- tibble(
species = c("Adelie", "Gentoo", "Chinstrap"),
mass = c(3750, 5200, 3800)
)
tbl
#> # A tibble: 3 × 2
#> species mass
#> <chr> <dbl>
#> 1 Adelie 3750
#> 2 Gentoo 5200
#> 3 Chinstrap 3800Three differences matter.
Printing. A tibble shows the first 10 rows, column types, and fits the screen width. No more flooding the console because you forgot head().
Subsetting. tbl[, 1] always returns a tibble, never a vector, so the drop gotcha is gone:
class(tbl[, 1])
#> [1] "tbl_df" "tbl" "data.frame"No partial matching. With a data frame, df$sp might match df$species. A tibble refuses:
tbl$sp
#> Warning: Unknown or uninitialised column: `sp`.
#> NULLPartial matching is autocomplete for bugs. You type df$sp, R guesses you meant df$species, and your code works until someone adds a column called spine_count. Then df$sp becomes ambiguous and silently returns NULL. A tibble catches this immediately.
A tibble is still a data frame, which means every function that accepts a data frame accepts a tibble:
is.data.frame(tbl)
#> [1] TRUEConvert between them with as_tibble() and as.data.frame():
as_tibble(mtcars)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rowsUse tibbles by default. Use plain data frames when a package demands one (rare) or when you want zero dependencies.
Exercises
- Create a tibble and a data frame with the same data. Compare what
[, 1]returns for each. - Create a data frame with a column called
temperature. Access it withdf$temp. Then try the same with a tibble. - What does
is.list()return for a tibble?
11.7 Exploring data frames
str() is the first thing to run on a new dataset:
str(penguins)
#> tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
#> $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
#> $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
#> $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
#> $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
#> $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
#> $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...344 rows, 8 columns, the type of each column, and a preview of the values. This is the first thing to run when you load a new dataset.
Other useful functions:
nrow(penguins)
#> [1] 344
ncol(penguins)
#> [1] 8
dim(penguins)
#> [1] 344 8names(penguins)
#> [1] "species" "island" "bill_length_mm"
#> [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
#> [7] "sex" "year"head(penguins, 3)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgers… 39.1 18.7 181 3750
#> 2 Adelie Torgers… 39.5 17.4 186 3800
#> 3 Adelie Torgers… 40.3 18 195 3250
#> # ℹ 2 more variables: sex <fct>, year <int>summary(penguins)
#> species island bill_length_mm bill_depth_mm
#> Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
#> Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
#> Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
#> Mean :43.92 Mean :17.15
#> 3rd Qu.:48.50 3rd Qu.:18.70
#> Max. :59.60 Max. :21.50
#> NA's :2 NA's :2
#> flipper_length_mm body_mass_g sex year
#> Min. :172.0 Min. :2700 female:165 Min. :2007
#> 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
#> Median :197.0 Median :4050 NA's : 11 Median :2008
#> Mean :200.9 Mean :4202 Mean :2008
#> 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
#> Max. :231.0 Max. :6300 Max. :2009
#> NA's :2 NA's :2summary() shows min, max, quartiles, and mean for numeric columns, counts for factors, and NA counts at the bottom of each column. RStudio’s View() gives you a spreadsheet instead, useful for browsing but it gets sluggish past a few thousand rows and does nothing outside interactive sessions.
Exercises
- How many
NAvalues are in thesexcolumn ofpenguins? Usesum()andis.na(). - What is the mean body mass of penguins? Use
mean()withna.rm = TRUE. - Run
str()on themtcarsdataset. How many columns are numeric?
11.8 Putting it together
A data frame is a list (Section 11.1), each column is a vector (Section 4.2), and subsetting uses the same [, [[, and $ operators you learned for lists (Chapter 8) and vectors (Section 4.5). Logical indexing filters rows (Section 11.3.2) with the same mechanism from Section 4.4, just applied to a two-dimensional structure. Everything connects.
One consequence of this design matters more than any syntax detail: R stores data frames column by column, each column a single contiguous vector in memory. Column operations like mean(df$x) walk sequential memory and run fast, while row operations like apply(df, 1, sum) jump between columns on every row and run slow. This is the same trade-off databases make: columnar stores (DuckDB, Parquet) optimize for analytical aggregation; row stores (PostgreSQL, MySQL) optimize for single-record transactions.
Chapter 12 covers the column types you will encounter most often: strings, factors, and dates. Chapter 14 introduces dplyr, which replaces the df[df$x > 5, ] syntax with readable verbs like filter(), mutate(), and summarize(). Those verbs work because data frames are lists of vectors. You started this chapter with three loose vectors and a prayer that they stayed aligned; you end it knowing why R binds them together and what that binding makes possible.