11 Data frames

You have a hundred measurements of penguin body mass, another hundred of bill length, and a character vector of species names. Three vectors, all the same length, all describing the same 100 birds. You could keep them as separate objects, indexing carefully, hoping that nobody reorders one without reordering the others. Or you could store them together in a structure that enforces the constraint for you: every column the same length, every row one observation, accessible with the list operations from Chapter 10.

That structure is the data frame: a list where every element is a vector of the same length. Lists hold anything; a data frame constrains that freedom to equal-length columns, and once you see this, subsetting, column operations, and the entire tidyverse make sense. The idea predates R by decades: Edgar Codd’s 1970 paper at IBM defined relations as sets of tuples, each row a typed record, and R’s data frames, SQL tables, pandas DataFrames, and Spark datasets all descend from that same relational algebra.

So what does this constraint actually buy you?

11.1 A list of equal-length vectors

library(palmerpenguins)
#> 
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#> 
#>     penguins, penguins_raw

Ask R what a data frame actually is:

typeof(penguins)
#> [1] "list"
is.list(penguins)
#> [1] TRUE

Type "list", because it is a list. The difference from an ordinary list is that every element (every column) must be a vector of the same length, so each row lines up across all columns; one observation, one position in every vector. Build one yourself:

df <- data.frame(
  name = c("Adelie", "Gentoo"),
  mass = c(3750, 5200)
)
df
#>     name mass
#> 1 Adelie 3750
#> 2 Gentoo 5200

You are building a list with two elements (name and mass), both of length 2. R adds a class attribute ("data.frame") and row numbers, but the underlying structure is just a named list with bookkeeping on top. You can verify this with str():

str(df)
#> 'data.frame':    2 obs. of  2 variables:
#>  $ name: chr  "Adelie" "Gentoo"
#>  $ mass: num  3750 5200

Two columns, two rows, and nothing more than a list wearing a label.

Type theory calls this a product type. A data frame with columns name (character) and mass (double) has type character × double: each row pairs one character value with one double value. Add a third column and the type grows: character × double × integer.

The equal-length constraint is doing more work than it appears. A plain list lets you pair a five-element vector with a three-element vector and says nothing about the mismatch. A data frame refuses. Why? Because the constraint is the guarantee: row i across all columns describes the same observation. Break the rectangle — one column longer than another — and that guarantee disappears, and every function that touches the data must re-verify alignment on its own. Codd’s 1970 relational model imposed the same rule: every tuple in a relation has the same attributes. SQL inherited it, pandas and Arrow enforce it today, and R’s data frame sits in that same lineage. The trade is flexibility for a promise: if the data is rectangular, row alignment is automatic and every subsetting operation preserves it. Logical indexing, merge(), and rbind() all depend on that promise holding.

So what goes into those columns when you build one from scratch?

Exercises

Create a data frame with three columns: species (character), island (character), and bill_length (numeric), with at least three rows. Check its typeof().
What does length(penguins) return? Why? (Hint: a data frame is a list.)
What does names(penguins) return?

11.2 Creating data frames

data.frame() is the base R workhorse. Column names come from the argument names:

measurements <- data.frame(
  species = c("Adelie", "Chinstrap", "Gentoo"),
  mass_g  = c(3750, 3800, 5200),
  island  = c("Torgersen", "Dream", "Biscoe")
)
measurements
#>     species mass_g    island
#> 1    Adelie   3750 Torgersen
#> 2 Chinstrap   3800     Dream
#> 3    Gentoo   5200    Biscoe

Before R 4.0, data.frame() converted character vectors to factors by default (stringsAsFactors = TRUE), which meant that strings that looked like text silently became categorical variables with integer codes underneath. This caused confusion for over two decades.

Opinion

The stringsAsFactors default was wrong for 23 years and finally fixed in R 4.0 (2020). If someone’s tutorial still warns about it, the tutorial is outdated.

11.2.1 Row names

Some built-in datasets carry a feature you might not expect:

head(mtcars, 3)
#>                mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

The car names (“Mazda RX4”, etc.) are row names, not a column. They look like data, they smell like data, but they live in a separate metadata slot that can’t contain duplicates, gets silently dropped by many functions, and behaves nothing like a real column. You can’t filter by row name without extra syntax, and you can’t have two rows with the same label.

Opinion

Pretend row names don’t exist. If something is data, put it in a column. tibble::rownames_to_column() converts row names to a proper column when you inherit a dataset that uses them.

Exercises

Create a data frame where one column is logical (e.g., passed = c(TRUE, FALSE, TRUE)). What does str() show?
Run row.names(mtcars) and row.names(penguins). What’s the difference?

11.3 Subsetting data frames

Because a data frame is a list, the same accessors work:

penguins$species[1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo

$ extracts a column and returns a vector, the same way [[ extracts an element from a list:

penguins[["species"]][1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo

[[ does the same thing but takes a string, which matters when the column name lives in a variable rather than in your source code:

col <- "species"
penguins[[col]][1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo

Single brackets, by contrast, return a one-column data frame (a sub-list), not a vector:

penguins["species"][1:3, ]
#> # A tibble: 3 × 1
#>   species
#>   <fct>  
#> 1 Adelie 
#> 2 Adelie 
#> 3 Adelie

Same rule as lists: [ keeps the container, [[ extracts the contents. But data frames are rectangular, which opens a second dimension.

11.3.1 Two-dimensional indexing

Because a data frame has rows and columns, it also accepts two indices: [row, column].

penguins[1, ]
#> # A tibble: 1 × 8
#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>             <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgers…           39.1          18.7               181        3750
#> # ℹ 2 more variables: sex <fct>, year <int>

First row, all columns, returned as a data frame. Now try the reverse:

penguins[, 1]
#> # A tibble: 344 × 1
#>    species
#>    <fct>  
#>  1 Adelie 
#>  2 Adelie 
#>  3 Adelie 
#>  4 Adelie 
#>  5 Adelie 
#>  6 Adelie 
#>  7 Adelie 
#>  8 Adelie 
#>  9 Adelie 
#> 10 Adelie 
#> # ℹ 334 more rows

All rows, first column. Here is the gotcha: by default, this returns a vector, not a data frame. R “drops” the data frame structure when the result has a single column.

class(penguins[, 1])
#> [1] "tbl_df"     "tbl"        "data.frame"

To keep the data frame structure, use drop = FALSE:

class(penguins[, 1, drop = FALSE])
#> [1] "tbl_df"     "tbl"        "data.frame"

This drop behavior is a common source of bugs. You write a function that subsets a data frame, test it with two columns, and it works; someone passes one column, and your function breaks because it suddenly receives a vector instead of a data frame.

penguins[1, 2]
#> # A tibble: 1 × 1
#>   island   
#>   <fct>    
#> 1 Torgersen

Row 1, column 2: a single value (a vector of length 1). [ with two indices reaches any cell, row, or column, but the drop gotcha still applies whenever your selection narrows to one column.

11.3.2 Logical subsetting

Logical indexing from Section 4.5 works on data frames too. Put the logical vector in the row position:

heavy <- penguins[penguins$body_mass_g > 5000 & !is.na(penguins$body_mass_g), ]
nrow(heavy)
#> [1] 61
head(heavy, 3)
#> # A tibble: 3 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
#> 1 Gentoo  Biscoe           50            16.3               230        5700
#> 2 Gentoo  Biscoe           50            15.2               218        5700
#> 3 Gentoo  Biscoe           47.6          14.5               215        5400
#> # ℹ 2 more variables: sex <fct>, year <int>

penguins$body_mass_g > 5000 produces a logical vector, TRUE for heavy penguins and FALSE (or NA) for the rest, and placing it before the comma selects the matching rows. dplyr::filter() (Chapter 14) does the same filtering without repeating the data frame name; base R makes you write it twice, once for the condition and once for the subset.

Exercises

Predict the class of each result before running:

penguins$bill_length_mm
penguins["bill_length_mm"]
penguins[["bill_length_mm"]]
penguins[, "bill_length_mm"]
penguins[, "bill_length_mm", drop = FALSE]

Extract all penguins from Biscoe island using logical subsetting. How many rows?
What does penguins[1:5, c("species", "island")] return?

11.4 Adding and modifying columns

Assign to a new name with $:

penguins$mass_kg <- penguins$body_mass_g / 1000
head(penguins$mass_kg)
#> [1] 3.75 3.80 3.25   NA 3.45 3.65

This adds (or overwrites) a column. The right-hand side must be a vector with the same number of rows, or a single value that gets recycled (Section 4.4.1).

For modifying multiple columns at once, base R provides within():

df2 <- within(penguins, {
  mass_kg <- body_mass_g / 1000
  bill_ratio <- bill_length_mm / bill_depth_mm
})
head(df2[, c("mass_kg", "bill_ratio")])
#> # A tibble: 6 × 2
#>   mass_kg bill_ratio
#>     <dbl>      <dbl>
#> 1    3.75       2.09
#> 2    3.8        2.27
#> 3    3.25       2.24
#> 4   NA         NA   
#> 5    3.45       1.90
#> 6    3.65       1.91

within() evaluates the expressions inside the data frame’s environment, so you can refer to columns by name without the df$ prefix, and the result is a new data frame. dplyr::mutate() (Chapter 14) has almost entirely replaced within() in modern R. You will still see it in older scripts.

Exercises

Add a column bill_area to penguins that multiplies bill_length_mm by bill_depth_mm. What is mean(penguins$bill_area, na.rm = TRUE)?
What happens if you assign a vector of the wrong length to a new column? Try penguins$test <- 1:10.

11.5 Combining data frames

rbind() stacks rows; the column names and types must match:

batch1 <- data.frame(species = "Adelie", mass = 3750)
batch2 <- data.frame(species = "Gentoo", mass = 5200)
rbind(batch1, batch2)
#>   species mass
#> 1  Adelie 3750
#> 2  Gentoo 5200

cbind() glues columns side by side; the row counts must match:

ids <- data.frame(id = c("A1", "G1"))
masses <- data.frame(mass = c(3750, 5200))
cbind(ids, masses)
#>   id mass
#> 1 A1 3750
#> 2 G1 5200

When they fail, the error messages mention dimensions rather than the real problem, which is almost always mismatched column names in rbind() or mismatched row counts in cbind(). Check names() and nrow() before combining, and you will save yourself a frustrating debugging session.

Exercises

Create two data frames with the same columns but different data. Stack them with rbind().
What happens if the column names don’t match? Try it.
Use cbind() to attach a new id column to penguins. What class is the result?

11.6 Tibbles

Print a large data frame in R and thousands of rows flood the console. Subset a single column: you silently get a vector instead of a data frame. Type a column name wrong and partial matching returns the wrong column.

A tibble is the tidyverse’s answer to all three problems:

library(tibble)
tbl <- tibble(
  species = c("Adelie", "Gentoo", "Chinstrap"),
  mass = c(3750, 5200, 3800)
)
tbl
#> # A tibble: 3 × 2
#>   species    mass
#>   <chr>     <dbl>
#> 1 Adelie     3750
#> 2 Gentoo     5200
#> 3 Chinstrap  3800

Three differences matter.

Printing. A tibble shows the first 10 rows, column types, and fits the screen width. No more flooding the console because you forgot head().

Subsetting. tbl[, 1] always returns a tibble, never a vector, so the drop gotcha is gone:

class(tbl[, 1])
#> [1] "tbl_df"     "tbl"        "data.frame"

No partial matching. With a data frame, df$sp might match df$species. A tibble refuses:

tbl$sp
#> Warning: Unknown or uninitialised column: `sp`.
#> NULL

Opinion

Partial matching is autocomplete for bugs. You type df$sp, R guesses you meant df$species, and your code works until someone adds a column called spine_count. Then df$sp becomes ambiguous and silently returns NULL. A tibble catches this immediately.

A tibble is still a data frame, which means every function that accepts a data frame accepts a tibble:

is.data.frame(tbl)
#> [1] TRUE

Convert between them with as_tibble() and as.data.frame():

as_tibble(mtcars)
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows

Opinion

Use tibbles by default. Use plain data frames when a package demands one (rare) or when you want zero dependencies.

Exercises

Create a tibble and a data frame with the same data. Compare what [, 1] returns for each.
Create a data frame with a column called temperature. Access it with df$temp. Then try the same with a tibble.
What does is.list() return for a tibble?

11.7 Exploring data frames

str() is the first thing to run on a new dataset:

str(penguins)
#> tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
#>  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
#>  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
#>  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
#>  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
#>  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
#>  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

344 rows, 8 columns, the type of each column, and a preview of the values. This is the first thing to run when you load a new dataset.

Other useful functions:

nrow(penguins)
#> [1] 344
ncol(penguins)
#> [1] 8
dim(penguins)
#> [1] 344   8

names(penguins)
#> [1] "species"           "island"            "bill_length_mm"   
#> [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
#> [7] "sex"               "year"

head(penguins, 3)
#> # A tibble: 3 × 8
#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>             <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgers…           39.1          18.7               181        3750
#> 2 Adelie  Torgers…           39.5          17.4               186        3800
#> 3 Adelie  Torgers…           40.3          18                 195        3250
#> # ℹ 2 more variables: sex <fct>, year <int>

summary(penguins)
#>       species          island    bill_length_mm  bill_depth_mm  
#>  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
#>  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
#>  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
#>                                  Mean   :43.92   Mean   :17.15  
#>                                  3rd Qu.:48.50   3rd Qu.:18.70  
#>                                  Max.   :59.60   Max.   :21.50  
#>                                  NA's   :2       NA's   :2      
#>  flipper_length_mm  body_mass_g       sex           year     
#>  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
#>  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
#>  Median :197.0     Median :4050   NA's  : 11   Median :2008  
#>  Mean   :200.9     Mean   :4202                Mean   :2008  
#>  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
#>  Max.   :231.0     Max.   :6300                Max.   :2009  
#>  NA's   :2         NA's   :2

summary() shows min, max, quartiles, and mean for numeric columns, counts for factors, and NA counts at the bottom of each column. RStudio’s View() gives you a spreadsheet instead, useful for browsing but it gets sluggish past a few thousand rows and does nothing outside interactive sessions.

Exercises

How many NA values are in the sex column of penguins? Use sum() and is.na().
What is the mean body mass of penguins? Use mean() with na.rm = TRUE.
Run str() on the mtcars dataset. How many columns are numeric?

11.8 Putting it together

A data frame is a list (Section 11.1), each column is a vector (Section 4.2), and subsetting uses the same [, [[, and $ operators you learned for lists (Chapter 8) and vectors (Section 4.5). Logical indexing filters rows (Section 11.3.2) with the same mechanism from Section 4.4, just applied to a two-dimensional structure. Everything connects.

One consequence of this design matters more than any syntax detail: R stores data frames column by column, each column a single contiguous vector in memory. Column operations like mean(df$x) walk sequential memory and run fast, while row operations like apply(df, 1, sum) jump between columns on every row and run slow. This is the same trade-off databases make: columnar stores (DuckDB, Parquet) optimize for analytical aggregation; row stores (PostgreSQL, MySQL) optimize for single-record transactions.

Chapter 12 covers the column types you will encounter most often: strings, factors, and dates. Chapter 14 introduces dplyr, which replaces the df[df$x > 5, ] syntax with readable verbs like filter(), mutate(), and summarize(). Those verbs work because data frames are lists of vectors. You started this chapter with three loose vectors and a prayer that they stayed aligned; you end it knowing why R binds them together and what that binding makes possible.