11  Data frames

A data frame is a list where every element is a vector of the same length. That is the entire definition. Chapter 8 showed that lists hold anything; a data frame constrains that freedom to equal-length columns. Once you see this, subsetting, column operations, and the entire tidyverse make sense. The idea is older than R: Edgar Codd’s 1970 paper at IBM defined relations as sets of tuples, each row a tuple of typed values. R’s data frames, SQL tables, pandas DataFrames, and Spark datasets are all descendants of that same relational algebra.

11.1 A list of equal-length vectors

library(palmerpenguins)
#> 
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#> 
#>     penguins, penguins_raw

Start by asking R what a data frame actually is:

typeof(penguins)
#> [1] "list"
is.list(penguins)
#> [1] TRUE

A data frame has type "list" because it is a list. The difference is that every element (every column) must be a vector of the same length. Each column is one vector. Each row is one observation across those vectors. The constraint is simple: all vectors must have the same number of elements.

Build one yourself:

df <- data.frame(
  name = c("Adelie", "Gentoo"),
  mass = c(3750, 5200)
)
df
#>     name mass
#> 1 Adelie 3750
#> 2 Gentoo 5200

You are building a list with two elements (name and mass), both of length 2. R adds a class attribute ("data.frame") and row numbers, but the underlying structure is just a list. You can verify this with str():

str(df)
#> 'data.frame':    2 obs. of  2 variables:
#>  $ name: chr  "Adelie" "Gentoo"
#>  $ mass: num  3750 5200

The result has two columns and two rows, but the underlying representation is just a list with extra bookkeeping.

In type theory, this structure is a product type: a tuple of named, typed fields. A data frame with columns name (character) and mass (double) has type character × double, meaning each row is a pair of one character value and one double value. Complex data structures are built from simpler combinators, and the product (pairing things together with AND) is the most basic one. Every time you add a column, you extend the product with another factor.

Exercises

  1. Create a data frame with three columns: species (character), island (character), and bill_length (numeric), with at least three rows. Check its typeof().
  2. What does length(penguins) return? Why? (Hint: a data frame is a list.)
  3. What does names(penguins) return?

11.2 Creating data frames

data.frame() is the base R workhorse. Column names come from the argument names:

measurements <- data.frame(
  species = c("Adelie", "Chinstrap", "Gentoo"),
  mass_g  = c(3750, 3800, 5200),
  island  = c("Torgersen", "Dream", "Biscoe")
)
measurements
#>     species mass_g    island
#> 1    Adelie   3750 Torgersen
#> 2 Chinstrap   3800     Dream
#> 3    Gentoo   5200    Biscoe

Before R 4.0, data.frame() converted character vectors to factors by default (stringsAsFactors = TRUE). This caused confusion for over two decades. Strings that looked like text silently became categorical variables with integer codes underneath.

TipOpinion

The stringsAsFactors default was wrong for 23 years. It was fixed in R 4.0 (2020). If someone’s tutorial still warns about it, their tutorial is outdated.

11.2.1 Row names

Data frames have row names. Some built-in datasets use them:

head(mtcars, 3)
#>                mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

The car names (“Mazda RX4”, etc.) are row names, not a column. This causes problems: row names can’t contain duplicates, they’re silently dropped by many functions, and they behave differently from columns. You can’t filter by row name without extra syntax.

TipOpinion

Pretend row names don’t exist. If something is data, put it in a column. tibble::rownames_to_column() converts row names to a proper column when you inherit a dataset that uses them.

Exercises

  1. Create a data frame where one column is logical (e.g., passed = c(TRUE, FALSE, TRUE)). What does str() show?
  2. Run row.names(mtcars) and row.names(penguins). What’s the difference?

11.3 Subsetting data frames

Because a data frame is a list, everything from Chapter 8 works:

penguins$species[1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo

$ extracts a column and returns a vector, the same way [[ extracts an element from a list.

penguins[["species"]][1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo

[[ does the same thing, but takes a string. This matters when the column name is stored in a variable:

col <- "species"
penguins[[col]][1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo

Single brackets return a one-column data frame (a sub-list), not a vector:

penguins["species"][1:3, ]
#> # A tibble: 3 × 1
#>   species
#>   <fct>  
#> 1 Adelie 
#> 2 Adelie 
#> 3 Adelie

This is the same [ vs [[ distinction from Chapter 8: [ keeps the container, [[ extracts the contents.

11.3.1 Two-dimensional indexing

Because a data frame is rectangular, it also takes two indices: [row, column].

penguins[1, ]
#> # A tibble: 1 × 8
#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>             <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgers…           39.1          18.7               181        3750
#> # ℹ 2 more variables: sex <fct>, year <int>

This selects the first row across all columns and returns a data frame.

penguins[, 1]
#> # A tibble: 344 × 1
#>    species
#>    <fct>  
#>  1 Adelie 
#>  2 Adelie 
#>  3 Adelie 
#>  4 Adelie 
#>  5 Adelie 
#>  6 Adelie 
#>  7 Adelie 
#>  8 Adelie 
#>  9 Adelie 
#> 10 Adelie 
#> # ℹ 334 more rows

All rows, first column. Here is the gotcha: by default, this returns a vector, not a data frame. R “drops” the data frame structure when the result has a single column.

class(penguins[, 1])
#> [1] "tbl_df"     "tbl"        "data.frame"

To keep the data frame structure, use drop = FALSE:

class(penguins[, 1, drop = FALSE])
#> [1] "tbl_df"     "tbl"        "data.frame"

The drop gotcha is a common source of bugs. You write a function that subsets a data frame, test it with two columns, and it works. Someone passes one column, and your function breaks because it suddenly receives a vector instead of a data frame.

penguins[1, 2]
#> # A tibble: 1 × 1
#>   island   
#>   <fct>    
#> 1 Torgersen

Row 1, column 2. Returns a single value (a vector of length 1).

11.3.2 Logical subsetting

Logical indexing from Section 4.5 works on data frames too. Put the logical vector in the row position:

heavy <- penguins[penguins$body_mass_g > 5000 & !is.na(penguins$body_mass_g), ]
nrow(heavy)
#> [1] 61
head(heavy, 3)
#> # A tibble: 3 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
#> 1 Gentoo  Biscoe           50            16.3               230        5700
#> 2 Gentoo  Biscoe           50            15.2               218        5700
#> 3 Gentoo  Biscoe           47.6          14.5               215        5400
#> # ℹ 2 more variables: sex <fct>, year <int>

penguins$body_mass_g > 5000 produces a logical vector, TRUE for heavy penguins and FALSE (or NA) for the rest. Placing it before the comma selects rows. This is what dplyr::filter() does under the hood (Chapter 14).

Exercises

  1. Predict the class of each result before running:

    penguins$bill_length_mm
    penguins["bill_length_mm"]
    penguins[["bill_length_mm"]]
    penguins[, "bill_length_mm"]
    penguins[, "bill_length_mm", drop = FALSE]
  2. Extract all penguins from Biscoe island using logical subsetting. How many rows?

  3. What does penguins[1:5, c("species", "island")] return?

11.4 Adding and modifying columns

Assign to a new name with $:

penguins$mass_kg <- penguins$body_mass_g / 1000
head(penguins$mass_kg)
#> [1] 3.75 3.80 3.25   NA 3.45 3.65

This adds (or overwrites) a column. The right-hand side must be a vector with the same number of rows, or a single value that gets recycled (Section 4.4.1).

For modifying multiple columns at once, base R provides within():

df2 <- within(penguins, {
  mass_kg <- body_mass_g / 1000
  bill_ratio <- bill_length_mm / bill_depth_mm
})
head(df2[, c("mass_kg", "bill_ratio")])
#> # A tibble: 6 × 2
#>   mass_kg bill_ratio
#>     <dbl>      <dbl>
#> 1    3.75       2.09
#> 2    3.8        2.27
#> 3    3.25       2.24
#> 4   NA         NA   
#> 5    3.45       1.90
#> 6    3.65       1.91

within() evaluates the expressions inside the data frame’s environment, so you can refer to columns by name without the df$ prefix. The result is a new data frame. dplyr::mutate() does this more cleanly, which you will see in Chapter 14.

Exercises

  1. Add a column bill_area to penguins that multiplies bill_length_mm by bill_depth_mm. What is mean(penguins$bill_area, na.rm = TRUE)?
  2. What happens if you assign a vector of the wrong length to a new column? Try penguins$test <- 1:10.

11.5 Combining data frames

rbind() stacks rows. The column names and types must match:

batch1 <- data.frame(species = "Adelie", mass = 3750)
batch2 <- data.frame(species = "Gentoo", mass = 5200)
rbind(batch1, batch2)
#>   species mass
#> 1  Adelie 3750
#> 2  Gentoo 5200

cbind() glues columns. The row counts must match:

ids <- data.frame(id = c("A1", "G1"))
masses <- data.frame(mass = c(3750, 5200))
cbind(ids, masses)
#>   id mass
#> 1 A1 3750
#> 2 G1 5200

When they fail, the error messages are not always helpful. Mismatched column names in rbind() or mismatched row counts in cbind() produce errors that mention dimensions rather than the real problem. Check names() and nrow() before combining.

Exercises

  1. Create two data frames with the same columns but different data. Stack them with rbind().
  2. What happens if the column names don’t match? Try it.
  3. Use cbind() to attach a new id column to penguins. What class is the result?

11.6 Tibbles

A tibble is a modern data frame from the tidyverse:

library(tibble)
tbl <- tibble(
  species = c("Adelie", "Gentoo", "Chinstrap"),
  mass = c(3750, 5200, 3800)
)
tbl
#> # A tibble: 3 × 2
#>   species    mass
#>   <chr>     <dbl>
#> 1 Adelie     3750
#> 2 Gentoo     5200
#> 3 Chinstrap  3800

Three differences matter:

Printing. A tibble shows the first 10 rows, column types, and fits the screen. No more flooding the console with 10,000 rows because you forgot head().

Subsetting. tbl[, 1] always returns a tibble, never a vector. The drop gotcha is gone:

class(tbl[, 1])
#> [1] "tbl_df"     "tbl"        "data.frame"

No partial matching. With a data frame, df$sp might match df$species. A tibble refuses:

tbl$sp
#> Warning: Unknown or uninitialised column: `sp`.
#> NULL
TipOpinion

Partial matching is autocomplete for bugs. You type df$sp, R guesses you meant df$species, and your code works until someone adds a column called spine_count. Then df$sp becomes ambiguous and silently returns NULL. A tibble catches this immediately.

A tibble is still a data frame. Every function that takes a data frame takes a tibble:

is.data.frame(tbl)
#> [1] TRUE

Convert between them with as_tibble() and as.data.frame():

as_tibble(mtcars)
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows
TipOpinion

Use tibbles by default. Use data frames when a package demands one (rare) or when you want zero dependencies.

Exercises

  1. Create a tibble and a data frame with the same data. Compare what [, 1] returns for each.
  2. Create a data frame with a column called temperature. Access it with df$temp. Then try the same with a tibble.
  3. What does is.list() return for a tibble?

11.7 Exploring data frames

You have met str() already. It is the single most useful function for understanding any R object:

str(penguins)
#> tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
#>  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
#>  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
#>  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
#>  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
#>  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
#>  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

The output tells you: 344 rows, 8 columns, the type of each column, and a preview of the values. This is the first thing to run when you load a new dataset.

Other useful functions:

nrow(penguins)
#> [1] 344
ncol(penguins)
#> [1] 8
dim(penguins)
#> [1] 344   8
names(penguins)
#> [1] "species"           "island"            "bill_length_mm"   
#> [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
#> [7] "sex"               "year"
head(penguins, 3)
#> # A tibble: 3 × 8
#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>             <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgers…           39.1          18.7               181        3750
#> 2 Adelie  Torgers…           39.5          17.4               186        3800
#> 3 Adelie  Torgers…           40.3          18                 195        3250
#> # ℹ 2 more variables: sex <fct>, year <int>
summary(penguins)
#>       species          island    bill_length_mm  bill_depth_mm  
#>  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
#>  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
#>  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
#>                                  Mean   :43.92   Mean   :17.15  
#>                                  3rd Qu.:48.50   3rd Qu.:18.70  
#>                                  Max.   :59.60   Max.   :21.50  
#>                                  NA's   :2       NA's   :2      
#>  flipper_length_mm  body_mass_g       sex           year     
#>  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
#>  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
#>  Median :197.0     Median :4050   NA's  : 11   Median :2008  
#>  Mean   :200.9     Mean   :4202                Mean   :2008  
#>  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
#>  Max.   :231.0     Max.   :6300                Max.   :2009  
#>  NA's   :2         NA's   :2

summary() gives per-column summaries: min, max, mean, and quartiles for numeric columns; counts for factors. NA counts appear at the bottom of each column.

In RStudio, View(penguins) opens a spreadsheet-style viewer. Useful for browsing, but not for code: View() is interactive and does nothing in a script.

Exercises

  1. How many NA values are in the sex column of penguins? Use sum() and is.na().
  2. What is the mean body mass of penguins? Use mean() with na.rm = TRUE.
  3. Run str() on the mtcars dataset. How many columns are numeric?

11.8 Putting it together

A data frame is a list (Section 11.1). Each column is a vector (Section 4.2). Subsetting uses the same [, [[, and $ operators you learned for lists (Chapter 8) and vectors (Section 4.5). Logical indexing filters rows (Section 11.3.2), the same mechanism from Section 4.4 applied to a two-dimensional structure.

One consequence of this structure: R stores data frames column by column. Each column is one contiguous vector in memory, so column operations (mean(df$x)) are fast (sequential memory access) while row operations (apply(df, 1, sum)) are slow (jumping between columns). This is the same trade-off databases make: columnar stores (DuckDB, Parquet) optimize for analytical aggregation; row stores (PostgreSQL, MySQL) optimize for single-record transactions.

The next chapters build on this foundation. Chapter 12 covers the column types you will encounter most often: strings, factors, and dates. Chapter 14 introduces dplyr, which replaces the df[df$x > 5, ] syntax with readable verbs like filter(), mutate(), and summarize(). Those verbs work because data frames are lists of vectors, and you now understand what that representation means in practice.