library(palmerpenguins)
#>
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#>
#> penguins, penguins_raw11 Data frames
A data frame is a list where every element is a vector of the same length. That is the entire definition. Chapter 8 showed that lists hold anything; a data frame constrains that freedom to equal-length columns. Once you see this, subsetting, column operations, and the entire tidyverse make sense. The idea is older than R: Edgar Codd’s 1970 paper at IBM defined relations as sets of tuples, each row a tuple of typed values. R’s data frames, SQL tables, pandas DataFrames, and Spark datasets are all descendants of that same relational algebra.
11.1 A list of equal-length vectors
Start by asking R what a data frame actually is:
typeof(penguins)
#> [1] "list"
is.list(penguins)
#> [1] TRUEA data frame has type "list" because it is a list. The difference is that every element (every column) must be a vector of the same length. Each column is one vector. Each row is one observation across those vectors. The constraint is simple: all vectors must have the same number of elements.
Build one yourself:
df <- data.frame(
name = c("Adelie", "Gentoo"),
mass = c(3750, 5200)
)
df
#> name mass
#> 1 Adelie 3750
#> 2 Gentoo 5200You are building a list with two elements (name and mass), both of length 2. R adds a class attribute ("data.frame") and row numbers, but the underlying structure is just a list. You can verify this with str():
str(df)
#> 'data.frame': 2 obs. of 2 variables:
#> $ name: chr "Adelie" "Gentoo"
#> $ mass: num 3750 5200The result has two columns and two rows, but the underlying representation is just a list with extra bookkeeping.
In type theory, this structure is a product type: a tuple of named, typed fields. A data frame with columns name (character) and mass (double) has type character × double, meaning each row is a pair of one character value and one double value. Complex data structures are built from simpler combinators, and the product (pairing things together with AND) is the most basic one. Every time you add a column, you extend the product with another factor.
Exercises
- Create a data frame with three columns:
species(character),island(character), andbill_length(numeric), with at least three rows. Check itstypeof(). - What does
length(penguins)return? Why? (Hint: a data frame is a list.) - What does
names(penguins)return?
11.2 Creating data frames
data.frame() is the base R workhorse. Column names come from the argument names:
measurements <- data.frame(
species = c("Adelie", "Chinstrap", "Gentoo"),
mass_g = c(3750, 3800, 5200),
island = c("Torgersen", "Dream", "Biscoe")
)
measurements
#> species mass_g island
#> 1 Adelie 3750 Torgersen
#> 2 Chinstrap 3800 Dream
#> 3 Gentoo 5200 BiscoeBefore R 4.0, data.frame() converted character vectors to factors by default (stringsAsFactors = TRUE). This caused confusion for over two decades. Strings that looked like text silently became categorical variables with integer codes underneath.
The stringsAsFactors default was wrong for 23 years. It was fixed in R 4.0 (2020). If someone’s tutorial still warns about it, their tutorial is outdated.
11.2.1 Row names
Data frames have row names. Some built-in datasets use them:
head(mtcars, 3)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1The car names (“Mazda RX4”, etc.) are row names, not a column. This causes problems: row names can’t contain duplicates, they’re silently dropped by many functions, and they behave differently from columns. You can’t filter by row name without extra syntax.
Pretend row names don’t exist. If something is data, put it in a column. tibble::rownames_to_column() converts row names to a proper column when you inherit a dataset that uses them.
Exercises
- Create a data frame where one column is logical (e.g.,
passed = c(TRUE, FALSE, TRUE)). What doesstr()show? - Run
row.names(mtcars)androw.names(penguins). What’s the difference?
11.3 Subsetting data frames
Because a data frame is a list, everything from Chapter 8 works:
penguins$species[1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo$ extracts a column and returns a vector, the same way [[ extracts an element from a list.
penguins[["species"]][1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap Gentoo[[ does the same thing, but takes a string. This matters when the column name is stored in a variable:
col <- "species"
penguins[[col]][1:3]
#> [1] Adelie Adelie Adelie
#> Levels: Adelie Chinstrap GentooSingle brackets return a one-column data frame (a sub-list), not a vector:
penguins["species"][1:3, ]
#> # A tibble: 3 × 1
#> species
#> <fct>
#> 1 Adelie
#> 2 Adelie
#> 3 AdelieThis is the same [ vs [[ distinction from Chapter 8: [ keeps the container, [[ extracts the contents.
11.3.1 Two-dimensional indexing
Because a data frame is rectangular, it also takes two indices: [row, column].
penguins[1, ]
#> # A tibble: 1 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgers… 39.1 18.7 181 3750
#> # ℹ 2 more variables: sex <fct>, year <int>This selects the first row across all columns and returns a data frame.
penguins[, 1]
#> # A tibble: 344 × 1
#> species
#> <fct>
#> 1 Adelie
#> 2 Adelie
#> 3 Adelie
#> 4 Adelie
#> 5 Adelie
#> 6 Adelie
#> 7 Adelie
#> 8 Adelie
#> 9 Adelie
#> 10 Adelie
#> # ℹ 334 more rowsAll rows, first column. Here is the gotcha: by default, this returns a vector, not a data frame. R “drops” the data frame structure when the result has a single column.
class(penguins[, 1])
#> [1] "tbl_df" "tbl" "data.frame"To keep the data frame structure, use drop = FALSE:
class(penguins[, 1, drop = FALSE])
#> [1] "tbl_df" "tbl" "data.frame"The drop gotcha is a common source of bugs. You write a function that subsets a data frame, test it with two columns, and it works. Someone passes one column, and your function breaks because it suddenly receives a vector instead of a data frame.
penguins[1, 2]
#> # A tibble: 1 × 1
#> island
#> <fct>
#> 1 TorgersenRow 1, column 2. Returns a single value (a vector of length 1).
11.3.2 Logical subsetting
Logical indexing from Section 4.5 works on data frames too. Put the logical vector in the row position:
heavy <- penguins[penguins$body_mass_g > 5000 & !is.na(penguins$body_mass_g), ]
nrow(heavy)
#> [1] 61
head(heavy, 3)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Gentoo Biscoe 50 16.3 230 5700
#> 2 Gentoo Biscoe 50 15.2 218 5700
#> 3 Gentoo Biscoe 47.6 14.5 215 5400
#> # ℹ 2 more variables: sex <fct>, year <int>penguins$body_mass_g > 5000 produces a logical vector, TRUE for heavy penguins and FALSE (or NA) for the rest. Placing it before the comma selects rows. This is what dplyr::filter() does under the hood (Chapter 14).
Exercises
Predict the class of each result before running:
penguins$bill_length_mm penguins["bill_length_mm"] penguins[["bill_length_mm"]] penguins[, "bill_length_mm"] penguins[, "bill_length_mm", drop = FALSE]Extract all penguins from Biscoe island using logical subsetting. How many rows?
What does
penguins[1:5, c("species", "island")]return?
11.4 Adding and modifying columns
Assign to a new name with $:
penguins$mass_kg <- penguins$body_mass_g / 1000
head(penguins$mass_kg)
#> [1] 3.75 3.80 3.25 NA 3.45 3.65This adds (or overwrites) a column. The right-hand side must be a vector with the same number of rows, or a single value that gets recycled (Section 4.4.1).
For modifying multiple columns at once, base R provides within():
df2 <- within(penguins, {
mass_kg <- body_mass_g / 1000
bill_ratio <- bill_length_mm / bill_depth_mm
})
head(df2[, c("mass_kg", "bill_ratio")])
#> # A tibble: 6 × 2
#> mass_kg bill_ratio
#> <dbl> <dbl>
#> 1 3.75 2.09
#> 2 3.8 2.27
#> 3 3.25 2.24
#> 4 NA NA
#> 5 3.45 1.90
#> 6 3.65 1.91within() evaluates the expressions inside the data frame’s environment, so you can refer to columns by name without the df$ prefix. The result is a new data frame. dplyr::mutate() does this more cleanly, which you will see in Chapter 14.
Exercises
- Add a column
bill_areatopenguinsthat multipliesbill_length_mmbybill_depth_mm. What ismean(penguins$bill_area, na.rm = TRUE)? - What happens if you assign a vector of the wrong length to a new column? Try
penguins$test <- 1:10.
11.5 Combining data frames
rbind() stacks rows. The column names and types must match:
batch1 <- data.frame(species = "Adelie", mass = 3750)
batch2 <- data.frame(species = "Gentoo", mass = 5200)
rbind(batch1, batch2)
#> species mass
#> 1 Adelie 3750
#> 2 Gentoo 5200cbind() glues columns. The row counts must match:
ids <- data.frame(id = c("A1", "G1"))
masses <- data.frame(mass = c(3750, 5200))
cbind(ids, masses)
#> id mass
#> 1 A1 3750
#> 2 G1 5200When they fail, the error messages are not always helpful. Mismatched column names in rbind() or mismatched row counts in cbind() produce errors that mention dimensions rather than the real problem. Check names() and nrow() before combining.
Exercises
- Create two data frames with the same columns but different data. Stack them with
rbind(). - What happens if the column names don’t match? Try it.
- Use
cbind()to attach a newidcolumn topenguins. What class is the result?
11.6 Tibbles
A tibble is a modern data frame from the tidyverse:
library(tibble)
tbl <- tibble(
species = c("Adelie", "Gentoo", "Chinstrap"),
mass = c(3750, 5200, 3800)
)
tbl
#> # A tibble: 3 × 2
#> species mass
#> <chr> <dbl>
#> 1 Adelie 3750
#> 2 Gentoo 5200
#> 3 Chinstrap 3800Three differences matter:
Printing. A tibble shows the first 10 rows, column types, and fits the screen. No more flooding the console with 10,000 rows because you forgot head().
Subsetting. tbl[, 1] always returns a tibble, never a vector. The drop gotcha is gone:
class(tbl[, 1])
#> [1] "tbl_df" "tbl" "data.frame"No partial matching. With a data frame, df$sp might match df$species. A tibble refuses:
tbl$sp
#> Warning: Unknown or uninitialised column: `sp`.
#> NULLPartial matching is autocomplete for bugs. You type df$sp, R guesses you meant df$species, and your code works until someone adds a column called spine_count. Then df$sp becomes ambiguous and silently returns NULL. A tibble catches this immediately.
A tibble is still a data frame. Every function that takes a data frame takes a tibble:
is.data.frame(tbl)
#> [1] TRUEConvert between them with as_tibble() and as.data.frame():
as_tibble(mtcars)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rowsUse tibbles by default. Use data frames when a package demands one (rare) or when you want zero dependencies.
Exercises
- Create a tibble and a data frame with the same data. Compare what
[, 1]returns for each. - Create a data frame with a column called
temperature. Access it withdf$temp. Then try the same with a tibble. - What does
is.list()return for a tibble?
11.7 Exploring data frames
You have met str() already. It is the single most useful function for understanding any R object:
str(penguins)
#> tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
#> $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
#> $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
#> $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
#> $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
#> $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
#> $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...The output tells you: 344 rows, 8 columns, the type of each column, and a preview of the values. This is the first thing to run when you load a new dataset.
Other useful functions:
nrow(penguins)
#> [1] 344
ncol(penguins)
#> [1] 8
dim(penguins)
#> [1] 344 8names(penguins)
#> [1] "species" "island" "bill_length_mm"
#> [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
#> [7] "sex" "year"head(penguins, 3)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgers… 39.1 18.7 181 3750
#> 2 Adelie Torgers… 39.5 17.4 186 3800
#> 3 Adelie Torgers… 40.3 18 195 3250
#> # ℹ 2 more variables: sex <fct>, year <int>summary(penguins)
#> species island bill_length_mm bill_depth_mm
#> Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
#> Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
#> Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
#> Mean :43.92 Mean :17.15
#> 3rd Qu.:48.50 3rd Qu.:18.70
#> Max. :59.60 Max. :21.50
#> NA's :2 NA's :2
#> flipper_length_mm body_mass_g sex year
#> Min. :172.0 Min. :2700 female:165 Min. :2007
#> 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
#> Median :197.0 Median :4050 NA's : 11 Median :2008
#> Mean :200.9 Mean :4202 Mean :2008
#> 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
#> Max. :231.0 Max. :6300 Max. :2009
#> NA's :2 NA's :2summary() gives per-column summaries: min, max, mean, and quartiles for numeric columns; counts for factors. NA counts appear at the bottom of each column.
In RStudio, View(penguins) opens a spreadsheet-style viewer. Useful for browsing, but not for code: View() is interactive and does nothing in a script.
Exercises
- How many
NAvalues are in thesexcolumn ofpenguins? Usesum()andis.na(). - What is the mean body mass of penguins? Use
mean()withna.rm = TRUE. - Run
str()on themtcarsdataset. How many columns are numeric?
11.8 Putting it together
A data frame is a list (Section 11.1). Each column is a vector (Section 4.2). Subsetting uses the same [, [[, and $ operators you learned for lists (Chapter 8) and vectors (Section 4.5). Logical indexing filters rows (Section 11.3.2), the same mechanism from Section 4.4 applied to a two-dimensional structure.
One consequence of this structure: R stores data frames column by column. Each column is one contiguous vector in memory, so column operations (mean(df$x)) are fast (sequential memory access) while row operations (apply(df, 1, sum)) are slow (jumping between columns). This is the same trade-off databases make: columnar stores (DuckDB, Parquet) optimize for analytical aggregation; row stores (PostgreSQL, MySQL) optimize for single-record transactions.
The next chapters build on this foundation. Chapter 12 covers the column types you will encounter most often: strings, factors, and dates. Chapter 14 introduces dplyr, which replaces the df[df$x > 5, ] syntax with readable verbs like filter(), mutate(), and summarize(). Those verbs work because data frames are lists of vectors, and you now understand what that representation means in practice.