library(palmerpenguins)
#>
#> Attaching package: 'palmerpenguins'
#> The following objects are masked from 'package:datasets':
#>
#> penguins, penguins_raw
library(ggplot2)17 Visualization
ggplot2 doesn’t give you a scatterplot function and a bar chart function. It gives you a language for describing any plot. Once you see the grammar, you stop memorizing plot types and start constructing them.
This chapter assumes you have tidy data (Chapter 16), can reshape it with dplyr (Chapter 14), and can pipe it through transformations (Chapter 15). Visualization is the payoff: you have cleaned the data, now you see it.
17.1 The grammar of graphics
In 1999, Leland Wilkinson published The Grammar of Graphics, arguing that a plot is not a type (bar chart, scatterplot, heatmap). A plot is a mapping from data to visual properties, rendered by geometric objects, in a coordinate system. Types are consequences of choices, not starting points. The intellectual roots go further back: Jacques Bertin’s Semiologie Graphique (1967) classified visual variables (position, size, shape, value, color, orientation, texture) and ranked their effectiveness for different data types. Wilkinson built the Grammar of Graphics on Bertin’s foundation, and when ggplot2 maps a variable to color vs size vs shape, it is implementing Bertin’s taxonomy: position is most effective for quantitative data, color for categorical.
Hadley Wickham implemented this idea in 2005 as ggplot2: a layered grammar where you build plots by composing independent components. The components are:
- Data: a tidy data frame (Chapter 16).
- Aesthetics (
aes()): which variables map to which visual properties. - Geoms: what shapes represent the data.
- Stats: statistical transformations (often implicit).
- Scales: how data values translate to aesthetic values (colors, sizes, axes).
- Facets: small multiples, splitting by a variable.
- Themes: non-data visual styling.
You don’t need all seven for every plot. Most plots are data + aesthetics + geom. The rest are refinements.
The simplest ggplot2 call needs three things: data, an aesthetic mapping, and a geom:
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g)) +
geom_point()
ggplot() sets up the coordinate system. aes() maps bill length to the x-axis and body mass to the y-axis. geom_point() draws one point per row. The + adds the geom as a layer. Those three lines contain the entire grammar.
17.2 Aesthetics: data to visuals
aes() creates a mapping from columns to visual properties:
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) +
geom_point()
Here color = species maps each species to a different color. ggplot2 picks the colors automatically and adds a legend. Common aesthetics include x, y, color (or colour), fill, shape, size, alpha, linetype, and group.
The critical distinction: inside aes() means mapped to a variable. Outside aes() means fixed for all observations.
# Color varies by species
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g)) +
geom_point(aes(color = species))
# All points are steel blue
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g)) +
geom_point(color = "steelblue")
If someone’s plot has a legend they don’t want, they put a variable outside aes(). If their points are all the same color when they shouldn’t be, they put a constant inside aes(). Getting this distinction right fixes half of all ggplot2 questions.
Aesthetics placed in ggplot() are inherited by all layers. Aesthetics placed in a specific geom_*() apply only to that layer. When multiple geoms share the same mapping, put it in ggplot(). When a mapping is specific to one layer, put it in the geom.
Exercises
- Create a scatterplot of
flipper_length_mmvsbody_mass_gfrompenguins. Mapspeciesto color. - What happens if you put
color = "blue"insideaes()? Try it and explain the result. - Map
islandto theshapeaesthetic in a scatterplot. What does the legend show?
17.3 Geoms: shapes for data
Each geom renders data differently. The same aesthetics, different geoms, different plots. Here are the essential ones.
geom_point() draws a scatterplot. Two continuous variables:
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
geom_histogram() shows the distribution of one continuous variable. Control the resolution with bins or binwidth:
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 200)
geom_boxplot() summarizes distributions by group (median, quartiles, outliers):
ggplot(penguins, aes(x = species, y = body_mass_g)) +
geom_boxplot()
geom_density() is a smooth alternative to the histogram. It estimates the probability density, which makes it easier to compare distributions:
ggplot(penguins, aes(x = body_mass_g, fill = species)) +
geom_density(alpha = 0.4)
geom_line() connects points from left to right. It is the standard geom for time series and trends. Order matters: if the data is not sorted by x, the lines will zigzag.
geom_col() draws bars with heights taken directly from the data. geom_bar() counts rows for you. The difference: geom_bar() applies stat = "count" internally, so you only map x. geom_col() uses stat = "identity", so you map both x and y.
ggplot(penguins, aes(x = species)) +
geom_bar()
geom_smooth() adds a fitted line or curve. Use method = "lm" for linear, method = "loess" for smooth:
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g)) +
geom_point() +
geom_smooth(method = "lm")
Layers compose with +. The plot above has two layers: points and a linear fit. Each + adds a component to the same plot object.
Exercises
- Create a scatterplot of
bill_length_mmvsbill_depth_mm. Add a smooth line withmethod = "loess". - Replace
geom_point()withgeom_density2d()in the same plot. What changes? - Make a boxplot of
flipper_length_mmbyisland. Addgeom_jitter(width = 0.2, alpha = 0.3)as a second layer.
17.4 Scales: controlling the mapping
Every aesthetic has a scale, even if you don’t set one explicitly. Scales control how data values become visual values: which colors, which axis limits, which labels.
When you write aes(x = bill_length_mm), ggplot2 creates a default scale_x_continuous() behind the scenes. You only need to set a scale explicitly when you want to override the default behavior: change axis limits, transform the axis, or pick specific colors.
Position scales control axes:
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point() +
scale_x_continuous(labels = scales::comma)
Color scales control how values map to colors. For discrete variables, scale_color_viridis_d() is a strong default (colorblind-friendly, perceptually uniform):
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) +
geom_point() +
scale_color_viridis_d()
labs() controls titles and labels. Always label your axes with units. bill_length_mm is a column name, not a label:
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) +
geom_point() +
labs(
x = "Bill length (mm)",
y = "Body mass (g)",
color = "Species",
title = "Palmer penguins"
)
Always label your axes with units. A plot with bill_length_mm on the axis is a draft. A plot with “Bill length (mm)” is communication.
Exercises
- Create a scatterplot of
bill_length_mmvsbody_mass_g, colored byspecies. Usescale_color_brewer(palette = "Set2")instead of viridis. - Add a
labs()call with a title, subtitle, and proper axis labels. - Use
scale_y_log10()on a plot ofbody_mass_g. When might a log scale be appropriate?
17.5 Facets: small multiples
Faceting splits one plot into multiple panels, one per level of a variable. facet_wrap() wraps panels into rows:
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g)) +
geom_point() +
facet_wrap(~ island)
facet_grid() creates a grid with rows and columns:
ggplot(penguins |> dplyr::filter(!is.na(sex)),
aes(x = bill_length_mm, y = body_mass_g)) +
geom_point() +
facet_grid(sex ~ island)
By default, all panels share the same axis scales. You can free them with scales = "free_y" or scales = "free", but use this cautiously: free scales make comparison across panels harder.
When to facet vs when to color: facet when the groups would overlap too much, or when you want each group’s pattern to stand alone. Color works when groups are few and visually separable.
Faceting replaces the temptation to build the same plot three times with a loop or copy-paste. One facet_wrap() call produces all the panels, with shared axes for honest comparison. The grammar handles the layout; you focus on the question.
Exercises
- Facet the penguins scatterplot (bill length vs body mass) by
speciesusingfacet_wrap(). - Use
facet_grid(species ~ island)on the same plot. Which cells are empty, and why? - Add
scales = "free"to your faceted plot. What changes? Is the comparison easier or harder?
17.6 Themes: non-data styling
Themes control the non-data elements of a plot: background, grid lines, fonts, legend position. ggplot2 ships several built-in themes:
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) +
geom_point() +
theme_minimal()
Other useful defaults include theme_classic() (white background, no grid) and theme_bw() (white background, light grid). For fine-grained control, theme() adjusts individual elements:
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) +
geom_point() +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold")
)
To set a default theme for your entire session, use theme_set():
theme_set(theme_minimal())Themes are purely cosmetic: the grammar is the substance of this chapter, not the polish, so spend your time on clear mappings and good labels before worrying about fonts.
That said, a consistent theme across all your plots signals professionalism. Pick one default early (most people end up with theme_minimal() or theme_bw()) and use it everywhere; consistency matters more than which theme you choose.
Exercises
- Apply
theme_classic()to any scatterplot from this chapter. How does it differ fromtheme_minimal()? - Use
theme(axis.text.x = element_text(angle = 45, hjust = 1))to rotate x-axis labels. When is this useful?
17.7 Putting it together
Here is a complete example, built step by step. Read it like a sentence: take penguins, remove missing sex, map bill length to x, mass to y, species to color, draw points, add linear fits, split by sex, use viridis colors, label everything, apply a minimal theme.
penguins |>
dplyr::filter(!is.na(sex)) |>
ggplot(aes(x = bill_length_mm, y = body_mass_g, color = species)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ sex) +
scale_color_viridis_d() +
labs(
x = "Bill length (mm)",
y = "Body mass (g)",
color = "Species",
title = "Bill length vs body mass by species and sex"
) +
theme_minimal()
The pipe feeds data into ggplot(), and after that, + composes the layers. The grammar makes it readable: each line adds one component, so you can read from top to bottom to understand the full specification.
This is the pattern you will use for most visualizations:
- Start with the data.
- Pipe it through any needed transformations (filtering, summarizing).
- Hand it to
ggplot()with your aesthetic mappings. - Add geoms.
- Refine with scales, facets, labels, and a theme.
Each step is independent and composable: you can swap out any component without touching the others. Change geom_point() to geom_jitter() and nothing else breaks; switch facet_wrap(~ sex) to facet_grid(sex ~ island) and the rest of the specification stays the same. The pieces are orthogonal, which is why the grammar scales to complex plots without becoming unwieldy.
17.8 ggplot2 as lambda calculus
A ggplot specification is a program. Not metaphorically: it is a tree of expressions that gets evaluated to produce output, just like any other R expression. The connection to the ideas from Chapter 7 runs deep.
aes() is a function that returns unevaluated expressions. When you write aes(x = bill_length_mm, y = body_mass_g), R does not look up bill_length_mm in your environment. It captures the expression and stores it for later evaluation inside the data frame. This is the same quoting mechanism you will see in Chapter 26: aes() is non-standard evaluation applied to visualization. The mapping is a description of a computation, not the computation itself, exactly like a lambda expression describes a function without executing it.
The + operator on ggplot layers is monoidal composition. Each layer is an independent unit, and combining two with + produces a new plot object. The empty plot ggplot() acts as an identity element (adding it changes nothing). An associative binary operation with an identity element is a monoid, one of the simplest structures in algebra. String concatenation with "" as identity is a monoid; list concatenation with list() is a monoid; function composition with the identity function is a monoid. ggplot layers are another instance. We will see in Chapter 21 that Reduce() works precisely because it folds over a monoid, and in Section 30.8 how the pattern generalizes.
The rendering pipeline itself is function composition. When ggplot2 draws a plot, it runs a sequence of transformations: data → stat (statistical transformation) → scale (map values to aesthetic space) → coord (project into pixel coordinates) → render (draw shapes). Each step takes a data structure and returns a data structure. The pipeline is render ∘ coord ∘ scale ∘ stat, pure composition. A graphics engine, at its core, is a chain of morphisms between categories: data space → statistical space → aesthetic space → physical space (pixels). Each geom_*() specifies what the final rendering step should draw; everything before it is a data transformation pipeline.
This is why the grammar works so well. It is not a grab bag of plot types; it is a small set of composable functions. You have already seen this principle: small functions that compose are more powerful than large functions that don’t (Chapter 15). ggplot2 applies that principle to visualization.
Exercises
- Build a plot from scratch: filter
penguinsto only Adelie penguins, then create a scatterplot of flipper length vs body mass, colored by island. Add proper labels and a theme. - Create a faceted histogram of
body_mass_gbyspecies, withbinwidth = 100. Usefill = speciesand setalpha = 0.7. - Start from the full example above and modify it: change the geom to
geom_density2d(), remove the faceting, and switch totheme_classic(). What does the plot reveal?
17.9 Common mistakes
Putting aes() in the wrong place. A variable inside aes() creates a mapping; a value outside aes() sets a constant. If your legend is unexpected, check which side of aes() your arguments are on (Section 17.2).
Using |> instead of + after ggplot(). The pipe feeds data into ggplot(). After that first call, every addition uses +. This trips up every beginner exactly once. If you get an error like “could not find function geom_point”, you probably wrote |> where you needed +. The reason ggplot2 uses + rather than |> is partly historical (ggplot2 predates both magrittr and the native pipe by a decade) and partly conceptual: the pipe transforms data sequentially, while + accumulates structure. These are different composition models: pipeline (linear) vs builder (additive).
# Wrong
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g)) |>
geom_point()
# Right
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g)) +
geom_point()Making multiple layers when the data needs reshaping. If you find yourself writing geom_line(aes(y = col_a)) and then geom_line(aes(y = col_b)) for different columns, that is a signal to use pivot_longer() first (Chapter 16) and map the new column to an aesthetic.
Complex calculations inside aes(). Move them to mutate() first (Chapter 14). aes(x = log(value + 1) / max(value)) is hard to read and hard to debug. Create the column, then map it.
Overloading a single plot. Too many colors, too many geoms, too much data. If a plot is hard to read, split it into facets or separate plots. A simpler plot almost always communicates better.
Most ggplot2 errors are not about syntax. They are about structure: the data is in the wrong shape, the mapping is in the wrong place, or the wrong composition operator is connecting the layers. When a plot doesn’t look right, check the data and the mappings before adjusting visual parameters. The grammar tells you where the problem is, if you listen to it.
17.10 Saving plots
A ggplot object is a deferred computation: it describes a plot but does not produce pixels until you print or save it. Printing sends the result to the screen; ggsave() sends it to a file:
ggsave("penguins_scatter.png", width = 8, height = 5, dpi = 300)The file format is inferred from the extension: .png, .pdf, .svg, .jpg. For publication, PDF or SVG gives vector graphics that scale cleanly at any size. For slides and web, PNG at 300 DPI is standard.
You can also pass a stored plot object explicitly:
p <- ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g, color = species)) +
geom_point() +
theme_minimal()
ggsave("penguins_scatter.pdf", plot = p, width = 8, height = 5)The width and height arguments control the output dimensions in inches (the default unit). Getting these right matters more than any theme adjustment: a plot squeezed into half its natural width produces unreadable axis labels, and a plot stretched too wide produces sparse, floating data. Experiment with dimensions before finalizing.
Always save with ggsave(), never with right-click or the RStudio export button. ggsave() is reproducible: the same code produces the same file tomorrow. The export button is a one-off action with no record of the dimensions or resolution you chose.