13 Reading and writing data

So reading is solved, or at least manageable. What about the return trip?

Try loading a CSV you downloaded six months ago from a government portal. The accented characters in city names have turned into question marks, the column that should be numeric is full of strings because row 4,012 contains the word “pending,” and the file path you hard-coded starts with C:/Users/someone_else/. Three lines of read.csv() and three different failures, none of which have anything to do with your analysis. Every data project begins here, in the gap between the file on disk and the clean data frame in memory, and that gap is wider than most tutorials admit.

This chapter is about crossing it. CSV, Excel, databases, binary formats: each has its own traps, and two problems cut across all of them (file paths and character encodings) with a persistence that borders on malice.

13.1 File paths

Every R session has a working directory, and all relative paths resolve from it:

getwd()

If your data lives at data/penguins.csv and your working directory happens to be the project root, read.csv("data/penguins.csv") finds the file. Move the script into a subdirectory, or run it on a colleague’s laptop where the project sits in a different folder, and the same call breaks silently.

Opinion

Don’t use setwd(). It welds your script to one machine, one folder hierarchy, one moment in time. Use RStudio projects or the here package instead; they let the project root follow the project, not the other way around.

The here package builds paths relative to the project root by searching upward for .Rproj files, .git directories, or similar markers, so the same call works regardless of which subdirectory your script lives in:

here::here("data", "penguins.csv")

It works on any machine, any OS, any nesting depth. Every file path in this chapter uses here::here().

Paths hide one more trap, and it is entirely cosmetic until it isn’t: slash direction. Forward slashes work everywhere in R, even on Windows, so "data/penguins.csv" is always valid. Backslashes need escaping ("data\\penguins.csv"), which you will forget exactly once before switching to forward slashes permanently.

Before reading any file, verify it exists. A clear error you write yourself beats a cryptic one from deep inside read.csv():

path <- here::here("data", "penguins.csv")
if (!file.exists(path)) stop("File not found: ", path)

dir.exists() does the same for directories. But finding the file is the easy part; what happens after you open it?

Exercises

Run getwd() in your console. What does it return?
Install the here package (install.packages("here")). Run here::here(). What directory does it point to?
Why is here::here("data", "file.csv") more portable than a hard-coded path like "C:/Users/me/project/file.csv"?

13.2 Reading CSV files

Reading a file is a function: hand it a path, get back a data frame. You could write the type signature as read_csv :: FilePath -> DataFrame. In a purely functional world the same path would always produce the same data frame, but I/O demolishes that guarantee because someone could edit the file between calls, or the disk could fail, or the network share could vanish. R doesn’t track this in its type system; the function looks pure but isn’t. Haskell solves the problem by wrapping I/O in a monad (readFile :: FilePath -> IO String), making the side effect visible in the type. R trusts you to know the difference.

That trust has consequences. Every call to read_csv() implicitly assumes: this file has not changed since I last looked, the encoding is what I think it is, and the column that was numeric in row 1 is still numeric in row 4012. When any of those assumptions break, the failure is hard to diagnose because the function looks deterministic — same path, same call — but the world behind the path has shifted. This is why readr lets you specify col_types explicitly: you are declaring the contract the file must satisfy, and readr validates it on your behalf. Without that declaration, the function infers types from the first thousand rows and hopes the rest agree. Specifying types is the functional programmer’s workaround for a function that cannot be pure.

Comma-separated values date to the 1970s on IBM mainframes, and there is still no real standard. RFC 4180 (2005) tried, but Excel, R, Python, and most other tools disagree on quoting rules, newlines inside fields, and encoding. “My CSV won’t load” is perhaps the only truly universal data experience.

Base R reads CSVs with read.csv():

df <- read.csv(here::here("data", "penguins.csv"))

It works, but the defaults are quirky. In older versions of R, read.csv() converted strings to factors automatically (stringsAsFactors = TRUE), the same pain point from Chapter 11. Since R 4.0 the default is FALSE, but you will still encounter old code and old advice that assume otherwise. The quirks, however, run deeper than stringsAsFactors.

readr::read_csv() addresses most of them:

library(readr)
df <- read_csv(here::here("data", "penguins.csv"))

read_csv() never converts strings to factors, it is faster on large files, it prints the column types it guessed so you can catch mistakes immediately, and it returns a tibble (Section 11.6) instead of a plain data frame. For most work, read_csv() is simply the better default.

Key arguments worth knowing:

col_types: specify column types explicitly instead of letting readr guess.
na: which strings should be treated as missing values (default: c("", "NA")).
skip: skip header lines before reading data.
locale: control decimal marks, encodings, and date formats.

df <- read_csv(
  here::here("data", "survey.csv"),
  col_types = cols(
    id = col_integer(),
    name = col_character(),
    score = col_double(),
    date = col_date()
  ),
  na = c("", "NA", "missing")
)

Opinion

Specify col_types explicitly in any script you will run more than once. readr guesses types from the first 1,000 rows, which works beautifully until row 1,001 contains a string in what it decided was a numeric column. Guessing is for exploration; production code deserves certainty.

For European CSVs where ; is the delimiter and , the decimal mark, use read_csv2(). For tab-separated files, read_tsv(). For anything else, read_delim() lets you specify whatever delimiter the file actually uses:

read_delim(here::here("data", "pipes.txt"), delim = "|")

Exercises

Read a CSV file (any file you have, or create one with a text editor). Use read_csv() and inspect the column types it prints.
Read the same file with explicit col_types. What happens if you declare a character column as col_integer()?
Create a small CSV where one value is the string "NA" and another is genuinely missing (empty cell). Read it with read_csv() and check the result with is.na().

13.3 Writing CSV files

readr::write_csv() writes a data frame to disk:

write_csv(df, here::here("output", "clean_data.csv"))

It always writes UTF-8 encoding and never adds row names. Both are good defaults.

Base R’s write.csv(), by contrast, adds row names by default, which usually just clutters the output with an unnamed first column:

write.csv(df, here::here("output", "clean_data.csv"), row.names = FALSE)

That row.names = FALSE is easy to forget, and forgetting it connects back to the same problem from Chapter 11: row names cause more confusion than they solve.

CSV is the right format for small to medium data that needs to be human-readable or shared across tools. But “small to medium” has a ceiling, and CSV hits it faster than you might expect. What happens when a file is too large to open in a text editor, or when you need to preserve exact types across languages?

Exercises

Create a small data frame and write it with both write_csv() and write.csv(). Open both files in a text editor. What’s different?
Write a data frame with write.csv() and forget row.names = FALSE. Read it back with read_csv(). What happened to the row names?

13.4 Excel files

The readxl package reads both .xls and .xlsx files with no external dependencies (no Java, no Perl, just C++ under the hood):

library(readxl)
df <- read_excel(here::here("data", "survey.xlsx"))

Key arguments:

sheet: select a sheet by name or index.
range: read a specific cell range (e.g., "B3:D15").
col_types, na, skip: same idea as readr.

df <- read_excel(
  here::here("data", "survey.xlsx"),
  sheet = "raw_data",
  range = "A1:F100",
  na = c("", "N/A")
)

When you don’t know the structure of an Excel file (and with Excel files, you often don’t), list the sheet names first:

excel_sheets(here::here("data", "survey.xlsx"))

To read all sheets into a list of data frames:

path <- here::here("data", "survey.xlsx")
all_sheets <- purrr::map(excel_sheets(path), \(s) read_excel(path, sheet = s))

This uses purrr::map() to apply read_excel() to each sheet name, returning a list. The full treatment of map() comes in Chapter 19.

For writing Excel files, writexl::write_xlsx() is lightweight and sufficient for most cases:

writexl::write_xlsx(df, here::here("output", "results.xlsx"))

Opinion

Excel is a fine data source. It is a terrible data format. Merged cells break read_excel() because R has no concept of a cell spanning two columns; multiple tables on one sheet force you to guess where each one starts; and color-coded values are simply invisible to any reader. Read from Excel, clean in R, save as CSV or Parquet. Do not round-trip back to .xlsx.

Exercises

If you have an Excel file with multiple sheets, use excel_sheets() to list them, then read one sheet by name.
Use the range argument to read only a subset of cells from a sheet.

13.5 Databases

When data lives in a database, R connects to it through the DBI package, and the pattern is always the same three steps: connect, query, disconnect.

library(DBI)

con <- dbConnect(RSQLite::SQLite(), here::here("data", "example.sqlite"))
result <- dbGetQuery(con, "SELECT * FROM measurements LIMIT 10")
dbDisconnect(con)

dbConnect() takes a driver (here, SQLite) and connection parameters; dbGetQuery() sends SQL and returns a data frame; dbDisconnect() closes the connection when you are done. This three-step shape works with every database R supports: PostgreSQL, MySQL, SQL Server, DuckDB, and others. Only the driver and connection string change.

SQLite is self-contained (one file, no server, no configuration) and therefore ideal for learning:

con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "mtcars", mtcars)
dbGetQuery(con, "SELECT mpg, cyl FROM mtcars WHERE cyl = 6")
dbDisconnect(con)

DuckDB is the modern choice for local analytical work. It stores data in columns rather than rows, runs aggregations fast, and can read Parquet and CSV files directly without an import step:

con <- dbConnect(duckdb::duckdb())
result <- dbGetQuery(con, "SELECT * FROM read_parquet('data/big_table.parquet')")
dbDisconnect(con, shutdown = TRUE)

The real power, though, comes from combining databases with dplyr. Instead of writing SQL by hand you can use dplyr verbs on a database table and let R translate them into the appropriate SQL dialect behind the scenes:

con <- dbConnect(RSQLite::SQLite(), here::here("data", "example.sqlite"))
tbl(con, "measurements") |>
  filter(year > 2020) |>
  summarise(mean_value = mean(value))

dplyr generates the SQL for you. The full treatment of database-backed dplyr is in Section 13.5.

What these tools share is a common interface: you describe what you want, and the backend figures out how to get it. But databases solve the “too big” problem by moving computation to the data. What if you just want a faster file?

Exercises

Create an in-memory SQLite database. Write the iris data frame to it, then query it with SQL.
Use dbListTables() to see what tables exist in a connection.

13.6 Binary and columnar formats

CSV is human-readable, which is wonderful for debugging and terrible for performance. The bottleneck is structural: CSV is row-oriented, storing each record as a unit, so reading a single column means scanning every row and throwing away everything you don’t need. Columnar formats (Parquet, Feather, DuckDB’s native format) flip this arrangement by storing each column contiguously, which makes analytical queries that touch few columns but many rows dramatically faster while also compressing far better because similar values sit next to each other. This row-versus-column distinction shapes every file format decision in data engineering.

saveRDS() and readRDS() serialize any R object to disk:

saveRDS(df, here::here("cache", "processed.rds"))
df <- readRDS(here::here("cache", "processed.rds"))

RDS is fast, compact, and preserves R types exactly: factors, dates, attributes, nested lists, everything. The catch is that it is R-only. Python can’t read it, Excel can’t read it, and the moment a collaborator opens Jupyter instead of RStudio, you need a different format.

Parquet is that format:

library(arrow)
write_parquet(df, here::here("data", "measurements.parquet"))
df <- read_parquet(here::here("data", "measurements.parquet"))

Parquet files are columnar, compressed, and readable by R, Python, Spark, DuckDB, and dozens of other tools. They are typically a fraction of the size of equivalent CSVs and far faster to read.

Arrow’s Feather format offers yet another trade-off: faster than Parquet for reading and writing, but producing larger files because it prioritizes speed over compression. Use Parquet for storage and sharing; use Feather for speed-critical intermediate files where disk space is cheap:

write_feather(df, here::here("cache", "fast_cache.feather"))
df <- read_feather(here::here("cache", "fast_cache.feather"))

When to use what:

Format	Strengths	Weaknesses
CSV	Universal, human-readable	Slow, large, lossy types
RDS	Fast, preserves all R types	R-only
Parquet	Fast, compressed, cross-language	Not human-readable
Feather	Very fast read/write	Larger than Parquet, less adopted

If your data doesn’t fit in memory at all, vectra (a package I wrote for exactly this case) can stream a CSV and run dplyr-like verbs (filter, select, mutate, group_by, summarise) without loading the full file into R.

Opinion

If your CSV takes more than a few seconds to read, switch to Parquet. You will wonder why you waited.

Exercises

Save the mtcars data frame as CSV, RDS, and (if you have the arrow package) Parquet. Compare the file sizes with file.size().
Read each format back and verify the result is identical with all.equal().
Time reading the CSV and RDS versions with system.time(). Which is faster?

13.7 Encodings

You have read the file, parsed the columns, and everything looks right until you notice that half the city names contain \xe9 instead of an accented e, or that German umlauts have become pairs of nonsense characters. The data isn’t corrupted. The bytes are fine. Your reader is just interpreting them with the wrong mapping.

An encoding is that mapping: a table that says “byte 0xE9 means e-with-accent” (in Latin-1) or “byte sequence 0xC3 0xA9 means e-with-accent” (in UTF-8). UTF-8 is the modern standard, capable of representing every character in every living script; Latin-1 (ISO-8859-1) is the old European standard; Windows-1252 is Microsoft’s variant of Latin-1. All three agree on basic ASCII (English letters, digits, punctuation) and disagree on everything else, which is precisely why files that contain only English text never trigger encoding problems and files with accented characters break constantly.

readr::read_csv() assumes UTF-8. Base R’s read.csv() assumes your system’s default encoding, which on Windows is often not UTF-8. This is why the same CSV can load perfectly on a Mac and produce garbled text on a Windows machine in the next room.

When you encounter garbled text, specify the encoding explicitly:

df <- read_csv(
  here::here("data", "european_data.csv"),
  locale = locale(encoding = "latin1")
)

For base R:

df <- read.csv(here::here("data", "european_data.csv"), fileEncoding = "latin1")

The BOM (byte order mark) is another common irritant. Excel sometimes prepends three invisible bytes to UTF-8 files as a signature, and read_csv() strips them automatically; read.csv() sometimes doesn’t, producing a garbled first column name (you’ll see something like X.U.FEFF.id where id should be).

Opinion

Save everything as UTF-8. If you receive a file in a different encoding, convert it once, save the result, and never look back. Life is too short to debug the same encoding issue twice.

You can check and convert encodings with base R:

# Check encoding of a string
Encoding(x)

# Convert from Latin-1 to UTF-8
x_utf8 <- iconv(x, from = "latin1", to = "UTF-8")

The practical rule fits in one sentence: use readr (it assumes UTF-8), and when characters still come out wrong, pass the correct encoding explicitly. Which brings us back to where we started: the gap between the file on disk and the data frame in memory is full of invisible decisions (paths, delimiters, types, encodings) that someone made for you, often badly, and that break the moment you move the file to a different machine.

Exercises

Create a CSV file in a text editor with a non-ASCII character (an accented letter, an umlaut). Read it with read_csv(). Does it display correctly?
What does Encoding("hello") return? What about Encoding("\u00e9")?