Quick Start

The Problem: Silent Data Corruption

You receive monthly customer exports from a CRM system. The data should have unique customer_id values and complete email addresses. One month, someone upstream changes the export logic. Now customer_id has duplicates and some emails are missing.

Without explicit checks, you won’t notice until something breaks downstream—wrong row counts after a join, duplicated invoices, failed email campaigns.

# January export: clean data
january <- data.frame(
  customer_id = c(101, 102, 103, 104, 105),
  email = c("alice@example.com", "bob@example.com", "carol@example.com",
            "dave@example.com", "eve@example.com"),
  segment = c("premium", "basic", "premium", "basic", "premium")
)

# February export: corrupted upstream (duplicates + missing email)
february <- data.frame(
  customer_id = c(101, 102, 102, 104, 105),  # Note: 102 is duplicated

  email = c("alice@example.com", "bob@example.com", NA,
            "dave@example.com", "eve@example.com"),
  segment = c("premium", "basic", "basic", "basic", "premium")
)

The February data looks fine at a glance:

head(february)
#>   customer_id             email segment
#> 1         101 alice@example.com premium
#> 2         102   bob@example.com   basic
#> 3         102              <NA>   basic
#> 4         104  dave@example.com   basic
#> 5         105   eve@example.com premium
nrow(february)  # Same row count
#> [1] 5

But it will silently corrupt your analysis.

The Solution: Make Assumptions Explicit

keyed catches these issues by making your assumptions explicit:

# Define what you expect: customer_id is unique
january_keyed <- january |>
  key(customer_id) |>
  assume_no_na(email)

# This works - January data is clean
january_keyed
#> # A keyed tibble: 5 x 3
#> # Key:            customer_id
#>   customer_id email             segment
#>         <dbl> <chr>             <chr>  
#> 1         101 alice@example.com premium
#> 2         102 bob@example.com   basic  
#> 3         103 carol@example.com premium
#> 4         104 dave@example.com  basic  
#> 5         105 eve@example.com   premium

Now try the same with February’s corrupted data:

# This fails immediately - duplicates detected
february |>
  key(customer_id)
#> Warning: Key is not unique.
#> ℹ 1 duplicate key value(s) found.
#> ℹ Key columns: customer_id
#> # A keyed tibble: 5 x 3
#> # Key:            customer_id
#>   customer_id email             segment
#>         <dbl> <chr>             <chr>  
#> 1         101 alice@example.com premium
#> 2         102 bob@example.com   basic  
#> 3         102 NA                basic  
#> 4         104 dave@example.com  basic  
#> 5         105 eve@example.com   premium

The error catches the problem at import time, not downstream when you’re debugging a mysterious row count mismatch.

Workflow 1: Monthly Data Validation

Goal: Validate each month’s export against expected constraints before processing.

Challenge: Data quality varies month-to-month. Silent corruption causes cascading errors.

Strategy: Define keys and assumptions once, apply consistently to each import.

Define validation function

validate_customer_export <- function(df) {
  df |>
    key(customer_id) |>
    assume_no_na(email) |>
    assume_nrow(min = 1)
}

# January: passes
january_clean <- validate_customer_export(january)
summary(january_clean)
#> 
#> ── Keyed Data Frame Summary
#> Dimensions: 5 rows x 3 columns
#> 
#> Key columns: customer_id
#> ✔ Key is unique
#> 
#> Row IDs: none

Keys survive transformations

Once defined, keys persist through dplyr operations:

# Filter preserves key
premium_customers <- january_clean |>
  filter(segment == "premium")

has_key(premium_customers)
#> [1] TRUE
get_key_cols(premium_customers)
#> [1] "customer_id"

# Mutate preserves key
enriched <- january_clean |>
  mutate(domain = sub(".*@", "", email))

has_key(enriched)
#> [1] TRUE

Graceful degradation

If an operation breaks uniqueness, keyed warns you rather than failing silently:

# This creates duplicates - key is dropped with warning
january_clean |>
  mutate(customer_id = 1)
#> Warning: Key modified and is no longer unique.
#> # A tibble: 5 × 3
#>   customer_id email             segment
#>         <dbl> <chr>             <chr>  
#> 1           1 alice@example.com premium
#> 2           1 bob@example.com   basic  
#> 3           1 carol@example.com premium
#> 4           1 dave@example.com  basic  
#> 5           1 eve@example.com   premium

The warning tells you exactly what happened. No silent corruption.

Workflow 2: Safe Joins

Goal: Join customer data with orders without accidentally duplicating rows.

Challenge: Join cardinality mistakes are common and hard to debug. A “one-to-one” join that’s actually one-to-many silently inflates your data.

Strategy: Use diagnose_join() to understand cardinality before joining.

Create sample data

customers <- data.frame(
  customer_id = 1:5,
  name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
  tier = c("gold", "silver", "gold", "bronze", "silver")
) |>
  key(customer_id)

orders <- data.frame(
  order_id = 1:8,
  customer_id = c(1, 1, 2, 3, 3, 3, 4, 5),
  amount = c(100, 150, 200, 50, 75, 125, 300, 80)
) |>
  key(order_id)

Diagnose before joining

diagnose_join(customers, orders, by = "customer_id", use_joinspy = FALSE)
#> 
#> ── Join Diagnosis
#> Cardinality: one-to-many
#> x: 5 rows, unique
#> y: 8 rows, 3 duplicates

The diagnosis shows:

Cardinality is one-to-many: Each customer can have multiple orders
Coverage: Shows how many keys match vs. don’t match

Now you know what to expect. A left_join() will create 8 rows (one per order), not 5 (one per customer).

Compare key structures

compare_keys(customers, orders)
#> 
#> ── Key Comparison
#> Comparing on: customer_id
#> 
#> x: 5 unique keys
#> y: 5 unique keys
#> 
#> Common: 5 (100.0% of x)
#> Only in x: 0
#> Only in y: 0

This shows the join key exists in both tables but with different uniqueness properties—essential information before joining.

Workflow 3: Row Identity Tracking

Goal: Track which original rows survive through a complex pipeline.

Challenge: After filtering, aggregating, and joining, you lose track of which source rows contributed to your final data.

Strategy: Use add_id() to attach stable identifiers that survive transformations.

Add row IDs

# Add UUIDs to rows
customers_tracked <- customers |>
  add_id()

customers_tracked
#> # A keyed tibble: 5 x 4
#> # Key:            customer_id | .id
#>   .id                                  customer_id name  tier  
#>   <chr>                                      <int> <chr> <chr> 
#> 1 7c04f88e-e5bd-4329-a3ba-d4f8f65f9c7a           1 Alice gold  
#> 2 e0aafb43-0892-43ad-8039-ad8733617c25           2 Bob   silver
#> 3 8119100f-1e5c-4338-bd15-677017694593           3 Carol gold  
#> 4 42cfb1b6-1b7c-4e4d-9bb4-4dc35bba61c5           4 Dave  bronze
#> 5 b8b047db-1474-47b1-a73d-c3bceea59076           5 Eve   silver

IDs survive transformations

# Filter: IDs persist
gold_customers <- customers_tracked |>
  filter(tier == "gold")

get_id(gold_customers)
#> [1] "7c04f88e-e5bd-4329-a3ba-d4f8f65f9c7a"
#> [2] "8119100f-1e5c-4338-bd15-677017694593"

# Compare with original
compare_ids(customers_tracked, gold_customers)
#> $lost
#> [1] "e0aafb43-0892-43ad-8039-ad8733617c25"
#> [2] "42cfb1b6-1b7c-4e4d-9bb4-4dc35bba61c5"
#> [3] "b8b047db-1474-47b1-a73d-c3bceea59076"
#> 
#> $gained
#> character(0)
#> 
#> $preserved
#> [1] "7c04f88e-e5bd-4329-a3ba-d4f8f65f9c7a"
#> [2] "8119100f-1e5c-4338-bd15-677017694593"

The comparison shows exactly which rows were lost (filtered out) and which were preserved.

Combining data with ID handling

When appending new data, bind_id() handles ID conflicts:

batch1 <- data.frame(x = 1:3) |> add_id()
batch2 <- data.frame(x = 4:6)  # No IDs yet

# bind_id assigns new IDs to batch2 and checks for conflicts
combined <- bind_id(batch1, batch2)
combined
#>                                    .id x
#> 1 e19bac07-f676-4724-b252-b5f1116d5a5a 1
#> 2 852c2ec5-95f1-439f-a6f9-fb7a31e77e42 2
#> 3 d4b115d6-6200-4ed1-ad63-c79eb6cfb722 3
#> 4 6fd72f58-a192-4094-9daa-fd259bcc3f6b 4
#> 5 6abf5c8c-9846-41ac-bcfc-ba6e94a66eb6 5
#> 6 7d12d6fc-5807-4f57-a935-10278143c5c9 6

Workflow 4: Drift Detection

Goal: Detect when data changes unexpectedly between pipeline runs.

Challenge: Reference data (lookup tables, dimension tables) changes upstream without notice. Your pipeline silently uses stale assumptions.

Strategy: Commit snapshots with commit_keyed() and check for drift with check_drift().

Commit a reference snapshot

# Commit current state as reference
reference_data <- data.frame(
  region_id = c("US", "EU", "APAC"),
  tax_rate = c(0.08, 0.20, 0.10)
) |>
  key(region_id) |>
  commit_keyed()
#> ✔ Snapshot committed: 76a76466...

Check for drift

# No changes yet
check_drift(reference_data)
#> 
#> ── Drift Report
#> ✔ No drift detected
#> Snapshot: 76a76466... (2026-01-24 01:27)

Detect changes

# Simulate upstream change: EU tax rate changed
modified_data <- reference_data
modified_data$tax_rate[2] <- 0.21

# Drift detected!
check_drift(modified_data)
#> 
#> ── Drift Report
#> ! Drift detected
#> Snapshot: 76a76466... (2026-01-24 01:27)
#> ℹ Key values changed
#> ℹ Cell values modified

The drift report shows exactly what changed, letting you decide whether to accept the new data or investigate.

Cleanup

# Remove snapshots when done
clear_all_snapshots()
#> ! This will remove 1 snapshot(s) from cache.
#> ✔ Cleared 1 snapshot(s).

Quick Reference

Core Functions

Function	Purpose
`key()`	Define key columns (validates uniqueness)
`unkey()`	Remove key
`has_key()`, `get_key_cols()`	Query key status

Assumption Checks

Function	Validates
`assume_unique()`	No duplicate values
`assume_no_na()`	No missing values
`assume_complete()`	All expected values present
`assume_coverage()`	Reference values covered
`assume_nrow()`	Row count within bounds

Diagnostics

Function	Purpose
`diagnose_join()`	Analyze join cardinality
`compare_keys()`	Compare key structures
`compare_ids()`	Compare row identities
`find_duplicates()`	Find duplicate key values
`key_status()`	Quick status summary

Row Identity

Function	Purpose
`add_id()`	Add UUID to rows
`get_id()`	Retrieve row IDs
`bind_id()`	Combine data with ID handling
`make_id()`	Create deterministic IDs from columns
`check_id()`	Validate ID integrity

Drift Detection

Function	Purpose
`commit_keyed()`	Save reference snapshot
`check_drift()`	Compare against snapshot
`list_snapshots()`	View saved snapshots
`clear_snapshot()`	Remove specific snapshot

When to Use Something Else

keyed is designed for flat-file workflows without database infrastructure. If you need:

Need	Better Alternative
Enforced schema	Database (SQLite, DuckDB)
Version history	Git, git2r
Full data validation	pointblank, validate
Production pipelines	targets

keyed fills a specific gap: lightweight key tracking for exploratory and semi-structured workflows where heavier tools add friction.

Gilles Colling

2026-01-24