Why Your Keys Don't Match

A join runs without error but the row count is wrong – fewer rows than expected, or more. The columns look fine. The key values look identical in the console.

R’s merge() and dplyr’s *_join() compare key values byte-for-byte. When keys fail to match, they are genuinely different at the byte level: a trailing space, a case mismatch, or a zero-width Unicode character that occupies no screen width.

This vignette walks through five scenarios where joins fail for string-level reasons that resist casual inspection. The data is synthetic; the patterns come from real pipelines. The first two sections cover the mechanics: what R compares when it joins two key columns, and which checks join_spy() runs against them. The scenarios then put both to work.

What a Join Actually Compares

Every join reduces to the same primitive: for each key in one table, find the positions of equal keys in the other. merge() builds on match(); dplyr and data.table run their own join engines and apply the same standard of equality. Two character keys are equal when their byte sequences are equal, after R translates any declared encodings to a common representation. There is no trimming, no case folding, no Unicode normalization along the way. A single byte of difference makes two keys strangers.

identical() applies the same standard, which makes it a convenient way to test what a join will see:

a <- "CUST-1002"
b <- "CUST-1002 "
identical(a, b)
#> [1] FALSE
c(nchar(a), nchar(b))
#> [1]  9 10

The strings differ by one character, and charToRaw() shows which byte is responsible:

charToRaw(b)
#>  [1] 43 55 53 54 2d 31 30 30 32 20

The final 20 is an ordinary space. This one happens to be easy to spot once printed as bytes. The characters behind longer debugging sessions render as nothing at all, or as something indistinguishable from a space. The non-breaking space is the classic example:

space <- "\u0020"
nbsp <- "\u00A0"
space == nbsp
#> [1] FALSE
c(utf8ToInt(space), utf8ToInt(nbsp))
#> [1]  32 160

utf8ToInt() returns the Unicode code point of each character, which makes it the most direct inspection tool for a suspicious key: 32 is the space on your keyboard, 160 is the non-breaking space from a web page or a PDF. Both render as a gap of the same width.

Two strings can also be visually identical with no invisible characters anywhere in them. Unicode allows an accented letter to be written two ways: as one precomposed code point, or as a base letter followed by a combining accent.

e_one <- "\u00E9"
e_two <- "e\u0301"
e_one
#> [1] "é"
e_two
#> [1] "é"
e_one == e_two
#> [1] FALSE

Both print as the same accented e. The first is a single code point, the second is two, and R compares the sequences without normalizing them:

c(nchar(e_one), nchar(e_two))
#> [1] 1 2
utf8ToInt(e_one)
#> [1] 233
utf8ToInt(e_two)
#> [1] 101 769

A reference table typed by hand usually carries the precomposed form. Data that passed through certain text processors or through macOS file paths can arrive decomposed, and a column of city or person names will then fail to match a column that looks identical character for character.

Encoding adds one more layer. The same character can be stored as different bytes depending on the encoding. The umlauted u is one byte in latin1 and two bytes in UTF-8:

city_utf8 <- "Z\u00FCrich"
city_latin1 <- iconv(city_utf8, from = "UTF-8", to = "latin1")
c(Encoding(city_utf8), Encoding(city_latin1))
#> [1] "UTF-8"  "latin1"
charToRaw(city_utf8)
#> [1] 5a c3 bc 72 69 63 68
charToRaw(city_latin1)
#> [1] 5a fc 72 69 63 68

Different bytes, same declared meaning. As long as the declaration is correct, R translates before comparing and the keys still match:

city_utf8 == city_latin1
#> [1] TRUE

Trouble starts when bytes and declaration disagree. A latin1 file read as if it were UTF-8, or read with no declaration on a system that assumes one, produces strings whose bytes no longer mean what R thinks they mean. Accented keys then differ from their clean counterparts, and nothing in the comparison machinery reconciles them. join_spy() flags key columns that mix declared encodings, since mixed declarations within a single column usually mean some values entered the session by a different route than the rest.

None of the failures above produce a warning, because none of them are errors from R’s point of view. String equality is exact and well defined. A join that matches three keys out of six did what it was told, and a zero-row inner join is a legitimate result (an anti-join hopes for exactly that). merge() has no way to know that six matches were expected. The evidence that something went wrong lives in the difference between the rows you expected and the rows you got, which is why diagnosing it takes a separate step.

What join_spy() Scans For

Every scenario below ends with a join_spy() call, so it is worth laying out what that call checks. Everything happens inside the one function and comes back in the printed JoinReport. The chunks in this section reproduce individual checks with base R so the mechanics are visible; none of them are needed in normal use.

Whitespace

Leading and trailing whitespace is detected with anchored regular expressions over each character key column:

ids <- c("CUST-1001", "CUST-1002 ", " CUST-1004")
grepl("^\\s+", ids)
#> [1] FALSE FALSE  TRUE
grepl("\\s+$", ids)
#> [1] FALSE  TRUE FALSE

Values failing either test are collected and reported with their column and table. Interior whitespace is left alone: "New York" and "New York" differ by an interior space and fall to the near-match check further down.

Case

A case mismatch is defined relationally. A key counts only when it has no exact match in the other table and would gain one if both sides were lowercased:

crm <- c("ALICE@ACME.COM", "EVE@ACME.COM")
clicks <- c("alice@acme.com", "frank@acme.com")
crm[tolower(crm) %in% tolower(clicks)]
#> [1] "ALICE@ACME.COM"

Eve appears in neither form on the right, so she is a genuinely unmatched key; Alice is a case mismatch. The report keeps example pairs, which is how the output in Scenario 2 can name the specific addresses involved.

Invisible characters and encodings

The encoding check looks for two conditions: mixed declared encodings within one column, read off Encoding(), and the presence of any of five code points that render as nothing or as an ordinary-looking space:

hidden <- "[\u200B\u200C\u200D\uFEFF\u00A0]"
grepl(hidden, c("India\u200B", "India"), perl = TRUE)
#> [1]  TRUE FALSE

The watch list is U+200B (zero-width space), U+200C (zero-width non-joiner), U+200D (zero-width joiner), U+FEFF (byte order mark), and U+00A0 (non-breaking space). U+00A0 earns its place on the list because regex whitespace classes and trimws() treat it inconsistently across engines and locales, so a whitespace scan alone cannot be trusted to catch it.

Empty strings

Empty strings sit in a blind spot: they read as missing data to a human and as a valid key to a join.

c("" == "", "" == NA)
#> [1] TRUE   NA

An empty string equals another empty string, so two tables that both use "" as a placeholder will match those rows to each other. The report flags empty keys as informational, and join_repair(empty_to_na = TRUE) converts them when that match behavior is unwanted.

Types, factors, and numeric keys

Key columns can also be factors, integers, or doubles, and the type checks cover the seams between column types. Character against factor is reported as informational, since the join will coerce. Character against numeric is reported as a warning, because that coercion is a recurring source of surprises. When both sides are factors, their level sets are compared and any levels unique to one side are listed. Numeric keys get two checks of their own: keys with decimal parts, and keys large enough to exhaust double precision.

0.1 + 0.2 == 0.3
#> [1] FALSE
2^53 == 2^53 + 1
#> [1] TRUE

Both lines are standard floating-point behavior, and both translate directly into joins that quietly drop or merge rows: a key computed as 0.1 + 0.2 will never find a key stored as 0.3, and above 2^53 distinct integer IDs stop being distinguishable as doubles.

Duplicates, NAs, and predicted row counts

Separately from string quality, join_spy() summarizes each table’s keys: rows, unique keys, duplicated keys, NA keys. Duplicates matter because each match against a duplicated key multiplies rows; NA keys matter because they will not match. From the per-key counts the report predicts the result size of all four join types. The inner-join figure is the sum, over matched keys, of the product of each side’s counts, which is exactly the arithmetic that turns a harmless-looking duplicate into a memory problem. The report object also carries a heuristic memory estimate per join type for the same reason.

Near matches

Whatever remains unmatched after the exact analysis goes through a typo hunt. Unmatched left keys are compared against right keys by Levenshtein edit distance, the minimum number of single-character insertions, deletions, and substitutions needed to turn one string into the other, computed with utils::adist():

utils::adist("WDG102", "WDG-102")
#>      [,1]
#> [1,]    1
utils::adist("Asia Pacific ", "Asia Pacific")
#>      [,1]
#> [1,]    1

A pair is reported as a near match when its distance is 1 or 2. Keys shorter than three characters are skipped, since nearly everything sits within two edits of a string like "A7". A length pre-check discards pairs whose lengths already differ by more than two, and to keep the scan cheap the comparison covers at most 50 unmatched left keys against 100 right keys, reporting the 10 closest pairs. Near matches stay informational: "WDG102" sitting one edit from "WDG-102" is strong evidence of a typo, and the report points it out without changing anything.

Compound keys

For multi-column keys, an extra pass analyzes each key column on its own: unique values per side, matches between them, and a per-column match rate. The column with the lowest match rate is called out as the likely problem column. Scenario 5 shows what this buys: an overall match rate says something is wrong, and the per-column breakdown says where.

Scenario 1: The Excel Export

A retail analytics team receives monthly sales data from a distribution partner as a CSV exported from Excel. They join it against their internal customer database on customer_id. For two quarters, everything works. Then one month, 30% of the sales records stop matching. Nobody changed the code or the customer database. The partner’s IDs are all present in the internal system – or so it appears.

partner_sales <- data.frame(
  customer_id = c("CUST-1001", "CUST-1002 ", "CUST-1003",
                  " CUST-1004", "CUST-1005 ", "CUST-1006"),
  amount = c(2500, 1800, 3200, 950, 4100, 1600),
  stringsAsFactors = FALSE
)

internal_db <- data.frame(
  customer_id = c("CUST-1001", "CUST-1002", "CUST-1003",
                  "CUST-1004", "CUST-1005", "CUST-1006", "CUST-1007"),
  region = c("West", "East", "West", "South", "East", "North", "West"),
  stringsAsFactors = FALSE
)

Nothing in str() or print() reveals the issue – trailing spaces are invisible in console output. join_spy() catches it:

report <- join_spy(partner_sales, internal_db, by = "customer_id")
report
#> 
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: customer_id
#> 
#> ── Table Summary ──
#> 
#> Left table: Rows: 6 Unique keys: 6 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 7 Unique keys: 7 Duplicate keys: 0 NA keys: 0
#> 
#> ── Match Analysis ──
#> 
#> Keys in both: 3
#> Keys only in left: 3
#> Keys only in right: 4
#> Match rate (left): "50%"
#> 
#> ── Issues Detected ──
#> 
#> ! Left column 'customer_id' has 3 value(s) with leading/trailing whitespace
#> ℹ 10 near-match(es) found (e.g., 'CUST-1002 ' ~ 'CUST-1002', ' CUST-1004' ~ 'CUST-1004', 'CUST-1002 ' ~ 'CUST-1001') - possible typos?
#> 
#> ── Expected Row Counts ──
#> 
#> inner_join: 3
#> left_join: 6
#> right_join: 7
#> full_join: 10

Three of the six partner IDs carry whitespace. "CUST-1002 " is a different string from "CUST-1002" as far as R is concerned. join_repair() trims both tables at once:

repaired <- join_repair(partner_sales, internal_db, by = "customer_id")
#> ✔ Repaired 3 value(s)
partner_fixed <- repaired$x
internal_fixed <- repaired$y

We can verify the repair worked:

key_check(partner_fixed, internal_fixed, by = "customer_id")
#> ✔ Key check passed: no issues detected

And now the join gives us what we expected:

result <- merge(partner_fixed, internal_fixed, by = "customer_id")
nrow(result)
#> [1] 6

The root cause was an Excel CONCATENATE formula that preserved trailing spaces from a variable-width source column. Excel renders "CUST-1002" and "CUST-1002 " identically, so nobody noticed. Trailing whitespace is the single most common join failure we see in practice.

Excel is one of several routes for this defect. SQL CHAR(n) columns pad values with spaces to a fixed width, and some database drivers deliver the padding to R intact. Fixed-width flat files do it by construction. Manual entry adds a trailing space whenever someone types one before tabbing out of a cell. Every route passes through an ingestion point, so the cheapest prevention is to trim key columns once, where external data enters the pipeline, and to put a key_check() call at that boundary. It would have turned this quarter’s silent 30% match drop into a loud failure on the first run.

Scenario 2: Two Databases, Two Conventions

A SaaS company wants to join CRM profiles to clickstream events for a churn analysis. The CRM stores email addresses in uppercase (a database migration decision from the late 1990s). The web app stores them in lowercase. Both systems are internally consistent.

crm_profiles <- data.frame(
  email = c("ALICE@ACME.COM", "BOB@ACME.COM", "CAROL@ACME.COM",
            "DAVE@ACME.COM", "EVE@ACME.COM"),
  plan = c("enterprise", "starter", "pro", "enterprise", "starter"),
  stringsAsFactors = FALSE
)

click_events <- data.frame(
  email = c("alice@acme.com", "bob@acme.com", "carol@acme.com",
            "dave@acme.com", "frank@acme.com"),
  page_views = c(47, 12, 89, 33, 5),
  stringsAsFactors = FALSE
)

An inner join returns zero rows. R’s string comparison is case-sensitive, so every key pair fails. join_spy() flags the situation before the join:

report <- join_spy(crm_profiles, click_events, by = "email")
report
#> 
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: email
#> 
#> ── Table Summary ──
#> 
#> Left table: Rows: 5 Unique keys: 5 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 5 Unique keys: 5 Duplicate keys: 0 NA keys: 0
#> 
#> ── Match Analysis ──
#> 
#> Keys in both: 0
#> Keys only in left: 5
#> Keys only in right: 5
#> Match rate (left): "0%"
#> 
#> ── Issues Detected ──
#> 
#> ! 4 key(s) would match if case-insensitive (e.g., 'ALICE@ACME.COM' vs 'alice@acme.com')
#> 
#> ── Expected Row Counts ──
#> 
#> inner_join: 0
#> left_join: 5
#> right_join: 5
#> full_join: 10

suggest_repairs() generates the fix:

suggest_repairs(report)
#> 
#> ── Suggested Repairs ───────────────────────────────────────────────────────────
#> # Standardize case:
#> x[["email"]] <- tolower(x[["email"]])
#> y[["email"]] <- tolower(y[["email"]])

Or we can use join_repair() directly, specifying case standardization:

repaired <- join_repair(
  crm_profiles, click_events,
  by = "email",
  standardize_case = "lower"
)
#> ✔ Repaired 5 value(s)

After lowercasing both sides, the inner join returns four matched rows (everyone except Eve, who has no click data, and Frank, who is not in the CRM):

result <- merge(repaired$x, repaired$y, by = "email")
nrow(result)
#> [1] 4
result
#>            email       plan page_views
#> 1 alice@acme.com enterprise         47
#> 2   bob@acme.com    starter         12
#> 3 carol@acme.com        pro         89
#> 4  dave@acme.com enterprise         33

Email addresses are case-insensitive by RFC 5321, so lowercasing is the right normalization here. For other identifier types (product codes, country abbreviations), "upper" may be more appropriate.

Conventions like these are system-wide and invisible from inside the system that holds them. Many database collations compare strings case-insensitively, so the CRM’s uppercase migration never broke an internal query; the web stack lowercased on write and was just as self-consistent. The mismatch exists only at the seam where the two systems meet, which is where an R join usually sits. The first join between two new sources is the right moment to run join_spy() and settle the convention question, and spelling the choice out through standardize_case documents it in code for the next person who touches the pipeline.

Scenario 3: The PDF Copy-Paste

A public health researcher compiles data from multiple sources for a systematic review. A few studies published supplementary tables only as PDF, so she copies the table from the PDF viewer, pastes into a spreadsheet, cleans up the columns, and reads the CSV into R. The data looks perfect – every country name is spelled correctly. But half the countries fail to match a reference population table.

# Simulating PDF copy-paste artifacts:
# \u00A0 is non-breaking space, \u200B is zero-width space
pdf_data <- data.frame(
  country = c("Brazil", "India\u200B", "Germany",
              "Japan\u00A0", "Canada", "France\u200B"),
  prevalence = c(12.3, 8.7, 5.1, 3.9, 6.2, 4.8),
  stringsAsFactors = FALSE
)

reference <- data.frame(
  country = c("Brazil", "India", "Germany", "Japan",
              "Canada", "France", "Italy"),
  population_m = c(214, 1408, 84, 125, 38, 68, 59),
  stringsAsFactors = FALSE
)

Printing the PDF data shows nothing wrong:

pdf_data$country
#> [1] "Brazil"  "India"   "Germany" "Japan "  "Canada"  "France"

The zero-width space after “India” and “France” occupies zero pixels. The non-breaking space after “Japan” renders like a regular space but is U+00A0, not U+0020 – trimws() will not always remove it. The merge reflects this:

nrow(merge(pdf_data, reference, by = "country"))
#> [1] 3

Three of six countries match. join_spy() detects the invisible characters:

report <- join_spy(pdf_data, reference, by = "country")
report
#> 
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: country
#> 
#> ── Table Summary ──
#> 
#> Left table: Rows: 6 Unique keys: 6 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 7 Unique keys: 7 Duplicate keys: 0 NA keys: 0
#> 
#> ── Match Analysis ──
#> 
#> Keys in both: 3
#> Keys only in left: 3
#> Keys only in right: 4
#> Match rate (left): "50%"
#> 
#> ── Issues Detected ──
#> 
#> ! Left column 'country' has encoding issues (invisible chars or mixed encoding)
#> ℹ 3 near-match(es) found (e.g., 'India' ~ 'India', 'Japan ' ~ 'Japan', 'France' ~ 'France') - possible typos?
#> 
#> ── Expected Row Counts ──
#> 
#> inner_join: 3
#> left_join: 6
#> right_join: 7
#> full_join: 10

Alongside the encoding warning, the near-match check reaches the same conclusion from a different direction: each contaminated name sits one edit away from its clean counterpart in the reference table.

repaired <- join_repair(pdf_data, reference, by = "country")
#> ✔ Repaired 3 value(s)
nrow(merge(repaired$x, repaired$y, by = "country"))
#> [1] 6

Six matches. PDF copy-paste is the most common source of these artifacts, but web scraping, OCR output, and legacy mainframe exports can produce them too. One useful debugging trick outside of joinspy: nchar("India\u200B") returns 6, not 5. But that requires already suspecting the problem.

Each watched code point has a typical entry route. The non-breaking space comes from typesetting: PDF layout engines and HTML ( ) use it to keep words on one line, and the clipboard preserves it faithfully. Zero-width spaces are inserted by some content-management systems and word processors as line-break hints. The byte order mark arrives at the front of files saved as “UTF-8 with BOM” by some Windows editors, where it fuses with the first field name or the first key of a CSV. All of these survive visual inspection, and all of them disappear under join_repair()’s invisible-character pass. Stripping them right after read.csv() keeps them out of every downstream join at once.

Scenario 4: The Slowly Growing Mismatch

An e-commerce pipeline joins transaction records to a product catalogue. The pipeline ran cleanly for months, then match rates started drifting: 99% in January, 97% in February, 94% in March. Nobody noticed until finance flagged a margin discrepancy in April.

The code had not changed. A new data entry clerk had joined the warehouse team in December. The canonical product code format was "WDG-100" – uppercase prefix, dash, three-digit suffix. The new clerk sometimes omitted the dash, sometimes typed lowercase. The warehouse system did fuzzy matching internally, so it accepted codes that the ETL join’s exact comparison rejected.

# Product catalogue (canonical format)
catalogue <- data.frame(
  product_code = c("WDG-100", "WDG-101", "WDG-102",
                   "WDG-103", "WDG-104", "WDG-105"),
  product_name = c("Widget A", "Widget B", "Widget C",
                   "Widget D", "Widget E", "Widget F"),
  margin = c(0.35, 0.28, 0.42, 0.31, 0.39, 0.25),
  stringsAsFactors = FALSE
)

# Recent transactions (mix of old and new clerk entries)
transactions <- data.frame(
  product_code = c("WDG-100", "WDG-101", "WDG102",
                   "wdg-103", "WDG-104", "wdg105",
                   "WDG-100", "WDG103"),
  quantity = c(5, 3, 7, 2, 4, 6, 1, 8),
  stringsAsFactors = FALSE
)

Some codes match and some do not, which makes partial failures harder to spot than complete ones.

report <- join_spy(transactions, catalogue, by = "product_code")
report
#> 
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: product_code
#> 
#> ── Table Summary ──
#> 
#> Left table: Rows: 8 Unique keys: 7 Duplicate keys: 1 NA keys: 0
#> Right table: Rows: 6 Unique keys: 6 Duplicate keys: 0 NA keys: 0
#> 
#> ── Match Analysis ──
#> 
#> Keys in both: 3
#> Keys only in left: 4
#> Keys only in right: 3
#> Match rate (left): "42.9%"
#> 
#> ── Issues Detected ──
#> 
#> ! Left table has 1 duplicate key(s) affecting 2 rows - may cause row multiplication
#> ! 1 key(s) would match if case-insensitive (e.g., 'wdg-103' vs 'WDG-103')
#> ℹ 10 near-match(es) found (e.g., 'WDG102' ~ 'WDG-102', 'WDG103' ~ 'WDG-103', 'WDG102' ~ 'WDG-100') - possible typos?
#> 
#> ── Expected Row Counts ──
#> 
#> inner_join: 4
#> left_join: 8
#> right_join: 7
#> full_join: 11

The report stacks several findings: the case check catches "wdg-103", and the near-match check lines the dashless codes up against catalogue entries one or two edits away.

Here is where this scenario differs from the previous ones. join_repair() can fix the case issue, but it cannot insert the missing dashes – that requires domain knowledge about the code format.

We can do a dry run to see what join_repair() would fix:

join_repair(transactions, catalogue,
            by = "product_code",
            standardize_case = "upper",
            dry_run = TRUE)
#> 
#> ── Repair Preview (Dry Run) ────────────────────────────────────────────────────
#> 
#> ── Left table (x) ──
#> 
#> ℹ product_code: upper case (2)

After applying the mechanical fixes:

repaired <- join_repair(transactions, catalogue,
                        by = "product_code",
                        standardize_case = "upper")
#> ✔ Repaired 2 value(s)

The case issues are resolved, but the missing dashes remain. A manual transformation handles those:

# Manual fix: insert dash if missing in product codes matching the pattern
fix_codes <- function(codes) {
  gsub("^([A-Z]{3})(\\d)", "\\1-\\2", codes)
}
repaired$x$product_code <- fix_codes(repaired$x$product_code)

result <- merge(repaired$x, repaired$y, by = "product_code")
nrow(result)
#> [1] 8

All eight transactions match. join_repair() handles context-free transformations (trimming, case normalization, stripping invisible characters). Inserting a dash into "WDG102" requires knowing the canonical format – that fix has to come from someone who understands the data.

The upstream story here is the most general of the five: two systems with different matching strictness, connected by people. The warehouse software tolerated sloppy codes because its internal lookup was fuzzy; the ETL join was exact; the clerk had no way to see the difference. Keys typed by humans drift like this wherever entry goes unvalidated. A rule rejecting anything that fails ^[A-Z]{3}-\d{3}$ at the warehouse boundary would have caught every malformed code in December. The drift is also visible from the R side: a match rate sliding from 99% to 94% over three months is a measurable signal, and logging a report per pipeline run (vignette("production") covers log_report() and set_log_file()) turns that slide into an alert long before finance notices the margins.

Scenario 5: Compound Keys

Two government datasets need to be linked: regional economic indicators and regional population estimates, keyed on region and year. The year column is numeric and matches without trouble. The region column has a whitespace problem that affects only some records.

economics <- data.frame(
  region = c("North America", "Europe", "Asia Pacific ",
             "North America", "Europe", "Asia Pacific "),
  year = c(2022, 2022, 2022, 2023, 2023, 2023),
  gdp_growth = c(2.1, 1.8, 4.2, 1.9, 0.9, 3.8),
  stringsAsFactors = FALSE
)

population <- data.frame(
  region = c("North America", "Europe", "Asia Pacific",
             "North America", "Europe", "Asia Pacific"),
  year = c(2022, 2022, 2022, 2023, 2023, 2023),
  pop_millions = c(580, 450, 4300, 585, 448, 4350),
  stringsAsFactors = FALSE
)

In a compound key join, both columns must match. The year column is fine everywhere, but "Asia Pacific " with a trailing space will not match "Asia Pacific".

merged <- merge(economics, population, by = c("region", "year"))
nrow(merged)
#> [1] 4

Four rows instead of six. North America and Europe match; Asia Pacific does not, because of the trailing space.

join_spy() pinpoints which column in the compound key has the problem:

report <- join_spy(economics, population, by = c("region", "year"))
report
#> 
#> ── Join Diagnostic Report ──────────────────────────────────────────────────────
#> Join columns: region, year
#> 
#> ── Table Summary ──
#> 
#> Left table: Rows: 6 Unique keys: 6 Duplicate keys: 0 NA keys: 0
#> Right table: Rows: 6 Unique keys: 6 Duplicate keys: 0 NA keys: 0
#> 
#> ── Match Analysis ──
#> 
#> Keys in both: 4
#> Keys only in left: 2
#> Keys only in right: 2
#> Match rate (left): "66.7%"
#> 
#> ── Issues Detected ──
#> 
#> ! Left column 'region' has 1 value(s) with leading/trailing whitespace
#> ℹ 1 near-match(es) found (e.g., 'Asia Pacific ' ~ 'Asia Pacific') - possible typos?
#> 
#> ── Per-Column Breakdown ──
#> 
#> region: "66.7%" match rate (2/3)
#> year: "100%" match rate (2/2)
#> ℹ Lowest match rate: region
#> 
#> ── Expected Row Counts ──
#> 
#> inner_join: 4
#> left_join: 6
#> right_join: 6
#> full_join: 8

The per-column breakdown shows the year column matching fully while region falls short, and names region as the column with the lowest match rate. From there the whitespace warning above it identifies the exact values at fault.

repaired <- join_repair(economics, population, by = c("region", "year"))
#> ✔ Repaired 2 value(s)
result <- merge(repaired$x, repaired$y, by = c("region", "year"))
nrow(result)
#> [1] 6

Six rows. With compound keys, a string issue in any single column is enough to break the match. The more columns in the key, the more places a byte-level discrepancy can occur.

Compound keys generalize everything in this vignette: each column added to by is one more surface where any of these defects can land. Region-by-year panels, site-by-date field records, patient-by-visit tables: wherever observations are keyed by a combination of identifiers, the component columns tend to come from different upstream systems with different hygiene. An aggregate match rate cannot say which component is at fault. The per-column breakdown can, and checking it first rules out most candidate explanations before any bytes get inspected.

The Pattern

These five scenarios share three properties. The data looks correct to standard inspection tools – str(), summary(), print() all render the values identically. R returns fewer (or more) rows without a warning, because the key values genuinely differ at the byte level. And the fix is mechanical once the cause is known – trimming, lowercasing, or stripping invisible Unicode are all one-line operations.

join_spy() surfaces the cause directly, which is especially useful with data from external sources, manual entry, PDF extraction, or cross-system integrations.

Worked the other way around, the shared structure becomes a checklist. Each question in a join post-mortem maps to one call:

Is this join safe to run? join_spy(x, y, by) gives the full report: key summaries, match rates, detected issues, and expected row counts for all four join types.
I want a yes/no gate in a script. key_check(x, y, by) prints a short status and invisibly returns a logical; warn = FALSE makes it silent.
Which rows share a key? key_duplicates(data, by) returns the duplicated rows with a .n_duplicates count column.
What would a cleanup change? join_repair() with dry_run = TRUE previews the repairs without touching the data.
Fix the mechanical problems. join_repair() trims whitespace, standardizes case, strips the invisible characters, and optionally converts empty strings to NA.
Show me the fix as code. suggest_repairs(report) prints copy-pasteable snippets matched to the issues a report found.
The join already ran and the row count is wrong. join_explain(result, x, y, by) reconstructs where the rows went.
Two versions of a table differ and the change is unclear. join_diff(before, after, by) compares dimensions, columns, and key statistics.
This must never happen silently again. join_strict() performs the join and errors when an expectation like expect = "1:1" is violated.

The split between diagnosis and repair is deliberate. join_spy() never modifies data, and join_repair() only applies transformations that are correct regardless of what the keys mean: trimming, casing, stripping invisible characters. Anything that requires knowing the data, like the missing dashes of Scenario 4, stays a human decision, with the diagnostics pointing at it.

For the wider catalogue of join failures beyond strings (duplicate keys multiplying rows, NA keys, accidental cartesian products), see vignette("common-issues"). For a tour of the full API on one worked example, see vignette("quickstart"). The string-level defects collected here are the ones that resist inspection longest; one diagnostic call before joining data you did not generate yourself covers every one of them.

Gilles Colling

2026-06-13