At the console, we see the wrong row count, fix the key, and re-run.
In a scheduled script, the same issue goes unnoticed until someone looks
at the output. This vignette wires joinspy into automated pipelines so
that join problems surface as errors or log entries. Every example uses
synthetic data and tempfile() paths, so the whole thing
runs end to end.
Assertions with key_check()
key_check() returns a single logical – TRUE
if no issues were detected, FALSE otherwise. The simplest
assertion wraps it in stopifnot():
orders <- data.frame(
customer_id = c("C01", "C02", "C03", "C04"),
amount = c(100, 200, 150, 300),
stringsAsFactors = FALSE
)
customers <- data.frame(
customer_id = c("C01", "C02", "C03", "C04"),
region = c("East", "West", "East", "North"),
stringsAsFactors = FALSE
)
# This passes -- keys are clean
stopifnot(key_check(orders, customers, by = "customer_id", warn = FALSE))When the keys have problems, the script halts:
orders_dirty <- data.frame(
customer_id = c("C01", "C02 ", "C03 ", "C04"),
amount = c(100, 200, 150, 300),
stringsAsFactors = FALSE
)
stopifnot(key_check(orders_dirty, customers, by = "customer_id", warn = FALSE))
#> Error:
#> ! key_check(orders_dirty, customers, by = "customer_id", warn = FALSE) is not TRUEWith warn = FALSE, the printed diagnostics are
suppressed – in a cron job or CI pipeline, we want the script to fail
hard rather than print warnings. An explicit
if/stop gives us a custom error message:
if (!key_check(orders_dirty, customers, by = "customer_id", warn = FALSE)) {
stop("Key quality check failed for orders-customers join. ",
"Run join_spy() interactively for details.", call. = FALSE)
}
#> Error:
#> ! Key quality check failed for orders-customers join. Run join_spy() interactively for details.We can also chain the assertion with a repair step – run
key_check() first, repair on failure, then re-check:
ok <- key_check(orders_dirty, customers, by = "customer_id", warn = FALSE)
if (!ok) {
repaired <- join_repair(
orders_dirty, customers,
by = "customer_id",
trim_whitespace = TRUE,
remove_invisible = TRUE
)
orders_clean <- repaired$x
customers_clean <- repaired$y
# Re-check after repair
stopifnot(key_check(orders_clean, customers_clean,
by = "customer_id", warn = FALSE))
}
#> ✔ Repaired 2 value(s)key_check() is a binary pass/fail gate;
join_spy() builds a full report with match rates, expected
row counts, and categorized issues. In production,
key_check() runs on every execution.
join_spy() is what we reach for when an assertion fails and
we need to understand why.
Silent Joins in Pipelines
The *_join_spy() wrappers print diagnostic output by
default. In a scheduled script, .quiet = TRUE suppresses
all printed output while still computing the report internally.
sensors <- data.frame(
sensor_id = c("S01", "S02", "S03", "S04"),
location = c("Roof", "Basement", "Lobby", "Garage"),
stringsAsFactors = FALSE
)
readings <- data.frame(
sensor_id = c("S01", "S02", "S03", "S05"),
temperature = c(22.1, 18.5, 21.0, 19.3),
stringsAsFactors = FALSE
)
# Nothing printed
result <- left_join_spy(sensors, readings, by = "sensor_id", .quiet = TRUE)The report is still available via last_report():
rpt <- last_report()
rpt$match_analysis$match_rate
#> [1] 0.75The join runs silently; later, we pull the report and check its contents programmatically:
rpt <- last_report()
if (rpt$match_analysis$match_rate < 0.95) {
warning(sprintf(
"Low match rate (%.1f%%) in sensor join -- check for missing sensor IDs",
rpt$match_analysis$match_rate * 100
))
}
#> Warning: Low match rate (75.0%) in sensor join -- check for missing sensor IDsThe report object is a plain list, so standard R subsetting works for arbitrarily complex validation logic.
One caveat: last_report() stores only the most recent
report. If a script performs three joins in sequence, only the third
report survives. To retain earlier reports, capture them explicitly:
result1 <- left_join_spy(sensors, readings, by = "sensor_id", .quiet = TRUE)
report1 <- last_report()
# ... later ...
result2 <- inner_join_spy(sensors, readings, by = "sensor_id", .quiet = TRUE)
report2 <- last_report()Cardinality Guards
A join that was one-to-one in development can become many-to-many in
production when upstream data changes. join_strict()
enforces a cardinality constraint and throws an error if it is
violated.
In development, we use detect_cardinality() to
understand the actual relationship:
products <- data.frame(
product_id = c("P1", "P2", "P3"),
name = c("Widget", "Gadget", "Gizmo"),
stringsAsFactors = FALSE
)
line_items <- data.frame(
product_id = c("P1", "P1", "P2", "P3", "P3"),
order_id = c(101, 102, 103, 104, 105),
stringsAsFactors = FALSE
)
detect_cardinality(products, line_items, by = "product_id")
#> ℹ Detected cardinality: "1:m"
#> Right duplicates: 2 key(s)One-to-many: each product appears once in products but
can appear multiple times in line_items. We encode that
expectation in production:
result <- join_strict(
products, line_items,
by = "product_id",
type = "left",
expect = "1:m"
)
nrow(result)
#> [1] 5If someone loads a products table with duplicate product
IDs, the script fails immediately:
products_bad <- data.frame(
product_id = c("P1", "P1", "P2", "P3"),
name = c("Widget", "Widget v2", "Gadget", "Gizmo"),
stringsAsFactors = FALSE
)
join_strict(
products_bad, line_items,
by = "product_id",
type = "left",
expect = "1:m"
)
#> Error:
#> ! Cardinality violation: expected 1:m but found m:m
#> Left duplicates: 1, Right duplicates: 2The four cardinality levels:
1:1 – lookup to lookup. Each key appears exactly once on both sides.
1:m – reference on the left, transactions on the right (products to line items, stations to hourly readings).
m:1 – transactions on the left, lookup on the right (sales joined to a region table).
m:m – duplicates on both sides. Almost always a bug; requiring an explicit
expect = "m:m"acts as a speed bump.
In practice, "1:m" and "m:1" cover most
production joins. detect_cardinality() confirms the
relationship during development; the expect value is then
hard-coded in the production script.
check_cartesian() solves a different problem: it warns
about Cartesian product explosion when a key has many duplicates on
both sides. A join can violate a "1:1" constraint
without triggering a Cartesian explosion (one extra duplicate is
enough), and a "m:m" join can produce a massive product
that join_strict() would allow. The two functions
complement each other.
Logging and Audit Trails
Manual logging
log_report() writes a single report to a file. The
format depends on the file extension:
report <- join_spy(sensors, readings, by = "sensor_id")
# Text format -- human-readable
txt_log <- tempfile(fileext = ".log")
log_report(report, txt_log)
#> ✔ Report logged to C:/Temp\Rtmp2BnYNR\file31bf43ca12605.log
cat(readLines(txt_log), sep = "\n")
#> Logged: 2026-03-31 23:21:09
#> ------------------------------------------------------------
#> Join Key: sensor_id
#>
#> Left Table (x):
#> Rows: 4
#> Unique keys: 4
#> Duplicated keys:
#> NA keys: 0
#>
#> Right Table (y):
#> Rows: 4
#> Unique keys: 4
#> Duplicated keys:
#> NA keys: 0
#>
#> Match Analysis:
#> Keys in both: 3
#> Keys only in left: 1
#> Keys only in right: 1
#> Match rate: 75%
#>
#> Expected Rows:
#> inner_join: 3
#> left_join: 4
#> right_join: 4
#> full_join: 5
#>
#> Issues Detected: 1
#> near_match: 1
#> ============================================================
unlink(txt_log)
# JSON format -- machine-readable
json_log <- tempfile(fileext = ".json")
log_report(report, json_log)
#> ✔ Report logged to C:/Temp\Rtmp2BnYNR\file31bf41787369d.json
cat(readLines(json_log), sep = "\n")
#> {
#> "by": "sensor_id",
#> "x_summary": {
#> "n_rows": 4,
#> "n_unique": 4,
#> "n_duplicated": null,
#> "n_na": 0
#> },
#> "y_summary": {
#> "n_rows": 4,
#> "n_unique": 4,
#> "n_duplicated": null,
#> "n_na": 0
#> },
#> "match_analysis": {
#> "n_matched": 3,
#> "n_left_only": 1,
#> "n_right_only": 1,
#> "match_rate": 0.75
#> },
#> "expected_rows": {
#> "inner": 3,
#> "left": 4,
#> "right": 4,
#> "full": 5
#> },
#> "n_issues": 1,
#> "issue_types": "near_match",
#> "logged_at": "2026-03-31 23:21:09"
#> }
unlink(json_log)Text format works for tailing logs during a batch run; JSON format
feeds into monitoring systems or downstream scripts. Reports can also be
saved as .rds files, which preserves the full R object for
later interactive inspection.
Automatic logging
For scripts with many joins, set_log_file() at the top
is cleaner than calling log_report() after each one. Every
subsequent *_join_spy() call appends its report to the
file.
auto_log <- tempfile(fileext = ".log")
set_log_file(auto_log, format = "text")
#> ℹ Automatic logging enabled: C:/Temp\Rtmp2BnYNR\file31bf429b81d46.log
# These joins are automatically logged
result1 <- left_join_spy(sensors, readings, by = "sensor_id", .quiet = TRUE)
result2 <- inner_join_spy(sensors, readings, by = "sensor_id", .quiet = TRUE)
# Check what got logged
cat(readLines(auto_log), sep = "\n")
#>
#> Logged: 2026-03-31 23:21:09
#> ------------------------------------------------------------
#> Join Key: sensor_id
#>
#> Left Table (x):
#> Rows: 4
#> Unique keys: 4
#> Duplicated keys:
#> NA keys: 0
#>
#> Right Table (y):
#> Rows: 4
#> Unique keys: 4
#> Duplicated keys:
#> NA keys: 0
#>
#> Match Analysis:
#> Keys in both: 3
#> Keys only in left: 1
#> Keys only in right: 1
#> Match rate: 75%
#>
#> Expected Rows:
#> inner_join: 3
#> left_join: 4
#> right_join: 4
#> full_join: 5
#>
#> Issues Detected: 1
#> near_match: 1
#> ============================================================
#>
#> Logged: 2026-03-31 23:21:09
#> ------------------------------------------------------------
#> Join Key: sensor_id
#>
#> Left Table (x):
#> Rows: 4
#> Unique keys: 4
#> Duplicated keys:
#> NA keys: 0
#>
#> Right Table (y):
#> Rows: 4
#> Unique keys: 4
#> Duplicated keys:
#> NA keys: 0
#>
#> Match Analysis:
#> Keys in both: 3
#> Keys only in left: 1
#> Keys only in right: 1
#> Match rate: 75%
#>
#> Expected Rows:
#> inner_join: 3
#> left_join: 4
#> right_join: 4
#> full_join: 5
#>
#> Issues Detected: 1
#> near_match: 1
#> ============================================================
# Clean up
set_log_file(NULL)
#> ℹ Automatic logging disabled
unlink(auto_log)Automatic logging only triggers from *_join_spy()
wrappers. join_strict() and bare merge() calls
are not logged – the wrappers are the instrumented path. To combine
cardinality enforcement with logging, run
detect_cardinality() as a separate check and use a
*_join_spy() wrapper for the actual join.
get_log_file() returns the current log path (or
NULL if logging is disabled):
# Only log if logging is configured
if (!is.null(get_log_file())) {
message("Logging is active at: ", get_log_file())
}Sampling for Large Datasets
The sample parameter in join_spy() runs the
analysis on a random subset while the actual join (via a
*_join_spy() wrapper) still operates on the full data.
# Simulate a large dataset
set.seed(42)
big_orders <- data.frame(
customer_id = sample(paste0("C", sprintf("%04d", 1:5000)), 50000, replace = TRUE),
amount = round(runif(50000, 10, 500), 2),
stringsAsFactors = FALSE
)
big_customers <- data.frame(
customer_id = paste0("C", sprintf("%04d", 1:6000)),
region = sample(c("North", "South", "East", "West"), 6000, replace = TRUE),
stringsAsFactors = FALSE
)
# Full analysis
system.time(report_full <- join_spy(big_orders, big_customers, by = "customer_id"))
#> user system elapsed
#> 0.11 0.06 0.18
# Sampled analysis
system.time(report_sampled <- join_spy(big_orders, big_customers,
by = "customer_id", sample = 5000))
#> user system elapsed
#> 0.09 0.00 0.09The sampled report is approximate – match rates and duplicate counts are estimated from the subset. For production monitoring, we typically care whether the match rate is roughly 95% or roughly 60%, not whether it is 94.7% or 95.1%. Sampling catches systemic problems (wrong key column, widespread encoding issues, duplicate explosion) with a fraction of the runtime.
Sampling can miss rare issues. If 0.1% of keys have a zero-width space, a 5,000-row sample from a 10-million-row table might not include any. Running full diagnostics periodically (weekly, or when the upstream source changes) alongside sampled daily runs covers both speed and thoroughness.
A Complete Production Pattern
Here is a realistic production workflow: a nightly job loads order and customer data, validates keys, repairs if needed, joins with cardinality enforcement, and logs everything.
# ============================================================
# Nightly order enrichment pipeline
# ============================================================
# --- Setup logging ---
pipeline_log <- tempfile(fileext = ".log")
set_log_file(pipeline_log, format = "text")
#> ℹ Automatic logging enabled: C:/Temp\Rtmp2BnYNR\file31bf4120083b.log
# --- Load data (simulated) ---
orders <- data.frame(
order_id = 1:6,
customer_id = c("C001", "C002 ", "C003", "C003", "C004", "C005"),
amount = c(150, 230, 89, 410, 320, 175),
stringsAsFactors = FALSE
)
customers <- data.frame(
customer_id = c("C001", "C002", "C003", "C004", "C005", "C006"),
name = c("Acme Corp", "Globex", "Initech", "Umbrella", "Soylent", "Wonka"),
tier = c("gold", "silver", "gold", "bronze", "silver", "gold"),
stringsAsFactors = FALSE
)
# --- Gate 1: key quality assertion ---
keys_ok <- key_check(orders, customers, by = "customer_id", warn = FALSE)
if (!keys_ok) {
message("Key issues detected -- attempting repair")
repaired <- join_repair(
orders, customers,
by = "customer_id",
trim_whitespace = TRUE,
remove_invisible = TRUE
)
orders <- repaired$x
customers <- repaired$y
}
#> Key issues detected -- attempting repair
#> ✔ Repaired 1 value(s)
# --- Gate 2: cardinality check ---
card <- detect_cardinality(orders, customers, by = "customer_id")
#> ℹ Detected cardinality: "m:1"
#> Left duplicates: 1 key(s)
if (card == "m:m") {
set_log_file(NULL)
unlink(pipeline_log)
stop("Unexpected m:m cardinality in orders-customers join", call. = FALSE)
}
# --- Join (with auto-logging via *_join_spy) ---
enriched <- left_join_spy(orders, customers, by = "customer_id", .quiet = TRUE)
# --- Gate 3: row count sanity check ---
# A left join should never lose rows from the left table
if (nrow(enriched) < nrow(orders)) {
set_log_file(NULL)
unlink(pipeline_log)
stop("Row count decreased after left join -- possible data corruption",
call. = FALSE)
}
# --- Output ---
message(sprintf("Pipeline complete: %d enriched orders", nrow(enriched)))
#> Pipeline complete: 6 enriched orders
head(enriched)
#> customer_id order_id amount name tier
#> 1 C001 1 150 Acme Corp gold
#> 2 C002 2 230 Globex silver
#> 3 C003 3 89 Initech gold
#> 4 C003 4 410 Initech gold
#> 5 C004 5 320 Umbrella bronze
#> 6 C005 6 175 Soylent silver
# --- Review the log ---
if (file.exists(pipeline_log)) {
cat(readLines(pipeline_log), sep = "\n")
}
#>
#> Logged: 2026-03-31 23:21:10
#> ------------------------------------------------------------
#> Join Key: customer_id
#>
#> Left Table (x):
#> Rows: 6
#> Unique keys: 5
#> Duplicated keys:
#> NA keys: 0
#>
#> Right Table (y):
#> Rows: 6
#> Unique keys: 6
#> Duplicated keys:
#> NA keys: 0
#>
#> Match Analysis:
#> Keys in both: 5
#> Keys only in left: 0
#> Keys only in right: 1
#> Match rate: 100%
#>
#> Expected Rows:
#> inner_join: 6
#> left_join: 6
#> right_join: 7
#> full_join: 7
#>
#> Issues Detected: 1
#> duplicates: 1
#> ============================================================
# --- Cleanup ---
set_log_file(NULL)
#> ℹ Automatic logging disabled
unlink(pipeline_log)The three gates catch different failure modes:
key_check() catches string-level problems and attempts
repair, detect_cardinality() halts on unexpected
many-to-many relationships, and the row count check guards against
anything the first two gates missed. Logging runs throughout because
set_log_file() was called at the top. The structure scales
– more tables, more gates, more joins, same pattern.