Skip to contents

Analyzes two data frames before joining to detect potential issues and predict the outcome. Returns a detailed report of key quality, match rates, and detected problems.

Usage

join_spy(x, y, by, sample = NULL, ...)

Arguments

x

A data frame (left table in the join).

y

A data frame (right table in the join).

by

A character vector of column names to join by, or a named character vector for joins where column names differ (e.g., c("id" = "customer_id")).

sample

Integer or NULL. If provided, randomly sample this many rows from each table for faster diagnostics on large datasets. Default NULL (analyze all rows).

...

Reserved for future use.

Value

A JoinReport object with the following components:

x_summary

Summary statistics for keys in the left table

y_summary

Summary statistics for keys in the right table

match_analysis

Details of which keys will/won't match

issues

List of detected problems (duplicates, whitespace, etc.)

expected_rows

Predicted row counts for each join type

Details

This function detects the following common join issues:

  • Duplicate keys: Keys appearing multiple times, which cause row multiplication during joins

  • Whitespace: Leading or trailing spaces that prevent matches

  • Case mismatches: Keys that differ only by case (e.g., "ABC" vs "abc")

  • Encoding issues: Different character encodings or invisible Unicode characters

  • NA values: Missing values in key columns

Examples

# Create sample data with issues
orders <- data.frame(
  order_id = 1:5,
  customer_id = c("A", "B", "B", "C", "D ")
)
customers <- data.frame(
  customer_id = c("A", "B", "C", "E"),
  name = c("Alice", "Bob", "Carol", "Eve")
)

# Get diagnostic report
join_spy(orders, customers, by = "customer_id")