What couplr Does
couplr creates matched samples from two groups of observations. Given a “left” group (e.g., treatment) and a “right” group (e.g., control), it finds optimal pairings based on similarity across variables you specify. couplr supports both one-to-one matching (each treatment unit paired with one control) and full matching (variable-ratio groups where every unit is assigned).
Common use cases:
Matching treated patients to similar controls in observational studies
Pairing survey respondents for comparison
Creating balanced samples for causal inference
Full matching when discarding unmatched units is undesirable
Documentation Roadmap
| Vignette | Focus | Audience |
|---|---|---|
| Quick Start (this) | Basic matching with match_couples()
|
Everyone |
| Matching Workflows | Full pipeline: preprocessing, blocking, diagnostics | Researchers |
| Algorithms | Mathematical foundations, solver selection | Technical users |
| Comparison | vs MatchIt, optmatch, designmatch | Package evaluators |
Start here, then proceed to whichever vignette matches your use case.
Your First Match
The simplest workflow uses match_couples():
library(couplr)
library(dplyr)
# Create example data: treatment and control groups
set.seed(123)
treatment <- tibble(
id = 1:50,
age = rnorm(50, mean = 45, sd = 10),
income = rnorm(50, mean = 55000, sd = 12000)
)
control <- tibble(
id = 1:80,
age = rnorm(80, mean = 50, sd = 12),
income = rnorm(80, mean = 48000, sd = 15000)
)
# Match on age and income
result <- match_couples(
left = treatment,
right = control,
vars = c("age", "income"),
auto_scale = TRUE
)
#> Auto-selected scaling method: standardize
# View matched pairs
head(result$pairs)
#> # A tibble: 6 × 5
#> left_id right_id distance .age_diff .income_diff
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 1 46 0.779 -4.23 9144.
#> 2 2 11 0.257 -0.398 3441.
#> 3 3 52 0.594 1.36 7840.
#> 4 4 66 0.809 -7.87 -5112.
#> 5 5 65 0.530 1.30 6977.
#> 6 6 36 0.130 -1.43 62.2What happened:
couplr calculated how similar each treatment unit is to each control unit
It found the optimal one-to-one pairing that minimizes total distance
Each treatment unit gets matched to exactly one control unit
Understanding the Output
# Quick overview with summary()
summary(result)
#> Matching Result Summary
#> =======================
#>
#> Method: lap
#> Pairs matched: 50
#> Unmatched: 0 left, 30 right
#>
#> Distance Statistics:
#> Total: 19.6423
#> Mean: 0.3928
#> Min: 0.0615
#> Q1: 0.1845
#> Median: 0.2967
#> Q3: 0.5039
#> Max: 1.2184
#> SD: 0.2863
#>
#> Distance Percentiles:
#> 5%: 0.0993
#> 10%: 0.1234
#> 25%: 0.1845
#> 50%: 0.2967
#> 75%: 0.5039
#> 90%: 0.7819
#> 95%: 1.0396
# Or access specific info
result$info$n_matched
#> [1] 50The result$pairs table contains:
left_id: Row number from the treatment groupright_id: Row number from the control groupdistance: How different the matched units are (lower = more similar)
Why Scaling Matters
Without scaling, variables with larger values dominate the matching. Income (measured in thousands) would overwhelm age (measured in decades):
# BAD: Without scaling, income dominates
result_unscaled <- match_couples(
treatment, control,
vars = c("age", "income"),
auto_scale = FALSE
)
# GOOD: With scaling, both variables contribute equally
result_scaled <- match_couples(
treatment, control,
vars = c("age", "income"),
auto_scale = TRUE
)
#> Auto-selected scaling method: standardize
# Compare mean distances
cat("Unscaled mean distance:", round(mean(result_unscaled$pairs$distance), 1), "\n")
#> Unscaled mean distance: 2769.1
cat("Scaled mean distance:", round(mean(result_scaled$pairs$distance), 3), "\n")
#> Scaled mean distance: 0.393Rule of thumb: Always use
auto_scale = TRUE unless you have a specific reason not
to.
Checking Match Quality
After matching, verify that treatment and control groups are now balanced:
# Get the matched observations
matched_treatment <- treatment[result$pairs$left_id, ]
matched_control <- control[result$pairs$right_id, ]
# Compare means before and after matching
cat("BEFORE matching:\n")
#> BEFORE matching:
cat(" Age difference:", round(mean(treatment$age) - mean(control$age), 1), "years\n")
#> Age difference: -3.4 years
cat(" Income difference: $", round(mean(treatment$income) - mean(control$income), 0), "\n\n")
#> Income difference: $ 9266
cat("AFTER matching:\n")
#> AFTER matching:
cat(" Age difference:", round(mean(matched_treatment$age) - mean(matched_control$age), 1), "years\n")
#> Age difference: -1.3 years
cat(" Income difference: $", round(mean(matched_treatment$income) - mean(matched_control$income), 0), "\n")
#> Income difference: $ 2667For formal balance assessment, use balance_diagnostics()
(covered in Matching
Workflows).
Large Datasets: Use Greedy Matching
For datasets larger than a few thousand observations, optimal
matching becomes slow. Use greedy_couples() instead; it’s
10-100x faster with nearly identical results:
# Create larger datasets
set.seed(456)
large_treatment <- tibble(
id = 1:2000,
age = rnorm(2000, 45, 10),
income = rnorm(2000, 55000, 12000)
)
large_control <- tibble(
id = 1:3000,
age = rnorm(3000, 50, 12),
income = rnorm(3000, 48000, 15000)
)
# Fast greedy matching
result_greedy <- greedy_couples(
large_treatment, large_control,
vars = c("age", "income"),
auto_scale = TRUE,
strategy = "row_best" # fastest strategy
)
#> Auto-selected scaling method: standardize
cat("Matched", result_greedy$info$n_matched, "pairs\n")
#> Matched 2000 pairs
cat("Mean distance:", round(mean(result_greedy$pairs$distance), 3), "\n")
#> Mean distance: 0.201When to use which:
| Dataset size | Recommended function |
|---|---|
| < 1,000 per group | match_couples() |
| 1,000 - 5,000 | Either works; greedy is faster |
| > 5,000 | greedy_couples() |
Setting a Maximum Distance (Caliper)
Sometimes you want to reject poor matches rather than force bad
pairings. Use max_distance to set a caliper:
# Allow any match
result_loose <- match_couples(
treatment, control,
vars = c("age", "income"),
auto_scale = TRUE
)
#> Auto-selected scaling method: standardize
# Only allow close matches
result_strict <- match_couples(
treatment, control,
vars = c("age", "income"),
auto_scale = TRUE,
max_distance = 0.5 # reject pairs more different than this
)
#> Auto-selected scaling method: standardize
cat("Without caliper:", result_loose$info$n_matched, "pairs\n")
#> Without caliper: 50 pairs
cat("With caliper:", result_strict$info$n_matched, "pairs\n")
#> With caliper: 40 pairsStricter calipers mean fewer but better matches.
Matching Within Groups (Blocking)
When you have natural groups in your data (e.g., hospitals, regions, study sites), you can match within each group separately. This ensures exact balance on the grouping variable.
First, create blocks with matchmaker(), then pass the
result to match_couples():
# Data from multiple hospital sites
set.seed(321)
treated <- tibble(
id = 1:60,
site = rep(c("Hospital A", "Hospital B", "Hospital C"), each = 20),
age = rnorm(60, 55, 10),
severity = rnorm(60, 5, 2)
)
controls <- tibble(
id = 1:90,
site = rep(c("Hospital A", "Hospital B", "Hospital C"), each = 30),
age = rnorm(90, 52, 12),
severity = rnorm(90, 4.5, 2.5)
)
# Step 1: Create blocks by hospital site
blocks <- matchmaker(
left = treated,
right = controls,
block_type = "group",
block_by = "site"
)
# Step 2: Match within each block
result_blocked <- match_couples(
left = blocks$left,
right = blocks$right,
vars = c("age", "severity"),
block_id = "block_id",
auto_scale = TRUE
)
#> Auto-selected scaling method: standardize
# Verify: matches stay within their block
result_blocked$pairs |> count(block_id)
#> # A tibble: 3 × 2
#> block_id n
#> <chr> <int>
#> 1 Hospital A 20
#> 2 Hospital B 20
#> 3 Hospital C 20Blocking guarantees that Hospital A patients are only matched to Hospital A controls, etc.
Full Matching: Keep Every Unit
One-to-one matching discards unmatched controls. If you want every
unit in a group, use full_match(). It creates
variable-ratio groups (e.g., 1 treatment + 3 controls) that minimize
total distance:
result_full <- full_match(
left = treatment,
right = control,
vars = c("age", "income")
)
result_full
#>
#> Full Matching Result
#> ====================
#>
#> Groups formed: 50
#> Left units: 50 matched, 0 unmatched (of 50)
#> Right units: 80 matched, 0 unmatched (of 80)
#>
#> Right units per group: min=1, median=1, max=13
# Each group has one or more left and right units with matching weights
head(result_full$groups)
#> # A tibble: 6 × 4
#> group_id id side weight
#> <int> <chr> <chr> <dbl>
#> 1 1 1 left 1
#> 2 1 42 right 1
#> 3 2 2 left 1
#> 4 2 11 right 1
#> 5 3 3 left 1
#> 6 3 59 right 1Full matching is useful when your control pool is much larger than
treatment and you don’t want to waste data. See
vignette("matching-workflows") for details on constraints
(min_controls, max_controls,
caliper) and the choice between
method = "optimal" (default, globally optimal) and
method = "greedy" (faster).
Other Matching Methods
couplr also supports several alternative matching strategies. Each is
covered in detail in vignette("matching-workflows"): -
cem_match() — Coarsened exact matching:
bins continuous variables and matches exactly within strata, avoiding
model dependence
subclass_match()— Propensity score subclassification: divides units into PS strata with target estimand weighting (ATT, ATE, ATC)ps_match()— Propensity score matching with a logit calipercardinality_match()— Maximizes sample size subject to strict balance constraints
All result types work with balance_diagnostics(),
match_data(), and as_matchit() for ecosystem
interoperability with cobalt and marginaleffects.
Complete Example
Here’s a realistic workflow from start to finish:
# 1. Prepare your data
set.seed(789)
patients_treated <- tibble(
patient_id = paste0("T", 1:100),
age = rnorm(100, 62, 8),
bmi = rnorm(100, 28, 4),
smoker = sample(0:1, 100, replace = TRUE, prob = c(0.6, 0.4))
)
patients_control <- tibble(
patient_id = paste0("C", 1:200),
age = rnorm(200, 58, 10),
bmi = rnorm(200, 26, 5),
smoker = sample(0:1, 200, replace = TRUE, prob = c(0.7, 0.3))
)
# 2. Match on clinical variables
matched <- match_couples(
left = patients_treated,
right = patients_control,
vars = c("age", "bmi", "smoker"),
auto_scale = TRUE
)
#> Auto-selected scaling method: standardize
# 3. Check how many matched
cat("Treated patients:", nrow(patients_treated), "\n")
#> Treated patients: 100
cat("Successfully matched:", matched$info$n_matched, "\n")
#> Successfully matched: 100
cat("Match rate:", round(100 * matched$info$n_matched / nrow(patients_treated), 1), "%\n")
#> Match rate: 100 %
# 4. Extract matched samples for analysis
treated_matched <- patients_treated[matched$pairs$left_id, ]
control_matched <- patients_control[matched$pairs$right_id, ]
# 5. Verify balance
cat("\nBalance check (difference in means):\n")
#>
#> Balance check (difference in means):
cat(" Age:", round(mean(treated_matched$age) - mean(control_matched$age), 2), "\n")
#> Age: NA
cat(" BMI:", round(mean(treated_matched$bmi) - mean(control_matched$bmi), 2), "\n")
#> BMI: NA
cat(" Smoker %:", round(100*(mean(treated_matched$smoker) - mean(control_matched$smoker)), 1), "\n")
#> Smoker %: NANext Steps
You now know the basics of matching with couplr. Here’s where to go next:
For production research workflows:
- Matching Workflows covers preprocessing, blocking, formal balance diagnostics, and publication-ready output
For understanding algorithm choices:
- Algorithms explains when different solvers are faster or more appropriate
For comparing with other packages:
- Comparison shows how couplr differs from MatchIt, optmatch, and designmatch
Additional: Direct Assignment Problem Solving
If you need to solve assignment problems directly (not matching workflows), couplr also provides lower-level functions.
lap_solve(): Matrix-Based Assignment
Given a cost matrix where entry (i,j) is the cost of assigning row i to column j:
# Cost matrix: 3 workers x 3 tasks
cost <- matrix(c(
4, 2, 5,
3, 3, 6,
7, 5, 4
), nrow = 3, byrow = TRUE)
result <- lap_solve(cost)
print(result)
#> Assignment Result
#> =================
#>
#> # A tibble: 3 × 3
#> source target cost
#> <int> <int> <dbl>
#> 1 1 2 2
#> 2 2 1 3
#> 3 3 3 4
#>
#> Total cost: 9
#> Method: bruteforceRow 1 is assigned to column 2 (cost 2), row 2 to column 1 (cost 3), row 3 to column 3 (cost 4). Total cost: 9.
Forbidden Assignments
Use NA or Inf for impossible
assignments:
cost_forbidden <- matrix(c(
4, 2, NA, # Row 1 cannot go to column 3
Inf, 3, 6, # Row 2 cannot go to column 1
7, 5, 4
), nrow = 3, byrow = TRUE)
lap_solve(cost_forbidden)
#> Assignment Result
#> =================
#>
#> # A tibble: 3 × 3
#> source target cost
#> <int> <int> <dbl>
#> 1 1 1 4
#> 2 2 2 3
#> 3 3 3 4
#>
#> Total cost: 11
#> Method: bruteforceMaximization
For preference or profit maximization:
preferences <- matrix(c(
8, 5, 3,
4, 7, 6,
2, 4, 9
), nrow = 3, byrow = TRUE)
lap_solve(preferences, maximize = TRUE)
#> Assignment Result
#> =================
#>
#> # A tibble: 3 × 3
#> source target cost
#> <int> <int> <dbl>
#> 1 1 1 8
#> 2 2 2 7
#> 3 3 3 9
#>
#> Total cost: 24
#> Method: bruteforceGrouped Data
Solve multiple assignment problems at once using grouped data frames:
# Weekly nurse-shift scheduling: solve each day separately
schedule <- tibble(
day = rep(c("Mon", "Tue", "Wed"), each = 9),
nurse = rep(rep(1:3, each = 3), 3),
shift = rep(1:3, 9),
cost = c(4,2,5, 3,3,6, 7,5,4, # Monday costs
5,3,4, 2,4,5, 6,4,3, # Tuesday costs
3,4,5, 4,2,6, 5,5,4) # Wednesday costs
)
# Solve all three days at once
schedule |>
group_by(day) |>
lap_solve(nurse, shift, cost)
#> # A tibble: 9 × 4
#> day source target cost
#> <chr> <int> <int> <dbl>
#> 1 Mon 1 2 2
#> 2 Mon 2 1 3
#> 3 Mon 3 3 4
#> 4 Tue 1 2 3
#> 5 Tue 2 1 2
#> 6 Tue 3 3 3
#> 7 Wed 1 1 3
#> 8 Wed 2 2 2
#> 9 Wed 3 3 4This solves each day’s assignment problem independently and returns all results in one tidy table.
K-Best Solutions
Find multiple near-optimal solutions:
cost <- matrix(c(1, 2, 3, 4, 3, 2, 5, 4, 1), nrow = 3, byrow = TRUE)
kbest <- lap_solve_kbest(cost, k = 3)
print(kbest)
#> K-Best Assignment Results
#> =========================
#>
#> Number of solutions: 3
#>
#> Solution costs:
#> Rank 1: 5.0000
#> Rank 2: 7.0000
#> Rank 3: 7.0000
#>
#> Assignments:
#> # A tibble: 9 × 6
#> rank solution_id source target cost total_cost
#> <int> <int> <int> <int> <dbl> <dbl>
#> 1 1 1 1 1 1 5
#> 2 1 1 2 2 3 5
#> 3 1 1 3 3 1 5
#> 4 2 2 1 2 2 7
#> 5 2 2 2 1 4 7
#> 6 2 2 3 3 1 7
#> 7 3 3 1 1 1 7
#> 8 3 3 2 3 2 7
#> 9 3 3 3 2 4 7See Also
?match_couples- Optimal one-to-one matching?full_match- Full matching (variable-ratio groups)?greedy_couples- Fast approximate matching?balance_diagnostics- Formal balance assessment?lap_solve- Direct assignment problem solving