Compute and Cache Distance Matrix for Reuse
Source:R/matching_distance_cache.R
compute_distances.RdPrecomputes a distance matrix between left and right datasets, allowing it to be reused across multiple matching operations with different constraints. This is particularly useful when exploring different matching parameters (max_distance, calipers, methods) without recomputing distances.
Usage
compute_distances(
left,
right,
vars,
distance = "euclidean",
weights = NULL,
scale = FALSE,
auto_scale = FALSE,
left_id = "id",
right_id = "id",
block_id = NULL
)Arguments
- left
Left dataset (data frame)
- right
Right dataset (data frame)
- vars
Character vector of variable names to use for distance computation
- distance
Distance metric (default: "euclidean")
- weights
Optional numeric vector of variable weights
- scale
Scaling method: FALSE, "standardize", "range", or "robust"
- auto_scale
Apply automatic preprocessing (default: FALSE)
- left_id
Name of ID column in left (default: "id")
- right_id
Name of ID column in right (default: "id")
- block_id
Optional block ID column name for blocked matching
Value
An S3 object of class "distance_object" containing:
cost_matrix: Numeric matrix of distancesleft_ids: Character vector of left IDsright_ids: Character vector of right IDsblock_id: Block ID column name (if specified)metadata: List with computation details (vars, distance, scale, etc.)original_left: Original left dataset (for later joining)original_right: Original right dataset (for later joining)
Details
This function computes distances once and stores them in a reusable object.
The resulting distance_object can be passed to match_couples() or
greedy_couples() instead of providing datasets and variables.
Benefits:
Performance: Avoid recomputing distances when trying different constraints
Exploration: Quickly test max_distance, calipers, or methods
Consistency: Ensures same distances used across comparisons
Memory efficient: Can use sparse matrices when many pairs are forbidden
The distance_object stores the original datasets, allowing downstream
functions like join_matched() to work seamlessly.
Examples
# Compute distances once
left <- data.frame(id = 1:5, age = c(25, 30, 35, 40, 45), income = c(45, 52, 48, 61, 55) * 1000)
right <- data.frame(id = 6:10, age = c(24, 29, 36, 41, 44), income = c(46, 51, 47, 60, 54) * 1000)
dist_obj <- compute_distances(
left, right,
vars = c("age", "income"),
scale = "standardize"
)
# Reuse for different matching strategies
result1 <- match_couples(dist_obj, max_distance = 0.5)
result2 <- match_couples(dist_obj, max_distance = 1.0)
result3 <- greedy_couples(dist_obj, strategy = "sorted")
# All use the same precomputed distances