This document catalogs all evaluation risks that BORG detects, organized by severity and mechanism.
Risk Classification
BORG classifies risks into two categories based on their impact on evaluation validity:
| Category | Impact | BORG Response |
|---|---|---|
| Hard Violation | Results are invalid | Blocks evaluation, requires fix |
| Soft Inflation | Results are biased | Warns, allows with caution |
Hard Violations
These make your evaluation results invalid. Any metrics computed with these violations are unreliable.
1. Index Overlap
What: Same row indices appear in both training and test sets.
Why it matters: The model has seen the exact data it’s being tested on. This is the most basic form of leakage.
Detection: Set intersection of
train_idx and test_idx.
data <- data.frame(x = 1:100, y = rnorm(100))
# Accidental overlap
result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100)
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (hard violations detected)
#> Hard violations: 1
#> Soft inflations: 0
#> Train indices: 60 rows
#> Test indices: 50 rows
#> Inspected at: 2026-03-04 12:49:54
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] index_overlap
#> Train and test indices overlap (10 shared indices). This invalidates evaluation.
#> Source: train_idx/test_idx
#> Affected: 10 indices (first 5: 51, 52, 53, 54, 55)Fix: Ensure indices are mutually exclusive. Use
setdiff() to create non-overlapping sets.
2. Duplicate Rows
What: Test set contains rows identical to training rows.
Why it matters: Model may have memorized these exact patterns. Even without index overlap, identical feature values constitute leakage.
Detection: Row hashing and comparison (C++ backend for numeric data).
# Data with duplicate rows
dup_data <- rbind(
data.frame(x = 1:5, y = 1:5),
data.frame(x = 1:5, y = 1:5) # Duplicates
)
result <- borg_inspect(dup_data, train_idx = 1:5, test_idx = 6:10)
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (hard violations detected)
#> Hard violations: 1
#> Soft inflations: 0
#> Train indices: 5 rows
#> Test indices: 5 rows
#> Inspected at: 2026-03-04 12:49:54
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] duplicate_rows
#> Test set contains 5 rows identical to training rows (memorization risk)
#> Source: data.frame
#> Affected: 6, 7, 8, 9, 10Fix: Remove duplicate rows before splitting, or ensure splits respect duplicates (keep all copies in same set).
3. Preprocessing Leakage
What: Normalization, imputation, or dimensionality reduction fitted on full data before splitting.
Why it matters: Test set statistics influenced the preprocessing parameters applied to training data. Information flows backwards from test to train.
Detection: Recompute statistics on train-only data and compare to stored parameters. Discrepancy indicates leakage.
Supported objects:
| Object Type | Parameters Checked |
|---|---|
caret::preProcess |
$mean, $std
|
recipes::recipe |
Step parameters after prep()
|
prcomp |
$center, $scale, rotation matrix |
scale() attributes |
center, scale
|
# BAD: Scale fitted on all data
scaled_data <- scale(data) # Uses all rows!
train <- scaled_data[1:70, ]
test <- scaled_data[71:100, ]
# BORG detects this
borg_inspect(scaled_data, train_idx = 1:70, test_idx = 71:100)Fix: Fit preprocessing on training data only, then apply to test:
4. Target Leakage (Direct)
What: Feature has absolute correlation > 0.99 with target.
Why it matters: Feature is almost certainly derived
from the outcome. Examples: - days_since_diagnosis when
predicting has_disease
total_spentwhen predictingis_customerAggregated future values leaked into current features
Detection: Compute Pearson correlation of each numeric feature with target on training data.
# Simulate target leakage
leaky <- data.frame(
x = rnorm(100),
outcome = rnorm(100)
)
leaky$leaked <- leaky$outcome + rnorm(100, sd = 0.01) # Near-perfect correlation
result <- borg_inspect(leaky, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: INVALID (hard violations detected)
#> Hard violations: 1
#> Soft inflations: 0
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-04 12:49:54
#>
#> --- HARD VIOLATIONS (must fix) ---
#>
#> [1] target_leakage_direct
#> Feature 'leaked' has correlation 1.000 with target 'outcome'. Likely derived from outcome.
#> Source: data.frame$leakedFix: Remove or investigate the leaky feature. If it’s a legitimate predictor, document why correlation > 0.99 is expected.
5. Group Leakage
What: Same group (patient, site, species) appears in both train and test.
Why it matters: Observations within a group tend to be similar. If the same patient appears in train and test, the model can exploit patient-specific patterns that won’t exist for new patients.
Detection: Set intersection of group membership values.
# Clinical data with patient IDs
clinical <- data.frame(
patient_id = rep(1:10, each = 10),
measurement = rnorm(100)
)
# Random split ignoring patients
set.seed(123)
all_idx <- sample(100)
train_idx <- all_idx[1:70]
test_idx <- all_idx[71:100]
result <- borg_inspect(clinical, train_idx = train_idx, test_idx = test_idx,
groups = "patient_id")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-04 12:49:54
#>
#> No risks detected.Fix: Use group-aware splitting:
6. Temporal Ordering Violation
What: Test observations predate training observations.
Why it matters: Model uses future information to predict the past. In deployment, future data won’t be available.
Detection: Compare max training timestamp to min test timestamp.
# Time series data
ts_data <- data.frame(
date = seq(as.Date("2020-01-01"), by = "day", length.out = 100),
value = cumsum(rnorm(100))
)
# Wrong: random split ignores time
set.seed(42)
random_idx <- sample(100)
train_idx <- random_idx[1:70]
test_idx <- random_idx[71:100]
result <- borg_inspect(ts_data, train_idx = train_idx, test_idx = test_idx,
time = "date")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-04 12:49:54
#>
#> No risks detected.Fix: Use chronological splits where all test data comes after training:
train_idx <- 1:70
test_idx <- 71:1007. CV Fold Contamination
What: Cross-validation folds contain test indices, or folds overlap incorrectly.
Why it matters: Nested CV requires the outer test set to be completely held out from all inner training.
Detection: Check if any fold’s training indices intersect with held-out test set.
Supported objects:
caret::trainControl- checks$indexand$indexOutrsample::vfold_cvand otherrsetobjectsrsample::rsplitobjects
8. Model Scope
What: Model was trained on more rows than claimed training set.
Why it matters: Model saw test data during training, even if indirectly (e.g., through hyperparameter tuning on full data).
Detection: Compare nrow(trainingData)
or length(fitted.values) to
length(train_idx).
Supported objects: lm,
glm, ranger, caret::train,
parsnip models, workflows.
Soft Inflation Risks
These bias results but may not completely invalidate them. Model ranking might be preserved even if absolute metrics are optimistic.
1. Target Leakage (Proxy)
What: Feature has correlation 0.95-0.99 with target.
Why warning not error: May be a legitimate strong predictor. Requires domain knowledge to judge.
Detection: Same as direct leakage, different threshold.
# Strong but not extreme correlation
proxy <- data.frame(
x = rnorm(100),
outcome = rnorm(100)
)
proxy$strong_predictor <- proxy$outcome + rnorm(100, sd = 0.3) # r ~ 0.96
result <- borg_inspect(proxy, train_idx = 1:70, test_idx = 71:100, target = "outcome")
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 1
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-04 12:49:54
#>
#> --- SOFT INFLATIONS (warnings) ---
#>
#> [1] target_leakage_proxy
#> Feature 'strong_predictor' has correlation 0.959 with target 'outcome'. May be a proxy for outcome.
#> Source: data.frame$strong_predictorAction: Review whether the feature should be available at prediction time in production.
2. Spatial Proximity
What: Test points are very close to training points in geographic space.
Why it matters: Spatial autocorrelation means nearby points share variance. Model learns local patterns that don’t generalize to distant locations.
Detection: Compute minimum distance from each test point to nearest training point. Flag if < 1% of spatial spread.
set.seed(42)
spatial <- data.frame(
lon = runif(100, 0, 100),
lat = runif(100, 0, 100),
value = rnorm(100)
)
# Random split intermixes nearby points
train_idx <- sample(100, 70)
test_idx <- setdiff(1:100, train_idx)
result <- borg_inspect(spatial, train_idx = train_idx, test_idx = test_idx,
coords = c("lon", "lat"))
result
#> BorgRisk Assessment
#> ===================
#>
#> Status: VALID (no hard violations)
#> Hard violations: 0
#> Soft inflations: 0
#> Train indices: 70 rows
#> Test indices: 30 rows
#> Inspected at: 2026-03-04 12:49:54
#>
#> No risks detected.Fix: Use spatial blocking:
3. Spatial Overlap
What: Test region falls inside training region’s convex hull.
Why it matters: Interpolation is easier than extrapolation. Model performance on “surrounded” test points overestimates performance on truly new regions.
Detection: Compute convex hull of training points, count test points inside.
Threshold: Warning if > 50% of test points fall inside training hull.
4. Random CV on Dependent Data
What: Using random k-fold CV when data has spatial, temporal, or group structure.
Why it matters: Random folds break dependencies artificially, leading to optimistic error estimates.
# Diagnose data dependencies
spatial <- data.frame(
lon = runif(200, 0, 100),
lat = runif(200, 0, 100),
response = rnorm(200)
)
diagnosis <- borg_diagnose(spatial, coords = c("lon", "lat"), target = "response",
verbose = FALSE)
diagnosis@recommended_cv
#> [1] "random"Fix: Use borg() to generate appropriate
blocked CV folds.
Quick Reference
| Risk Type | Severity | Detection Method | Fix |
|---|---|---|---|
index_overlap |
Hard | Index intersection | Use setdiff()
|
duplicate_rows |
Hard | Row hashing | Deduplicate or group |
preprocessing_leak |
Hard | Parameter comparison | Fit on train only |
target_leakage |
Hard | Correlation > 0.99 | Remove feature |
group_leakage |
Hard | Group intersection | Group-aware split |
temporal_leak |
Hard | Timestamp comparison | Chronological split |
cv_contamination |
Hard | Fold index check | Rebuild folds |
model_scope |
Hard | Row count | Refit on train only |
proxy_leakage |
Soft | Correlation 0.95-0.99 | Domain review |
spatial_proximity |
Soft | Distance check | Spatial blocking |
spatial_overlap |
Soft | Convex hull | Geographic split |
Accessing Risk Details
# Create result with violations
result <- borg_inspect(
data.frame(x = 1:100, y = rnorm(100)),
train_idx = 1:60,
test_idx = 51:100
)
# Summary
cat("Valid:", result@is_valid, "\n")
#> Valid: FALSE
cat("Hard violations:", result@n_hard, "\n")
#> Hard violations: 1
cat("Soft warnings:", result@n_soft, "\n")
#> Soft warnings: 0
# Individual risks
for (risk in result@risks) {
cat("\n", risk$type, "(", risk$severity, "):\n", sep = "")
cat(" ", risk$description, "\n")
if (!is.null(risk$affected)) {
cat(" Affected:", head(risk$affected, 5), "...\n")
}
}
#>
#> index_overlap(hard_violation):
#> Train and test indices overlap (10 shared indices). This invalidates evaluation.
#> Affected: 51 52 53 54 55 ...
# Tabular format
as.data.frame(result)
#> type severity
#> 1 index_overlap hard_violation
#> description
#> 1 Train and test indices overlap (10 shared indices). This invalidates evaluation.
#> source_object n_affected
#> 1 train_idx/test_idx 10See Also
vignette("quickstart")- Basic usagevignette("frameworks")- Framework integration