Creates cross-validation folds that respect data dependency structure. When spatial, temporal, or clustered dependencies are detected, random CV is disabled and appropriate blocking strategies are enforced.
Usage
borg_cv(
data,
diagnosis = NULL,
v = 5,
coords = NULL,
time = NULL,
groups = NULL,
target = NULL,
block_size = NULL,
embargo = NULL,
output = c("list", "rsample", "caret", "mlr3"),
allow_random = FALSE,
verbose = FALSE
)Arguments
- data
A data frame to create CV folds for.
- diagnosis
A
BorgDiagnosisobject fromborg_diagnose. If NULL, diagnosis is performed automatically.- v
Integer. Number of folds. Default: 5.
- coords
Character vector of length 2 specifying coordinate column names. Required for spatial blocking if diagnosis is NULL.
- time
Character string specifying the time column name. Required for temporal blocking if diagnosis is NULL.
- groups
Character string specifying the grouping column name. Required for group CV if diagnosis is NULL.
- target
Character string specifying the response variable column name.
- block_size
Numeric. For spatial blocking, the minimum block size. If NULL, automatically determined from diagnosis. Should be larger than the autocorrelation range.
- embargo
Integer. For temporal blocking, minimum gap between train and test. If NULL, automatically determined from diagnosis.
- output
Character. Output format: "list" (default), "rsample", "caret", "mlr3".
- allow_random
Logical. If TRUE, allows random CV even when dependencies detected. Default: FALSE. Setting to TRUE requires explicit acknowledgment.
- verbose
Logical. If TRUE, print diagnostic messages. Default: FALSE.
Value
Depending on output:
- "list"
A list with elements:
folds(list of train/test index vectors),diagnosis(the BorgDiagnosis used),strategy(CV strategy name),params(parameters used).- "rsample"
An
rsamplersetobject compatible with tidymodels.- "caret"
A
trainControlobject for caret.- "mlr3"
An
mlr3Resamplingobject.
Details
The Enforcement Principle
Unlike traditional CV helpers, borg_cv enforces valid evaluation:
If spatial autocorrelation is detected, random CV is disabled
If temporal autocorrelation is detected, random CV is disabled
If clustered structure is detected, random CV is disabled
To use random CV on dependent data, you must set
allow_random = TRUEand provide justification (this is logged).
Spatial Blocking
When spatial dependencies are detected, data are partitioned into spatial blocks using k-means clustering on coordinates. Block size is set to exceed the estimated autocorrelation range. This ensures train and test sets are spatially separated.
Examples
# Spatial data with autocorrelation
set.seed(42)
spatial_data <- data.frame(
x = runif(200, 0, 100),
y = runif(200, 0, 100),
response = rnorm(200)
)
# Diagnose and create CV
cv <- borg_cv(spatial_data, coords = c("x", "y"), target = "response")
str(cv$folds) # List of train/test indices
# Clustered data
clustered_data <- data.frame(
site = rep(1:20, each = 10),
value = rep(rnorm(20, sd = 2), each = 10) + rnorm(200, sd = 0.5)
)
cv <- borg_cv(clustered_data, groups = "site", target = "value")
cv$strategy # "group_fold"
# Get rsample-compatible output for tidymodels
if (FALSE) { # \dontrun{
cv_rsample <- borg_cv(spatial_data, coords = c("x", "y"), output = "rsample")
} # }