Automatically detects spatial autocorrelation, temporal autocorrelation, and clustered structure in data. Returns a diagnosis object that specifies appropriate cross-validation strategies.
Usage
borg_diagnose(
data,
coords = NULL,
time = NULL,
groups = NULL,
target = NULL,
alpha = 0.05,
verbose = FALSE
)Arguments
- data
A data frame to diagnose.
- coords
Character vector of length 2 specifying coordinate column names (e.g.,
c("lon", "lat")orc("x", "y")). If NULL, spatial autocorrelation is not tested.- time
Character string specifying the time column name. Can be Date, POSIXct, or numeric. If NULL, temporal autocorrelation is not tested.
- groups
Character string specifying the grouping column name (e.g., "site_id", "patient_id"). If NULL, clustered structure is not tested.
- target
Character string specifying the response variable column name. Used for more accurate autocorrelation diagnostics on residuals. Optional.
- alpha
Numeric. Significance level for autocorrelation tests. Default: 0.05.
- verbose
Logical. If TRUE, print diagnostic progress. Default: FALSE.
Value
A BorgDiagnosis object containing:
Detected dependency type(s)
Severity assessment
Recommended CV strategy
Detailed diagnostics for each dependency type
Estimated metric inflation from using random CV
Details
Spatial Autocorrelation
Detected using Moran's I test on the target variable (or first numeric column). The autocorrelation range is estimated from the empirical variogram. Effective sample size is computed as \(n_{eff} = n / DEFF\) where DEFF is the design effect.
Examples
# Spatial data example
set.seed(42)
spatial_data <- data.frame(
x = runif(100, 0, 100),
y = runif(100, 0, 100),
response = rnorm(100)
)
# Add spatial autocorrelation (nearby points are similar)
for (i in 2:100) {
nearest <- which.min((spatial_data$x[1:(i-1)] - spatial_data$x[i])^2 +
(spatial_data$y[1:(i-1)] - spatial_data$y[i])^2)
spatial_data$response[i] <- 0.7 * spatial_data$response[nearest] +
0.3 * rnorm(1)
}
diagnosis <- borg_diagnose(spatial_data, coords = c("x", "y"),
target = "response")
print(diagnosis)
# Clustered data example
clustered_data <- data.frame(
site = rep(1:10, each = 20),
value = rep(rnorm(10, sd = 2), each = 20) + rnorm(200, sd = 0.5)
)
diagnosis <- borg_diagnose(clustered_data, groups = "site", target = "value")
print(diagnosis)