corrPrune() performs model-free variable subset selection by iteratively
removing predictors until all pairwise associations fall below a specified
threshold. It returns a single pruned data frame with predictors that satisfy
the association constraint.
Usage
corrPrune(
data,
threshold = 0.7,
measure = "auto",
mode = "auto",
force_in = NULL,
by = NULL,
group_q = 1,
max_exact_p = 100,
...
)Arguments
- data
A data.frame containing candidate predictors.
- threshold
Numeric scalar. Maximum allowed pairwise association (default: 0.7). Must be non-negative.
- measure
Character string specifying the association measure to use. Options:
"auto"(default),"pearson","spearman","kendall","cramersv","eta", etc. When"auto", Pearson correlation is used for all-numeric data, and appropriate measures are selected for mixed-type data.- mode
Character string specifying the search algorithm. Options:
"auto"(default): uses exact search if number of predictors <=max_exact_p, otherwise uses greedy search"exact": exhaustive search for maximal subsets (may be slow for large p)"greedy": fast approximate search using iterative removal
- force_in
Character vector of variable names that must be retained in the final subset. Default: NULL.
- by
Character vector naming one or more grouping variables. If provided, associations are computed separately within each group, then aggregated using the quantile specified by
group_q. Default: NULL (no grouping).- group_q
Numeric scalar in (0, 1]. Quantile used to aggregate associations across groups when
byis provided. Default: 1 (maximum, ensuring threshold holds in all groups). Use 0.9 for 90th percentile, etc.- max_exact_p
Integer. Maximum number of predictors for which exact mode is used when
mode = "auto". Default: 100.- ...
Additional arguments (reserved for future use).
Value
A data.frame containing the pruned subset of predictors. The result has the following attributes:
- selected_vars
Character vector of retained variable names
- removed_vars
Character vector of removed variable names
- mode
Character string indicating which mode was used ("exact" or "greedy")
- measure
Character string indicating which association measure was used
- threshold
The threshold value used
Details
corrPrune() identifies a subset of predictors whose pairwise associations
are all below threshold. The function works in several stages:
Variable type detection: Identifies numeric vs. categorical predictors
Association measurement: Computes appropriate pairwise associations
Grouping (optional): If
byis specified, computes associations within each group and aggregates using the specified quantileFeasibility check: Verifies that
force_invariables satisfy the threshold constraintSubset selection: Uses either exact or greedy search to find a valid subset
Grouped Pruning: When by is provided, the function ensures the selected
predictors satisfy the threshold constraint across groups. For example, with
group_q = 1 (default), the returned predictors will have pairwise associations
below threshold in all groups. With group_q = 0.9, they will satisfy
the constraint in at least 90% of groups.
Mode Selection: Exact mode guarantees finding all maximal subsets and returns the largest one. Greedy mode is faster but approximate, using an iterative removal strategy based on association scores.
Tie-Breaking: When multiple subsets or variables are equally good, deterministic tie-breaking is applied:
Exact mode: Selects by (1) largest subset size, (2) lowest average correlation, (3) alphabetically first variable names. Column order does not affect the result.
Greedy mode: Removes the variable with (1) most constraint violations, (2) highest max association, (3) highest average association, (4) lowest column index. Column order can influence the result when earlier criteria are tied.
To see all maximal subsets instead of a single selection, use
corrSelect().
See also
corrSelect for exhaustive subset enumeration,
assocSelect for mixed-type data subset enumeration,
modelPrune for model-based predictor pruning.
Examples
# Basic numeric data pruning
data(mtcars)
pruned <- corrPrune(mtcars, threshold = 0.7)
names(pruned)
# Force certain variables to be included
pruned <- corrPrune(mtcars, threshold = 0.7, force_in = "mpg")
# Use greedy mode for faster computation
pruned <- corrPrune(mtcars, threshold = 0.7, mode = "greedy")