Association-Based Predictor Pruning

corrPrune() performs model-free variable subset selection by iteratively removing predictors until all pairwise associations fall below a specified threshold. It returns a single pruned data frame with predictors that satisfy the association constraint.

Usage

corrPrune(
  data,
  threshold = 0.7,
  measure = "auto",
  mode = "auto",
  force_in = NULL,
  by = NULL,
  group_q = 1,
  max_exact_p = 100,
  ...
)

Arguments

data

A data.frame containing candidate predictors.

threshold

Numeric scalar. Maximum allowed pairwise association (default: 0.7). Must be non-negative.

measure

Character string specifying the association measure to use. Options: "auto" (default), "pearson", "spearman", "kendall", "cramersv", "eta", etc. When "auto", Pearson correlation is used for all-numeric data, and appropriate measures are selected for mixed-type data.

mode

Character string specifying the search algorithm. Options:

"auto" (default): uses exact search if number of predictors <= max_exact_p, otherwise uses greedy search
"exact": exhaustive search for maximal subsets (may be slow for large p)
"greedy": fast approximate search using iterative removal

force_in

Character vector of variable names that must be retained in the final subset. Default: NULL.

by

Character vector naming one or more grouping variables. If provided, associations are computed separately within each group, then aggregated using the quantile specified by group_q. Default: NULL (no grouping).

group_q

Numeric scalar in (0, 1]. Quantile used to aggregate associations across groups when by is provided. Default: 1 (maximum, ensuring threshold holds in all groups). Use 0.9 for 90th percentile, etc.

max_exact_p

Integer. Maximum number of predictors for which exact mode is used when mode = "auto". Default: 100.

...

Additional arguments (reserved for future use).

Value

A data.frame containing the pruned subset of predictors. The result has the following attributes:

selected_vars: Character vector of retained variable names
removed_vars: Character vector of removed variable names
mode: Character string indicating which mode was used ("exact" or "greedy")
measure: Character string indicating which association measure was used
threshold: The threshold value used

Details

corrPrune() identifies a subset of predictors whose pairwise associations are all below threshold. The function works in several stages:

Variable type detection: Identifies numeric vs. categorical predictors
Association measurement: Computes appropriate pairwise associations
Grouping (optional): If by is specified, computes associations within each group and aggregates using the specified quantile
Feasibility check: Verifies that force_in variables satisfy the threshold constraint
Subset selection: Uses either exact or greedy search to find a valid subset

Grouped Pruning: When by is provided, the function ensures the selected predictors satisfy the threshold constraint across groups. For example, with group_q = 1 (default), the returned predictors will have pairwise associations below threshold in all groups. With group_q = 0.9, they will satisfy the constraint in at least 90% of groups.

Mode Selection: Exact mode guarantees finding all maximal subsets and returns the largest one. Greedy mode is faster but approximate, using an iterative removal strategy based on association scores.

Tie-Breaking: When multiple subsets or variables are equally good, deterministic tie-breaking is applied:

Exact mode: Selects by (1) largest subset size, (2) lowest average correlation, (3) alphabetically first variable names. Column order does not affect the result.
Greedy mode: Removes the variable with (1) most constraint violations, (2) highest max association, (3) highest average association, (4) lowest column index. Column order can influence the result when earlier criteria are tied.

To see all maximal subsets instead of a single selection, use corrSelect().

Examples

# Basic numeric data pruning
data(mtcars)
pruned <- corrPrune(mtcars, threshold = 0.7)
names(pruned)

# Force certain variables to be included
pruned <- corrPrune(mtcars, threshold = 0.7, force_in = "mpg")

# Use greedy mode for faster computation
pruned <- corrPrune(mtcars, threshold = 0.7, mode = "greedy")

Usage

Arguments

Value

Details

See also

Examples