Select Variable Subsets with Low Association (Mixed-Type Data Frame Interface)
Source:R/assocSelect.R
assocSelect.Rd
Identifies combinations of variables of any common data type (numeric,
ordered factors, or unordered) factors—whose pair-wise association does not
exceed a user-supplied threshold.
The routine wraps MatSelect()
and handles all pre-processing
(type conversion, missing-row removal, constant-column checks) for typical
data-frame/tibble/data-table inputs.
Arguments
- df
A data frame (or tibble / data.table). May contain any mix of:
numeric / integer (treated as numeric)
ordered factors
unordered factors (character vectors are coerced to factors)
- threshold
Numeric in \((0,1)\). Maximum allowed pair-wise absolute association. Default
0.7
.- method
Character; the subset-search algorithm. One of
"els"
or"bron-kerbosch"
. IfNULL
(default) the function selects automatically: ELS whenforce_in
is supplied, otherwise Bron–Kerbosch.- force_in
Optional character vector or column indices specifying variables that must appear in every returned subset.
- method_num_num
Association measure for numeric–numeric pairs. One of
"pearson"
(default),"spearman"
,"kendall"
,"bicor"
,"distance"
, or"maximal"
.- method_num_ord
Association measure for numeric–ordered pairs. One of
"spearman"
(default) or"kendall"
.- method_ord_ord
Association measure for ordered–ordered pairs. One of
"spearman"
(default) or"kendall"
.- ...
Additional arguments passed unchanged to
MatSelect()
(e.g.,use_pivot = TRUE
for Bron–Kerbosch).
Value
A CorrCombo
S4 object containing:
all valid subsets,
their summary association statistics,
metadata (algorithm used, rows kept, forced-in variables, etc.).
The object’s show()
method prints the association metrics that were
actually used for this data set.
Details
A single call can therefore screen a data set that mixes continuous and categorical features and return every subset whose internal associations are “sufficiently low” under the metric(s) you choose.
Rows containing NA
are dropped with a warning; constant columns are
treated as having zero association with every other variable.
The default association measure for each variable-type combination is:
- numeric – numeric
method_num_num
(default"pearson"
)- numeric – ordered
method_num_ord
- numeric – unordered
"eta"
(ANOVA \(\eta^{2}\))- ordered – ordered
method_ord_ord
- ordered – unordered
"cramersv"
- unordered – unordered
"cramersv"
All association measures are rescaled to \([0,1]\) before thresholding.
External packages are required for
"bicor"
(WGCNA),
"distance"
(energy),
and "maximal"
(minerva); an informative error is thrown if they
are missing.
Examples
df <- data.frame(
height = rnorm(15, 170, 10),
weight = rnorm(15, 70, 12),
group = factor(rep(LETTERS[1:3], each = 5)),
score = ordered(sample(c("low","med","high"), 15, TRUE))
)
## keep every subset whose internal associations ≤ 0.6
assocSelect(df, threshold = 0.6)
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: mixed
#> AssocMethod: numeric_numeric = pearson, numeric_factor = eta, numeric_ordered
#> = spearman, factor_ordered = cramersv
#> Threshold: 0.600
#> Subsets: 1 valid combinations
#> Data Rows: 15 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] height, weight, group, score 0.174 0.474 4
## use Kendall for all rank-based comparisons and force 'height' to appear
assocSelect(df,
threshold = 0.5,
method_num_num = "kendall",
method_num_ord = "kendall",
method_ord_ord = "kendall",
force_in = "height")
#> CorrCombo object
#> -----------------
#> Method: els
#> Correlation: mixed
#> AssocMethod: numeric_numeric = kendall, numeric_factor = eta, numeric_ordered
#> = kendall, factor_ordered = cramersv
#> Threshold: 0.500
#> Subsets: 1 valid combinations
#> Data Rows: 15 used in correlation
#> Forced-in: height
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] height, weight, group, score 0.173 0.474 4