Title: | Stability-enHanced Approaches using Resampling Procedures |
---|---|
Description: | In stability selection (N Meinshausen, P Bühlmann (2010) <doi:10.1111/j.1467-9868.2010.00740.x>) and consensus clustering (S Monti et al (2003) <doi:10.1023/A:1023949509487>), resampling techniques are used to enhance the reliability of the results. In this package, hyper-parameters are calibrated by maximising model stability, which is measured under the null hypothesis that all selection (or co-membership) probabilities are identical (B Bodinier et al (2023a) <doi:10.1093/jrsssc/qlad058> and B Bodinier et al (2023b) <doi:10.1093/bioinformatics/btad635>). Functions are readily implemented for the use of LASSO regression, sparse PCA, sparse (group) PLS or graphical LASSO in stability selection, and hierarchical clustering, partitioning around medoids, K means or Gaussian mixture models in consensus clustering. |
Authors: | Barbara Bodinier [aut, cre] |
Maintainer: | Barbara Bodinier <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.4.6 |
Built: | 2025-02-27 03:48:12 UTC |
Source: | https://github.com/barbarabodinier/sharp |
In stability selection and consensus clustering, resampling techniques are used to enhance the reliability of the results. In this package, hyper-parameters are calibrated by maximising model stability, which is measured under the null hypothesis that all selection (or co-membership) probabilities are identical. Functions are readily implemented for the use of LASSO regression, sparse PCA, sparse (group) PLS or graphical LASSO in stability selection, and hierarchical clustering, partitioning around medoids, K means or Gaussian mixture models in consensus clustering.
Package: | sharp |
Type: | Package |
Version: | 1.4.6 |
Date: | 2024-02-03 |
License: | GPL (>= 3) |
Maintainer: | Barbara Bodinier [email protected] |
Bodinier B, Vuckovic D, Rodrigues S, Filippi S, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration of consensus weighted distance-based clustering approaches using sharp.” Bioinformatics, btad635. ISSN 1367-4811, doi:10.1093/bioinformatics/btad635, https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad635/52191190/btad635.pdf.
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.
Monti S, Tamayo P, Mesirov J, Golub T (2003). “Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data.” Machine Learning, 52(1), 91–118. doi:10.1023/A:1023949509487.
oldpar <- par(no.readonly = TRUE) par(mar = c(5, 5, 5, 5)) ## Regression models # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50) # Stability selection stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata) CalibrationPlot(stab) summary(stab) SelectedVariables(stab) ## Graphical models # Data simulation set.seed(1) simul <- SimulateGraphical(n = 100, pk = 20, topology = "scale-free") # Stability selection stab <- GraphicalModel(xdata = simul$data) CalibrationPlot(stab) summary(stab) plot(stab) ## PCA models if (requireNamespace("elasticnet", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateComponents(pk = c(5, 3, 4)) plot(simul) # Stability selection stab <- BiSelection( xdata = simul$data, ncomp = 3, implementation = SparsePCA ) CalibrationPlot(stab) summary(stab) SelectedVariables(stab) } ## PLS models if (requireNamespace("sgPLS", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(n = 50, pk = c(10, 20, 30), family = "gaussian") # Stability selection stab <- BiSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", ncomp = 3, implementation = SparsePLS ) CalibrationPlot(stab) summary(stab) plot(stab) } par(oldpar)
oldpar <- par(no.readonly = TRUE) par(mar = c(5, 5, 5, 5)) ## Regression models # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50) # Stability selection stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata) CalibrationPlot(stab) summary(stab) SelectedVariables(stab) ## Graphical models # Data simulation set.seed(1) simul <- SimulateGraphical(n = 100, pk = 20, topology = "scale-free") # Stability selection stab <- GraphicalModel(xdata = simul$data) CalibrationPlot(stab) summary(stab) plot(stab) ## PCA models if (requireNamespace("elasticnet", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateComponents(pk = c(5, 3, 4)) plot(simul) # Stability selection stab <- BiSelection( xdata = simul$data, ncomp = 3, implementation = SparsePCA ) CalibrationPlot(stab) summary(stab) SelectedVariables(stab) } ## PLS models if (requireNamespace("sgPLS", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(n = 50, pk = c(10, 20, 30), family = "gaussian") # Stability selection stab <- BiSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", ncomp = 3, implementation = SparsePLS ) CalibrationPlot(stab) summary(stab) plot(stab) } par(oldpar)
Computes descriptive statistics (defined by FUN
) for coefficients of
the (calibrated) models conditionally on selection across resampling
iterations.
AggregatedEffects( stability, lambda_id = NULL, side = "X", comp = 1, FUN = stats::median, ... )
AggregatedEffects( stability, lambda_id = NULL, side = "X", comp = 1, FUN = stats::median, ... )
stability |
output of |
lambda_id |
parameter ID with respect to the grid |
side |
character string indicating if coefficients of predictors
( |
comp |
component ID. Only applicable to PLS models. |
FUN |
function to use to aggregate coefficients of visited models over
resampling iterations. Recommended functions include
|
... |
additional arguments to be passed to |
A matrix of summarised coefficients conditionally on selection across
resampling iterations. Missing values (NA
) are returned for
variables that are never selected.
VariableSelection
, BiSelection
,
Refit
# Example with univariate outcome set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian") median_betas <- AggregatedEffects(stab) # Comparison with refitted model refitted <- Refit(xdata = simul$xdata, ydata = simul$ydata, stability = stab) refitted_betas <- coef(refitted)[-1, 1] plot(median_betas[names(refitted_betas), ], refitted_betas, panel.first = abline(0, 1, lty = 2) ) # Extracting mean betas conditionally on selection mean_betas <- AggregatedEffects(stab, FUN = mean) plot(median_betas, mean_betas) # Regression with multivariate outcomes set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, q = 2, family = "gaussian") stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "mgaussian") median_betas <- AggregatedEffects(stab) dim(median_betas) # Sparse PLS with multivariate outcome if (requireNamespace("sgPLS", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 50, pk = 15, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), implementation = SparsePLS ) median_betas <- AggregatedEffects(stab) dim(median_betas) median_betas <- AggregatedEffects(stab, side = "Y") dim(median_betas) }
# Example with univariate outcome set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian") median_betas <- AggregatedEffects(stab) # Comparison with refitted model refitted <- Refit(xdata = simul$xdata, ydata = simul$ydata, stability = stab) refitted_betas <- coef(refitted)[-1, 1] plot(median_betas[names(refitted_betas), ], refitted_betas, panel.first = abline(0, 1, lty = 2) ) # Extracting mean betas conditionally on selection mean_betas <- AggregatedEffects(stab, FUN = mean) plot(median_betas, mean_betas) # Regression with multivariate outcomes set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, q = 2, family = "gaussian") stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "mgaussian") median_betas <- AggregatedEffects(stab) dim(median_betas) # Sparse PLS with multivariate outcome if (requireNamespace("sgPLS", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 50, pk = 15, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), implementation = SparsePLS ) median_betas <- AggregatedEffects(stab) dim(median_betas) median_betas <- AggregatedEffects(stab, side = "Y") dim(median_betas) }
Extracts the calibrated hyper-parameters (or their indices for
ArgmaxId
) with respect to the grids provided in Lambda
and pi_list
in argument stability
.
ArgmaxId(stability = NULL, S = NULL) Argmax(stability)
ArgmaxId(stability = NULL, S = NULL) Argmax(stability)
stability |
output of |
S |
matrix of stability scores obtained with different combinations of
parameters where rows correspond to different values of the parameter
controlling the level of sparsity in the underlying feature selection
algorithm and columns correspond to different values of the threshold in
selection proportions. If |
A matrix of hyper-parameters (Argmax
) or indices
(ArgmaxId
). For multi-block graphical models, rows correspond
to different blocks.
VariableSelection
, GraphicalModel
# Data simulation set.seed(1) simul <- SimulateGraphical(pk = 20) # Stability selection stab <- GraphicalModel(xdata = simul$data) # Extracting calibrated hyper-parameters Argmax(stab) # Extracting calibrated hyper-parameters IDs ids <- ArgmaxId(stab) ids # Relationship between the two functions stab$Lambda[ids[1], 1] stab$params$pi_list[ids[2]]
# Data simulation set.seed(1) simul <- SimulateGraphical(pk = 20) # Stability selection stab <- GraphicalModel(xdata = simul$data) # Extracting calibrated hyper-parameters Argmax(stab) # Extracting calibrated hyper-parameters IDs ids <- ArgmaxId(stab) ids # Relationship between the two functions stab$Lambda[ids[1], 1] stab$params$pi_list[ids[2]]
Performs stability selection for dimensionality reduction. The underlying variable selection algorithm (e.g. sparse PLS) is run with different combinations of parameters controlling the sparsity (e.g. number of selected variables per component) and thresholds in selection proportions. These hyper-parameters are jointly calibrated by maximisation of the stability score.
BiSelection( xdata, ydata = NULL, group_x = NULL, group_y = NULL, LambdaX = NULL, LambdaY = NULL, AlphaX = NULL, AlphaY = NULL, ncomp = 1, scale = TRUE, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, family = "gaussian", implementation = SparsePLS, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ... )
BiSelection( xdata, ydata = NULL, group_x = NULL, group_y = NULL, LambdaX = NULL, LambdaY = NULL, AlphaX = NULL, AlphaY = NULL, ncomp = 1, scale = TRUE, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, family = "gaussian", implementation = SparsePLS, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ... )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
group_x |
vector encoding the grouping structure among predictors. This
argument indicates the number of variables in each group. Only used for
models with group penalisation (e.g. |
group_y |
optional vector encoding the grouping structure among
outcomes. This argument indicates the number of variables in each group.
Only used if |
LambdaX |
matrix of parameters controlling the number of selected variables (for sparse PCA/PLS) or groups (for group and sparse group PLS) in X. |
LambdaY |
matrix of parameters controlling the number of selected
variables (for sparse PLS) or groups (for group or sparse group PLS) in Y.
Only used if |
AlphaX |
matrix of parameters controlling the level of sparsity within
groups in X. Only used if |
AlphaY |
matrix of parameters controlling the level of sparsity within
groups in X. Only used if |
ncomp |
number of components. |
scale |
logical indicating if the data should be scaled (i.e. transformed so that all variables have a standard deviation of one). |
pi_list |
vector of thresholds in selection proportions. If
|
K |
number of resampling iterations. |
tau |
subsample size. Only used if |
seed |
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see |
n_cat |
computation options for the stability score. Default is
|
family |
type of PLS model. This parameter must be set to
|
implementation |
function to use for feature selection. Possible
functions are: |
resampling |
resampling approach. Possible values are:
|
cpss |
logical indicating if complementary pair stability selection
should be done. For this, the algorithm is applied on two non-overlapping
subsets of half of the observations. A feature is considered as selected if
it is selected for both subsamples. With this method, the data is split
|
PFER_method |
method used to compute the upper-bound of the expected
number of False Positives (or Per Family Error Rate, PFER). If
|
PFER_thr |
threshold in PFER for constrained calibration by error
control. If |
FDP_thr |
threshold in the expected proportion of falsely selected
features (or False Discovery Proportion) for constrained calibration by
error control. If |
n_cores |
number of cores to use for parallel computing (see argument
|
output_data |
logical indicating if the input datasets |
verbose |
logical indicating if a loading bar and messages should be printed. |
beep |
sound indicating the end of the run. Possible values are:
|
... |
additional parameters passed to the functions provided in
|
In stability selection, a feature selection algorithm is fitted on
K
subsamples (or bootstrap samples) of the data with different
parameters controlling the sparsity (LambdaX
, LambdaY
,
AlphaX
, and/or AlphaY
). For a given (set of) sparsity
parameter(s), the proportion out of the K
models in which each
feature is selected is calculated. Features with selection proportions
above a threshold pi are considered stably selected. The stability
selection model is controlled by the sparsity parameter(s) (denoted by
) for the underlying algorithm, and the threshold in selection
proportion:
For sparse and sparse group dimensionality reduction, "feature" refers to
variable (variable selection model). For group PLS, "feature" refers to
group (group selection model). For (sparse) group PLS, groups need to be
defined a priori and specified in arguments group_x
and/or
group_y
.
These parameters can be calibrated by maximisation of a stability score
(see ConsensusScore
if n_cat=NULL
or
StabilityScore
otherwise) calculated under the null
hypothesis of equiprobability of selection.
It is strongly recommended to examine the calibration plot carefully to
check that the grids of parameters Lambda
and pi_list
do not
restrict the calibration to a region that would not include the global
maximum (see CalibrationPlot
). In particular, the grid
Lambda
may need to be extended when the maximum stability is
observed on the left or right edges of the calibration heatmap. In some
instances, multiple peaks of stability score can be observed. Simulation
studies suggest that the peak corresponding to the largest number of
selected features tend to give better selection performances. This is not
necessarily the highest peak (which is automatically retained by the
functions in this package). The user can decide to manually choose another
peak.
To control the expected number of False Positives (Per Family Error Rate)
in the results, a threshold PFER_thr
can be specified. The
optimisation problem is then constrained to sets of parameters that
generate models with an upper-bound in PFER below PFER_thr
(see
Meinshausen and Bühlmann (2010) and Shah and Samworth (2013)).
Possible resampling procedures include defining (i) K
subsamples of
a proportion tau
of the observations, (ii) K
bootstrap samples
with the full sample size (obtained with replacement), and (iii) K/2
splits of the data in half for complementary pair stability selection (see
arguments resampling
and cpss
). In complementary pair
stability selection, a feature is considered selected at a given resampling
iteration if it is selected in the two complementary subsamples.
For categorical outcomes (argument family
is "binomial"
or
"multinomial"
), the proportions of observations from each category
in all subsamples or bootstrap samples are the same as in the full sample.
To ensure reproducibility of the results, the starting number of the random
number generator is set to seed
.
For parallelisation, stability selection with different sets of parameters
can be run on n_cores
cores. Using n_cores > 1
creates a
multisession
.
An object of class bi_selection
. A list with:
summary |
a
matrix of the best stability scores and corresponding parameters
controlling the level of sparsity in the underlying algorithm for different
numbers of components. Possible columns include: |
summary_full |
a matrix of the best stability scores for different combinations of parameters controlling the sparsity and components. |
selectedX |
a binary matrix encoding stably selected predictors. |
selpropX |
a matrix of calibrated selection proportions for predictors. |
selectedY |
a binary matrix encoding stably selected outcomes. Only returned for PLS models. |
selpropY |
a matrix of calibrated selection proportions for outcomes. Only returned for PLS models. |
selected |
a binary matrix encoding stable relationships between predictor and outcome variables. Only returned for PLS models. |
selectedX_full |
a binary matrix encoding stably selected predictors. |
selpropX_full |
a matrix of selection proportions for predictors. |
selectedY_full |
a binary matrix encoding stably selected outcomes. Only returned for PLS models. |
selpropY_full |
a matrix of selection proportions for outcomes. Only returned for PLS models. |
coefX |
an
array of estimated loadings coefficients for the different components
(rows), for the predictors (columns), as obtained across the |
coefY |
an array of
estimated loadings coefficients for the different components (rows), for
the outcomes (columns), as obtained across the |
method |
a
list with |
params |
a list with values used
for arguments |
The rows of
summary
and columns of selectedX
, selectedY
,
selpropX
, selpropY
, selected
, coefX
and
coefY
are ordered in the same way and correspond to components and
parameter values stored in summary
. The rows of summary_full
and columns of selectedX_full
, selectedY_full
,
selpropX_full
and selpropY_full
are ordered in the same way
and correspond to components and parameter values stored in
summary_full
.
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Shah RD, Samworth RJ (2013). “Variable selection with error control: another look at stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80. doi:10.1111/j.1467-9868.2011.01034.x.
Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.
Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2016). “Group and sparse group partial least square approaches applied in genomics context.” Bioinformatics, 32(1), 35-42. ISSN 1367-4803, doi:10.1093/bioinformatics/btv535.
KA LC, Rossouw D, Robert-Granié C, Besse P (2008). “A sparse PLS for variable selection when integrating omics data.” Stat Appl Genet Mol Biol, 7(1), Article 35. ISSN 1544-6115, doi:10.2202/1544-6115.1390.
Shen H, Huang JZ (2008). “Sparse principal component analysis via regularized low rank matrix approximation.” Journal of Multivariate Analysis, 99(6), 1015-1034. ISSN 0047-259X, doi:10.1016/j.jmva.2007.06.007.
Zou H, Hastie T, Tibshirani R (2006). “Sparse Principal Component Analysis.” Journal of Computational and Graphical Statistics, 15(2), 265-286. doi:10.1198/106186006X113430.
SparsePCA
, SparsePLS
,
GroupPLS
, SparseGroupPLS
,
VariableSelection
, Resample
,
StabilityScore
Other stability functions:
Clustering()
,
GraphicalModel()
,
StructuralModel()
,
VariableSelection()
if (requireNamespace("sgPLS", quietly = TRUE)) { oldpar <- par(no.readonly = TRUE) par(mar = c(12, 5, 1, 1)) ## Sparse Principal Component Analysis # Data simulation set.seed(1) simul <- SimulateComponents(pk = c(5, 3, 4)) # sPCA: sparsity on X (unsupervised) stab <- BiSelection( xdata = simul$data, ncomp = 2, LambdaX = seq_len(ncol(simul$data) - 1), implementation = SparsePCA ) print(stab) # Calibration plot CalibrationPlot(stab) # Visualisation of the results summary(stab) plot(stab) SelectedVariables(stab) ## Sparse (Group) Partial Least Squares # Data simulation (continuous outcomes) set.seed(1) simul <- SimulateRegression(n = 100, pk = 15, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # sPLS: sparsity on X stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), implementation = SparsePLS ) CalibrationPlot(stab) summary(stab) plot(stab) # sPLS: sparsity on both X and Y stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), LambdaY = seq_len(ncol(y) - 1), implementation = SparsePLS, n_cat = 2 ) CalibrationPlot(stab) summary(stab) plot(stab) # sgPLS: sparsity on X stab <- BiSelection( xdata = x, ydata = y, K = 10, group_x = c(2, 8, 5), family = "gaussian", ncomp = 3, LambdaX = seq_len(2), AlphaX = seq(0.1, 0.9, by = 0.1), implementation = SparseGroupPLS ) CalibrationPlot(stab) summary(stab) par(oldpar) }
if (requireNamespace("sgPLS", quietly = TRUE)) { oldpar <- par(no.readonly = TRUE) par(mar = c(12, 5, 1, 1)) ## Sparse Principal Component Analysis # Data simulation set.seed(1) simul <- SimulateComponents(pk = c(5, 3, 4)) # sPCA: sparsity on X (unsupervised) stab <- BiSelection( xdata = simul$data, ncomp = 2, LambdaX = seq_len(ncol(simul$data) - 1), implementation = SparsePCA ) print(stab) # Calibration plot CalibrationPlot(stab) # Visualisation of the results summary(stab) plot(stab) SelectedVariables(stab) ## Sparse (Group) Partial Least Squares # Data simulation (continuous outcomes) set.seed(1) simul <- SimulateRegression(n = 100, pk = 15, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # sPLS: sparsity on X stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), implementation = SparsePLS ) CalibrationPlot(stab) summary(stab) plot(stab) # sPLS: sparsity on both X and Y stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), LambdaY = seq_len(ncol(y) - 1), implementation = SparsePLS, n_cat = 2 ) CalibrationPlot(stab) summary(stab) plot(stab) # sgPLS: sparsity on X stab <- BiSelection( xdata = x, ydata = y, K = 10, group_x = c(2, 8, 5), family = "gaussian", ncomp = 3, LambdaX = seq_len(2), AlphaX = seq(0.1, 0.9, by = 0.1), implementation = SparseGroupPLS ) CalibrationPlot(stab) summary(stab) par(oldpar) }
Generates a matrix of parameters controlling the sparsity of the underlying selection algorithm for multi-block calibration.
BlockLambdaGrid(Lambda, lambda_other_blocks = NULL)
BlockLambdaGrid(Lambda, lambda_other_blocks = NULL)
Lambda |
vector or matrix of penalty parameters. |
lambda_other_blocks |
optional vector of penalty parameters to use for other blocks in the iterative multi-block procedure. |
A list with:
Lambda |
a matrix of (block-specific) penalty parameters. In multi-block stability selection, rows correspond to sets of penalty parameters and columns correspond to different blocks. |
Sequential_template |
logical matrix encoding the type of procedure
for data with multiple blocks in stability selection graphical modelling.
For multi-block estimation, each block is calibrated separately while
others blocks are weakly penalised ( |
# Multi-block grid Lambda <- matrix( c( 0.8, 0.6, 0.3, 0.5, 0.4, 0.2, 0.7, 0.5, 0.1 ), ncol = 3, byrow = TRUE ) mygrid <- BlockLambdaGrid(Lambda, lambda_other_blocks = 0.1) # Multi-parameter grid (not recommended) Lambda <- matrix( c( 0.8, 0.6, 0.3, 0.5, 0.4, 0.2, 0.7, 0.5, 0.1 ), ncol = 3, byrow = TRUE ) mygrid <- BlockLambdaGrid(Lambda, lambda_other_blocks = NULL)
# Multi-block grid Lambda <- matrix( c( 0.8, 0.6, 0.3, 0.5, 0.4, 0.2, 0.7, 0.5, 0.1 ), ncol = 3, byrow = TRUE ) mygrid <- BlockLambdaGrid(Lambda, lambda_other_blocks = 0.1) # Multi-parameter grid (not recommended) Lambda <- matrix( c( 0.8, 0.6, 0.3, 0.5, 0.4, 0.2, 0.7, 0.5, 0.1 ), ncol = 3, byrow = TRUE ) mygrid <- BlockLambdaGrid(Lambda, lambda_other_blocks = NULL)
Creates a plot showing the stability score as a function of the parameter(s)
controlling the level of sparsity in the underlying feature selection
algorithm and/or the threshold in selection proportions. See examples in
VariableSelection
, GraphicalModel
,
Clustering
and BiSelection
.
CalibrationPlot( stability, block_id = NULL, col = NULL, pch = 19, cex = 0.7, xlim = NULL, ylim = NULL, bty = "o", lines = TRUE, lty = 3, lwd = 2, show_argmax = TRUE, show_pix = FALSE, show_piy = FALSE, offset = 0.3, legend = TRUE, legend_length = NULL, legend_range = NULL, ncol = 1, xlab = NULL, ylab = NULL, zlab = expression(italic(q)), xlas = 2, ylas = NULL, zlas = 2, cex.lab = 1.5, cex.axis = 1, cex.legend = 1.2, xgrid = FALSE, ygrid = FALSE, params = c("ny", "alphay", "nx", "alphax") )
CalibrationPlot( stability, block_id = NULL, col = NULL, pch = 19, cex = 0.7, xlim = NULL, ylim = NULL, bty = "o", lines = TRUE, lty = 3, lwd = 2, show_argmax = TRUE, show_pix = FALSE, show_piy = FALSE, offset = 0.3, legend = TRUE, legend_length = NULL, legend_range = NULL, ncol = 1, xlab = NULL, ylab = NULL, zlab = expression(italic(q)), xlas = 2, ylas = NULL, zlas = 2, cex.lab = 1.5, cex.axis = 1, cex.legend = 1.2, xgrid = FALSE, ygrid = FALSE, params = c("ny", "alphay", "nx", "alphax") )
stability |
output of |
block_id |
ID of the block to visualise. Only used for multi-block
stability selection graphical models. If |
col |
vector of colours. |
pch |
type of point, as in |
cex |
size of point. |
xlim |
displayed range along the x-axis. Only used if |
ylim |
displayed range along the y-axis. Only used if |
bty |
character string indicating if the box around the plot should be
drawn. Possible values include: |
lines |
logical indicating if the points should be linked by lines. Only
used if |
lty |
line type, as in |
lwd |
line width, as in |
show_argmax |
logical indicating if the calibrated parameter(s) should be indicated by lines. |
show_pix |
logical indicating if the calibrated threshold in selection
proportion in |
show_piy |
logical indicating if the calibrated threshold in selection
proportion in |
offset |
distance between the point and the text, as in
|
legend |
logical indicating if the legend should be included. |
legend_length |
length of the colour bar. Only used if |
legend_range |
range of the colour bar. Only used if |
ncol |
integer indicating the number of columns in the legend. |
xlab |
label of the x-axis. |
ylab |
label of the y-axis. |
zlab |
label of the z-axis. Only used if |
xlas |
orientation of labels on the x-axis, as |
ylas |
orientation of labels on the y-axis, as |
zlas |
orientation of labels on the z-axis, as |
cex.lab |
font size for labels. |
cex.axis |
font size for axes. |
cex.legend |
font size for text legend entries. |
xgrid |
logical indicating if a vertical grid should be drawn. Only used
if |
ygrid |
logical indicating if a horizontal grid should be drawn. Only
used if |
params |
vector of possible parameters if |
A calibration plot.
VariableSelection
, GraphicalModel
,
Clustering
, BiSelection
Runs decision trees using implementation from rpart
.
This function is not using stability.
CART(xdata, ydata, Lambda = NULL, family, ...)
CART(xdata, ydata, Lambda = NULL, family, ...)
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
Lambda |
matrix of parameters controlling the number of splits in the decision tree. |
family |
type of regression model. This argument is defined as in
|
... |
additional parameters passed to |
A list with:
selected |
matrix of binary selection status. Rows correspond to different model parameters. Columns correspond to predictors. |
beta_full |
array of model coefficients. Rows correspond to different model parameters. Columns correspond to predictors. Indices along the third dimension correspond to outcome variable(s). |
Breiman L, Friedman JH, Olshen R, Stone CJ (1984). Classification and Regression Trees. Wadsworth.
SelectionAlgo
, VariableSelection
Other underlying algorithm functions:
ClusteringAlgo()
,
PenalisedGraphical()
,
PenalisedOpenMx()
,
PenalisedRegression()
if (requireNamespace("rpart", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(pk = 50) # Running the LASSO mycart <- CART( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian" ) head(mycart$selected) }
if (requireNamespace("rpart", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(pk = 50) # Running the LASSO mycart <- CART( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian" ) head(mycart$selected) }
Performs consensus (weighted) clustering. The underlying algorithm (e.g.
hierarchical clustering) is run with different number of clusters nc
.
In consensus weighed clustering, weighted distances are calculated using the
cosa2
algorithm with different penalty parameters
Lambda
. The hyper-parameters are calibrated by maximisation of the
consensus score.
Clustering( xdata, nc = NULL, eps = NULL, Lambda = NULL, K = 100, tau = 0.5, seed = 1, n_cat = 3, implementation = HierarchicalClustering, scale = TRUE, linkage = "complete", row = TRUE, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ... )
Clustering( xdata, nc = NULL, eps = NULL, Lambda = NULL, K = 100, tau = 0.5, seed = 1, n_cat = 3, implementation = HierarchicalClustering, scale = TRUE, linkage = "complete", row = TRUE, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ... )
xdata |
data matrix with observations as rows and variables as columns. |
nc |
matrix of parameters controlling the number of clusters in the
underlying algorithm specified in |
eps |
radius in density-based clustering, see
|
Lambda |
vector of penalty parameters for weighted distance calculation.
Only used for distance-based clustering, including for example
|
K |
number of resampling iterations. |
tau |
subsample size. |
seed |
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see |
n_cat |
computation options for the stability score. Default is
|
implementation |
function to use for clustering. Possible functions
include |
scale |
logical indicating if the data should be scaled to ensure that all variables contribute equally to the clustering of the observations. |
linkage |
character string indicating the type of linkage used in
hierarchical clustering to define the stable clusters. Possible values
include |
row |
logical indicating if rows (if |
optimisation |
character string indicating the type of optimisation
method to calibrate the regularisation parameter (only used if
|
n_cores |
number of cores to use for parallel computing (see argument
|
output_data |
logical indicating if the input datasets |
verbose |
logical indicating if a loading bar and messages should be printed. |
beep |
sound indicating the end of the run. Possible values are:
|
... |
additional parameters passed to the functions provided in
|
In consensus clustering, a clustering algorithm is applied on
K
subsamples of the observations with different numbers of clusters
provided in nc
. If row=TRUE
(the default), the observations
(rows) are the items to cluster. If row=FALSE
, the variables
(columns) are the items to cluster. For a given number of clusters, the
consensus matrix coprop
stores the proportion of iterations where
two items were in the same estimated cluster, out of all iterations where
both items were drawn in the subsample.
Stable cluster membership is obtained by applying a distance-based
clustering method using (1-coprop)
as distance (see
Clusters).
These parameters can be calibrated by maximisation of a stability score
(see ConsensusScore
) calculated under the null hypothesis of
equiprobability of co-membership.
It is strongly recommended to examine the calibration plot (see
CalibrationPlot
) to check that there is a clear maximum. The
absence of a clear maximum suggests that the clustering is not stable,
consensus clustering outputs should not be trusted in that case.
To ensure reproducibility of the results, the starting number of the random
number generator is set to seed
.
For parallelisation, stability selection with different sets of parameters
can be run on n_cores
cores. Using n_cores > 1
creates a
multisession
.
An object of class clustering
. A list with:
Sc |
a matrix of the best stability scores for different (sets of) parameters controlling the number of clusters and penalisation of attribute weights. |
nc |
a matrix of numbers of clusters. |
Lambda |
a matrix of regularisation parameters for attribute weights. |
Q |
a matrix of the average number of selected attributes by the underlying algorithm with different regularisation parameters. |
coprop |
an array of consensus matrices. Rows and columns correspond to items. Indices along the third dimension correspond to different parameters controlling the number of clusters and penalisation of attribute weights. |
selprop |
an array of selection proportions. Columns correspond to attributes. Rows correspond to different parameters controlling the number of clusters and penalisation of attribute weights. |
method |
a list with |
params |
a list with values used for arguments
|
The rows of Sc
, nc
,
Lambda
, Q
, selprop
and indices along the third
dimension of coprop
are ordered in the same way and correspond to
parameter values stored in nc
and Lambda
.
Bodinier B, Vuckovic D, Rodrigues S, Filippi S, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration of consensus weighted distance-based clustering approaches using sharp.” Bioinformatics, btad635. ISSN 1367-4811, doi:10.1093/bioinformatics/btad635, https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad635/52191190/btad635.pdf.
Kampert MM, Meulman JJ, Friedman JH (2017). “rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.” Journal of Classification, 34(3), 514–547. doi:10.1007/s00357-017-9240-z.
Friedman JH, Meulman JJ (2004). “Clustering objects on subsets of attributes (with discussion).” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4), 815-849. doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.
Monti S, Tamayo P, Mesirov J, Golub T (2003). “Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data.” Machine Learning, 52(1), 91–118. doi:10.1023/A:1023949509487.
Resample
, ConsensusScore
,
HierarchicalClustering
, PAMClustering
,
KMeansClustering
, GMMClustering
Other stability functions:
BiSelection()
,
GraphicalModel()
,
StructuralModel()
,
VariableSelection()
# Consensus clustering set.seed(1) simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1, ev_xc = 0.5 ) stab <- Clustering(xdata = simul$data) print(stab) CalibrationPlot(stab) summary(stab) Clusters(stab) plot(stab) # Consensus weighted clustering if (requireNamespace("rCOSA", quietly = TRUE)) { set.seed(1) simul <- SimulateClustering( n = c(30, 30, 30), pk = 20, theta_xc = c(rep(1, 10), rep(0, 10)), ev_xc = 0.9 ) stab <- Clustering( xdata = simul$data, Lambda = LambdaSequence(lmin = 0.1, lmax = 10, cardinal = 10), noit = 20, niter = 10 ) print(stab) CalibrationPlot(stab) summary(stab) Clusters(stab) plot(stab) WeightBoxplot(stab) }
# Consensus clustering set.seed(1) simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1, ev_xc = 0.5 ) stab <- Clustering(xdata = simul$data) print(stab) CalibrationPlot(stab) summary(stab) Clusters(stab) plot(stab) # Consensus weighted clustering if (requireNamespace("rCOSA", quietly = TRUE)) { set.seed(1) simul <- SimulateClustering( n = c(30, 30, 30), pk = 20, theta_xc = c(rep(1, 10), rep(0, 10)), ev_xc = 0.9 ) stab <- Clustering( xdata = simul$data, Lambda = LambdaSequence(lmin = 0.1, lmax = 10, cardinal = 10), noit = 20, niter = 10 ) print(stab) CalibrationPlot(stab) summary(stab) Clusters(stab) plot(stab) WeightBoxplot(stab) }
Runs the (weighted) clustering algorithm specified in the argument
implementation
and returns matrices of variable weights, and the
co-membership structure. This function is not using stability.
ClusteringAlgo( xdata, nc = NULL, eps = NULL, Lambda = NULL, scale = TRUE, row = TRUE, implementation = HierarchicalClustering, ... )
ClusteringAlgo( xdata, nc = NULL, eps = NULL, Lambda = NULL, scale = TRUE, row = TRUE, implementation = HierarchicalClustering, ... )
xdata |
data matrix with observations as rows and variables as columns. |
nc |
matrix of parameters controlling the number of clusters in the
underlying algorithm specified in |
eps |
radius in density-based clustering, see
|
Lambda |
vector of penalty parameters. |
scale |
logical indicating if the data should be scaled to ensure that all variables contribute equally to the clustering of the observations. |
row |
logical indicating if rows (if |
implementation |
function to use for clustering. Possible functions
include |
... |
additional parameters passed to the function provided in
|
A list with:
selected |
matrix of binary selection status. Rows correspond to different model parameters. Columns correspond to predictors. |
weight |
array of model coefficients. Rows correspond to different model parameters. Columns correspond to predictors. Indices along the third dimension correspond to outcome variable(s). |
comembership |
array of model coefficients. Rows correspond to different model parameters. Columns correspond to predictors. Indices along the third dimension correspond to outcome variable(s). |
Other underlying algorithm functions:
CART()
,
PenalisedGraphical()
,
PenalisedOpenMx()
,
PenalisedRegression()
# Simulation of 15 observations belonging to 3 groups set.seed(1) simul <- SimulateClustering( n = c(5, 5, 5), pk = 100 ) # Running hierarchical clustering myclust <- ClusteringAlgo( xdata = simul$data, nc = 2:5, implementation = HierarchicalClustering )
# Simulation of 15 observations belonging to 3 groups set.seed(1) simul <- SimulateClustering( n = c(5, 5, 5), pk = 100 ) # Running hierarchical clustering myclust <- ClusteringAlgo( xdata = simul$data, nc = 2:5, implementation = HierarchicalClustering )
Computes different metrics of clustering performance by comparing true and predicted co-membership. This function can only be used in simulation studies (i.e. when the true cluster membership is known).
ClusteringPerformance(theta, theta_star, ...)
ClusteringPerformance(theta, theta_star, ...)
theta |
output from |
theta_star |
output from |
... |
additional arguments to be passed to |
A matrix of selection metrics including:
TP |
number of True Positives (TP) |
FN |
number of False Negatives (TN) |
FP |
number of False Positives (FP) |
TN |
number of True Negatives (TN) |
sensitivity |
sensitivity, i.e. TP/(TP+FN) |
specificity |
specificity, i.e. TN/(TN+FP) |
accuracy |
accuracy, i.e. (TP+TN)/(TP+TN+FP+FN) |
precision |
precision (p), i.e. TP/(TP+FP) |
recall |
recall (r), i.e. TP/(TP+FN) |
F1_score |
F1-score, i.e. 2*p*r/(p+r) |
rand |
Rand Index, i.e. (TP+TN)/(TP+FP+TN+FN) |
ari |
Adjusted Rand Index (ARI), i.e. 2*(TP*TN-FP*FN)/((TP+FP)*(TN+FP)+(TP+FN)*(TN+FN)) |
jaccard |
Jaccard index, i.e. TP/(TP+FP+FN) |
Other functions for model performance:
SelectionPerformance()
,
SelectionPerformanceGraph()
# Data simulation set.seed(1) simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1 ) plot(simul) # Consensus clustering stab <- Clustering( xdata = simul$data, nc = seq_len(5) ) # Clustering performance ClusteringPerformance(stab, simul) # Alternative formulation ClusteringPerformance( theta = CoMembership(Clusters(stab)), theta_star = simul$theta )
# Data simulation set.seed(1) simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1 ) plot(simul) # Consensus clustering stab <- Clustering( xdata = simul$data, nc = seq_len(5) ) # Clustering performance ClusteringPerformance(stab, simul) # Alternative formulation ClusteringPerformance( theta = CoMembership(Clusters(stab)), theta_star = simul$theta )
Merges the outputs from two runs of VariableSelection
,
GraphicalModel
or Clustering
. The two runs must
have been done using the same methods
and the same params
but
with different seed
s. The combined output will contain results based
on iterations from both stability1
and stability2
. This
function can be used for parallelisation.
Combine(stability1, stability2, include_beta = TRUE)
Combine(stability1, stability2, include_beta = TRUE)
stability1 |
output from a first run of |
stability2 |
output from a second run of
|
include_beta |
logical indicating if the beta coefficients of visited models should be concatenated. Only applicable to variable selection or clustering. |
A single output of the same format.
VariableSelection
, GraphicalModel
## Variable selection # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") # Two runs stab1 <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, seed = 1, K = 10) stab2 <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, seed = 2, K = 10) # Merging the outputs stab <- Combine(stability1 = stab1, stability2 = stab2, include_beta = FALSE) str(stab) ## Graphical modelling # Data simulation simul <- SimulateGraphical(pk = 20) # Two runs stab1 <- GraphicalModel(xdata = simul$data, seed = 1, K = 10) stab2 <- GraphicalModel(xdata = simul$data, seed = 2, K = 10) # Merging the outputs stab <- Combine(stability1 = stab1, stability2 = stab2) str(stab) ## Clustering # Data simulation simul <- SimulateClustering(n = c(15, 15, 15)) # Two runs stab1 <- Clustering(xdata = simul$data, seed = 1) stab2 <- Clustering(xdata = simul$data, seed = 2) # Merging the outputs stab <- Combine(stability1 = stab1, stability2 = stab2) str(stab)
## Variable selection # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") # Two runs stab1 <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, seed = 1, K = 10) stab2 <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, seed = 2, K = 10) # Merging the outputs stab <- Combine(stability1 = stab1, stability2 = stab2, include_beta = FALSE) str(stab) ## Graphical modelling # Data simulation simul <- SimulateGraphical(pk = 20) # Two runs stab1 <- GraphicalModel(xdata = simul$data, seed = 1, K = 10) stab2 <- GraphicalModel(xdata = simul$data, seed = 2, K = 10) # Merging the outputs stab <- Combine(stability1 = stab1, stability2 = stab2) str(stab) ## Clustering # Data simulation simul <- SimulateClustering(n = c(15, 15, 15)) # Two runs stab1 <- Clustering(xdata = simul$data, seed = 1) stab2 <- Clustering(xdata = simul$data, seed = 2) # Merging the outputs stab <- Combine(stability1 = stab1, stability2 = stab2) str(stab)
Generates a symmetric and binary matrix indicating, if two items are co-members, i.e. belong to the same cluster.
CoMembership(groups)
CoMembership(groups)
groups |
vector of group membership. |
A symmetric and binary matrix.
# Simulated grouping structure mygroups <- c(rep(1, 3), rep(2, 5), rep(3, 2)) # Co-membership matrix CoMembership(mygroups)
# Simulated grouping structure mygroups <- c(rep(1, 3), rep(2, 5), rep(3, 2)) # Co-membership matrix CoMembership(mygroups)
Computes the consensus score from the consensus matrix, matrix of co-sampling counts and consensus clusters. The score is a z statistic for the comparison of the co-membership proportions observed within and between the consensus clusters.
ConsensusScore(prop, K, theta)
ConsensusScore(prop, K, theta)
prop |
consensus matrix. |
K |
matrix of co-sampling counts. |
theta |
consensus co-membership matrix. |
To calculate the consensus score, the features are classified as being stably
selected or not (in selection) or as being in the same consensus cluster or
not (in clustering). In selection, the quantities and
are
defined as the sum of the selection counts for features that are stably
selected or not, respectively. In clustering, the quantities
and
are defined as the sum of the co-membership counts for pairs of
items in the same consensus cluster or in different consensus clusters,
respectively.
Conditionally on this classification, and under the assumption that the
selection (or co-membership) probabilities are the same for all features (or
item pairs) in each of these two categories, the quantities and
follow binomial distributions with probabilities
and
, respectively.
In the most unstable situation, we suppose that all features (or item pairs)
would have the same probability of being selected (or co-members). The
consensus score is the z statistic from a z test where the null hypothesis is
.
The consensus score increases with stability.
A consensus score.
Other stability metric functions:
FDP()
,
PFER()
,
StabilityMetrics()
,
StabilityScore()
# Data simulation set.seed(2) simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1 ) plot(simul) # Consensus clustering stab <- Clustering( xdata = simul$data ) stab$Sc[3] # Calculating the consensus score theta <- CoMembership(Clusters(stab, argmax_id = 3)) ConsensusScore( prop = (stab$coprop[, , 3])[upper.tri(stab$coprop[, , 3])], K = stab$sampled_pairs[upper.tri(stab$sampled_pairs)], theta = theta[upper.tri(theta)] )
# Data simulation set.seed(2) simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1 ) plot(simul) # Consensus clustering stab <- Clustering( xdata = simul$data ) stab$Sc[3] # Calculating the consensus score theta <- CoMembership(Clusters(stab, argmax_id = 3)) ConsensusScore( prop = (stab$coprop[, , 3])[upper.tri(stab$coprop[, , 3])], K = stab$sampled_pairs[upper.tri(stab$sampled_pairs)], theta = theta[upper.tri(theta)] )
Runs Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
clustering using implementation from dbscan
. This is
also known as the k-medoids algorithm. If Lambda
is provided,
clustering is applied on the weighted distance matrix calculated using the
COSA algorithm as implemented in cosa2
. Otherwise,
distances are calculated using dist
. This function is
not using stability.
DBSCANClustering( xdata, nc = NULL, eps = NULL, Lambda = NULL, distance = "euclidean", ... )
DBSCANClustering( xdata, nc = NULL, eps = NULL, Lambda = NULL, distance = "euclidean", ... )
xdata |
data matrix with observations as rows and variables as columns. |
nc |
matrix of parameters controlling the number of clusters in the
underlying algorithm specified in |
eps |
radius in density-based clustering, see
|
Lambda |
vector of penalty parameters (see argument |
distance |
character string indicating the type of distance to use. If
|
... |
additional parameters passed to |
A list with:
comembership |
an array of binary and symmetric co-membership matrices. |
weights |
a matrix of median weights by feature. |
Kampert MM, Meulman JJ, Friedman JH (2017). “rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.” Journal of Classification, 34(3), 514–547. doi:10.1007/s00357-017-9240-z.
Friedman JH, Meulman JJ (2004). “Clustering objects on subsets of attributes (with discussion).” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4), 815-849. doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.
Other clustering algorithms:
GMMClustering()
,
HierarchicalClustering()
,
KMeansClustering()
,
PAMClustering()
if (requireNamespace("dbscan", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) plot(simul) # DBSCAN clustering myclust <- DBSCANClustering( xdata = simul$data, eps = seq(0, 2 * sqrt(ncol(simul$data) - 1), by = 0.1) ) # Weighted PAM clustering (using COSA) if (requireNamespace("rCOSA", quietly = TRUE)) { myclust <- DBSCANClustering( xdata = simul$data, eps = c(0.25, 0.5, 0.75), Lambda = c(0.2, 0.5) ) } }
if (requireNamespace("dbscan", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) plot(simul) # DBSCAN clustering myclust <- DBSCANClustering( xdata = simul$data, eps = seq(0, 2 * sqrt(ncol(simul$data) - 1), by = 0.1) ) # Weighted PAM clustering (using COSA) if (requireNamespace("rCOSA", quietly = TRUE)) { myclust <- DBSCANClustering( xdata = simul$data, eps = c(0.25, 0.5, 0.75), Lambda = c(0.2, 0.5) ) } }
Creates an ensemble predictive model from VariableSelection
outputs.
Ensemble(stability, xdata, ydata)
Ensemble(stability, xdata, ydata)
stability |
output of |
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
An object of class ensemble_model
. A list with:
intercept |
a vector of refitted intercepts for the |
beta |
a matrix of beta coefficients from the
|
models |
a list of |
family |
type of regression, extracted from
|
Other ensemble model functions:
EnsemblePredictions()
# Linear regression set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian") ensemble <- Ensemble(stability = stab, xdata = simul$xdata, ydata = simul$ydata) # Logistic regression set.seed(1) simul <- SimulateRegression(n = 200, pk = 20, family = "binomial") stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "binomial") ensemble <- Ensemble(stability = stab, xdata = simul$xdata, ydata = simul$ydata)
# Linear regression set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian") ensemble <- Ensemble(stability = stab, xdata = simul$xdata, ydata = simul$ydata) # Logistic regression set.seed(1) simul <- SimulateRegression(n = 200, pk = 20, family = "binomial") stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "binomial") ensemble <- Ensemble(stability = stab, xdata = simul$xdata, ydata = simul$ydata)
Makes predictions using an ensemble model created from
VariableSelection
outputs. For each observation in
xdata
, the predictions are calculated as the average predicted values
obtained for that observation over the K
models fitted in calibrated
stability selection.
EnsemblePredictions(ensemble, xdata, ...)
EnsemblePredictions(ensemble, xdata, ...)
ensemble |
output of |
xdata |
matrix of predictors with observations as rows and variables as columns. |
... |
additional parameters passed to |
A matrix of predictions computed from the observations in
xdata
.
Other ensemble model functions:
Ensemble()
# Data simulation set.seed(1) simul <- SimulateRegression(n = 1000, pk = 50, family = "gaussian") # Training/test split ids <- Split(data = simul$ydata, tau = c(0.8, 0.2)) stab <- VariableSelection( xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ] ) # Constructing the ensemble model ensemble <- Ensemble( stability = stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ] ) # Making predictions yhat <- EnsemblePredictions( ensemble = ensemble, xdata = simul$xdata[ids[[2]], ] ) # Calculating Q-squared cor(simul$ydata[ids[[2]], ], yhat)^2
# Data simulation set.seed(1) simul <- SimulateRegression(n = 1000, pk = 50, family = "gaussian") # Training/test split ids <- Split(data = simul$ydata, tau = c(0.8, 0.2)) stab <- VariableSelection( xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ] ) # Constructing the ensemble model ensemble <- Ensemble( stability = stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ] ) # Making predictions yhat <- EnsemblePredictions( ensemble = ensemble, xdata = simul$xdata[ids[[2]], ] ) # Calculating Q-squared cor(simul$ydata[ids[[2]], ], yhat)^2
Calculates model performance for linear (measured by Q-squared), logistic
(AUC) or Cox (C-statistic) regression. This is done by (i) refitting the
model on a training set including a proportion tau
of the
observations, and (ii) evaluating the performance on the remaining
observations (test set). For more reliable results, the procedure can be
repeated K
times (default K=1
).
ExplanatoryPerformance( xdata, ydata, new_xdata = NULL, new_ydata = NULL, stability = NULL, family = NULL, implementation = NULL, prediction = NULL, resampling = "subsampling", K = 1, tau = 0.8, seed = 1, n_thr = NULL, time = 1000, verbose = FALSE, ... )
ExplanatoryPerformance( xdata, ydata, new_xdata = NULL, new_ydata = NULL, stability = NULL, family = NULL, implementation = NULL, prediction = NULL, resampling = "subsampling", K = 1, tau = 0.8, seed = 1, n_thr = NULL, time = 1000, verbose = FALSE, ... )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
new_xdata |
optional test set (predictor data). |
new_ydata |
optional test set (outcome data). |
stability |
output of |
family |
type of regression model. Possible values include
|
implementation |
optional function to refit the model. If
|
prediction |
optional function to compute predicted values from the
model refitted with |
resampling |
resampling approach to create the training set. The default
is |
K |
number of training-test splits. Only used if |
tau |
proportion of observations used in the training set. Only used if
|
seed |
value of the seed to ensure reproducibility of the results. Only
used if |
n_thr |
number of thresholds to use to construct the ROC curve. If
|
time |
numeric indicating the time for which the survival probabilities are computed. Only applicable to Cox regression. |
verbose |
logical indicating if a loading bar and messages should be printed. |
... |
additional parameters passed to the function provided in
|
For a fair evaluation of the prediction performance, the data is
split into a training set (including a proportion tau
of the
observations) and test set (remaining observations). The regression model
is fitted on the training set and applied on the test set. Performance
metrics are computed in the test set by comparing predicted and observed
outcomes.
For logistic regression, a Receiver Operating Characteristic (ROC) analysis is performed: the True and False Positive Rates (TPR and FPR), and Area Under the Curve (AUC) are computed for different thresholds in predicted probabilities.
For Cox regression, the Concordance Index (as implemented in
concordance
) looking at survival probabilities up
to a specific time
is computed.
For linear regression, the squared correlation between predicted and observed outcome in the test set (Q-squared) is reported.
A list with:
TPR |
True Positive Rate (for logistic regression only). |
FPR |
False Positive Rate (for logistic regression only). |
AUC |
Area Under the Curve (for logistic regression only). |
concordance |
Concordance index (for Cox regression only). |
Beta |
matrix of estimated beta coefficients across the |
Other prediction performance functions:
Incremental()
# Data simulation set.seed(1) simul <- SimulateRegression( n = 1000, pk = 20, family = "binomial", ev_xy = 0.8 ) # Data split: selection, training and test set ids <- Split( data = simul$ydata, family = "binomial", tau = c(0.4, 0.3, 0.3) ) xselect <- simul$xdata[ids[[1]], ] yselect <- simul$ydata[ids[[1]], ] xtrain <- simul$xdata[ids[[2]], ] ytrain <- simul$ydata[ids[[2]], ] xtest <- simul$xdata[ids[[3]], ] ytest <- simul$ydata[ids[[3]], ] # Stability selection stab <- VariableSelection( xdata = xselect, ydata = yselect, family = "binomial" ) # Performances in test set of model refitted in training set roc <- ExplanatoryPerformance( xdata = xtrain, ydata = ytrain, new_xdata = xtest, new_ydata = ytest, stability = stab ) plot(roc) roc$AUC # Alternative with multiple training/test splits roc <- ExplanatoryPerformance( xdata = rbind(xtrain, xtest), ydata = c(ytrain, ytest), stability = stab, K = 100 ) plot(roc) boxplot(roc$AUC) # Partial Least Squares Discriminant Analysis if (requireNamespace("sgPLS", quietly = TRUE)) { stab <- VariableSelection( xdata = xselect, ydata = yselect, implementation = SparsePLS, family = "binomial" ) # Defining wrapping functions for predictions from PLS-DA PLSDA <- function(xdata, ydata, family = "binomial") { model <- mixOmics::plsda(X = xdata, Y = as.factor(ydata), ncomp = 1) return(model) } PredictPLSDA <- function(xdata, model) { xdata <- xdata[, rownames(model$loadings$X), drop = FALSE] predicted <- predict(object = model, newdata = xdata)$predict[, 2, 1] return(predicted) } # Performances with custom models roc <- ExplanatoryPerformance( xdata = rbind(xtrain, xtest), ydata = c(ytrain, ytest), stability = stab, K = 100, implementation = PLSDA, prediction = PredictPLSDA ) plot(roc) }
# Data simulation set.seed(1) simul <- SimulateRegression( n = 1000, pk = 20, family = "binomial", ev_xy = 0.8 ) # Data split: selection, training and test set ids <- Split( data = simul$ydata, family = "binomial", tau = c(0.4, 0.3, 0.3) ) xselect <- simul$xdata[ids[[1]], ] yselect <- simul$ydata[ids[[1]], ] xtrain <- simul$xdata[ids[[2]], ] ytrain <- simul$ydata[ids[[2]], ] xtest <- simul$xdata[ids[[3]], ] ytest <- simul$ydata[ids[[3]], ] # Stability selection stab <- VariableSelection( xdata = xselect, ydata = yselect, family = "binomial" ) # Performances in test set of model refitted in training set roc <- ExplanatoryPerformance( xdata = xtrain, ydata = ytrain, new_xdata = xtest, new_ydata = ytest, stability = stab ) plot(roc) roc$AUC # Alternative with multiple training/test splits roc <- ExplanatoryPerformance( xdata = rbind(xtrain, xtest), ydata = c(ytrain, ytest), stability = stab, K = 100 ) plot(roc) boxplot(roc$AUC) # Partial Least Squares Discriminant Analysis if (requireNamespace("sgPLS", quietly = TRUE)) { stab <- VariableSelection( xdata = xselect, ydata = yselect, implementation = SparsePLS, family = "binomial" ) # Defining wrapping functions for predictions from PLS-DA PLSDA <- function(xdata, ydata, family = "binomial") { model <- mixOmics::plsda(X = xdata, Y = as.factor(ydata), ncomp = 1) return(model) } PredictPLSDA <- function(xdata, model) { xdata <- xdata[, rownames(model$loadings$X), drop = FALSE] predicted <- predict(object = model, newdata = xdata)$predict[, 2, 1] return(predicted) } # Performances with custom models roc <- ExplanatoryPerformance( xdata = rbind(xtrain, xtest), ydata = c(ytrain, ytest), stability = stab, K = 100, implementation = PLSDA, prediction = PredictPLSDA ) plot(roc) }
Computes the False Discovery Proportion (upper-bound) as a ratio of the PFER (upper-bound) over the number of stably selected features. In stability selection, the FDP corresponds to the expected proportion of stably selected features that are not relevant to the outcome (i.e. proportion of False Positives among stably selected features).
FDP(selprop, PFER, pi)
FDP(selprop, PFER, pi)
selprop |
matrix or vector of selection proportions. |
PFER |
Per Family Error Rate. |
pi |
threshold in selection proportions. |
The estimated upper-bound in FDP.
Other stability metric functions:
ConsensusScore()
,
PFER()
,
StabilityMetrics()
,
StabilityScore()
# Simulating set of selection proportions selprop <- round(runif(n = 20), digits = 2) # Computing the FDP with a threshold of 0.8 fdp <- FDP(PFER = 3, selprop = selprop, pi = 0.8)
# Simulating set of selection proportions selprop <- round(runif(n = 20), digits = 2) # Computing the FDP with a threshold of 0.8 fdp <- FDP(PFER = 3, selprop = selprop, pi = 0.8)
Generates a list of n_folds
non-overlapping sets of observation IDs
(folds).
Folds(data, family = NULL, n_folds = 5)
Folds(data, family = NULL, n_folds = 5)
data |
vector or matrix of data. In regression, this should be the outcome data. |
family |
type of regression model. This argument is defined as in
|
n_folds |
number of folds. |
For categorical outcomes (i.e. family
argument is set to
"binomial"
, "multinomial"
or "cox"
), the split is done
such that the proportion of observations from each of the categories in
each of the folds is representative of that of the full sample.
A list of length n_folds
with sets of non-overlapping
observation IDs.
# Splitting into 5 folds simul <- SimulateRegression() ids <- Folds(data = simul$ydata) lapply(ids, length) # Balanced folds with respect to a binary variable simul <- SimulateRegression(family = "binomial") ids <- Folds(data = simul$ydata, family = "binomial") lapply(ids, FUN = function(x) { table(simul$ydata[x, ]) })
# Splitting into 5 folds simul <- SimulateRegression() ids <- Folds(data = simul$ydata) lapply(ids, length) # Balanced folds with respect to a binary variable simul <- SimulateRegression(family = "binomial") ids <- Folds(data = simul$ydata, family = "binomial") lapply(ids, FUN = function(x) { table(simul$ydata[x, ]) })
Runs clustering with Gaussian Mixture Models (GMM) using implementation from
Mclust
. This function is not using stability.
GMMClustering(xdata, nc = NULL, ...)
GMMClustering(xdata, nc = NULL, ...)
xdata |
data matrix with observations as rows and variables as columns. |
nc |
matrix of parameters controlling the number of clusters in the
underlying algorithm specified in |
... |
additional parameters passed to |
A list with:
comembership |
an array of binary and symmetric co-membership matrices. |
weights |
a matrix of median weights by feature. |
Other clustering algorithms:
DBSCANClustering()
,
HierarchicalClustering()
,
KMeansClustering()
,
PAMClustering()
# Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) # Clustering using Gaussian Mixture Models mygmm <- GMMClustering(xdata = simul$data, nc = seq_len(30))
# Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) # Clustering using Gaussian Mixture Models mygmm <- GMMClustering(xdata = simul$data, nc = seq_len(30))
Produces an igraph
object from an
adjacency matrix.
Graph( adjacency, node_label = NULL, node_colour = NULL, node_shape = NULL, edge_colour = "grey60", label_colour = "grey20", mode = "undirected", weighted = FALSE, satellites = FALSE )
Graph( adjacency, node_label = NULL, node_colour = NULL, node_shape = NULL, edge_colour = "grey60", label_colour = "grey20", mode = "undirected", weighted = FALSE, satellites = FALSE )
adjacency |
adjacency matrix or output of |
node_label |
optional vector of node labels. This vector must contain as many entries as there are rows/columns in the adjacency matrix and must be in the same order (the order is used to assign labels to nodes). |
node_colour |
optional vector of node colours. This vector must contain as many entries as there are rows/columns in the adjacency matrix and must be in the same order (the order is used to assign colours to nodes). Integers, named colours or RGB values can be used. |
node_shape |
optional vector of node shapes. This vector must contain as
many entries as there are rows/columns in the adjacency matrix and must be
in the same order (the order is used to assign shapes to nodes). Possible
values are |
edge_colour |
optional character string for edge colour. Integers, named colours or RGB values can be used. |
label_colour |
optional character string for label colour. Integers, named colours or RGB values can be used. |
mode |
character string indicating how the adjacency matrix should be
interpreted. Possible values include |
weighted |
indicating if entries of the adjacency matrix should define
edge width. If |
satellites |
logical indicating if unconnected nodes (satellites) should be included in the igraph object. |
All functionalities implemented in
igraph
can be used on the output.
These include cosmetic changes for the visualisation, but also various
tools for network analysis (including topological properties and community
detection).
The R package visNetwork
offers
interactive network visualisation tools. An
igraph
object can easily be converted
to a visNetwork
object (see
example below).
For Cytoscape users, the RCy3
package can be used
to open the network in Cytoscape.
An igraph object.
Adjacency
, GraphicalModel
,
igraph manual,
visNetwork manual,
Cytoscape
## From adjacency matrix # Un-weighted adjacency <- SimulateAdjacency(pk = 20, topology = "scale-free") plot(Graph(adjacency)) # Weighted adjacency <- adjacency * runif(prod(dim(adjacency))) adjacency <- adjacency + t(adjacency) plot(Graph(adjacency, weighted = TRUE)) # Node colours and shapes plot(Graph(adjacency, weighted = TRUE, node_shape = "star", node_colour = "red")) ## From stability selection outputs # Graphical model set.seed(1) simul <- SimulateGraphical(pk = 20) stab <- GraphicalModel(xdata = simul$data) plot(Graph(stab)) # Sparse PLS if (requireNamespace("sgPLS", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 50, pk = c(5, 5, 5), family = "gaussian") x <- simul$xdata y <- simul$ydata stab <- BiSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), implementation = SparsePLS ) plot(Graph(stab)) } ## Tools from other packages # Applying some igraph functionalities adjacency <- SimulateAdjacency(pk = 20, topology = "scale-free") mygraph <- Graph(adjacency) igraph::degree(mygraph) igraph::betweenness(mygraph) igraph::shortest_paths(mygraph, from = 1, to = 2) igraph::walktrap.community(mygraph) # Interactive view using visNetwork if (requireNamespace("visNetwork", quietly = TRUE)) { vgraph <- mygraph igraph::V(vgraph)$shape <- rep("dot", length(igraph::V(vgraph))) v <- visNetwork::visIgraph(vgraph) mylayout <- as.matrix(v$x$nodes[, c("x", "y")]) mylayout[, 2] <- -mylayout[, 2] plot(mygraph, layout = mylayout) } # Opening in Cytoscape using RCy3 if (requireNamespace("RCy3", quietly = TRUE)) { # Make sure that Cytoscape is open before running the following line # RCy3::createNetworkFromIgraph(mygraph) }
## From adjacency matrix # Un-weighted adjacency <- SimulateAdjacency(pk = 20, topology = "scale-free") plot(Graph(adjacency)) # Weighted adjacency <- adjacency * runif(prod(dim(adjacency))) adjacency <- adjacency + t(adjacency) plot(Graph(adjacency, weighted = TRUE)) # Node colours and shapes plot(Graph(adjacency, weighted = TRUE, node_shape = "star", node_colour = "red")) ## From stability selection outputs # Graphical model set.seed(1) simul <- SimulateGraphical(pk = 20) stab <- GraphicalModel(xdata = simul$data) plot(Graph(stab)) # Sparse PLS if (requireNamespace("sgPLS", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 50, pk = c(5, 5, 5), family = "gaussian") x <- simul$xdata y <- simul$ydata stab <- BiSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), implementation = SparsePLS ) plot(Graph(stab)) } ## Tools from other packages # Applying some igraph functionalities adjacency <- SimulateAdjacency(pk = 20, topology = "scale-free") mygraph <- Graph(adjacency) igraph::degree(mygraph) igraph::betweenness(mygraph) igraph::shortest_paths(mygraph, from = 1, to = 2) igraph::walktrap.community(mygraph) # Interactive view using visNetwork if (requireNamespace("visNetwork", quietly = TRUE)) { vgraph <- mygraph igraph::V(vgraph)$shape <- rep("dot", length(igraph::V(vgraph))) v <- visNetwork::visIgraph(vgraph) mylayout <- as.matrix(v$x$nodes[, c("x", "y")]) mylayout[, 2] <- -mylayout[, 2] plot(mygraph, layout = mylayout) } # Opening in Cytoscape using RCy3 if (requireNamespace("RCy3", quietly = TRUE)) { # Make sure that Cytoscape is open before running the following line # RCy3::createNetworkFromIgraph(mygraph) }
Generates an igraph
object representing
the common and graph-specific edges.
GraphComparison( graph1, graph2, col = c("tomato", "forestgreen", "navy"), lty = c(2, 3, 1), node_colour = NULL, show_labels = TRUE, ... )
GraphComparison( graph1, graph2, col = c("tomato", "forestgreen", "navy"), lty = c(2, 3, 1), node_colour = NULL, show_labels = TRUE, ... )
graph1 |
first graph. Possible inputs are: adjacency matrix, or
|
graph2 |
second graph. |
col |
vector of edge colours. The first entry of the vector defines
the colour of edges in |
lty |
vector of line types for edges. The order is defined as for
argument |
node_colour |
optional vector of node colours. This vector must contain as many entries as there are rows/columns in the adjacency matrix and must be in the same order (the order is used to assign colours to nodes). Integers, named colours or RGB values can be used. |
show_labels |
logical indicating if the node labels should be displayed. |
... |
additional arguments to be passed to |
An igraph object.
# Data simulation set.seed(1) simul1 <- SimulateGraphical(pk = 30) set.seed(2) simul2 <- SimulateGraphical(pk = 30) # Edge-wise comparison of the two graphs mygraph <- GraphComparison( graph1 = simul1, graph2 = simul2 ) plot(mygraph, layout = igraph::layout_with_kk(mygraph))
# Data simulation set.seed(1) simul1 <- SimulateGraphical(pk = 30) set.seed(2) simul2 <- SimulateGraphical(pk = 30) # Edge-wise comparison of the two graphs mygraph <- GraphComparison( graph1 = simul1, graph2 = simul2 ) plot(mygraph, layout = igraph::layout_with_kk(mygraph))
Runs the algorithm specified in the argument implementation
and
returns the estimated adjacency matrix. This function is not using stability.
GraphicalAlgo( xdata, pk = NULL, Lambda, Sequential_template = NULL, scale = TRUE, implementation = PenalisedGraphical, start = "cold", ... )
GraphicalAlgo( xdata, pk = NULL, Lambda, Sequential_template = NULL, scale = TRUE, implementation = PenalisedGraphical, start = "cold", ... )
xdata |
matrix with observations as rows and variables as columns. |
pk |
optional vector encoding the grouping structure. Only used for
multi-block stability selection where |
Lambda |
matrix of parameters controlling the level of sparsity in the
underlying feature selection algorithm specified in |
Sequential_template |
logical matrix encoding the type of procedure to
use for data with multiple blocks in stability selection graphical
modelling. For multi-block estimation, the stability selection model is
constructed as the union of block-specific stable edges estimated while the
others are weakly penalised ( |
scale |
logical indicating if the correlation ( |
implementation |
function to use for graphical modelling. If
|
start |
character string indicating if the algorithm should be
initialised at the estimated (inverse) covariance with previous penalty
parameters ( |
... |
additional parameters passed to the function provided in
|
The use of the procedure from Equation (4) or (5) is controlled by the argument "Sequential_template".
An array with binary and symmetric adjacency matrices along the third dimension.
GraphicalModel
, PenalisedGraphical
Other wrapping functions:
SelectionAlgo()
# Data simulation set.seed(1) simul <- SimulateGraphical() # Running graphical LASSO myglasso <- GraphicalAlgo( xdata = simul$data, Lambda = cbind(c(0.1, 0.2)) )
# Data simulation set.seed(1) simul <- SimulateGraphical() # Running graphical LASSO myglasso <- GraphicalAlgo( xdata = simul$data, Lambda = cbind(c(0.1, 0.2)) )
Performs stability selection for graphical models. The underlying graphical model (e.g. graphical LASSO) is run with different combinations of parameters controlling the sparsity (e.g. penalty parameter) and thresholds in selection proportions. These two hyper-parameters are jointly calibrated by maximisation of the stability score.
GraphicalModel( xdata, pk = NULL, Lambda = NULL, lambda_other_blocks = 0.1, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, implementation = PenalisedGraphical, start = "warm", scale = TRUE, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 50, lambda_max = NULL, lambda_path_factor = 0.001, max_density = 0.5, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ... )
GraphicalModel( xdata, pk = NULL, Lambda = NULL, lambda_other_blocks = 0.1, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, implementation = PenalisedGraphical, start = "warm", scale = TRUE, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 50, lambda_max = NULL, lambda_path_factor = 0.001, max_density = 0.5, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ... )
xdata |
data matrix with observations as rows and variables as columns. For multi-block stability selection, the variables in data have to be ordered by group. |
pk |
optional vector encoding the grouping structure. Only used for
multi-block stability selection where |
Lambda |
matrix of parameters controlling the level of sparsity in the
underlying feature selection algorithm specified in |
lambda_other_blocks |
optional vector of parameters controlling the
level of sparsity in neighbour blocks for the multi-block procedure. To use
jointly a specific set of parameters for each block,
|
pi_list |
vector of thresholds in selection proportions. If
|
K |
number of resampling iterations. |
tau |
subsample size. Only used if |
seed |
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see |
n_cat |
computation options for the stability score. Default is
|
implementation |
function to use for graphical modelling. If
|
start |
character string indicating if the algorithm should be
initialised at the estimated (inverse) covariance with previous penalty
parameters ( |
scale |
logical indicating if the correlation ( |
resampling |
resampling approach. Possible values are:
|
cpss |
logical indicating if complementary pair stability selection
should be done. For this, the algorithm is applied on two non-overlapping
subsets of half of the observations. A feature is considered as selected if
it is selected for both subsamples. With this method, the data is split
|
PFER_method |
method used to compute the upper-bound of the expected
number of False Positives (or Per Family Error Rate, PFER). If
|
PFER_thr |
threshold in PFER for constrained calibration by error
control. If |
FDP_thr |
threshold in the expected proportion of falsely selected
features (or False Discovery Proportion) for constrained calibration by
error control. If |
Lambda_cardinal |
number of values in the grid of parameters controlling
the level of sparsity in the underlying algorithm. Only used if
|
lambda_max |
optional maximum value for the grid in penalty parameters.
If |
lambda_path_factor |
multiplicative factor used to define the minimum value in the grid. |
max_density |
threshold on the density. The grid is defined such that the density of the estimated graph does not exceed max_density. |
optimisation |
character string indicating the type of optimisation
method. With |
n_cores |
number of cores to use for parallel computing (see argument
|
output_data |
logical indicating if the input datasets |
verbose |
logical indicating if a loading bar and messages should be printed. |
beep |
sound indicating the end of the run. Possible values are:
|
... |
additional parameters passed to the functions provided in
|
In stability selection, a feature selection algorithm is fitted on
K
subsamples (or bootstrap samples) of the data with different
parameters controlling the sparsity (Lambda
). For a given (set of)
sparsity parameter(s), the proportion out of the K
models in which
each feature is selected is calculated. Features with selection proportions
above a threshold pi are considered stably selected. The stability
selection model is controlled by the sparsity parameter(s) for the
underlying algorithm, and the threshold in selection proportion:
These parameters can be calibrated by maximisation of a stability score
(see ConsensusScore
if n_cat=NULL
or
StabilityScore
otherwise) calculated under the null
hypothesis of equiprobability of selection.
It is strongly recommended to examine the calibration plot carefully to
check that the grids of parameters Lambda
and pi_list
do not
restrict the calibration to a region that would not include the global
maximum (see CalibrationPlot
). In particular, the grid
Lambda
may need to be extended when the maximum stability is
observed on the left or right edges of the calibration heatmap. In some
instances, multiple peaks of stability score can be observed. Simulation
studies suggest that the peak corresponding to the largest number of
selected features tend to give better selection performances. This is not
necessarily the highest peak (which is automatically retained by the
functions in this package). The user can decide to manually choose another
peak.
To control the expected number of False Positives (Per Family Error Rate)
in the results, a threshold PFER_thr
can be specified. The
optimisation problem is then constrained to sets of parameters that
generate models with an upper-bound in PFER below PFER_thr
(see
Meinshausen and Bühlmann (2010) and Shah and Samworth (2013)).
Possible resampling procedures include defining (i) K
subsamples of
a proportion tau
of the observations, (ii) K
bootstrap samples
with the full sample size (obtained with replacement), and (iii) K/2
splits of the data in half for complementary pair stability selection (see
arguments resampling
and cpss
). In complementary pair
stability selection, a feature is considered selected at a given resampling
iteration if it is selected in the two complementary subsamples.
To ensure reproducibility of the results, the starting number of the random
number generator is set to seed
.
For parallelisation, stability selection with different sets of parameters
can be run on n_cores
cores. Using n_cores > 1
creates a
multisession
. Alternatively,
the function can be run manually with different seed
s and all other
parameters equal. The results can then be combined using
Combine
.
The generated network can be converted into
igraph
object using
Graph
. The R package
visNetwork
can be used for
interactive network visualisation (see examples in Graph
).
An object of class graphical_model
. A list with:
S |
a matrix of the best stability scores for different (sets of) parameters controlling the level of sparsity in the underlying algorithm. |
Lambda |
a matrix of parameters controlling the level of sparsity in the underlying algorithm. |
Q |
a matrix of the average number of selected features by the underlying algorithm with different parameters controlling the level of sparsity. |
Q_s |
a matrix of the calibrated number of stably selected features with different parameters controlling the level of sparsity. |
P |
a matrix of calibrated thresholds in selection proportions for different parameters controlling the level of sparsity in the underlying algorithm. |
PFER |
a matrix of upper-bounds in PFER of calibrated stability selection models with different parameters controlling the level of sparsity. |
FDP |
a matrix of upper-bounds in FDP of calibrated stability selection models with different parameters controlling the level of sparsity. |
S_2d |
a matrix of stability scores obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. |
PFER_2d |
a matrix of upper-bounds in FDP obtained with different
combinations of parameters. Columns correspond to different thresholds in
selection proportions. Only returned if |
FDP_2d |
a matrix of upper-bounds in PFER obtained with different
combinations of parameters. Columns correspond to different thresholds in
selection proportions. Only returned if |
selprop |
an array of selection proportions. Rows and columns correspond to nodes in the graph. Indices along the third dimension correspond to different parameters controlling the level of sparsity in the underlying algorithm. |
sign |
a matrix of signs of Pearson's
correlations estimated from |
method |
a list with
|
params |
a list with values used for arguments
|
The rows of S
, Lambda
, Q
, Q_s
, P
,
PFER
, FDP
, S_2d
, PFER_2d
and FDP_2d
, and
indices along the third dimension of selprop
are ordered in the same
way and correspond to parameter values stored in Lambda
. For
multi-block inference, the columns of S
, Lambda
, Q
,
Q_s
, P
, PFER
and FDP
, and indices along the
third dimension of S_2d
correspond to the different blocks.
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Shah RD, Samworth RJ (2013). “Variable selection with error control: another look at stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80. doi:10.1111/j.1467-9868.2011.01034.x.
Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.
Friedman J, Hastie T, Tibshirani R (2008). “Sparse inverse covariance estimation with the graphical lasso.” Biostatistics, 9(3), 432–441.
PenalisedGraphical
, GraphicalAlgo
,
LambdaGridGraphical
, Resample
,
StabilityScore
Graph
, Adjacency
,
Other stability functions:
BiSelection()
,
Clustering()
,
StructuralModel()
,
VariableSelection()
oldpar <- par(no.readonly = TRUE) par(mar = rep(7, 4)) ## Single-block stability selection # Data simulation set.seed(1) simul <- SimulateGraphical(n = 100, pk = 20, nu_within = 0.1) # Stability selection stab <- GraphicalModel(xdata = simul$data) print(stab) # Calibration heatmap CalibrationPlot(stab) # Visualisation of the results summary(stab) plot(stab) # Extraction of adjacency matrix or igraph object Adjacency(stab) Graph(stab) ## Multi-block stability selection # Data simulation set.seed(1) simul <- SimulateGraphical(pk = c(10, 10)) # Stability selection stab <- GraphicalModel(xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10) print(stab) # Calibration heatmap # par(mfrow = c(1, 3)) CalibrationPlot(stab) # Producing three plots # Visualisation of the results summary(stab) plot(stab) # Multi-parameter stability selection (not recommended) Lambda <- matrix(c(0.8, 0.6, 0.3, 0.5, 0.4, 0.3, 0.7, 0.5, 0.1), ncol = 3) stab <- GraphicalModel( xdata = simul$data, pk = c(10, 10), Lambda = Lambda, lambda_other_blocks = NULL ) stab$Lambda ## Example with user-defined function: shrinkage estimation and selection # Data simulation set.seed(1) simul <- SimulateGraphical(n = 100, pk = 20, nu_within = 0.1) if (requireNamespace("corpcor", quietly = TRUE)) { # Writing user-defined algorithm in a portable function ShrinkageSelection <- function(xdata, Lambda, ...) { mypcor <- corpcor::pcor.shrink(xdata, verbose = FALSE) adjacency <- array(NA, dim = c(nrow(mypcor), ncol(mypcor), nrow(Lambda))) for (k in seq_len(nrow(Lambda))) { A <- ifelse(abs(mypcor) >= Lambda[k, 1], yes = 1, no = 0) diag(A) <- 0 adjacency[, , k] <- A } return(list(adjacency = adjacency)) } # Running the algorithm without stability myglasso <- GraphicalAlgo( xdata = simul$data, Lambda = matrix(c(0.05, 0.1), ncol = 1), implementation = ShrinkageSelection ) # Stability selection using shrinkage estimation and selection stab <- GraphicalModel( xdata = simul$data, Lambda = matrix(c(0.01, 0.05, 0.1), ncol = 1), implementation = ShrinkageSelection ) CalibrationPlot(stab) stable_adjacency <- Adjacency(stab) } par(oldpar)
oldpar <- par(no.readonly = TRUE) par(mar = rep(7, 4)) ## Single-block stability selection # Data simulation set.seed(1) simul <- SimulateGraphical(n = 100, pk = 20, nu_within = 0.1) # Stability selection stab <- GraphicalModel(xdata = simul$data) print(stab) # Calibration heatmap CalibrationPlot(stab) # Visualisation of the results summary(stab) plot(stab) # Extraction of adjacency matrix or igraph object Adjacency(stab) Graph(stab) ## Multi-block stability selection # Data simulation set.seed(1) simul <- SimulateGraphical(pk = c(10, 10)) # Stability selection stab <- GraphicalModel(xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10) print(stab) # Calibration heatmap # par(mfrow = c(1, 3)) CalibrationPlot(stab) # Producing three plots # Visualisation of the results summary(stab) plot(stab) # Multi-parameter stability selection (not recommended) Lambda <- matrix(c(0.8, 0.6, 0.3, 0.5, 0.4, 0.3, 0.7, 0.5, 0.1), ncol = 3) stab <- GraphicalModel( xdata = simul$data, pk = c(10, 10), Lambda = Lambda, lambda_other_blocks = NULL ) stab$Lambda ## Example with user-defined function: shrinkage estimation and selection # Data simulation set.seed(1) simul <- SimulateGraphical(n = 100, pk = 20, nu_within = 0.1) if (requireNamespace("corpcor", quietly = TRUE)) { # Writing user-defined algorithm in a portable function ShrinkageSelection <- function(xdata, Lambda, ...) { mypcor <- corpcor::pcor.shrink(xdata, verbose = FALSE) adjacency <- array(NA, dim = c(nrow(mypcor), ncol(mypcor), nrow(Lambda))) for (k in seq_len(nrow(Lambda))) { A <- ifelse(abs(mypcor) >= Lambda[k, 1], yes = 1, no = 0) diag(A) <- 0 adjacency[, , k] <- A } return(list(adjacency = adjacency)) } # Running the algorithm without stability myglasso <- GraphicalAlgo( xdata = simul$data, Lambda = matrix(c(0.05, 0.1), ncol = 1), implementation = ShrinkageSelection ) # Stability selection using shrinkage estimation and selection stab <- GraphicalModel( xdata = simul$data, Lambda = matrix(c(0.01, 0.05, 0.1), ncol = 1), implementation = ShrinkageSelection ) CalibrationPlot(stab) stable_adjacency <- Adjacency(stab) } par(oldpar)
Runs a group Partial Least Squares model using implementation from
sgPLS-package
. This function is not using stability.
GroupPLS( xdata, ydata, family = "gaussian", group_x, group_y = NULL, Lambda, keepX_previous = NULL, keepY = NULL, ncomp = 1, scale = TRUE, ... )
GroupPLS( xdata, ydata, family = "gaussian", group_x, group_y = NULL, Lambda, keepX_previous = NULL, keepY = NULL, ncomp = 1, scale = TRUE, ... )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
family |
type of PLS model. If |
group_x |
vector encoding the grouping structure among predictors. This argument indicates the number of variables in each group. |
group_y |
optional vector encoding the grouping structure among outcomes. This argument indicates the number of variables in each group. |
Lambda |
matrix of parameters controlling the number of selected groups
at current component, as defined by |
keepX_previous |
number of selected groups in previous components. Only
used if |
keepY |
number of selected groups of outcome variables. This argument is
defined as in |
ncomp |
number of components. |
scale |
logical indicating if the data should be scaled (i.e.
transformed so that all variables have a standard deviation of one). Only
used if |
... |
A list with:
selected |
matrix of binary selection status. Rows correspond to different model parameters. Columns correspond to predictors. |
beta_full |
array of model coefficients. Rows correspond to different model parameters. Columns correspond to predictors (starting with "X") or outcomes (starting with "Y") variables for different components (denoted by "PC"). |
Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2016). “Group and sparse group partial least square approaches applied in genomics context.” Bioinformatics, 32(1), 35-42. ISSN 1367-4803, doi:10.1093/bioinformatics/btv535.
VariableSelection
, BiSelection
Other penalised dimensionality reduction functions:
SparseGroupPLS()
,
SparsePCA()
,
SparsePLS()
if (requireNamespace("sgPLS", quietly = TRUE)) { ## Group PLS # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # Running gPLS with 1 group and sparsity of 0.5 mypls <- GroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 25), ) # Running gPLS with groups on outcomes mypls <- GroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 25), group_y = c(2, 1), keepY = 1 ) }
if (requireNamespace("sgPLS", quietly = TRUE)) { ## Group PLS # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # Running gPLS with 1 group and sparsity of 0.5 mypls <- GroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 25), ) # Running gPLS with groups on outcomes mypls <- GroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 25), group_y = c(2, 1), keepY = 1 ) }
Runs hierarchical clustering using implementation from
hclust
. If Lambda
is provided, clustering is
applied on the weighted distance matrix calculated using the
cosa2
algorithm. Otherwise, distances are calculated
using dist
. This function is not using stability.
HierarchicalClustering( xdata, nc = NULL, Lambda = NULL, distance = "euclidean", linkage = "complete", ... )
HierarchicalClustering( xdata, nc = NULL, Lambda = NULL, distance = "euclidean", linkage = "complete", ... )
xdata |
data matrix with observations as rows and variables as columns. |
nc |
matrix of parameters controlling the number of clusters in the
underlying algorithm specified in |
Lambda |
vector of penalty parameters (see argument |
distance |
character string indicating the type of distance to use. If
|
linkage |
character string indicating the type of linkage used in
hierarchical clustering to define the stable clusters. Possible values
include |
... |
additional parameters passed to |
A list with:
comembership |
an array of binary and symmetric co-membership matrices. |
weights |
a matrix of median weights by feature. |
Kampert MM, Meulman JJ, Friedman JH (2017). “rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.” Journal of Classification, 34(3), 514–547. doi:10.1007/s00357-017-9240-z.
Friedman JH, Meulman JJ (2004). “Clustering objects on subsets of attributes (with discussion).” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4), 815-849. doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.
Other clustering algorithms:
DBSCANClustering()
,
GMMClustering()
,
KMeansClustering()
,
PAMClustering()
# Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) # Hierarchical clustering myhclust <- HierarchicalClustering( xdata = simul$data, nc = seq_len(20) ) # Weighted Hierarchical clustering (using COSA) if (requireNamespace("rCOSA", quietly = TRUE)) { myhclust <- HierarchicalClustering( xdata = simul$data, weighted = TRUE, nc = seq_len(20), Lambda = c(0.2, 0.5) ) }
# Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) # Hierarchical clustering myhclust <- HierarchicalClustering( xdata = simul$data, nc = seq_len(20) ) # Weighted Hierarchical clustering (using COSA) if (requireNamespace("rCOSA", quietly = TRUE)) { myhclust <- HierarchicalClustering( xdata = simul$data, weighted = TRUE, nc = seq_len(20), Lambda = c(0.2, 0.5) ) }
Computes the prediction performance of regression models where predictors are
sequentially added by order of decreasing selection proportion. This function
can be used to evaluate the marginal contribution of each of the selected
predictors over and above more stable predictors. Performances are evaluated
as in ExplanatoryPerformance
.
Incremental( xdata, ydata, new_xdata = NULL, new_ydata = NULL, stability = NULL, family = NULL, implementation = NULL, prediction = NULL, resampling = "subsampling", n_predictors = NULL, K = 100, tau = 0.8, seed = 1, n_thr = NULL, time = 1000, verbose = TRUE, ... )
Incremental( xdata, ydata, new_xdata = NULL, new_ydata = NULL, stability = NULL, family = NULL, implementation = NULL, prediction = NULL, resampling = "subsampling", n_predictors = NULL, K = 100, tau = 0.8, seed = 1, n_thr = NULL, time = 1000, verbose = TRUE, ... )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
new_xdata |
optional test set (predictor data). |
new_ydata |
optional test set (outcome data). |
stability |
output of |
family |
type of regression model. Possible values include
|
implementation |
optional function to refit the model. If
|
prediction |
optional function to compute predicted values from the
model refitted with |
resampling |
resampling approach to create the training set. The default
is |
n_predictors |
number of predictors to consider. |
K |
number of training-test splits. Only used if |
tau |
proportion of observations used in the training set. Only used if
|
seed |
value of the seed to ensure reproducibility of the results. Only
used if |
n_thr |
number of thresholds to use to construct the ROC curve. If
|
time |
numeric indicating the time for which the survival probabilities are computed. Only applicable to Cox regression. |
verbose |
logical indicating if a loading bar and messages should be printed. |
... |
additional parameters passed to the function provided in
|
An object of class incremental
.
For logistic regression, a list with:
FPR |
A list with, for each of the models (sequentially added predictors), the False Positive Rates for different thresholds (columns) and different data splits (rows). |
TPR |
A list with, for each of the models (sequentially added predictors), the True Positive Rates for different thresholds (columns) and different data splits (rows). |
AUC |
A list with, for each of the models (sequentially added predictors), a vector of Area Under the Curve (AUC) values obtained with different data splits. |
Beta |
Estimated regression coefficients from visited models. |
names |
Names of the predictors by order of inclusion. |
stable |
Binary vector indicating
which predictors are stably selected. Only returned if |
For Cox regression, a list with:
concordance |
A list with, for each of the models (sequentially added predictors), a vector of concordance indices obtained with different data splits. |
Beta |
Estimated regression coefficients from visited models. |
names |
Names of the predictors by order of inclusion. |
stable |
Binary vector indicating
which predictors are stably selected. Only returned if |
For linear regression, a list with:
Q_squared |
A list with, for each of the models (sequentially added predictors), a vector of Q-squared obtained with different data splits. |
Beta |
Estimated regression coefficients from visited models. |
names |
Names of the predictors by order of inclusion. |
stable |
Binary vector indicating which
predictors are stably selected. Only returned if |
Other prediction performance functions:
ExplanatoryPerformance()
# Data simulation set.seed(1) simul <- SimulateRegression( n = 1000, pk = 20, family = "binomial", ev_xy = 0.8 ) # Data split: selection, training and test set ids <- Split( data = simul$ydata, family = "binomial", tau = c(0.4, 0.3, 0.3) ) xselect <- simul$xdata[ids[[1]], ] yselect <- simul$ydata[ids[[1]], ] xtrain <- simul$xdata[ids[[2]], ] ytrain <- simul$ydata[ids[[2]], ] xtest <- simul$xdata[ids[[3]], ] ytest <- simul$ydata[ids[[3]], ] # Stability selection stab <- VariableSelection( xdata = xselect, ydata = yselect, family = "binomial" ) # Performances in test set of model refitted in training set incr <- Incremental( xdata = xtrain, ydata = ytrain, new_xdata = xtest, new_ydata = ytest, stability = stab, n_predictors = 10 ) plot(incr) # Alternative with multiple training/test splits incr <- Incremental( xdata = rbind(xtrain, xtest), ydata = c(ytrain, ytest), stability = stab, K = 10, n_predictors = 10 ) plot(incr)
# Data simulation set.seed(1) simul <- SimulateRegression( n = 1000, pk = 20, family = "binomial", ev_xy = 0.8 ) # Data split: selection, training and test set ids <- Split( data = simul$ydata, family = "binomial", tau = c(0.4, 0.3, 0.3) ) xselect <- simul$xdata[ids[[1]], ] yselect <- simul$ydata[ids[[1]], ] xtrain <- simul$xdata[ids[[2]], ] ytrain <- simul$ydata[ids[[2]], ] xtest <- simul$xdata[ids[[3]], ] ytest <- simul$ydata[ids[[3]], ] # Stability selection stab <- VariableSelection( xdata = xselect, ydata = yselect, family = "binomial" ) # Performances in test set of model refitted in training set incr <- Incremental( xdata = xtrain, ydata = ytrain, new_xdata = xtest, new_ydata = ytest, stability = stab, n_predictors = 10 ) plot(incr) # Alternative with multiple training/test splits incr <- Incremental( xdata = rbind(xtrain, xtest), ydata = c(ytrain, ytest), stability = stab, K = 10, n_predictors = 10 ) plot(incr)
Runs k-means clustering using implementation from
kmeans
. This function is not using stability.
KMeansClustering(xdata, nc = NULL, Lambda = NULL, ...)
KMeansClustering(xdata, nc = NULL, Lambda = NULL, ...)
xdata |
data matrix with observations as rows and variables as columns. |
nc |
matrix of parameters controlling the number of clusters in the
underlying algorithm specified in |
Lambda |
vector of penalty parameters (see argument |
... |
additional parameters passed to |
A list with:
comembership |
an array of binary and symmetric co-membership matrices. |
weights |
a matrix of median weights by feature. |
Witten DM, Tibshirani R (2010). “A Framework for Feature Selection in Clustering.” Journal of the American Statistical Association, 105(490), 713-726. doi:10.1198/jasa.2010.tm09415, PMID: 20811510.
Other clustering algorithms:
DBSCANClustering()
,
GMMClustering()
,
HierarchicalClustering()
,
PAMClustering()
# Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) # K means clustering mykmeans <- KMeansClustering(xdata = simul$data, nc = seq_len(20)) # Sparse K means clustering if (requireNamespace("sparcl", quietly = TRUE)) { mykmeans <- KMeansClustering( xdata = simul$data, nc = seq_len(20), Lambda = c(2, 5) ) }
# Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) # K means clustering mykmeans <- KMeansClustering(xdata = simul$data, nc = seq_len(20)) # Sparse K means clustering if (requireNamespace("sparcl", quietly = TRUE)) { mykmeans <- KMeansClustering( xdata = simul$data, nc = seq_len(20), Lambda = c(2, 5) ) }
Generates a relevant grid of penalty parameter values for penalised graphical models.
LambdaGridGraphical( xdata, pk = NULL, lambda_other_blocks = 0.1, K = 100, tau = 0.5, n_cat = 3, implementation = PenalisedGraphical, start = "cold", scale = TRUE, resampling = "subsampling", PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 50, lambda_max = NULL, lambda_path_factor = 0.001, max_density = 0.5, ... )
LambdaGridGraphical( xdata, pk = NULL, lambda_other_blocks = 0.1, K = 100, tau = 0.5, n_cat = 3, implementation = PenalisedGraphical, start = "cold", scale = TRUE, resampling = "subsampling", PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 50, lambda_max = NULL, lambda_path_factor = 0.001, max_density = 0.5, ... )
xdata |
data matrix with observations as rows and variables as columns. For multi-block stability selection, the variables in data have to be ordered by group. |
pk |
optional vector encoding the grouping structure. Only used for
multi-block stability selection where |
lambda_other_blocks |
optional vector of parameters controlling the
level of sparsity in neighbour blocks for the multi-block procedure. To use
jointly a specific set of parameters for each block,
|
K |
number of resampling iterations. |
tau |
subsample size. Only used if |
n_cat |
computation options for the stability score. Default is
|
implementation |
function to use for graphical modelling. If
|
start |
character string indicating if the algorithm should be
initialised at the estimated (inverse) covariance with previous penalty
parameters ( |
scale |
logical indicating if the correlation ( |
resampling |
resampling approach. Possible values are:
|
PFER_method |
method used to compute the upper-bound of the expected
number of False Positives (or Per Family Error Rate, PFER). If
|
PFER_thr |
threshold in PFER for constrained calibration by error
control. If |
FDP_thr |
threshold in the expected proportion of falsely selected
features (or False Discovery Proportion) for constrained calibration by
error control. If |
Lambda_cardinal |
number of values in the grid of parameters controlling the level of sparsity in the underlying algorithm. |
lambda_max |
optional maximum value for the grid in penalty parameters.
If |
lambda_path_factor |
multiplicative factor used to define the minimum value in the grid. |
max_density |
threshold on the density. The grid is defined such that the density of the estimated graph does not exceed max_density. |
... |
additional parameters passed to the functions provided in
|
A matrix of lambda values with length(pk)
columns and
Lambda_cardinal
rows.
Other lambda grid functions:
LambdaGridRegression()
,
LambdaSequence()
# Single-block simulation set.seed(1) simul <- SimulateGraphical() # Generating grid of 10 values Lambda <- LambdaGridGraphical(xdata = simul$data, Lambda_cardinal = 10) # Ensuring PFER < 5 Lambda <- LambdaGridGraphical(xdata = simul$data, Lambda_cardinal = 10, PFER_thr = 5) # Multi-block simulation set.seed(1) simul <- SimulateGraphical(pk = c(10, 10)) # Multi-block grid Lambda <- LambdaGridGraphical(xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10) # Denser neighbouring blocks Lambda <- LambdaGridGraphical( xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10, lambda_other_blocks = 0 ) # Using different neighbour penalties Lambda <- LambdaGridGraphical( xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10, lambda_other_blocks = c(0.1, 0, 0.1) ) stab <- GraphicalModel( xdata = simul$data, pk = c(10, 10), Lambda = Lambda, lambda_other_blocks = c(0.1, 0, 0.1) ) stab$Lambda # Visiting from empty to full graphs with max_density=1 Lambda <- LambdaGridGraphical( xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10, max_density = 1 ) bigblocks <- BlockMatrix(pk = c(10, 10)) bigblocks_vect <- bigblocks[upper.tri(bigblocks)] N_blocks <- unname(table(bigblocks_vect)) N_blocks # max number of edges per block stab <- GraphicalModel(xdata = simul$data, pk = c(10, 10), Lambda = Lambda) apply(stab$Q, 2, max, na.rm = TRUE) # max average number of edges from underlying algo
# Single-block simulation set.seed(1) simul <- SimulateGraphical() # Generating grid of 10 values Lambda <- LambdaGridGraphical(xdata = simul$data, Lambda_cardinal = 10) # Ensuring PFER < 5 Lambda <- LambdaGridGraphical(xdata = simul$data, Lambda_cardinal = 10, PFER_thr = 5) # Multi-block simulation set.seed(1) simul <- SimulateGraphical(pk = c(10, 10)) # Multi-block grid Lambda <- LambdaGridGraphical(xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10) # Denser neighbouring blocks Lambda <- LambdaGridGraphical( xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10, lambda_other_blocks = 0 ) # Using different neighbour penalties Lambda <- LambdaGridGraphical( xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10, lambda_other_blocks = c(0.1, 0, 0.1) ) stab <- GraphicalModel( xdata = simul$data, pk = c(10, 10), Lambda = Lambda, lambda_other_blocks = c(0.1, 0, 0.1) ) stab$Lambda # Visiting from empty to full graphs with max_density=1 Lambda <- LambdaGridGraphical( xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10, max_density = 1 ) bigblocks <- BlockMatrix(pk = c(10, 10)) bigblocks_vect <- bigblocks[upper.tri(bigblocks)] N_blocks <- unname(table(bigblocks_vect)) N_blocks # max number of edges per block stab <- GraphicalModel(xdata = simul$data, pk = c(10, 10), Lambda = Lambda) apply(stab$Q, 2, max, na.rm = TRUE) # max average number of edges from underlying algo
Generates a relevant grid of penalty parameter values for penalised
regression using the implementation in glmnet
.
LambdaGridRegression( xdata, ydata, tau = 0.5, seed = 1, family = "gaussian", resampling = "subsampling", Lambda_cardinal = 100, check_input = TRUE, ... )
LambdaGridRegression( xdata, ydata, tau = 0.5, seed = 1, family = "gaussian", resampling = "subsampling", Lambda_cardinal = 100, check_input = TRUE, ... )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
tau |
subsample size. Only used if |
seed |
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see |
family |
type of regression model. This argument is defined as in
|
resampling |
resampling approach. Possible values are:
|
Lambda_cardinal |
number of values in the grid of parameters controlling the level of sparsity in the underlying algorithm. |
check_input |
logical indicating if input values should be checked (recommended). |
... |
additional parameters passed to the functions provided in
|
A matrix of lambda values with one column and as many rows as
indicated in Lambda_cardinal
.
Other lambda grid functions:
LambdaGridGraphical()
,
LambdaSequence()
# Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") # simulated data # Lambda grid for linear regression Lambda <- LambdaGridRegression( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", Lambda_cardinal = 20 ) # Grid can be used in VariableSelection() stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", Lambda = Lambda ) print(SelectedVariables(stab))
# Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") # simulated data # Lambda grid for linear regression Lambda <- LambdaGridRegression( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", Lambda_cardinal = 20 ) # Grid can be used in VariableSelection() stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", Lambda = Lambda ) print(SelectedVariables(stab))
Generates a sequence of penalty parameters from extreme values and the required number of elements. The sequence is defined on the log-scale.
LambdaSequence(lmax, lmin, cardinal = 100)
LambdaSequence(lmax, lmin, cardinal = 100)
lmax |
maximum value in the grid. |
lmin |
minimum value in the grid. |
cardinal |
number of values in the grid. |
A vector with values between "lmin" and "lmax" and as many values as indicated by "cardinal".
Other lambda grid functions:
LambdaGridGraphical()
,
LambdaGridRegression()
# Grid from extreme values mygrid <- LambdaSequence(lmax = 0.7, lmin = 0.001, cardinal = 10)
# Grid from extreme values mygrid <- LambdaSequence(lmax = 0.7, lmin = 0.001, cardinal = 10)
Returns a matrix from output of PenalisedLinearSystem
.
LinearSystemMatrix(vect, adjacency)
LinearSystemMatrix(vect, adjacency)
vect |
vector of coefficients to assign to entries of the matrix. |
adjacency |
binary adjacency matrix of the Directed Acyclic Graph (transpose of the asymmetric matrix A in Reticular Action Model notation). The row and column names of this matrix must be defined. |
An asymmetric matrix.
Returns a matrix from output of mxPenaltySearch
.
OpenMxMatrix(vect, adjacency, residual_covariance = NULL)
OpenMxMatrix(vect, adjacency, residual_covariance = NULL)
vect |
vector of coefficients to assign to entries of the matrix. |
adjacency |
binary adjacency matrix of the Directed Acyclic Graph (transpose of the asymmetric matrix A in Reticular Action Model notation). The row and column names of this matrix must be defined. |
residual_covariance |
binary and symmetric matrix encoding the nonzero entries in the residual covariance matrix (symmetric matrix S in Reticular Action Model notation). By default, this is the identity matrix (no residual covariance). |
An asymmetric matrix.
Returns matrix specification for use in mxModel
from
(i) the adjacency matrix of a Directed Acyclic Graph (asymmetric matrix A in
Reticular Action Model notation), and (ii) a binary matrix encoding nonzero
entries in the residual covariance matrix (symmetric matrix S in Reticular
Action Model notation).
OpenMxModel(adjacency, residual_covariance = NULL, manifest = NULL)
OpenMxModel(adjacency, residual_covariance = NULL, manifest = NULL)
adjacency |
binary adjacency matrix of the Directed Acyclic Graph (transpose of the asymmetric matrix A in Reticular Action Model notation). The row and column names of this matrix must be defined. |
residual_covariance |
binary and symmetric matrix encoding the nonzero entries in the residual covariance matrix (symmetric matrix S in Reticular Action Model notation). By default, this is the identity matrix (no residual covariance). |
manifest |
optional vector of manifest variable names. |
A list of RAM matrices that can be used in mxRun
.
if (requireNamespace("OpenMx", quietly = TRUE)) { # Definition of simulated effects pk <- c(3, 2, 3) dag <- LayeredDAG(layers = pk) theta <- dag theta[2, 4] <- 0 theta[3, 7] <- 0 theta[4, 7] <- 0 # Data simulation set.seed(1) simul <- SimulateStructural(n = 500, v_between = 1, theta = theta, pk = pk) # Writing RAM matrices for mxModel ram_matrices <- OpenMxModel(adjacency = dag) # Running unpenalised model unpenalised <- OpenMx::mxRun(OpenMx::mxModel( "Model", OpenMx::mxData(simul$data, type = "raw"), ram_matrices$Amat, ram_matrices$Smat, ram_matrices$Fmat, ram_matrices$Mmat, OpenMx::mxExpectationRAM("A", "S", "F", "M", dimnames = colnames(dag)), OpenMx::mxFitFunctionML() ), silent = TRUE, suppressWarnings = TRUE) unpenalised$A$values # Incorporating latent variables ram_matrices <- OpenMxModel( adjacency = dag, manifest = paste0("x", seq_len(7)) ) ram_matrices$Fmat$values # Running unpenalised model unpenalised <- OpenMx::mxRun(OpenMx::mxModel( "Model", OpenMx::mxData(simul$data[, seq_len(7)], type = "raw"), ram_matrices$Amat, ram_matrices$Smat, ram_matrices$Fmat, ram_matrices$Mmat, OpenMx::mxExpectationRAM("A", "S", "F", "M", dimnames = colnames(dag)), OpenMx::mxFitFunctionML() ), silent = TRUE, suppressWarnings = TRUE) unpenalised$A$values }
if (requireNamespace("OpenMx", quietly = TRUE)) { # Definition of simulated effects pk <- c(3, 2, 3) dag <- LayeredDAG(layers = pk) theta <- dag theta[2, 4] <- 0 theta[3, 7] <- 0 theta[4, 7] <- 0 # Data simulation set.seed(1) simul <- SimulateStructural(n = 500, v_between = 1, theta = theta, pk = pk) # Writing RAM matrices for mxModel ram_matrices <- OpenMxModel(adjacency = dag) # Running unpenalised model unpenalised <- OpenMx::mxRun(OpenMx::mxModel( "Model", OpenMx::mxData(simul$data, type = "raw"), ram_matrices$Amat, ram_matrices$Smat, ram_matrices$Fmat, ram_matrices$Mmat, OpenMx::mxExpectationRAM("A", "S", "F", "M", dimnames = colnames(dag)), OpenMx::mxFitFunctionML() ), silent = TRUE, suppressWarnings = TRUE) unpenalised$A$values # Incorporating latent variables ram_matrices <- OpenMxModel( adjacency = dag, manifest = paste0("x", seq_len(7)) ) ram_matrices$Fmat$values # Running unpenalised model unpenalised <- OpenMx::mxRun(OpenMx::mxModel( "Model", OpenMx::mxData(simul$data[, seq_len(7)], type = "raw"), ram_matrices$Amat, ram_matrices$Smat, ram_matrices$Fmat, ram_matrices$Mmat, OpenMx::mxExpectationRAM("A", "S", "F", "M", dimnames = colnames(dag)), OpenMx::mxFitFunctionML() ), silent = TRUE, suppressWarnings = TRUE) unpenalised$A$values }
Runs Partitioning Around Medoids (PAM) clustering using implementation from
pam
. This is also known as the k-medoids algorithm. If
Lambda
is provided, clustering is applied on the weighted distance
matrix calculated using the COSA algorithm as implemented in
cosa2
. Otherwise, distances are calculated using
dist
. This function is not using stability.
PAMClustering(xdata, nc = NULL, Lambda = NULL, distance = "euclidean", ...)
PAMClustering(xdata, nc = NULL, Lambda = NULL, distance = "euclidean", ...)
xdata |
data matrix with observations as rows and variables as columns. |
nc |
matrix of parameters controlling the number of clusters in the
underlying algorithm specified in |
Lambda |
vector of penalty parameters (see argument |
distance |
character string indicating the type of distance to use. If
|
... |
additional parameters passed to |
A list with:
comembership |
an array of binary and symmetric co-membership matrices. |
weights |
a matrix of median weights by feature. |
Kampert MM, Meulman JJ, Friedman JH (2017). “rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.” Journal of Classification, 34(3), 514–547. doi:10.1007/s00357-017-9240-z.
Friedman JH, Meulman JJ (2004). “Clustering objects on subsets of attributes (with discussion).” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4), 815-849. doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.
Other clustering algorithms:
DBSCANClustering()
,
GMMClustering()
,
HierarchicalClustering()
,
KMeansClustering()
if (requireNamespace("cluster", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) # PAM clustering myclust <- PAMClustering( xdata = simul$data, nc = seq_len(20) ) # Weighted PAM clustering (using COSA) if (requireNamespace("rCOSA", quietly = TRUE)) { myclust <- PAMClustering( xdata = simul$data, nc = seq_len(20), Lambda = c(0.2, 0.5) ) } }
if (requireNamespace("cluster", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateClustering(n = c(10, 10), pk = 50) # PAM clustering myclust <- PAMClustering( xdata = simul$data, nc = seq_len(20) ) # Weighted PAM clustering (using COSA) if (requireNamespace("rCOSA", quietly = TRUE)) { myclust <- PAMClustering( xdata = simul$data, nc = seq_len(20), Lambda = c(0.2, 0.5) ) } }
Runs the graphical LASSO algorithm for estimation of a Gaussian Graphical Model (GGM). This function is not using stability.
PenalisedGraphical( xdata, pk = NULL, Lambda, Sequential_template = NULL, scale = TRUE, start = "cold", output_omega = FALSE, ... )
PenalisedGraphical( xdata, pk = NULL, Lambda, Sequential_template = NULL, scale = TRUE, start = "cold", output_omega = FALSE, ... )
xdata |
matrix with observations as rows and variables as columns. |
pk |
optional vector encoding the grouping structure. Only used for
multi-block stability selection where |
Lambda |
matrix of parameters controlling the level of sparsity. |
Sequential_template |
logical matrix encoding the type of procedure to
use for data with multiple blocks in stability selection graphical
modelling. For multi-block estimation, the stability selection model is
constructed as the union of block-specific stable edges estimated while the
others are weakly penalised ( |
scale |
logical indicating if the correlation ( |
start |
character string indicating if the algorithm should be
initialised at the estimated (inverse) covariance with previous penalty
parameters ( |
output_omega |
logical indicating if the estimated precision matrices should be stored and returned. |
... |
additional parameters passed to the function provided in
|
The use of the procedure from Equation (4) or (5) is controlled by the argument "Sequential_template".
An array with binary and symmetric adjacency matrices along the third dimension.
Friedman J, Hastie T, Tibshirani R (2008). “Sparse inverse covariance estimation with the graphical lasso.” Biostatistics, 9(3), 432–441.
Other underlying algorithm functions:
CART()
,
ClusteringAlgo()
,
PenalisedOpenMx()
,
PenalisedRegression()
# Data simulation set.seed(1) simul <- SimulateGraphical() # Running graphical LASSO myglasso <- PenalisedGraphical( xdata = simul$data, Lambda = matrix(c(0.1, 0.2), ncol = 1) ) # Returning estimated precision matrix myglasso <- PenalisedGraphical( xdata = simul$data, Lambda = matrix(c(0.1, 0.2), ncol = 1), output_omega = TRUE )
# Data simulation set.seed(1) simul <- SimulateGraphical() # Running graphical LASSO myglasso <- PenalisedGraphical( xdata = simul$data, Lambda = matrix(c(0.1, 0.2), ncol = 1) ) # Returning estimated precision matrix myglasso <- PenalisedGraphical( xdata = simul$data, Lambda = matrix(c(0.1, 0.2), ncol = 1), output_omega = TRUE )
Runs penalised Structural Equation Modelling using implementations from
OpenMx
functions (for PenalisedOpenMx
),
or using series of penalised regressions with glmnet
(for PenalisedLinearSystem
). The function
PenalisedLinearSystem
does not accommodate latent variables.
These functions are not using stability.
PenalisedOpenMx( xdata, adjacency, penalised = NULL, residual_covariance = NULL, Lambda, ... ) PenalisedLinearSystem(xdata, adjacency, penalised = NULL, Lambda = NULL, ...)
PenalisedOpenMx( xdata, adjacency, penalised = NULL, residual_covariance = NULL, Lambda, ... ) PenalisedLinearSystem(xdata, adjacency, penalised = NULL, Lambda = NULL, ...)
xdata |
matrix with observations as rows and variables as columns.
Column names must be defined and in line with the row and column names of
|
adjacency |
binary adjacency matrix of the Directed Acyclic Graph (transpose of the asymmetric matrix A in Reticular Action Model notation). The row and column names of this matrix must be defined. |
penalised |
optional binary matrix indicating which coefficients are regularised. |
residual_covariance |
binary and symmetric matrix encoding the nonzero entries in the residual covariance matrix (symmetric matrix S in Reticular Action Model notation). By default, this is the identity matrix (no residual covariance). |
Lambda |
matrix of parameters controlling the level of sparsity. Only
the minimum, maximum and length are used in |
... |
additional parameters passed to |
A list with:
selected |
matrix of binary selection status. Rows correspond to different regularisation parameters. Columns correspond to different parameters to estimated. |
beta_full |
matrix of model coefficients. Rows correspond to different regularisation parameters. Columns correspond to different parameters to estimated. |
Jacobucci R, Grimm KJ, McArdle JJ (2016). “Regularized structural equation modeling.” Structural equation modeling: a multidisciplinary journal, 23(4), 555–566. doi:10.1080/10705511.2016.1154793.
SelectionAlgo
, VariableSelection
,
OpenMxMatrix
,
LinearSystemMatrix
Other underlying algorithm functions:
CART()
,
ClusteringAlgo()
,
PenalisedGraphical()
,
PenalisedRegression()
# Data simulation pk <- c(3, 2, 3) dag <- LayeredDAG(layers = pk) theta <- dag theta[2, 4] <- 0 set.seed(1) simul <- SimulateStructural(theta = theta, pk = pk, output_matrices = TRUE) # Running regularised SEM (OpenMx) if (requireNamespace("OpenMx", quietly = TRUE)) { mysem <- PenalisedOpenMx( xdata = simul$data, adjacency = dag, Lambda = seq(1, 10, 1) ) OpenMxMatrix(vect = mysem$selected[3, ], adjacency = dag) } # Running regularised SEM (glmnet) mysem <- PenalisedLinearSystem( xdata = simul$data, adjacency = dag ) LinearSystemMatrix(vect = mysem$selected[20, ], adjacency = dag)
# Data simulation pk <- c(3, 2, 3) dag <- LayeredDAG(layers = pk) theta <- dag theta[2, 4] <- 0 set.seed(1) simul <- SimulateStructural(theta = theta, pk = pk, output_matrices = TRUE) # Running regularised SEM (OpenMx) if (requireNamespace("OpenMx", quietly = TRUE)) { mysem <- PenalisedOpenMx( xdata = simul$data, adjacency = dag, Lambda = seq(1, 10, 1) ) OpenMxMatrix(vect = mysem$selected[3, ], adjacency = dag) } # Running regularised SEM (glmnet) mysem <- PenalisedLinearSystem( xdata = simul$data, adjacency = dag ) LinearSystemMatrix(vect = mysem$selected[20, ], adjacency = dag)
Runs penalised regression using implementation from
glmnet
. This function is not using stability.
PenalisedRegression( xdata, ydata, Lambda = NULL, family, penalisation = c("classic", "randomised", "adaptive"), gamma = NULL, ... )
PenalisedRegression( xdata, ydata, Lambda = NULL, family, penalisation = c("classic", "randomised", "adaptive"), gamma = NULL, ... )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
Lambda |
matrix of parameters controlling the level of sparsity. |
family |
type of regression model. This argument is defined as in
|
penalisation |
type of penalisation to use. If
|
gamma |
parameter for randomised or adaptive regularisation. Default is
|
... |
additional parameters passed to |
A list with:
selected |
matrix of binary selection status. Rows correspond to different model parameters. Columns correspond to predictors. |
beta_full |
array of model coefficients. Rows correspond to different model parameters. Columns correspond to predictors. Indices along the third dimension correspond to outcome variable(s). |
Zou H (2006). “The adaptive lasso and its oracle properties.” Journal of the American statistical association, 101(476), 1418–1429.
Tibshirani R (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. ISSN 00359246, http://www.jstor.org/stable/2346178.
SelectionAlgo
, VariableSelection
Other underlying algorithm functions:
CART()
,
ClusteringAlgo()
,
PenalisedGraphical()
,
PenalisedOpenMx()
# Data simulation set.seed(1) simul <- SimulateRegression(pk = 50) # Running the LASSO mylasso <- PenalisedRegression( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1, 0.2), family = "gaussian" ) # Using glmnet arguments mylasso <- PenalisedRegression( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1), family = "gaussian", penalty.factor = c(rep(0, 10), rep(1, 40)) ) mylasso$beta_full
# Data simulation set.seed(1) simul <- SimulateRegression(pk = 50) # Running the LASSO mylasso <- PenalisedRegression( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1, 0.2), family = "gaussian" ) # Using glmnet arguments mylasso <- PenalisedRegression( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1), family = "gaussian", penalty.factor = c(rep(0, 10), rep(1, 40)) ) mylasso$beta_full
Computes the Per Family Error Rate upper-bound of a stability selection model using the methods proposed by Meinshausen and Bühlmann (2010) or Shah and Samworth (2013). In stability selection, the PFER corresponds to the expected number of stably selected features that are not relevant to the outcome (i.e. False Positives).
PFER(q, pi, N, K, PFER_method = "MB")
PFER(q, pi, N, K, PFER_method = "MB")
q |
average number of features selected by the underlying algorithm. |
pi |
threshold in selection proportions. |
N |
total number of features. |
K |
number of resampling iterations. |
PFER_method |
method used to compute the upper-bound of the expected
number of False Positives (or Per Family Error Rate, PFER). If
|
The estimated upper-bound in PFER.
Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.
Shah RD, Samworth RJ (2013). “Variable selection with error control: another look at stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80. doi:10.1111/j.1467-9868.2011.01034.x.
Other stability metric functions:
ConsensusScore()
,
FDP()
,
StabilityMetrics()
,
StabilityScore()
# Computing PFER for 10/50 selected features and threshold of 0.8 pfer_mb <- PFER(q = 10, pi = 0.8, N = 50, K = 100, PFER_method = "MB") pfer_ss <- PFER(q = 10, pi = 0.8, N = 50, K = 100, PFER_method = "SS")
# Computing PFER for 10/50 selected features and threshold of 0.8 pfer_mb <- PFER(q = 10, pi = 0.8, N = 50, K = 100, PFER_method = "MB") pfer_ss <- PFER(q = 10, pi = 0.8, N = 50, K = 100, PFER_method = "SS")
Creates a heatmap of the (calibrated) consensus matrix. See examples in
Clustering
.
## S3 method for class 'clustering' plot( x, linkage = "complete", argmax_id = NULL, theta = NULL, theta_star = NULL, col = c("ivory", "navajowhite", "tomato", "darkred"), lines = TRUE, col.lines = c("blue"), lwd.lines = 2, tick = TRUE, axes = TRUE, col.axis = NULL, cex.axis = 1, xlas = 2, ylas = 2, bty = "n", ... )
## S3 method for class 'clustering' plot( x, linkage = "complete", argmax_id = NULL, theta = NULL, theta_star = NULL, col = c("ivory", "navajowhite", "tomato", "darkred"), lines = TRUE, col.lines = c("blue"), lwd.lines = 2, tick = TRUE, axes = TRUE, col.axis = NULL, cex.axis = 1, xlas = 2, ylas = 2, bty = "n", ... )
x |
output of |
linkage |
character string indicating the type of linkage used in
hierarchical clustering to define the stable clusters. Possible values
include |
argmax_id |
optional indices of hyper-parameters. If
|
theta |
optional vector of cluster membership. If provided, the ordering
of the items should be the same as in |
theta_star |
optional vector of true cluster membership. If provided,
the ordering of the items should be the same as in |
col |
vector of colours. |
lines |
logical indicating if lines separating the clusters provided in
|
col.lines |
colour of the lines separating the clusters. |
lwd.lines |
width of the lines separating the clusters. |
tick |
logical indicating if axis tickmarks should be displayed. |
axes |
logical indicating if item labels should be displayed. |
col.axis |
optional vector of cluster colours. |
cex.axis |
font size for axes. |
xlas |
orientation of labels on the x-axis, as |
ylas |
orientation of labels on the y-axis, as |
bty |
character string indicating if the box around the plot should be
drawn. Possible values include: |
... |
additional arguments passed to |
A heatmap.
Represents prediction performances upon sequential inclusion of the
predictors in a logistic or Cox regression model as produced by
Incremental
. The median and quantiles
of the performance
metric are reported. See examples in Incremental
.
## S3 method for class 'incremental' plot( x, quantiles = c(0.05, 0.95), col = c("red", "grey"), col.axis = NULL, xgrid = FALSE, ygrid = FALSE, output_data = FALSE, ... ) IncrementalPlot( x, quantiles = c(0.05, 0.95), col = c("red", "grey"), col.axis = NULL, xgrid = FALSE, ygrid = FALSE, output_data = FALSE, ... ) PlotIncremental( x, quantiles = c(0.05, 0.95), col = c("red", "grey"), col.axis = NULL, xgrid = FALSE, ygrid = FALSE, output_data = FALSE, ... )
## S3 method for class 'incremental' plot( x, quantiles = c(0.05, 0.95), col = c("red", "grey"), col.axis = NULL, xgrid = FALSE, ygrid = FALSE, output_data = FALSE, ... ) IncrementalPlot( x, quantiles = c(0.05, 0.95), col = c("red", "grey"), col.axis = NULL, xgrid = FALSE, ygrid = FALSE, output_data = FALSE, ... ) PlotIncremental( x, quantiles = c(0.05, 0.95), col = c("red", "grey"), col.axis = NULL, xgrid = FALSE, ygrid = FALSE, output_data = FALSE, ... )
x |
output of |
quantiles |
quantiles defining the lower and upper bounds. |
col |
vector of colours by stable selection status. |
col.axis |
optional vector of label colours by stable selection status. |
xgrid |
logical indicating if a vertical grid should be drawn. |
ygrid |
logical indicating if a horizontal grid should be drawn. |
output_data |
logical indicating if the median and quantiles should be returned in a matrix. |
... |
additional plotting arguments (see |
A plot.
Plots the True Positive Rate (TPR) as a function of the False Positive Rate
(FPR) for different thresholds in predicted probabilities. If the results
from multiple ROC analyses are provided (e.g. output of
ExplanatoryPerformance
with large K
), the point-wise
median is represented and flanked by a transparent band defined by point-wise
quantiles
. See examples in ROC
and
ExplanatoryPerformance
.
## S3 method for class 'roc_band' plot( x, col_band = NULL, alpha = 0.5, quantiles = c(0.05, 0.95), add = FALSE, ... )
## S3 method for class 'roc_band' plot( x, col_band = NULL, alpha = 0.5, quantiles = c(0.05, 0.95), add = FALSE, ... )
x |
output of |
col_band |
colour of the band defined by point-wise |
alpha |
level of opacity for the band. |
quantiles |
point-wise quantiles of the performances defining the band. |
add |
logical indicating if the curve should be added to the current plot. |
... |
additional plotting arguments (see |
A base plot.
Makes a barplot of selection proportions in decreasing order. See examples in
VariableSelection
.
## S3 method for class 'variable_selection' plot( x, col = c("red", "grey"), col.axis = NULL, col.thr = "darkred", lty.thr = 2, n_predictors = NULL, ... )
## S3 method for class 'variable_selection' plot( x, col = c("red", "grey"), col.axis = NULL, col.thr = "darkred", lty.thr = 2, n_predictors = NULL, ... )
x |
output of |
col |
vector of colours by stable selection status. |
col.axis |
optional vector of label colours by stable selection status. |
col.thr |
threshold colour. |
lty.thr |
threshold line type as |
n_predictors |
number of predictors to display. |
... |
additional plotting arguments (see |
A plot.
Runs a Partial Least Squares (PLS) model in regression mode using algorithm
implemented in pls
. This function allows for the
construction of components based on different sets of predictor and/or
outcome variables. This function is not using stability.
PLS( xdata, ydata, selectedX = NULL, selectedY = NULL, family = "gaussian", ncomp = NULL, scale = TRUE )
PLS( xdata, ydata, selectedX = NULL, selectedY = NULL, family = "gaussian", ncomp = NULL, scale = TRUE )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
selectedX |
binary matrix of size |
selectedY |
binary matrix of size |
family |
type of PLS model. Only |
ncomp |
number of components. |
scale |
logical indicating if the data should be scaled (i.e. transformed so that all variables have a standard deviation of one). |
All matrices are defined as in (Wold et al. 2001). The weight matrix
Wmat
is the equivalent of loadings$X
in
pls
. The loadings matrix Pmat
is the
equivalent of mat.c
in pls
. The score
matrices Tmat
and Qmat
are the equivalent of
variates$X
and variates$Y
in pls
.
A list with:
Wmat |
matrix of X-weights. |
Wstar |
matrix of transformed X-weights. |
Pmat |
matrix of X-loadings. |
Cmat |
matrix of Y-weights. |
Tmat |
matrix of X-scores. |
Umat |
matrix of Y-scores. |
Qmat |
matrix needed for predictions. |
Rmat |
matrix needed for predictions. |
meansX |
vector used for centering of predictors, needed for predictions. |
sigmaX |
vector used for scaling of predictors, needed for predictions. |
meansY |
vector used for centering of outcomes, needed for predictions. |
sigmaY |
vector used for scaling of outcomes, needed for predictions. |
methods |
a list with |
params |
a list with
|
Wold S, Sjöström M, Eriksson L (2001). “PLS-regression: a basic tool of chemometrics.” Chemometrics and Intelligent Laboratory Systems, 58(2), 109-130. ISSN 0169-7439, doi:10.1016/S0169-7439(01)00155-1, PLS Methods.
VariableSelection
, BiSelection
if (requireNamespace("mixOmics", quietly = TRUE)) { oldpar <- par(no.readonly = TRUE) # Data simulation set.seed(1) simul <- SimulateRegression(n = 200, pk = 15, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # PLS mypls <- PLS(xdata = x, ydata = y, ncomp = 3) if (requireNamespace("sgPLS", quietly = TRUE)) { # Sparse PLS to identify relevant variables stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), LambdaY = seq_len(ncol(y) - 1), implementation = SparsePLS, n_cat = 2 ) plot(stab) # Refitting of PLS model mypls <- PLS( xdata = x, ydata = y, selectedX = stab$selectedX, selectedY = stab$selectedY ) # Nonzero entries in weights are the same as in selectedX par(mfrow = c(2, 2)) Heatmap(stab$selectedX, legend = FALSE ) title("Selected in X") Heatmap(ifelse(mypls$Wmat != 0, yes = 1, no = 0), legend = FALSE ) title("Nonzero entries in Wmat") Heatmap(stab$selectedY, legend = FALSE ) title("Selected in Y") Heatmap(ifelse(mypls$Cmat != 0, yes = 1, no = 0), legend = FALSE ) title("Nonzero entries in Cmat") } # Multilevel PLS # Generating random design z <- rep(seq_len(50), each = 4) # Extracting the within-variability x_within <- mixOmics::withinVariation(X = x, design = cbind(z)) # Running PLS on within-variability mypls <- PLS(xdata = x_within, ydata = y, ncomp = 3) par(oldpar) }
if (requireNamespace("mixOmics", quietly = TRUE)) { oldpar <- par(no.readonly = TRUE) # Data simulation set.seed(1) simul <- SimulateRegression(n = 200, pk = 15, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # PLS mypls <- PLS(xdata = x, ydata = y, ncomp = 3) if (requireNamespace("sgPLS", quietly = TRUE)) { # Sparse PLS to identify relevant variables stab <- BiSelection( xdata = x, ydata = y, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(x) - 1), LambdaY = seq_len(ncol(y) - 1), implementation = SparsePLS, n_cat = 2 ) plot(stab) # Refitting of PLS model mypls <- PLS( xdata = x, ydata = y, selectedX = stab$selectedX, selectedY = stab$selectedY ) # Nonzero entries in weights are the same as in selectedX par(mfrow = c(2, 2)) Heatmap(stab$selectedX, legend = FALSE ) title("Selected in X") Heatmap(ifelse(mypls$Wmat != 0, yes = 1, no = 0), legend = FALSE ) title("Nonzero entries in Wmat") Heatmap(stab$selectedY, legend = FALSE ) title("Selected in Y") Heatmap(ifelse(mypls$Cmat != 0, yes = 1, no = 0), legend = FALSE ) title("Nonzero entries in Cmat") } # Multilevel PLS # Generating random design z <- rep(seq_len(50), each = 4) # Extracting the within-variability x_within <- mixOmics::withinVariation(X = x, design = cbind(z)) # Running PLS on within-variability mypls <- PLS(xdata = x_within, ydata = y, ncomp = 3) par(oldpar) }
Computes predicted values from the output of VariableSelection
.
## S3 method for class 'variable_selection' predict( object, xdata, ydata, newdata = NULL, method = c("ensemble", "refit"), ... )
## S3 method for class 'variable_selection' predict( object, xdata, ydata, newdata = NULL, method = c("ensemble", "refit"), ... )
object |
output of |
xdata |
predictor data (training set). |
ydata |
outcome data (training set). |
newdata |
optional predictor data (test set). |
method |
character string indicating if predictions should be obtained
from an |
... |
additional arguments passed to |
Predicted values.
Refit
, Ensemble
,
EnsemblePredictions
## Linear regression # Data simulation set.seed(1) simul <- SimulateRegression(n = 500, pk = 50, family = "gaussian") # Training/test split ids <- Split(data = simul$ydata, tau = c(0.8, 0.2)) # Stability selection stab <- VariableSelection( xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ] ) # Predictions from post stability selection estimation yhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "refit" ) cor(simul$ydata[ids[[2]], ], yhat)^2 # Q-squared # Predictions from ensemble model yhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "ensemble" ) cor(simul$ydata[ids[[2]], ], yhat)^2 # Q-squared ## Logistic regression # Data simulation set.seed(1) simul <- SimulateRegression(n = 500, pk = 20, family = "binomial", ev_xy = 0.9) # Training/test split ids <- Split(data = simul$ydata, family = "binomial", tau = c(0.8, 0.2)) # Stability selection stab <- VariableSelection( xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], family = "binomial" ) # Predictions from post stability selection estimation yhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "refit", type = "response" ) plot(ROC(predicted = yhat, observed = simul$ydata[ids[[2]], ])) # Predictions from ensemble model yhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "ensemble", type = "response" ) plot(ROC(predicted = yhat, observed = simul$ydata[ids[[2]], ]), add = TRUE, col = "blue" )
## Linear regression # Data simulation set.seed(1) simul <- SimulateRegression(n = 500, pk = 50, family = "gaussian") # Training/test split ids <- Split(data = simul$ydata, tau = c(0.8, 0.2)) # Stability selection stab <- VariableSelection( xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ] ) # Predictions from post stability selection estimation yhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "refit" ) cor(simul$ydata[ids[[2]], ], yhat)^2 # Q-squared # Predictions from ensemble model yhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "ensemble" ) cor(simul$ydata[ids[[2]], ], yhat)^2 # Q-squared ## Logistic regression # Data simulation set.seed(1) simul <- SimulateRegression(n = 500, pk = 20, family = "binomial", ev_xy = 0.9) # Training/test split ids <- Split(data = simul$ydata, family = "binomial", tau = c(0.8, 0.2)) # Stability selection stab <- VariableSelection( xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], family = "binomial" ) # Predictions from post stability selection estimation yhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "refit", type = "response" ) plot(ROC(predicted = yhat, observed = simul$ydata[ids[[2]], ])) # Predictions from ensemble model yhat <- predict(stab, xdata = simul$xdata[ids[[1]], ], ydata = simul$ydata[ids[[1]], ], newdata = simul$xdata[ids[[2]], ], method = "ensemble", type = "response" ) plot(ROC(predicted = yhat, observed = simul$ydata[ids[[2]], ]), add = TRUE, col = "blue" )
Computes predicted values from a Partial Least Squares (PLS) model in
regression mode applied on xdata
. This function is using the algorithm
implemented in predict.pls
.
PredictPLS(xdata, model)
PredictPLS(xdata, model)
xdata |
matrix of predictors with observations as rows and variables as columns. |
model |
output of |
An array of predicted values.
if (requireNamespace("mixOmics", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = c(5, 5, 5), family = "gaussian") x <- simul$xdata y <- simul$ydata # PLS mypls <- PLS(xdata = x, ydata = y, ncomp = 3) # Predicted values predicted <- PredictPLS(xdata = x, model = mypls) }
if (requireNamespace("mixOmics", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = c(5, 5, 5), family = "gaussian") x <- simul$xdata y <- simul$ydata # PLS mypls <- PLS(xdata = x, ydata = y, ncomp = 3) # Predicted values predicted <- PredictPLS(xdata = x, model = mypls) }
Refits the regression model with stably selected variables as predictors
(without penalisation). Variables in xdata
not evaluated in the
stability selection model will automatically be included as predictors.
Refit( xdata, ydata, stability = NULL, family = NULL, implementation = NULL, Lambda = NULL, seed = 1, verbose = TRUE, ... ) Recalibrate( xdata, ydata, stability = NULL, family = NULL, implementation = NULL, Lambda = NULL, seed = 1, verbose = TRUE, ... )
Refit( xdata, ydata, stability = NULL, family = NULL, implementation = NULL, Lambda = NULL, seed = 1, verbose = TRUE, ... ) Recalibrate( xdata, ydata, stability = NULL, family = NULL, implementation = NULL, Lambda = NULL, seed = 1, verbose = TRUE, ... )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
stability |
output of |
family |
type of regression model. Possible values include
|
implementation |
optional function to refit the model. If |
Lambda |
optional vector of penalty parameters. |
seed |
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see |
verbose |
logical indicating if a loading bar and messages should be printed. |
... |
additional arguments to be passed to the function provided in
|
The output as obtained from:
\link[stats]{lm} |
for
linear regression ( |
\link[survival]{coxph} |
for Cox regression ( |
\link[stats]{glm} |
for logistic regression
( |
\link[nnet]{multinom} |
for
multinomial regression ( |
## Linear regression # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") # Data split ids_train <- Resample( data = simul$ydata, tau = 0.5, family = "gaussian" ) xtrain <- simul$xdata[ids_train, , drop = FALSE] ytrain <- simul$ydata[ids_train, , drop = FALSE] xrefit <- simul$xdata[-ids_train, , drop = FALSE] yrefit <- simul$ydata[-ids_train, , drop = FALSE] # Stability selection stab <- VariableSelection(xdata = xtrain, ydata = ytrain, family = "gaussian") print(SelectedVariables(stab)) # Refitting the model refitted <- Refit( xdata = xrefit, ydata = yrefit, stability = stab ) refitted$coefficients # refitted coefficients head(refitted$fitted.values) # refitted predicted values # Fitting the full model (including all possible predictors) refitted <- Refit( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian" ) refitted$coefficients # refitted coefficients ## Logistic regression # Data simulation set.seed(1) simul <- SimulateRegression(n = 200, pk = 20, family = "binomial") # Data split ids_train <- Resample( data = simul$ydata, tau = 0.5, family = "binomial" ) xtrain <- simul$xdata[ids_train, , drop = FALSE] ytrain <- simul$ydata[ids_train, , drop = FALSE] xrefit <- simul$xdata[-ids_train, , drop = FALSE] yrefit <- simul$ydata[-ids_train, , drop = FALSE] # Stability selection stab <- VariableSelection(xdata = xtrain, ydata = ytrain, family = "binomial") # Refitting the model refitted <- Refit( xdata = xrefit, ydata = yrefit, stability = stab ) refitted$coefficients # refitted coefficients head(refitted$fitted.values) # refitted predicted probabilities ## Partial Least Squares (multiple components) if (requireNamespace("sgPLS", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(n = 500, pk = 15, q = 3, family = "gaussian") # Data split ids_train <- Resample( data = simul$ydata, tau = 0.5, family = "gaussian" ) xtrain <- simul$xdata[ids_train, , drop = FALSE] ytrain <- simul$ydata[ids_train, , drop = FALSE] xrefit <- simul$xdata[-ids_train, , drop = FALSE] yrefit <- simul$ydata[-ids_train, , drop = FALSE] # Stability selection stab <- BiSelection( xdata = xtrain, ydata = ytrain, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(xtrain) - 1), LambdaY = seq_len(ncol(ytrain) - 1), implementation = SparsePLS ) plot(stab) # Refitting the model refitted <- Refit( xdata = xrefit, ydata = yrefit, stability = stab ) refitted$Wmat # refitted X-weights refitted$Cmat # refitted Y-weights }
## Linear regression # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") # Data split ids_train <- Resample( data = simul$ydata, tau = 0.5, family = "gaussian" ) xtrain <- simul$xdata[ids_train, , drop = FALSE] ytrain <- simul$ydata[ids_train, , drop = FALSE] xrefit <- simul$xdata[-ids_train, , drop = FALSE] yrefit <- simul$ydata[-ids_train, , drop = FALSE] # Stability selection stab <- VariableSelection(xdata = xtrain, ydata = ytrain, family = "gaussian") print(SelectedVariables(stab)) # Refitting the model refitted <- Refit( xdata = xrefit, ydata = yrefit, stability = stab ) refitted$coefficients # refitted coefficients head(refitted$fitted.values) # refitted predicted values # Fitting the full model (including all possible predictors) refitted <- Refit( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian" ) refitted$coefficients # refitted coefficients ## Logistic regression # Data simulation set.seed(1) simul <- SimulateRegression(n = 200, pk = 20, family = "binomial") # Data split ids_train <- Resample( data = simul$ydata, tau = 0.5, family = "binomial" ) xtrain <- simul$xdata[ids_train, , drop = FALSE] ytrain <- simul$ydata[ids_train, , drop = FALSE] xrefit <- simul$xdata[-ids_train, , drop = FALSE] yrefit <- simul$ydata[-ids_train, , drop = FALSE] # Stability selection stab <- VariableSelection(xdata = xtrain, ydata = ytrain, family = "binomial") # Refitting the model refitted <- Refit( xdata = xrefit, ydata = yrefit, stability = stab ) refitted$coefficients # refitted coefficients head(refitted$fitted.values) # refitted predicted probabilities ## Partial Least Squares (multiple components) if (requireNamespace("sgPLS", quietly = TRUE)) { # Data simulation set.seed(1) simul <- SimulateRegression(n = 500, pk = 15, q = 3, family = "gaussian") # Data split ids_train <- Resample( data = simul$ydata, tau = 0.5, family = "gaussian" ) xtrain <- simul$xdata[ids_train, , drop = FALSE] ytrain <- simul$ydata[ids_train, , drop = FALSE] xrefit <- simul$xdata[-ids_train, , drop = FALSE] yrefit <- simul$ydata[-ids_train, , drop = FALSE] # Stability selection stab <- BiSelection( xdata = xtrain, ydata = ytrain, family = "gaussian", ncomp = 3, LambdaX = seq_len(ncol(xtrain) - 1), LambdaY = seq_len(ncol(ytrain) - 1), implementation = SparsePLS ) plot(stab) # Refitting the model refitted <- Refit( xdata = xrefit, ydata = yrefit, stability = stab ) refitted$Wmat # refitted X-weights refitted$Cmat # refitted Y-weights }
Generates a vector of resampled observation IDs.
Resample(data, family = NULL, tau = 0.5, resampling = "subsampling", ...)
Resample(data, family = NULL, tau = 0.5, resampling = "subsampling", ...)
data |
vector or matrix of data. In regression, this should be the outcome data. |
family |
type of regression model. This argument is defined as in
|
tau |
subsample size. Only used if |
resampling |
resampling approach. Possible values are:
|
... |
additional parameters passed to the function provided in
|
With categorical outcomes (i.e. "family" argument is set to "binomial", "multinomial" or "cox"), the resampling is done such that the proportion of observations from each of the categories is representative of that of the full sample.
A vector of resampled IDs.
## Linear regression framework # Data simulation simul <- SimulateRegression() # Subsampling ids <- Resample(data = simul$ydata, family = "gaussian") sum(duplicated(ids)) # Bootstrapping ids <- Resample(data = simul$ydata, family = "gaussian", resampling = "bootstrap") sum(duplicated(ids)) ## Logistic regression framework # Data simulation simul <- SimulateRegression(family = "binomial") # Subsampling ids <- Resample(data = simul$ydata, family = "binomial") sum(duplicated(ids)) prop.table(table(simul$ydata)) prop.table(table(simul$ydata[ids])) # Data simulation for a binary confounder conf <- ifelse(runif(n = 100) > 0.5, yes = 1, no = 0) # User-defined resampling function BalancedResampling <- function(data, tau, Z, ...) { s <- NULL for (z in unique(Z)) { s <- c(s, sample(which((data == "0") & (Z == z)), size = tau * sum((data == "0") & (Z == z)))) s <- c(s, sample(which((data == "1") & (Z == z)), size = tau * sum((data == "1") & (Z == z)))) } return(s) } # Resampling keeping proportions by Y and Z ids <- Resample(data = simul$ydata, family = "binomial", resampling = BalancedResampling, Z = conf) prop.table(table(simul$ydata, conf)) prop.table(table(simul$ydata[ids], conf[ids])) # User-defined resampling for stability selection stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "binomial", resampling = BalancedResampling, Z = conf )
## Linear regression framework # Data simulation simul <- SimulateRegression() # Subsampling ids <- Resample(data = simul$ydata, family = "gaussian") sum(duplicated(ids)) # Bootstrapping ids <- Resample(data = simul$ydata, family = "gaussian", resampling = "bootstrap") sum(duplicated(ids)) ## Logistic regression framework # Data simulation simul <- SimulateRegression(family = "binomial") # Subsampling ids <- Resample(data = simul$ydata, family = "binomial") sum(duplicated(ids)) prop.table(table(simul$ydata)) prop.table(table(simul$ydata[ids])) # Data simulation for a binary confounder conf <- ifelse(runif(n = 100) > 0.5, yes = 1, no = 0) # User-defined resampling function BalancedResampling <- function(data, tau, Z, ...) { s <- NULL for (z in unique(Z)) { s <- c(s, sample(which((data == "0") & (Z == z)), size = tau * sum((data == "0") & (Z == z)))) s <- c(s, sample(which((data == "1") & (Z == z)), size = tau * sum((data == "1") & (Z == z)))) } return(s) } # Resampling keeping proportions by Y and Z ids <- Resample(data = simul$ydata, family = "binomial", resampling = BalancedResampling, Z = conf) prop.table(table(simul$ydata, conf)) prop.table(table(simul$ydata[ids], conf[ids])) # User-defined resampling for stability selection stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "binomial", resampling = BalancedResampling, Z = conf )
Runs the variable selection algorithm specified in the argument
implementation
. This function is not using stability.
SelectionAlgo( xdata, ydata = NULL, Lambda, group_x = NULL, scale = TRUE, family = NULL, implementation = PenalisedRegression, ... )
SelectionAlgo( xdata, ydata = NULL, Lambda, group_x = NULL, scale = TRUE, family = NULL, implementation = PenalisedRegression, ... )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
Lambda |
matrix of parameters controlling the level of sparsity in the
underlying feature selection algorithm specified in |
group_x |
vector encoding the grouping structure among predictors. This
argument indicates the number of variables in each group. Only used for
models with group penalisation (e.g. |
scale |
logical indicating if the predictor data should be scaled. |
family |
type of regression model. This argument is defined as in
|
implementation |
function to use for variable selection. Possible
functions are: |
... |
additional parameters passed to the function provided in
|
A list with:
selected |
matrix of binary selection status. Rows correspond to different model parameters. Columns correspond to predictors. |
beta_full |
array of model coefficients. Rows correspond to different model parameters. Columns correspond to predictors. Indices along the third dimension correspond to outcome variable(s). |
VariableSelection
, PenalisedRegression
,
SparsePCA
, SparsePLS
, GroupPLS
,
SparseGroupPLS
Other wrapping functions:
GraphicalAlgo()
# Data simulation (univariate outcome) set.seed(1) simul <- SimulateRegression(pk = 50) # Running the LASSO mylasso <- SelectionAlgo( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1, 0.2), family = "gaussian", ) # Data simulation (multivariate outcome) set.seed(1) simul <- SimulateRegression(pk = 50, q = 3) # Running multivariate Gaussian LASSO mylasso <- SelectionAlgo( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1, 0.2), family = "mgaussian" ) str(mylasso)
# Data simulation (univariate outcome) set.seed(1) simul <- SimulateRegression(pk = 50) # Running the LASSO mylasso <- SelectionAlgo( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1, 0.2), family = "gaussian", ) # Data simulation (multivariate outcome) set.seed(1) simul <- SimulateRegression(pk = 50, q = 3) # Running multivariate Gaussian LASSO mylasso <- SelectionAlgo( xdata = simul$xdata, ydata = simul$ydata, Lambda = c(0.1, 0.2), family = "mgaussian" ) str(mylasso)
Computes different metrics of selection performance by comparing the set of selected features to the set of true predictors/edges. This function can only be used in simulation studies (i.e. when the true model is known).
SelectionPerformance(theta, theta_star, pk = NULL, cor = NULL, thr = 0.5)
SelectionPerformance(theta, theta_star, pk = NULL, cor = NULL, thr = 0.5)
theta |
output from |
theta_star |
output from |
pk |
optional vector encoding the grouping structure. Only used for
multi-block stability selection where |
cor |
optional correlation matrix. Only used in graphical modelling. |
thr |
optional threshold in correlation. Only used in graphical modelling and when argument "cor" is not NULL. |
A matrix of selection metrics including:
TP |
number of True Positives (TP) |
FN |
number of False Negatives (TN) |
FP |
number of False Positives (FP) |
TN |
number of True Negatives (TN) |
sensitivity |
sensitivity, i.e. TP/(TP+FN) |
specificity |
specificity, i.e. TN/(TN+FP) |
accuracy |
accuracy, i.e. (TP+TN)/(TP+TN+FP+FN) |
precision |
precision (p), i.e. TP/(TP+FP) |
recall |
recall (r), i.e. TP/(TP+FN) |
F1_score |
F1-score, i.e. 2*p*r/(p+r) |
If argument "cor" is provided, the number of False Positives among correlated (FP_c) and uncorrelated (FP_i) pairs, defined as having correlations (provided in "cor") above or below the threshold "thr", are also reported.
Block-specific performances are reported if "pk" is not NULL. In this case,
the first row of the matrix corresponds to the overall performances, and
subsequent rows correspond to each of the blocks. The order of the blocks
is defined as in BlockStructure
.
Other functions for model performance:
ClusteringPerformance()
,
SelectionPerformanceGraph()
# Variable selection model set.seed(1) simul <- SimulateRegression(pk = 30, nu_xy = 0.5) stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata) # Selection performance SelectionPerformance(theta = stab, theta_star = simul) # Alternative formulation SelectionPerformance( theta = SelectedVariables(stab), theta_star = simul$theta )
# Variable selection model set.seed(1) simul <- SimulateRegression(pk = 30, nu_xy = 0.5) stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata) # Selection performance SelectionPerformance(theta = stab, theta_star = simul) # Alternative formulation SelectionPerformance( theta = SelectedVariables(stab), theta_star = simul$theta )
Generates an igraph object representing the True Positive, False Positive and False Negative edges by comparing the set of selected edges to the set of true edges. This function can only be used in simulation studies (i.e. when the true model is known).
SelectionPerformanceGraph( theta, theta_star, col = c("tomato", "forestgreen", "navy"), lty = c(2, 3, 1), node_colour = NULL, show_labels = TRUE, ... )
SelectionPerformanceGraph( theta, theta_star, col = c("tomato", "forestgreen", "navy"), lty = c(2, 3, 1), node_colour = NULL, show_labels = TRUE, ... )
theta |
binary adjacency matrix or output of
|
theta_star |
true binary adjacency matrix or output of
|
col |
vector of edge colours. The first entry of the vector defines the colour of False Positive edges, second entry is for True Negatives and third entry is for True Positives. |
lty |
vector of line types for edges. The order is defined as for
argument |
node_colour |
optional vector of node colours. This vector must contain as many entries as there are rows/columns in the adjacency matrix and must be in the same order (the order is used to assign colours to nodes). Integers, named colours or RGB values can be used. |
show_labels |
logical indicating if the node labels should be displayed. |
... |
additional arguments to be passed to |
An igraph object.
SimulateGraphical
, SimulateRegression
,
GraphicalModel
, VariableSelection
,
BiSelection
Other functions for model performance:
ClusteringPerformance()
,
SelectionPerformance()
# Data simulation set.seed(1) simul <- SimulateGraphical(pk = 30) # Stability selection stab <- GraphicalModel(xdata = simul$data, K = 10) # Performance graph perfgraph <- SelectionPerformanceGraph( theta = stab, theta_star = simul ) plot(perfgraph)
# Data simulation set.seed(1) simul <- SimulateGraphical(pk = 30) # Stability selection stab <- GraphicalModel(xdata = simul$data, K = 10) # Performance graph perfgraph <- SelectionPerformanceGraph( theta = stab, theta_star = simul ) plot(perfgraph)
Extracts selection proportions (for stability selection) or co-membership proportions (for consensus clustering).
SelectionProportions(stability, argmax_id = NULL) ConsensusMatrix(stability, argmax_id = NULL)
SelectionProportions(stability, argmax_id = NULL) ConsensusMatrix(stability, argmax_id = NULL)
stability |
output of |
argmax_id |
optional indices of hyper-parameters. If
|
A vector or matrix of proportions.
VariableSelection
, GraphicalModel
,
BiSelection
, Clustering
# Stability selection set.seed(1) simul <- SimulateRegression(pk = 50) stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata) SelectionProportions(stab) # Consensus clustering set.seed(1) simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1, ev_xc = 0.5 ) stab <- Clustering(xdata = simul$data) ConsensusMatrix(stab)
# Stability selection set.seed(1) simul <- SimulateRegression(pk = 50) stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata) SelectionProportions(stab) # Consensus clustering set.seed(1) simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1, ev_xc = 0.5 ) stab <- Clustering(xdata = simul$data) ConsensusMatrix(stab)
Runs a sparse group Partial Least Squares model using implementation from
sgPLS-package
. This function is not using stability.
SparseGroupPLS( xdata, ydata, family = "gaussian", group_x, group_y = NULL, Lambda, alpha.x, alpha.y = NULL, keepX_previous = NULL, keepY = NULL, ncomp = 1, scale = TRUE, ... )
SparseGroupPLS( xdata, ydata, family = "gaussian", group_x, group_y = NULL, Lambda, alpha.x, alpha.y = NULL, keepX_previous = NULL, keepY = NULL, ncomp = 1, scale = TRUE, ... )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
family |
type of PLS model. If |
group_x |
vector encoding the grouping structure among predictors. This argument indicates the number of variables in each group. |
group_y |
optional vector encoding the grouping structure among outcomes. This argument indicates the number of variables in each group. |
Lambda |
matrix of parameters controlling the number of selected groups
at current component, as defined by |
alpha.x |
vector of parameters controlling the level of sparsity within groups of predictors. |
alpha.y |
optional vector of parameters controlling the level of
sparsity within groups of outcomes. Only used if |
keepX_previous |
number of selected groups in previous components. Only
used if |
keepY |
number of selected groups of outcome variables. This argument is
defined as in |
ncomp |
number of components. |
scale |
logical indicating if the data should be scaled (i.e.
transformed so that all variables have a standard deviation of one). Only
used if |
... |
A list with:
selected |
matrix of binary selection status. Rows correspond to different model parameters. Columns correspond to predictors. |
beta_full |
array of model coefficients. Rows correspond to different model parameters. Columns correspond to predictors (starting with "X") or outcomes (starting with "Y") variables for different components (denoted by "PC"). |
Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2016). “Group and sparse group partial least square approaches applied in genomics context.” Bioinformatics, 32(1), 35-42. ISSN 1367-4803, doi:10.1093/bioinformatics/btv535.
VariableSelection
, BiSelection
Other penalised dimensionality reduction functions:
GroupPLS()
,
SparsePCA()
,
SparsePLS()
if (requireNamespace("sgPLS", quietly = TRUE)) { ## Sparse group PLS # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 30, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # Running sgPLS with 1 group and sparsity of 0.5 mypls <- SparseGroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 5), alpha.x = 0.5 ) # Running sgPLS with groups on outcomes mypls <- SparseGroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 5), alpha.x = 0.5, group_y = c(2, 1), keepY = 1, alpha.y = 0.9 ) ## Sparse group PLS-DA # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "binomial") # Running sgPLS-DA with 1 group and sparsity of 0.9 mypls <- SparseGroupPLS( xdata = simul$xdata, ydata = simul$ydata, Lambda = 1, family = "binomial", group_x = c(10, 15, 25), alpha.x = 0.9 ) }
if (requireNamespace("sgPLS", quietly = TRUE)) { ## Sparse group PLS # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 30, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # Running sgPLS with 1 group and sparsity of 0.5 mypls <- SparseGroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 5), alpha.x = 0.5 ) # Running sgPLS with groups on outcomes mypls <- SparseGroupPLS( xdata = x, ydata = y, Lambda = 1, family = "gaussian", group_x = c(10, 15, 5), alpha.x = 0.5, group_y = c(2, 1), keepY = 1, alpha.y = 0.9 ) ## Sparse group PLS-DA # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "binomial") # Running sgPLS-DA with 1 group and sparsity of 0.9 mypls <- SparseGroupPLS( xdata = simul$xdata, ydata = simul$ydata, Lambda = 1, family = "binomial", group_x = c(10, 15, 25), alpha.x = 0.9 ) }
Runs a sparse Principal Component Analysis model using implementation from
spca
(if algo="sPCA"
) or
spca
(if algo="rSVD"
). This function is not
using stability.
SparsePCA( xdata, Lambda, ncomp = 1, scale = TRUE, keepX_previous = NULL, algorithm = "sPCA", ... )
SparsePCA( xdata, Lambda, ncomp = 1, scale = TRUE, keepX_previous = NULL, algorithm = "sPCA", ... )
xdata |
data matrix with observations as rows and variables as columns. |
Lambda |
matrix of parameters controlling the number of selected
variables at current component, as defined by |
ncomp |
number of components. |
scale |
logical indicating if the data should be scaled (i.e. transformed so that all variables have a standard deviation of one). |
keepX_previous |
number of selected predictors in previous components.
Only used if |
algorithm |
character string indicating the name of the algorithm to use for
sparse PCA. Possible values are: "sPCA" (for the algorithm proposed by Zou,
Hastie and Tibshirani and implemented in |
... |
additional arguments to be passed to
|
A list with:
selected |
matrix of binary selection status. Rows correspond to different model parameters. Columns correspond to predictors. |
beta_full |
array of model coefficients. Rows correspond to different model parameters. Columns correspond to predictors (starting with "X") or outcomes (starting with "Y") variables for different components (denoted by "PC"). |
Zou H, Hastie T, Tibshirani R (2006). “Sparse Principal Component Analysis.” Journal of Computational and Graphical Statistics, 15(2), 265-286. doi:10.1198/106186006X113430.
Shen H, Huang JZ (2008). “Sparse principal component analysis via regularized low rank matrix approximation.” Journal of Multivariate Analysis, 99(6), 1015-1034. ISSN 0047-259X, doi:10.1016/j.jmva.2007.06.007.
VariableSelection
, BiSelection
Other penalised dimensionality reduction functions:
GroupPLS()
,
SparseGroupPLS()
,
SparsePLS()
# Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") x <- simul$xdata # Sparse PCA (by Zou, Hastie, Tibshirani) if (requireNamespace("elasticnet", quietly = TRUE)) { mypca <- SparsePCA( xdata = x, ncomp = 2, Lambda = c(1, 2), keepX_previous = 10, algorithm = "sPCA" ) } # Sparse PCA (by Shen and Huang) if (requireNamespace("mixOmics", quietly = TRUE)) { mypca <- SparsePCA( xdata = x, ncomp = 2, Lambda = c(1, 2), keepX_previous = 10, algorithm = "rSVD" ) }
# Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") x <- simul$xdata # Sparse PCA (by Zou, Hastie, Tibshirani) if (requireNamespace("elasticnet", quietly = TRUE)) { mypca <- SparsePCA( xdata = x, ncomp = 2, Lambda = c(1, 2), keepX_previous = 10, algorithm = "sPCA" ) } # Sparse PCA (by Shen and Huang) if (requireNamespace("mixOmics", quietly = TRUE)) { mypca <- SparsePCA( xdata = x, ncomp = 2, Lambda = c(1, 2), keepX_previous = 10, algorithm = "rSVD" ) }
Runs a sparse Partial Least Squares model using implementation from
sgPLS-package
. This function is not using stability.
SparsePLS( xdata, ydata, Lambda, family = "gaussian", ncomp = 1, scale = TRUE, keepX_previous = NULL, keepY = NULL, ... )
SparsePLS( xdata, ydata, Lambda, family = "gaussian", ncomp = 1, scale = TRUE, keepX_previous = NULL, keepY = NULL, ... )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
Lambda |
matrix of parameters controlling the number of selected
predictors at current component, as defined by |
family |
type of PLS model. If |
ncomp |
number of components. |
scale |
logical indicating if the data should be scaled (i.e.
transformed so that all variables have a standard deviation of one). Only
used if |
keepX_previous |
number of selected predictors in previous components.
Only used if |
keepY |
number of selected outcome variables. This argument is defined
as in |
... |
A list with:
selected |
matrix of binary selection status. Rows correspond to different model parameters. Columns correspond to predictors. |
beta_full |
array of model coefficients. Rows correspond to different model parameters. Columns correspond to predictors (starting with "X") or outcomes (starting with "Y") variables for different components (denoted by "PC"). |
KA LC, Rossouw D, Robert-Granié C, Besse P (2008). “A sparse PLS for variable selection when integrating omics data.” Stat Appl Genet Mol Biol, 7(1), Article 35. ISSN 1544-6115, doi:10.2202/1544-6115.1390.
VariableSelection
, BiSelection
Other penalised dimensionality reduction functions:
GroupPLS()
,
SparseGroupPLS()
,
SparsePCA()
if (requireNamespace("sgPLS", quietly = TRUE)) { ## Sparse PLS # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 20, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # Running sPLS with 2 X-variables and 1 Y-variable mypls <- SparsePLS(xdata = x, ydata = y, Lambda = 2, family = "gaussian", keepY = 1) ## Sparse PLS-DA # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 20, family = "binomial") # Running sPLS-DA with 2 X-variables and 1 Y-variable mypls <- SparsePLS(xdata = simul$xdata, ydata = simul$ydata, Lambda = 2, family = "binomial") }
if (requireNamespace("sgPLS", quietly = TRUE)) { ## Sparse PLS # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 20, q = 3, family = "gaussian") x <- simul$xdata y <- simul$ydata # Running sPLS with 2 X-variables and 1 Y-variable mypls <- SparsePLS(xdata = x, ydata = y, Lambda = 2, family = "gaussian", keepY = 1) ## Sparse PLS-DA # Data simulation set.seed(1) simul <- SimulateRegression(n = 100, pk = 20, family = "binomial") # Running sPLS-DA with 2 X-variables and 1 Y-variable mypls <- SparsePLS(xdata = simul$xdata, ydata = simul$ydata, Lambda = 2, family = "binomial") }
Generates a list of length(tau)
non-overlapping sets of observation
IDs.
Split(data, family = NULL, tau = c(0.5, 0.25, 0.25))
Split(data, family = NULL, tau = c(0.5, 0.25, 0.25))
data |
vector or matrix of data. In regression, this should be the outcome data. |
family |
type of regression model. This argument is defined as in
|
tau |
vector of the proportion of observations in each of the sets. |
With categorical outcomes (i.e. family
argument is set to
"binomial"
, "multinomial"
or "cox"
), the split is done
such that the proportion of observations from each of the categories in
each of the sets is representative of that of the full sample.
A list of length length(tau)
with sets of non-overlapping
observation IDs.
# Splitting into 3 sets simul <- SimulateRegression() ids <- Split(data = simul$ydata) lapply(ids, length) # Balanced splits with respect to a binary variable simul <- SimulateRegression(family = "binomial") ids <- Split(data = simul$ydata, family = "binomial") lapply(ids, FUN = function(x) { table(simul$ydata[x, ]) })
# Splitting into 3 sets simul <- SimulateRegression() ids <- Split(data = simul$ydata) lapply(ids, length) # Balanced splits with respect to a binary variable simul <- SimulateRegression(family = "binomial") ids <- Split(data = simul$ydata, family = "binomial") lapply(ids, FUN = function(x) { table(simul$ydata[x, ]) })
Generates a symmetric adjacency matrix encoding a bipartite graph.
Square(x)
Square(x)
x |
matrix encoding the edges between two types of nodes (rows and columns). |
A symmetric adjacency matrix encoding a bipartite graph.
# Simulated links between two sets set.seed(1) mat <- matrix(sample(c(0, 1), size = 15, replace = TRUE), nrow = 5, ncol = 3 ) # Adjacency matrix of a bipartite graph Square(mat)
# Simulated links between two sets set.seed(1) mat <- matrix(sample(c(0, 1), size = 15, replace = TRUE), nrow = 5, ncol = 3 ) # Adjacency matrix of a bipartite graph Square(mat)
Computes the stability score (see StabilityScore
) and
upper-bounds of the PFER
and FDP
from selection
proportions of models with a given parameter controlling the sparsity of the
underlying algorithm and for different thresholds in selection proportions.
StabilityMetrics( selprop, pk = NULL, pi_list = seq(0.6, 0.9, by = 0.01), K = 100, n_cat = NULL, PFER_method = "MB", PFER_thr_blocks = Inf, FDP_thr_blocks = Inf, Sequential_template = NULL, graph = TRUE, group = NULL )
StabilityMetrics( selprop, pk = NULL, pi_list = seq(0.6, 0.9, by = 0.01), K = 100, n_cat = NULL, PFER_method = "MB", PFER_thr_blocks = Inf, FDP_thr_blocks = Inf, Sequential_template = NULL, graph = TRUE, group = NULL )
selprop |
array of selection proportions. |
pk |
optional vector encoding the grouping structure. Only used for
multi-block stability selection where |
pi_list |
vector of thresholds in selection proportions. If
|
K |
number of resampling iterations. |
n_cat |
computation options for the stability score. Default is
|
PFER_method |
method used to compute the upper-bound of the expected
number of False Positives (or Per Family Error Rate, PFER). If
|
PFER_thr_blocks |
vector of block-specific thresholds in PFER for
constrained calibration by error control. If |
FDP_thr_blocks |
vector of block-specific thresholds in the expected
proportion of falsely selected features (or False Discovery Proportion,
FDP) for constrained calibration by error control. If |
Sequential_template |
logical matrix encoding the type of procedure to
use for data with multiple blocks in stability selection graphical
modelling. For multi-block estimation, the stability selection model is
constructed as the union of block-specific stable edges estimated while the
others are weakly penalised ( |
graph |
logical indicating if stability selection is performed in a
regression (if |
group |
vector encoding the grouping structure among predictors. This argument indicates the number of variables in each group and only needs to be provided for group (but not sparse group) penalisation. |
A list with:
S |
a matrix of the best (block-specific) stability scores for different (sets of) penalty parameters. In multi-block stability selection, rows correspond to different sets of penalty parameters, (values are stored in the output "Lambda") and columns correspond to different blocks. |
Lambda |
a matrix of (block-specific) penalty parameters. In multi-block stability selection, rows correspond to sets of penalty parameters and columns correspond to different blocks. |
Q |
a matrix of average numbers of (block-specific) edges selected by the underlying algorithm for different (sets of) penalty parameters. In multi-block stability selection, rows correspond to different sets of penalty parameters, (values are stored in the output "Lambda") and columns correspond to different blocks. |
Q_s |
a matrix of calibrated numbers of (block-specific) stable edges for different (sets of) penalty parameters. In multi-block stability selection, rows correspond to different sets of penalty parameters, (values are stored in the output "Lambda") and columns correspond to different blocks. |
P |
a matrix of calibrated (block-specific) thresholds in selection proportions for different (sets of) penalty parameters. In multi-block stability selection, rows correspond to different sets of penalty parameters, (values are stored in the output "Lambda") and columns correspond to different blocks. |
PFER |
a matrix of computed (block-specific) upper-bounds in PFER of calibrated graphs for different (sets of) penalty parameters. In multi-block stability selection, rows correspond to different sets of penalty parameters, (values are stored in the output "Lambda") and columns correspond to different blocks. |
FDP |
a matrix of computed (block-specific) upper-bounds in FDP of calibrated stability selection models for different (sets of) penalty parameters. In multi-block stability selection, rows correspond to different sets of penalty parameters, (values are stored in the output "Lambda") and columns correspond to different blocks. |
S_2d |
an array of (block-specific) stability scores obtained with different combinations of parameters. Rows correspond to different (sets of) penalty parameters and columns correspond to different thresholds in selection proportions. In multi-block stability selection, indices along the third dimension correspond to different blocks. |
PFER_2d |
an array of computed upper-bounds of PFER obtained with different combinations of parameters. Rows correspond to different penalty parameters and columns correspond to different thresholds in selection proportions. Not available in multi-block stability selection graphical modelling. |
FDP_2d |
an array of computed upper-bounds of FDP obtained with different combinations of parameters. Rows correspond to different penalty parameters and columns correspond to different thresholds in selection proportions. Not available in multi-block stability selection graphical modelling. |
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.
Shah RD, Samworth RJ (2013). “Variable selection with error control: another look at stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80. doi:10.1111/j.1467-9868.2011.01034.x.
Other stability metric functions:
ConsensusScore()
,
FDP()
,
PFER()
,
StabilityScore()
## Sparse or sparse group penalisation # Simulating set of selection proportions set.seed(1) selprop <- matrix(round(runif(n = 20), digits = 2), nrow = 2) # Computing stability scores for different thresholds metrics <- StabilityMetrics( selprop = selprop, pi = c(0.6, 0.7, 0.8), K = 100, graph = FALSE ) ## Group penalisation # Simulating set of selection proportions set.seed(1) selprop <- matrix(round(runif(n = 6), digits = 2), nrow = 2) selprop <- cbind( selprop[, 1], selprop[, 1], selprop[, 2], selprop[, 2], matrix(rep(selprop[, 3], each = 6), nrow = 2, byrow = TRUE) ) # Computing stability scores for different thresholds metrics <- StabilityMetrics( selprop = selprop, pi = c(0.6, 0.7, 0.8), K = 100, graph = FALSE, group = c(2, 2, 6) )
## Sparse or sparse group penalisation # Simulating set of selection proportions set.seed(1) selprop <- matrix(round(runif(n = 20), digits = 2), nrow = 2) # Computing stability scores for different thresholds metrics <- StabilityMetrics( selprop = selprop, pi = c(0.6, 0.7, 0.8), K = 100, graph = FALSE ) ## Group penalisation # Simulating set of selection proportions set.seed(1) selprop <- matrix(round(runif(n = 6), digits = 2), nrow = 2) selprop <- cbind( selprop[, 1], selprop[, 1], selprop[, 2], selprop[, 2], matrix(rep(selprop[, 3], each = 6), nrow = 2, byrow = TRUE) ) # Computing stability scores for different thresholds metrics <- StabilityMetrics( selprop = selprop, pi = c(0.6, 0.7, 0.8), K = 100, graph = FALSE, group = c(2, 2, 6) )
Computes the stability score from selection proportions of models with a given parameter controlling the sparsity and for different thresholds in selection proportions. The score measures how unlikely it is that the selection procedure is uniform (i.e. uninformative) for a given combination of parameters.
StabilityScore( selprop, pi_list = seq(0.6, 0.9, by = 0.01), K, n_cat = 3, group = NULL )
StabilityScore( selprop, pi_list = seq(0.6, 0.9, by = 0.01), K, n_cat = 3, group = NULL )
selprop |
array of selection proportions. |
pi_list |
vector of thresholds in selection proportions. If
|
K |
number of resampling iterations. |
n_cat |
computation options for the stability score. Default is
|
group |
vector encoding the grouping structure among predictors. This argument indicates the number of variables in each group and only needs to be provided for group (but not sparse group) penalisation. |
The stability score is derived from the likelihood under the assumption of uniform (uninformative) selection.
We classify the features into three categories: the stably selected ones
(that have selection proportions ), the stably excluded ones
(selection proportion
), and the unstable ones (selection
proportions between
and
).
Under the hypothesis of equiprobability of selection (instability), the likelihood of observing stably selected, stably excluded and unstable features can be expressed as:
where is the selection count of feature
and
is the cumulative probability function of the binomial
distribution with parameters
and the average proportion of selected
features over resampling iterations.
The stability score is computed as the minus log-transformed likelihood under the assumption of equiprobability of selection:
The stability score increases with stability.
Alternatively, the stability score can be computed by considering only two
sets of features: stably selected (selection proportions ) or
not (selection proportions
). This can be done using
n_cat=2
.
A vector of stability scores obtained with the different thresholds in selection proportions.
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Other stability metric functions:
ConsensusScore()
,
FDP()
,
PFER()
,
StabilityMetrics()
# Simulating set of selection proportions set.seed(1) selprop <- round(runif(n = 20), digits = 2) # Computing stability scores for different thresholds score <- StabilityScore(selprop, pi_list = c(0.6, 0.7, 0.8), K = 100)
# Simulating set of selection proportions set.seed(1) selprop <- round(runif(n = 20), digits = 2) # Computing stability scores for different thresholds score <- StabilityScore(selprop, pi_list = c(0.6, 0.7, 0.8), K = 100)
Extracts stable results for stability selection or consensus clustering.
Stable(stability, argmax_id = NULL, linkage = "complete") SelectedVariables(stability, argmax_id = NULL) Adjacency(stability, argmax_id = NULL) Clusters(stability, linkage = "complete", argmax_id = NULL)
Stable(stability, argmax_id = NULL, linkage = "complete") SelectedVariables(stability, argmax_id = NULL) Adjacency(stability, argmax_id = NULL) Clusters(stability, linkage = "complete", argmax_id = NULL)
stability |
output of |
argmax_id |
optional indices of hyper-parameters. If
|
linkage |
character string indicating the type of linkage used in
hierarchical clustering to define the stable clusters. Possible values
include |
A binary vector or matrix encoding the selection status (1
if
selected, 0
otherwise) in stability selection or stable cluster
membership in consensus clustering.
VariableSelection
, BiSelection
,
GraphicalModel
, Clustering
# Variable selection set.seed(1) simul <- SimulateRegression(pk = 20) stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata) SelectedVariables(stab) Stable(stab) # Graphical model set.seed(1) simul <- SimulateGraphical(pk = 10) stab <- GraphicalModel(xdata = simul$data) Adjacency(stab) Stable(stab) # Clustering set.seed(1) simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1 ) stab <- Clustering(xdata = simul$data) Clusters(stab) Stable(stab)
# Variable selection set.seed(1) simul <- SimulateRegression(pk = 20) stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata) SelectedVariables(stab) Stable(stab) # Graphical model set.seed(1) simul <- SimulateGraphical(pk = 10) stab <- GraphicalModel(xdata = simul$data) Adjacency(stab) Stable(stab) # Clustering set.seed(1) simul <- SimulateClustering( n = c(30, 30, 30), nu_xc = 1 ) stab <- Clustering(xdata = simul$data) Clusters(stab) Stable(stab)
Performs stability selection for Structural Equation Models. The underlying arrow selection algorithm (e.g. regularised Structural Equation Modelling) is run with different combinations of parameters controlling the sparsity (e.g. penalty parameter) and thresholds in selection proportions. These two hyper-parameters are jointly calibrated by maximisation of the stability score.
StructuralModel( xdata, adjacency, residual_covariance = NULL, Lambda = NULL, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, implementation = PenalisedLinearSystem, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 100, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, ... )
StructuralModel( xdata, adjacency, residual_covariance = NULL, Lambda = NULL, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, implementation = PenalisedLinearSystem, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 100, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, ... )
xdata |
matrix with observations as rows and variables as columns.
Column names must be defined and in line with the row and column names of
|
adjacency |
binary adjacency matrix of the Directed Acyclic Graph (transpose of the asymmetric matrix A in Reticular Action Model notation). The row and column names of this matrix must be defined. |
residual_covariance |
binary and symmetric matrix encoding the nonzero entries in the residual covariance matrix (symmetric matrix S in Reticular Action Model notation). By default, this is the identity matrix (no residual covariance). |
Lambda |
matrix of parameters controlling the level of sparsity in the
underlying feature selection algorithm specified in |
pi_list |
vector of thresholds in selection proportions. If
|
K |
number of resampling iterations. |
tau |
subsample size. Only used if |
seed |
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see |
n_cat |
computation options for the stability score. Default is
|
implementation |
function to use for variable selection. |
resampling |
resampling approach. Possible values are:
|
cpss |
logical indicating if complementary pair stability selection
should be done. For this, the algorithm is applied on two non-overlapping
subsets of half of the observations. A feature is considered as selected if
it is selected for both subsamples. With this method, the data is split
|
PFER_method |
method used to compute the upper-bound of the expected
number of False Positives (or Per Family Error Rate, PFER). If
|
PFER_thr |
threshold in PFER for constrained calibration by error
control. If |
FDP_thr |
threshold in the expected proportion of falsely selected
features (or False Discovery Proportion) for constrained calibration by
error control. If |
Lambda_cardinal |
number of values in the grid of parameters controlling
the level of sparsity in the underlying algorithm. Only used if
|
optimisation |
character string indicating the type of optimisation
method. With |
n_cores |
number of cores to use for parallel computing (see argument
|
output_data |
logical indicating if the input datasets |
verbose |
logical indicating if a loading bar and messages should be printed. |
... |
additional parameters passed to the functions provided in
|
In stability selection, a feature selection algorithm is fitted on
K
subsamples (or bootstrap samples) of the data with different
parameters controlling the sparsity (Lambda
). For a given (set of)
sparsity parameter(s), the proportion out of the K
models in which
each feature is selected is calculated. Features with selection proportions
above a threshold pi are considered stably selected. The stability
selection model is controlled by the sparsity parameter(s) for the
underlying algorithm, and the threshold in selection proportion:
In Structural Equation Modelling, "feature" refers to an arrow in the corresponding Directed Acyclic Graph.
These parameters can be calibrated by maximisation of a stability score
(see ConsensusScore
if n_cat=NULL
or
StabilityScore
otherwise) calculated under the null
hypothesis of equiprobability of selection.
It is strongly recommended to examine the calibration plot carefully to
check that the grids of parameters Lambda
and pi_list
do not
restrict the calibration to a region that would not include the global
maximum (see CalibrationPlot
). In particular, the grid
Lambda
may need to be extended when the maximum stability is
observed on the left or right edges of the calibration heatmap. In some
instances, multiple peaks of stability score can be observed. Simulation
studies suggest that the peak corresponding to the largest number of
selected features tend to give better selection performances. This is not
necessarily the highest peak (which is automatically retained by the
functions in this package). The user can decide to manually choose another
peak.
To control the expected number of False Positives (Per Family Error Rate)
in the results, a threshold PFER_thr
can be specified. The
optimisation problem is then constrained to sets of parameters that
generate models with an upper-bound in PFER below PFER_thr
(see
Meinshausen and Bühlmann (2010) and Shah and Samworth (2013)).
Possible resampling procedures include defining (i) K
subsamples of
a proportion tau
of the observations, (ii) K
bootstrap
samples with the full sample size (obtained with replacement), and (iii)
K/2
splits of the data in half for complementary pair stability
selection (see arguments resampling
and cpss
). In
complementary pair stability selection, a feature is considered selected at
a given resampling iteration if it is selected in the two complementary
subsamples.
To ensure reproducibility of the results, the starting number of the random
number generator is set to seed
.
For parallelisation, stability selection with different sets of parameters
can be run on n_cores
cores. Using n_cores > 1
creates a
multisession
. Alternatively,
the function can be run manually with different seed
s and all other
parameters equal. The results can then be combined using
Combine
.
An object of class variable_selection
. A list with:
S |
a matrix of the best stability scores for different parameters controlling the level of sparsity in the underlying algorithm. |
Lambda |
a matrix of parameters controlling the level of sparsity in the underlying algorithm. |
Q |
a matrix of the average number of selected features by the underlying algorithm with different parameters controlling the level of sparsity. |
Q_s |
a matrix of the calibrated number of stably selected features with different parameters controlling the level of sparsity. |
P |
a matrix of calibrated thresholds in selection proportions for different parameters controlling the level of sparsity in the underlying algorithm. |
PFER |
a matrix of upper-bounds in PFER of calibrated stability selection models with different parameters controlling the level of sparsity. |
FDP |
a matrix of upper-bounds in FDP of calibrated stability selection models with different parameters controlling the level of sparsity. |
S_2d |
a matrix of stability scores obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. |
PFER_2d |
a matrix of upper-bounds in FDP obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. |
FDP_2d |
a matrix of upper-bounds in PFER obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. |
selprop |
a matrix of selection proportions.
Columns correspond to predictors from |
Beta |
an array
of model coefficients. Columns correspond to predictors from |
method |
a list with
|
params |
a list with values used for arguments
|
For all matrices and arrays returned, the rows
are ordered in the same way and correspond to parameter values stored in
Lambda
.
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.
Shah RD, Samworth RJ (2013). “Variable selection with error control: another look at stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80. doi:10.1111/j.1467-9868.2011.01034.x.
Jacobucci R, Grimm KJ, McArdle JJ (2016). “Regularized structural equation modeling.” Structural equation modeling: a multidisciplinary journal, 23(4), 555–566. doi:10.1080/10705511.2016.1154793.
SelectionAlgo
,
Resample
, StabilityScore
Other stability functions:
BiSelection()
,
Clustering()
,
GraphicalModel()
,
VariableSelection()
oldpar <- par(no.readonly = TRUE) par(mar = rep(7, 4)) # Data simulation set.seed(1) pk <- c(3, 2, 3) simul <- SimulateStructural( n = 500, pk = pk, nu_between = 0.5, v_between = 1, v_sign = 1 ) # Stability selection (using glmnet) dag <- LayeredDAG(layers = pk) stab <- StructuralModel( xdata = simul$data, adjacency = dag ) CalibrationPlot(stab) LinearSystemMatrix(vect = Stable(stab), adjacency = dag) # Stability selection (using OpenMx) if (requireNamespace("OpenMx", quietly = TRUE)) { stab <- StructuralModel( xdata = simul$data, implementation = PenalisedOpenMx, Lambda = seq(50, 500, by = 50), adjacency = dag ) CalibrationPlot(stab) OpenMxMatrix(SelectedVariables(stab), adjacency = dag) } ## Not run: # Data simulation with latent variables set.seed(1) pk <- c(3, 2, 3) simul <- SimulateStructural( n = 500, pk = pk, nu_between = 0.5, v_sign = 1, v_between = 1, n_manifest = 3, ev_manifest = 0.95 ) # Stability selection (using OpenMx) if (requireNamespace("OpenMx", quietly = TRUE)) { dag <- LayeredDAG(layers = pk, n_manifest = 3) penalised <- dag penalised[, seq_len(ncol(simul$data))] <- 0 stab <- StructuralModel( xdata = simul$data, implementation = PenalisedOpenMx, adjacency = dag, penalised = penalised, Lambda = seq(10, 100, by = 20), K = 10 # to increase for real use ) CalibrationPlot(stab) ids_latent <- grep("f", colnames(dag)) OpenMxMatrix(SelectedVariables(stab), adjacency = dag )[ids_latent, ids_latent] } ## End(Not run) par(oldpar)
oldpar <- par(no.readonly = TRUE) par(mar = rep(7, 4)) # Data simulation set.seed(1) pk <- c(3, 2, 3) simul <- SimulateStructural( n = 500, pk = pk, nu_between = 0.5, v_between = 1, v_sign = 1 ) # Stability selection (using glmnet) dag <- LayeredDAG(layers = pk) stab <- StructuralModel( xdata = simul$data, adjacency = dag ) CalibrationPlot(stab) LinearSystemMatrix(vect = Stable(stab), adjacency = dag) # Stability selection (using OpenMx) if (requireNamespace("OpenMx", quietly = TRUE)) { stab <- StructuralModel( xdata = simul$data, implementation = PenalisedOpenMx, Lambda = seq(50, 500, by = 50), adjacency = dag ) CalibrationPlot(stab) OpenMxMatrix(SelectedVariables(stab), adjacency = dag) } ## Not run: # Data simulation with latent variables set.seed(1) pk <- c(3, 2, 3) simul <- SimulateStructural( n = 500, pk = pk, nu_between = 0.5, v_sign = 1, v_between = 1, n_manifest = 3, ev_manifest = 0.95 ) # Stability selection (using OpenMx) if (requireNamespace("OpenMx", quietly = TRUE)) { dag <- LayeredDAG(layers = pk, n_manifest = 3) penalised <- dag penalised[, seq_len(ncol(simul$data))] <- 0 stab <- StructuralModel( xdata = simul$data, implementation = PenalisedOpenMx, adjacency = dag, penalised = penalised, Lambda = seq(10, 100, by = 20), K = 10 # to increase for real use ) CalibrationPlot(stab) ids_latent <- grep("f", colnames(dag)) OpenMxMatrix(SelectedVariables(stab), adjacency = dag )[ids_latent, ids_latent] } ## End(Not run) par(oldpar)
Performs stability selection for regression models. The underlying variable selection algorithm (e.g. LASSO regression) is run with different combinations of parameters controlling the sparsity (e.g. penalty parameter) and thresholds in selection proportions. These two hyper-parameters are jointly calibrated by maximisation of the stability score.
VariableSelection( xdata, ydata = NULL, Lambda = NULL, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, family = "gaussian", implementation = PenalisedRegression, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 100, group_x = NULL, group_penalisation = FALSE, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ... )
VariableSelection( xdata, ydata = NULL, Lambda = NULL, pi_list = seq(0.01, 0.99, by = 0.01), K = 100, tau = 0.5, seed = 1, n_cat = NULL, family = "gaussian", implementation = PenalisedRegression, resampling = "subsampling", cpss = FALSE, PFER_method = "MB", PFER_thr = Inf, FDP_thr = Inf, Lambda_cardinal = 100, group_x = NULL, group_penalisation = FALSE, optimisation = c("grid_search", "nloptr"), n_cores = 1, output_data = FALSE, verbose = TRUE, beep = NULL, ... )
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
Lambda |
matrix of parameters controlling the level of sparsity in the
underlying feature selection algorithm specified in |
pi_list |
vector of thresholds in selection proportions. If
|
K |
number of resampling iterations. |
tau |
subsample size. Only used if |
seed |
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see |
n_cat |
computation options for the stability score. Default is
|
family |
type of regression model. This argument is defined as in
|
implementation |
function to use for variable selection. Possible
functions are: |
resampling |
resampling approach. Possible values are:
|
cpss |
logical indicating if complementary pair stability selection
should be done. For this, the algorithm is applied on two non-overlapping
subsets of half of the observations. A feature is considered as selected if
it is selected for both subsamples. With this method, the data is split
|
PFER_method |
method used to compute the upper-bound of the expected
number of False Positives (or Per Family Error Rate, PFER). If
|
PFER_thr |
threshold in PFER for constrained calibration by error
control. If |
FDP_thr |
threshold in the expected proportion of falsely selected
features (or False Discovery Proportion) for constrained calibration by
error control. If |
Lambda_cardinal |
number of values in the grid of parameters controlling
the level of sparsity in the underlying algorithm. Only used if
|
group_x |
vector encoding the grouping structure among predictors. This
argument indicates the number of variables in each group. Only used for
models with group penalisation (e.g. |
group_penalisation |
logical indicating if a group penalisation should
be considered in the stability score. The use of
|
optimisation |
character string indicating the type of optimisation
method. With |
n_cores |
number of cores to use for parallel computing (see argument
|
output_data |
logical indicating if the input datasets |
verbose |
logical indicating if a loading bar and messages should be printed. |
beep |
sound indicating the end of the run. Possible values are:
|
... |
additional parameters passed to the functions provided in
|
In stability selection, a feature selection algorithm is fitted on
K
subsamples (or bootstrap samples) of the data with different
parameters controlling the sparsity (Lambda
). For a given (set of)
sparsity parameter(s), the proportion out of the K
models in which
each feature is selected is calculated. Features with selection proportions
above a threshold pi are considered stably selected. The stability
selection model is controlled by the sparsity parameter(s) for the
underlying algorithm, and the threshold in selection proportion:
If argument group_penalisation=FALSE
, "feature" refers to variable
(variable selection model). If argument group_penalisation=TRUE
,
"feature" refers to group (group selection model). In this case, groups
need to be defined a priori and specified in argument
group_x
.
These parameters can be calibrated by maximisation of a stability score
(see ConsensusScore
if n_cat=NULL
or
StabilityScore
otherwise) calculated under the null
hypothesis of equiprobability of selection.
It is strongly recommended to examine the calibration plot carefully to
check that the grids of parameters Lambda
and pi_list
do not
restrict the calibration to a region that would not include the global
maximum (see CalibrationPlot
). In particular, the grid
Lambda
may need to be extended when the maximum stability is
observed on the left or right edges of the calibration heatmap. In some
instances, multiple peaks of stability score can be observed. Simulation
studies suggest that the peak corresponding to the largest number of
selected features tend to give better selection performances. This is not
necessarily the highest peak (which is automatically retained by the
functions in this package). The user can decide to manually choose another
peak.
To control the expected number of False Positives (Per Family Error Rate)
in the results, a threshold PFER_thr
can be specified. The
optimisation problem is then constrained to sets of parameters that
generate models with an upper-bound in PFER below PFER_thr
(see
Meinshausen and Bühlmann (2010) and Shah and Samworth (2013)).
Possible resampling procedures include defining (i) K
subsamples of
a proportion tau
of the observations, (ii) K
bootstrap
samples with the full sample size (obtained with replacement), and (iii)
K/2
splits of the data in half for complementary pair stability
selection (see arguments resampling
and cpss
). In
complementary pair stability selection, a feature is considered selected at
a given resampling iteration if it is selected in the two complementary
subsamples.
For categorical or time to event outcomes (argument family
is
"binomial"
, "multinomial"
or "cox"
), the proportions
of observations from each category in all subsamples or bootstrap samples
are the same as in the full sample.
To ensure reproducibility of the results, the starting number of the random
number generator is set to seed
.
For parallelisation, stability selection with different sets of parameters
can be run on n_cores
cores. Using n_cores > 1
creates a
multisession
. Alternatively,
the function can be run manually with different seed
s and all other
parameters equal. The results can then be combined using
Combine
.
An object of class variable_selection
. A list with:
S |
a matrix of the best stability scores for different parameters controlling the level of sparsity in the underlying algorithm. |
Lambda |
a matrix of parameters controlling the level of sparsity in the underlying algorithm. |
Q |
a matrix of the average number of selected features by the underlying algorithm with different parameters controlling the level of sparsity. |
Q_s |
a matrix of the calibrated number of stably selected features with different parameters controlling the level of sparsity. |
P |
a matrix of calibrated thresholds in selection proportions for different parameters controlling the level of sparsity in the underlying algorithm. |
PFER |
a matrix of upper-bounds in PFER of calibrated stability selection models with different parameters controlling the level of sparsity. |
FDP |
a matrix of upper-bounds in FDP of calibrated stability selection models with different parameters controlling the level of sparsity. |
S_2d |
a matrix of stability scores obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. |
PFER_2d |
a matrix of upper-bounds in FDP obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. |
FDP_2d |
a matrix of upper-bounds in PFER obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. |
selprop |
a matrix of selection proportions.
Columns correspond to predictors from |
Beta |
an array
of model coefficients. Columns correspond to predictors from |
method |
a list with
|
params |
a list with values used for arguments
|
For all matrices and arrays returned, the rows
are ordered in the same way and correspond to parameter values stored in
Lambda
.
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Shah RD, Samworth RJ (2013). “Variable selection with error control: another look at stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80. doi:10.1111/j.1467-9868.2011.01034.x.
Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.
Tibshirani R (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. ISSN 00359246, http://www.jstor.org/stable/2346178.
PenalisedRegression
, SelectionAlgo
,
LambdaGridRegression
, Resample
,
StabilityScore
Refit
,
ExplanatoryPerformance
, Incremental
,
Other stability functions:
BiSelection()
,
Clustering()
,
GraphicalModel()
,
StructuralModel()
oldpar <- par(no.readonly = TRUE) par(mar = rep(7, 4)) # Linear regression set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian" ) # Calibration plot CalibrationPlot(stab) # Extracting the results summary(stab) Stable(stab) SelectionProportions(stab) plot(stab) # Using randomised LASSO stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", penalisation = "randomised" ) plot(stab) # Using adaptive LASSO stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", penalisation = "adaptive" ) plot(stab) # Using additional arguments from glmnet (e.g. penalty.factor) stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", penalty.factor = c(rep(1, 45), rep(0, 5)) ) head(coef(stab)) # Using CART if (requireNamespace("rpart", quietly = TRUE)) { stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, implementation = CART, family = "gaussian", ) plot(stab) } # Regression with multivariate outcomes set.seed(1) simul <- SimulateRegression(n = 100, pk = 20, q = 3, family = "gaussian") stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "mgaussian" ) summary(stab) # Logistic regression set.seed(1) simul <- SimulateRegression(n = 200, pk = 10, family = "binomial", ev_xy = 0.8) stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "binomial" ) summary(stab) # Sparse PCA (1 component, see BiSelection for more components) if (requireNamespace("elasticnet", quietly = TRUE)) { set.seed(1) simul <- SimulateComponents(pk = c(5, 3, 4)) stab <- VariableSelection( xdata = simul$data, Lambda = seq_len(ncol(simul$data) - 1), implementation = SparsePCA ) CalibrationPlot(stab, xlab = "") summary(stab) } # Sparse PLS (1 outcome, 1 component, see BiSelection for more options) if (requireNamespace("sgPLS", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, Lambda = seq_len(ncol(simul$xdata) - 1), implementation = SparsePLS, family = "gaussian" ) CalibrationPlot(stab, xlab = "") SelectedVariables(stab) } # Group PLS (1 outcome, 1 component, see BiSelection for more options) if (requireNamespace("sgPLS", quietly = TRUE)) { stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, Lambda = seq_len(5), group_x = c(5, 5, 10, 20, 10), group_penalisation = TRUE, implementation = GroupPLS, family = "gaussian" ) CalibrationPlot(stab, xlab = "") SelectedVariables(stab) } # Example with more hyper-parameters: elastic net set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") TuneElasticNet <- function(xdata, ydata, family, alpha) { stab <- VariableSelection( xdata = xdata, ydata = ydata, family = family, alpha = alpha, verbose = FALSE ) return(max(stab$S, na.rm = TRUE)) } myopt <- optimise(TuneElasticNet, lower = 0.1, upper = 1, maximum = TRUE, xdata = simul$xdata, ydata = simul$ydata, family = "gaussian" ) stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", alpha = myopt$maximum ) summary(stab) enet <- SelectedVariables(stab) # Comparison with LASSO stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian") summary(stab) lasso <- SelectedVariables(stab) table(lasso, enet) # Example using an external function: group-LASSO with gglasso if (requireNamespace("gglasso", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 200, pk = 20, family = "binomial") ManualGridGroupLasso <- function(xdata, ydata, family, group_x, ...) { # Defining the grouping group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) { rep(i, group_x[i]) })) if (family == "binomial") { ytmp <- ydata ytmp[ytmp == min(ytmp)] <- -1 ytmp[ytmp == max(ytmp)] <- 1 return(gglasso::gglasso(xdata, ytmp, loss = "logit", group = group, ...)) } else { return(gglasso::gglasso(xdata, ydata, lambda = lambda, loss = "ls", group = group, ...)) } } Lambda <- LambdaGridRegression( xdata = simul$xdata, ydata = simul$ydata, family = "binomial", Lambda_cardinal = 20, implementation = ManualGridGroupLasso, group_x = rep(5, 4) ) GroupLasso <- function(xdata, ydata, Lambda, family, group_x, ...) { # Defining the grouping group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) { rep(i, group_x[i]) })) # Running the regression if (family == "binomial") { ytmp <- ydata ytmp[ytmp == min(ytmp)] <- -1 ytmp[ytmp == max(ytmp)] <- 1 mymodel <- gglasso::gglasso(xdata, ytmp, lambda = Lambda, loss = "logit", group = group, ...) } if (family == "gaussian") { mymodel <- gglasso::gglasso(xdata, ydata, lambda = Lambda, loss = "ls", group = group, ...) } # Extracting and formatting the beta coefficients beta_full <- t(as.matrix(mymodel$beta)) beta_full <- beta_full[, colnames(xdata)] selected <- ifelse(beta_full != 0, yes = 1, no = 0) return(list(selected = selected, beta_full = beta_full)) } stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, implementation = GroupLasso, family = "binomial", Lambda = Lambda, group_x = rep(5, 4), group_penalisation = TRUE ) summary(stab) } par(oldpar)
oldpar <- par(no.readonly = TRUE) par(mar = rep(7, 4)) # Linear regression set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian" ) # Calibration plot CalibrationPlot(stab) # Extracting the results summary(stab) Stable(stab) SelectionProportions(stab) plot(stab) # Using randomised LASSO stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", penalisation = "randomised" ) plot(stab) # Using adaptive LASSO stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", penalisation = "adaptive" ) plot(stab) # Using additional arguments from glmnet (e.g. penalty.factor) stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", penalty.factor = c(rep(1, 45), rep(0, 5)) ) head(coef(stab)) # Using CART if (requireNamespace("rpart", quietly = TRUE)) { stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, implementation = CART, family = "gaussian", ) plot(stab) } # Regression with multivariate outcomes set.seed(1) simul <- SimulateRegression(n = 100, pk = 20, q = 3, family = "gaussian") stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "mgaussian" ) summary(stab) # Logistic regression set.seed(1) simul <- SimulateRegression(n = 200, pk = 10, family = "binomial", ev_xy = 0.8) stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "binomial" ) summary(stab) # Sparse PCA (1 component, see BiSelection for more components) if (requireNamespace("elasticnet", quietly = TRUE)) { set.seed(1) simul <- SimulateComponents(pk = c(5, 3, 4)) stab <- VariableSelection( xdata = simul$data, Lambda = seq_len(ncol(simul$data) - 1), implementation = SparsePCA ) CalibrationPlot(stab, xlab = "") summary(stab) } # Sparse PLS (1 outcome, 1 component, see BiSelection for more options) if (requireNamespace("sgPLS", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, Lambda = seq_len(ncol(simul$xdata) - 1), implementation = SparsePLS, family = "gaussian" ) CalibrationPlot(stab, xlab = "") SelectedVariables(stab) } # Group PLS (1 outcome, 1 component, see BiSelection for more options) if (requireNamespace("sgPLS", quietly = TRUE)) { stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, Lambda = seq_len(5), group_x = c(5, 5, 10, 20, 10), group_penalisation = TRUE, implementation = GroupPLS, family = "gaussian" ) CalibrationPlot(stab, xlab = "") SelectedVariables(stab) } # Example with more hyper-parameters: elastic net set.seed(1) simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian") TuneElasticNet <- function(xdata, ydata, family, alpha) { stab <- VariableSelection( xdata = xdata, ydata = ydata, family = family, alpha = alpha, verbose = FALSE ) return(max(stab$S, na.rm = TRUE)) } myopt <- optimise(TuneElasticNet, lower = 0.1, upper = 1, maximum = TRUE, xdata = simul$xdata, ydata = simul$ydata, family = "gaussian" ) stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, family = "gaussian", alpha = myopt$maximum ) summary(stab) enet <- SelectedVariables(stab) # Comparison with LASSO stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian") summary(stab) lasso <- SelectedVariables(stab) table(lasso, enet) # Example using an external function: group-LASSO with gglasso if (requireNamespace("gglasso", quietly = TRUE)) { set.seed(1) simul <- SimulateRegression(n = 200, pk = 20, family = "binomial") ManualGridGroupLasso <- function(xdata, ydata, family, group_x, ...) { # Defining the grouping group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) { rep(i, group_x[i]) })) if (family == "binomial") { ytmp <- ydata ytmp[ytmp == min(ytmp)] <- -1 ytmp[ytmp == max(ytmp)] <- 1 return(gglasso::gglasso(xdata, ytmp, loss = "logit", group = group, ...)) } else { return(gglasso::gglasso(xdata, ydata, lambda = lambda, loss = "ls", group = group, ...)) } } Lambda <- LambdaGridRegression( xdata = simul$xdata, ydata = simul$ydata, family = "binomial", Lambda_cardinal = 20, implementation = ManualGridGroupLasso, group_x = rep(5, 4) ) GroupLasso <- function(xdata, ydata, Lambda, family, group_x, ...) { # Defining the grouping group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) { rep(i, group_x[i]) })) # Running the regression if (family == "binomial") { ytmp <- ydata ytmp[ytmp == min(ytmp)] <- -1 ytmp[ytmp == max(ytmp)] <- 1 mymodel <- gglasso::gglasso(xdata, ytmp, lambda = Lambda, loss = "logit", group = group, ...) } if (family == "gaussian") { mymodel <- gglasso::gglasso(xdata, ydata, lambda = Lambda, loss = "ls", group = group, ...) } # Extracting and formatting the beta coefficients beta_full <- t(as.matrix(mymodel$beta)) beta_full <- beta_full[, colnames(xdata)] selected <- ifelse(beta_full != 0, yes = 1, no = 0) return(list(selected = selected, beta_full = beta_full)) } stab <- VariableSelection( xdata = simul$xdata, ydata = simul$ydata, implementation = GroupLasso, family = "binomial", Lambda = Lambda, group_x = rep(5, 4), group_penalisation = TRUE ) summary(stab) } par(oldpar)
Creates a boxplots of the distribution of (calibrated) median attribute
weights obtained from the COSA algorithm across the subsampling iterations.
See examples in Clustering
.
WeightBoxplot( stability, at = NULL, argmax_id = NULL, col = NULL, boxwex = 0.3, xlab = "", ylab = "Weight", cex.lab = 1.5, las = 3, frame = "F", add = FALSE, ... )
WeightBoxplot( stability, at = NULL, argmax_id = NULL, col = NULL, boxwex = 0.3, xlab = "", ylab = "Weight", cex.lab = 1.5, las = 3, frame = "F", add = FALSE, ... )
stability |
output of |
at |
coordinates along the x-axis (more details in
|
argmax_id |
optional indices of hyper-parameters. If
|
col |
optional vector of colours. |
boxwex |
box width (more details in |
xlab |
label of the x-axis. |
ylab |
label of the y-axis. |
cex.lab |
font size for labels. |
las |
orientation of labels on the x-axis (see
|
frame |
logical indicating if the box around the plot should be drawn
(more details in |
add |
logical indicating if the boxplot should be added to the current plot. |
... |
additional parameters passed to |
A boxplot.