Article

Graphical modeling of binary data using the LASSO: a simulation study.

Institute for Medical Informatics, Biometrics and Epidemiology, Ludwig-Maximilians-Universität München, Munich, Germany.
BMC Medical Research Methodology (impact factor: 2.67). 02/2012; 12:16. DOI:10.1186/1471-2288-12-16 pp.16
Source: PubMed

ABSTRACT Graphical models were identified as a promising new approach to modeling high-dimensional clinical data. They provided a probabilistic tool to display, analyze and visualize the net-like dependence structures by drawing a graph describing the conditional dependencies between the variables. Until now, the main focus of research was on building Gaussian graphical models for continuous multivariate data following a multivariate normal distribution. Satisfactory solutions for binary data were missing. We adapted the method of Meinshausen and Bühlmann to binary data and used the LASSO for logistic regression. Objective of this paper was to examine the performance of the Bolasso to the development of graphical models for high dimensional binary data. We hypothesized that the performance of Bolasso is superior to competing LASSO methods to identify graphical models.
We analyzed the Bolasso to derive graphical models in comparison with other LASSO based method. Model performance was assessed in a simulation study with random data generated via symmetric local logistic regression models and Gibbs sampling. Main outcome variables were the Structural Hamming Distance and the Youden Index.We applied the results of the simulation study to a real-life data with functioning data of patients having head and neck cancer.
Bootstrap aggregating as incorporated in the Bolasso algorithm greatly improved the performance in higher sample sizes. The number of bootstraps did have minimal impact on performance. Bolasso performed reasonable well with a cutpoint of 0.90 and a small penalty term. Optimal prediction for Bolasso leads to very conservative models in comparison with AIC, BIC or cross-validated optimal penalty terms.
Bootstrap aggregating may improve variable selection if the underlying selection process is not too unstable due to small sample size and if one is mainly interested in reducing the false discovery rate. We propose using the Bolasso for graphical modeling in large sample sizes.

0 0
 · 
0 Bookmarks
 · 
59 Views
  • Source
    Article: Heuristics of instability and stabilization in model selection
    [show abstract] [hide abstract]
    ABSTRACT: In model selection, usually a "best" predictor is chosen from a collection ${\hat{\mu}(\cdot, s)}$ of predictors where $\hat{\mu}(\cdot, s)$ is the minimum least-squares predictor in a collection $\mathsf{U}_s$ of predictors. Here s is a complexity parameter; that is, the smaller s, the lower dimensional/smoother the models in $\mathsf{U}_s$. ¶ If $\mathsf{L}$ is the data used to derive the sequence ${\hat{\mu}(\cdot, s)}$, the procedure is called unstable if a small change in $\mathsf{L}$ can cause large changes in ${\hat{\mu}(\cdot, s)}$. With a crystal ball, one could pick the predictor in ${\hat{\mu}(\cdot, s)}$ having minimum prediction error. Without prescience, one uses test sets, cross-validation and so forth. The difference in prediction error between the crystal ball selection and the statistician's choice we call predictive loss. For an unstable procedure the predictive loss is large. This is shown by some analytics in a simple case and by simulation results in a more complex comparison of four different linear regression methods. Unstable procedures can be stabilized by perturbing the data, getting a new predictor sequence ${\hat{\mu'}(\cdot, s)}$ and then averaging over many such predictor sequences.
  • Article: Methodological considerations, such as directed acyclic graphs, for studying "acute on chronic" disease epidemiology: chronic obstructive pulmonary disease example.
    [show abstract] [hide abstract]
    ABSTRACT: Acute exacerbations of chronic disease are ubiquitous in clinical medicine, and thus far, there has been a paucity of integrated methodological discussion on this phenomenon. We use acute exacerbations of chronic obstructive pulmonary disease as an example to emphasize key epidemiological and statistical issues for this understudied field in clinical epidemiology. Directed acyclic graphs are a useful epidemiological tool to explain the differential effects of risk factor on health outcomes in studies of acute and chronic phases of disease. To study the pathogenesis of acute exacerbations of chronic disease, case-crossover design and time-series analysis are well-suited study designs to differentiate acute and chronic effect. Modeling changes over time and setting appropriate thresholds are important steps to separate acute from chronic phases of disease in serial measurements. In statistical analysis, acute exacerbations are recurrent events, and some individuals are more prone to recurrences than others. Therefore, appropriate statistical modeling should take into account intraindividual dependence. Finally, we recommend the use of "event-based" number needed to treat (NNT) to prevent a single exacerbation instead of traditional patient-based NNT. Addressing these methodological challenges will advance research quality in acute on chronic disease epidemiology.
    Journal of clinical epidemiology 03/2009; 62(9):982-90. · 2.96 Impact Factor
  • Article: Graphical models illustrated complex associations between variables describing human functioning.
    [show abstract] [hide abstract]
    ABSTRACT: To examine whether graphical modeling is a potentially useful method for the study of human functioning using data collected by means of the International Classification of Functioning, Disability and Health (ICF). The applicability of the method was examined in a convenience sample of 616 patients from a cross-sectional multicentric study undergoing early postacute rehabilitation. Functioning was qualified using 115 second-level ICF categories. The modeling was carried out on a data set with imputed missing values. The least absolute shrinkage and selection operator (LASSO) for generalized linear models was used to identify conditional dependencies between the ICF categories. Bootstrap aggregating was used to enhance the accuracy and validity of model selection. The resulting graph showed highly meaningful relationships. For example, one structure centered around speaking and included three paths addressing conversation, speech functions, and mental functions of language. Graphical modeling of human functioning using data collected by means of the ICF yields clinically meaningful results. The structures found may be the basis for the identification of suitable targets for rehabilitation interventions, the identification of confounders and intermediate variables, and the selection of parsimonious sets of variables for multivariate epidemiological modeling.
    Journal of clinical epidemiology 07/2009; 62(9):922-33. · 2.96 Impact Factor

Full-text (2 Sources)

View
4 Downloads
Available from
25 Jan 2013

Keywords

binary data
 
Bootstrap aggregating
 
building Gaussian graphical models
 
conservative models
 
continuous multivariate data
 
derive graphical models
 
dimensional binary data
 
Graphical models
 
modeling high-dimensional clinical data
 
multivariate normal distribution
 
promising new approach
 
random data
 
real-life data
 
Satisfactory solutions
 
simulation study
 
small penalty term
 
small sample size
 
symmetric local logistic regression models
 
variable selection
 
Youden Index.We