[Show abstract][Hide abstract] ABSTRACT: In this paper, we propose a majorization-minimization (MM) algorithm for
high-dimensional fused lasso regression (FLR) suitable for parallelization
using graphics processing units (GPUs). The MM algorithm is stable and flexible
as it can solve the FLR problems with various types of design matrices and
penalty structures within a few tens of iterations. We also show that the
convergence of the proposed algorithm is guaranteed. We conduct numerical
studies to compare our algorithm with other existing algorithms. We demonstrate
that the proposed MM algorithm is competitive in general settings. The merit of
GPU parallelization is also exhibited.
[Show abstract][Hide abstract] ABSTRACT: Estimation of high-dimensional covariance matrices is known to be a difficult problem, has many applications, and is of current interest to the larger statistics community. In many applications including so-called the "large p small n" setting, the estimate of the covariance matrix is required to be not only invertible, but also well-conditioned. Although many regularization schemes attempt to do this, none of them address the ill-conditioning problem directly. In this paper, we propose a maximum likelihood approach, with the direct goal of obtaining a well-conditioned estimator. No sparsity assumption on either the covariance matrix or its inverse are are imposed, thus making our procedure more widely applicable. We demonstrate that the proposed regularization scheme is computationally efficient, yields a type of Steinian shrinkage estimator, and has a natural Bayesian interpretation. We investigate the theoretical properties of the regularized covariance estimator comprehensively, including its regularization path, and proceed to develop an approach that adaptively determines the level of regularization that is required. Finally, we demonstrate the performance of the regularized estimator in decision-theoretic comparisons and in the financial portfolio optimization setting. The proposed approach has desirable properties, and can serve as a competitive procedure, especially when the sample size is small and when a well-conditioned estimator is required.
Journal of the Royal Statistical Society Series B (Statistical Methodology) 06/2013; 75(3):427-450. · 4.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper proposes a procedure to obtain monotone estimates of both the
local and the tail false discovery rates that arise in large-scale multiple
testing. The proposed monotonization is asymptotically optimal for controlling
the false discovery rate and also has many attractive finite-sample properties.
Statistics [?] Probability Letters 05/2013; · 0.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: As the need for large-scale data analysis is rapidly increasing, Hadoop, or the platform that realizes large-scale data processing, and MapReduce, or the internal computational model of Hadoop, are receiving great attention. This paper reviews the basic concepts of Hadoop and MapReduce necessary for data analysts who are familiar with statistical programming, through examples that combine the R programming language and Hadoop.
Journal of the Korean Data and Information Science Society. 01/2013; 24(5).
[Show abstract][Hide abstract] ABSTRACT: The ROC convex hull (ROCCH) is the least convex majorant of the empirical ROC curve, and represents the optimal ROC curve of a set of classifiers. This paper provides a probabilistic view to the ROCCH. We show that the ROCCH can be characterized as a nonparametric maximum likelihood estimator (NPMLE) of a convex ROC curve. We provide two NPMLE formulations, one unconditional and the other conditional, both of which yield the ROOCH as the solution. The solution technique relates the NPMLEs to convex optimization and classifier calibration. The connection between the NPMLEs and the ROCCH also suggests efficient algorithms to compute NPMLEs of a convex ROC curve, and a conditional bootstrap procedure for assessing uncertainties in the ROCCH.
[Show abstract][Hide abstract] ABSTRACT: Highly multiplexed assays using antibody coated, fluorescent (xMap) beads are widely used to measure quantities of soluble analytes, such as cytokines and antibodies in clinical and other studies. Current analyses of these assays use methods based on standard curves that have limitations in detecting low or high abundance analytes. Here we describe SAxCyB (Significance Analysis of xMap Cytokine Beads), a method that uses fluorescence measurements of individual beads to find significant differences between experimental conditions. We show that SAxCyB outperforms conventional analysis schemes in both sensitivity (low fluorescence) and robustness (high variability) and has enabled us to find many new differentially expressed cytokines in published studies.
Proceedings of the National Academy of Sciences 02/2012; 109(8):2848-53. · 9.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Direct projection of three-dimensional branching structures, such as networks of cables, blood vessels, or neurons onto a 2D image creates the illusion of intersecting structural parts and creates challenges for understanding and communication. We present a method for visualizing such structures, and demonstrate its utility in visualizing the abdominal aorta and its branches, whose tomographic images might be obtained by computed tomography or magnetic resonance angiography, in a single two-dimensional stylistic image, without overlaps among branches. The visualization method, termed uncluttered single-image visualization (USIV), involves optimization of geometry. This paper proposes a novel optimization technique that utilizes an interesting connection of the optimization problem regarding USIV to the protein structure prediction problem. Adopting the integer linear programming-based formulation for the protein structure prediction problem, we tested the proposed technique using 30 visualizations produced from five patient scans with representative anatomical variants in the abdominal aortic vessel tree. The novel technique can exploit commodity-level parallelism, enabling use of general-purpose graphics processing unit (GPGPU) technology that yields a significant speedup. Comparison of the results with the other optimization technique previously reported elsewhere suggests that, in most aspects, the quality of the visualization is comparable to that of the previous one, with a significant gain in the computation time of the algorithm.
IEEE transactions on visualization and computer graphics. 01/2012;
[Show abstract][Hide abstract] ABSTRACT: Though recently they have fallen into some disrepute, genome-wide association studies (GWAS) have been formulated and applied to understanding essential hypertension. The principal goal here is to use data gathered in a GWAS to gauge the extent to which SNPs and their interactions with other features can be combined to predict mean arterial blood pressure (MAP) in 3138 pre-menopausal and naturally post-menopausal white women. More precisely, we quantify the extent to which data as described permit prediction of MAP beyond what is possible from traditional risk factors such as blood cholesterol levels and glucose levels. Of course, these traditional risk factors are genetic, though typically not explicitly so. In all, there were 44 such risk factors/clinical variables measured and 377,790 single nucleotide polymorphisms (SNPs) genotyped. Data for women we studied are from first visit measurements taken as part of the Atherosclerotic Risk in Communities (ARIC) study. We begin by assessing non-SNP features in their abilities to predict MAP, employing a novel regression technique with two stages, first the discovery of main effects and next discovery of their interactions. The long list of SNPs genotyped is reduced to a manageable list for combining with non-SNP features in prediction. We adapted Efron's local false discovery rate to produce this reduced list. Selected non-SNP and SNP features and their interactions are used to predict MAP using adaptive linear regression. We quantify quality of prediction by an estimated coefficient of determination (R(2)). We compare the accuracy of prediction with and without information from SNPs.
PLoS ONE 01/2011; 6(11):e27891. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The authors develop a method to visualize the abdominal aorta and its branches, obtained by CT or MR angiography, in a single 2D stylistic image without overlap among branches.
The abdominal aortic vasculature is modeled as an articulated object whose underlying topology is a rooted tree. The inputs to the algorithm are the 3D centerlines of the abdominal aorta, its branches, and their associated diameter information. The visualization problem is formulated as an optimization problem that finds a spatial configuration of the bounding boxes of the centerlines most similar to the projection of the input into a given viewing direction (e.g., anteroposterior), while not introducing intersections among the boxes. The optimization algorithm minimizes a score function regarding the overlap of the bounding boxes and the deviation from the input. The output of the algorithm is used to produce a stylistic visualization, made of the 2D centerlines modulated by the associated diameter information, on a plane. The authors performed a preliminary evaluation by asking three radiologists to label 366 arterial branches from the 30 visualizations of five cases produced by the method. Each of the five patients was presented in six different variant images, selected from ten variants with the three lowest and three highest scores. For each label, they assigned confidence and distortion ratings (low/medium/high). They studied the association between the quantitative metrics measured from the visualization and the subjective ratings by the radiologists.
All resulting visualizations were free from branch overlaps. Labeling accuracies of the three readers were 93.4%, 94.5%, and 95.4%, respectively. For the total of 1098 samples, the distortion ratings were low: 77.39%, medium: 10.48%, and high: 12.12%. The confidence ratings were low: 5.56%, medium: 16.50%, and high: 77.94%. The association study shows that the proposed quantitative metrics can predict a reader's subjective ratings and suggests that the visualization with the lowest score should be selected for readers.
The method for eliminating misleading false intersections in 2D projections of the abdominal aortic tree conserves the overall shape and does not diminish accurate identifiability of the branches.
Medical Physics 11/2009; 36(11):5245-60. · 2.91 Impact Factor