Grid Binary LOgistic REgression (GLORE): building shared models without sharing data.

Division of Biomedical Informatics, Department of Medicine, University of California San Diego, La Jolla, California 92093, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.57). 04/2012; 19(5):758-64. DOI: 10.1136/amiajnl-2012-000862
Source: PubMed

ABSTRACT The classification of complex or rare patterns in clinical and genomic data requires the availability of a large, labeled patient set. While methods that operate on large, centralized data sources have been extensively used, little attention has been paid to understanding whether models such as binary logistic regression (LR) can be developed in a distributed manner, allowing researchers to share models without necessarily sharing patient data.
Instead of bringing data to a central repository for computation, we bring computation to the data. The Grid Binary LOgistic REgression (GLORE) model integrates decomposable partial elements or non-privacy sensitive prediction values to obtain model coefficients, the variance-covariance matrix, the goodness-of-fit test statistic, and the area under the receiver operating characteristic (ROC) curve.
We conducted experiments on both simulated and clinically relevant data, and compared the computational costs of GLORE with those of a traditional LR model estimated using the combined data. We showed that our results are the same as those of LR to a 10(-15) precision. In addition, GLORE is computationally efficient.
In GLORE, the calculation of coefficient gradients must be synchronized at different sites, which involves some effort to ensure the integrity of communication. Ensuring that the predictors have the same format and meaning across the data sets is necessary.
The results suggest that GLORE performs as well as LR and allows data to remain protected at their original sites.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Assuring confidentiality of personal information and preserving privacy are vital when data is harvested from multiple institutions for business decision-making. An algorithm that builds knowledge using statistics based on subject data from distributed sites that satisfy specified selection criteria is presented here. The algorithm maintains complete fidelity of information structures in the distributed data compared to the centralized equivalent. Heterogeneous data schemas across sites can be accommodated and thresholds can be set for global minimum saturation for attributes to participate in the prediction model building. Policies for inclusion and exclusion of non-exhaustive attributes among sites are introduced. Unification of attributes is introduced for homogenizing attribute values globally. Results of experiments using data from medical, higher education, and social domains elucidate the value of our algorithm in regulated industries, where shipping raw data outside parent institution is not practical.
    Journal of Intelligent Information Systems 02/2015; 44(1). · 0.63 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Bringing together the information latent in distributed medical databases promises to personalize medical care by enabling reliable, stable modeling of outcomes with rich feature sets (including patient characteristics and treatments received). However, there are barriers to aggregation of medical data, due to lack of standardization of ontologies, privacy concerns, proprietary attitudes toward data, and a reluctance to give up control over end use. Aggregation of data is not always necessary for model fitting. In models based on maximizing a likelihood, the computations can be distributed, with aggregation limited to the intermediate results of calculations on local data, rather than raw data. Distributed fitting is also possible for singular value decomposition. There has been work on the technical aspects of shared computation for particular applications, but little has been published on the software needed to support the "social networking" aspect of shared computing, to reduce the barriers to collaboration. We describe a set of software tools that allow the rapid assembly of a collaborative computational project, based on the flexible and extensible R statistical software and other open source packages, that can work across a heterogeneous collection of database environments, with full transparency to allow local officials concerned with privacy protections to validate the safety of the method. We describe the principles, architecture, and successful test results for the site-stratified Cox model and rank-k Singular Value Decomposition (SVD).
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Privacy protecting is an important issue in medical informatics and differential privacy is a state-of-the-art framework for data privacy research. Differential privacy offers provable privacy against attackers who have auxiliary information, and can be applied to data mining models (for example, logistic regression). However, differentially private methods sometimes introduce too much noise and make outputs less useful. Given available public data in medical research (e.g. from patients who sign open-consent agreements), we can design algorithms that use both public and private data sets to decrease the amount of noise that is introduced.
    BMC Medical Genomics 05/2014; 7(Suppl 1):S14. · 3.91 Impact Factor


Available from