Article

Grid Binary LOgistic REgression (GLORE): building shared models without sharing data

Division of Biomedical Informatics, Department of Medicine, University of California San Diego, La Jolla, California 92093, USA.
Journal of the American Medical Informatics Association (Impact Factor: 3.93). 04/2012; 19(5):758-64. DOI: 10.1136/amiajnl-2012-000862
Source: PubMed

ABSTRACT The classification of complex or rare patterns in clinical and genomic data requires the availability of a large, labeled patient set. While methods that operate on large, centralized data sources have been extensively used, little attention has been paid to understanding whether models such as binary logistic regression (LR) can be developed in a distributed manner, allowing researchers to share models without necessarily sharing patient data.
Instead of bringing data to a central repository for computation, we bring computation to the data. The Grid Binary LOgistic REgression (GLORE) model integrates decomposable partial elements or non-privacy sensitive prediction values to obtain model coefficients, the variance-covariance matrix, the goodness-of-fit test statistic, and the area under the receiver operating characteristic (ROC) curve.
We conducted experiments on both simulated and clinically relevant data, and compared the computational costs of GLORE with those of a traditional LR model estimated using the combined data. We showed that our results are the same as those of LR to a 10(-15) precision. In addition, GLORE is computationally efficient.
In GLORE, the calculation of coefficient gradients must be synchronized at different sites, which involves some effort to ensure the integrity of communication. Ensuring that the predictors have the same format and meaning across the data sets is necessary.
The results suggest that GLORE performs as well as LR and allows data to remain protected at their original sites.

0 Followers
 · 
142 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe functional specifications and practicalities in the software development process for a web service that allows the construction of the multivariate logistic regression model, Grid Logistic Regression (GLORE), by aggregating partial estimates from distributed sites, with no exchange of patient-level data. We recently developed and published a web service for model construction and data analysis in a distributed environment. This recent paper provided an overview of the system that is useful for users, but included very few details that are relevant for biomedical informatics developers or network security personnel who may be interested in implementing this or similar systems. We focus here on how the system was conceived and implemented. We followed a two-stage development approach by first implementing the backbone system and incrementally improving the user experience through interactions with potential users during the development. Our system went through various stages such as concept proof, algorithm validation, user interface development, and system testing. We used the Zoho Project management system to track tasks and milestones. We leveraged Google Code and Apache Subversion to share code among team members, and developed an applet-servlet architecture to support the cross platform deployment. During the development process, we encountered challenges such as Information Technology (IT) infrastructure gaps and limited team experience in user-interface design. We figured out solutions as well as enabling factors to support the translation of an innovative privacy-preserving, distributed modeling technology into a working prototype. Using GLORE (a distributed model that we developed earlier) as a pilot example, we demonstrated the feasibility of building and integrating distributed modeling technology into a usable framework that can support privacy-preserving, distributed data analysis among researchers at geographically dispersed institutes.
    12/2014; 2(1):1053. DOI:10.13063/2327-9214.1053
  • [Show abstract] [Hide abstract]
    ABSTRACT: Assuring confidentiality of personal information and preserving privacy are vital when data is harvested from multiple institutions for business decision-making. An algorithm that builds knowledge using statistics based on subject data from distributed sites that satisfy specified selection criteria is presented here. The algorithm maintains complete fidelity of information structures in the distributed data compared to the centralized equivalent. Heterogeneous data schemas across sites can be accommodated and thresholds can be set for global minimum saturation for attributes to participate in the prediction model building. Policies for inclusion and exclusion of non-exhaustive attributes among sites are introduced. Unification of attributes is introduced for homogenizing attribute values globally. Results of experiments using data from medical, higher education, and social domains elucidate the value of our algorithm in regulated industries, where shipping raw data outside parent institution is not practical.
    Journal of Intelligent Information Systems 02/2015; 44(1). DOI:10.1007/s10844-014-0331-6
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multi-category response models are very important complements to binary logistic models in medical decision-making. Decomposing model construction by aggregating computation developed at different sites is necessary when data cannot be moved outside institutions due to privacy or other concerns. Such decomposition makes it possible to conduct grid computing to protect the privacy of individual observations. This paper proposes two grid multi-category response models for ordinal and multinomial logistic regressions. Grid computation to test model assumptions is also developed for these two types of models. In addition, we present grid methods for goodness-of-fit assessment and for classification performance evaluation. Simulation results show that the grid models produce the same results as those obtained from corresponding centralized models, demonstrating that it is possible to build models using multi-center data without losing accuracy or transmitting observation-level data. Two real data sets are used to evaluate the performance of our proposed grid models. The grid fitting method offers a practical solution for resolving privacy and other issues caused by pooling all data in a central site. The proposed method is applicable for various likelihood estimation problems, including other generalized linear models.
    BMC Medical Informatics and Decision Making 12/2015; 15(1). DOI:10.1186/s12911-015-0133-y

Preview

Download
2 Downloads
Available from