Grid Binary LOgistic REgression (GLORE): building
shared models without sharing data
Yuan Wu, Xiaoqian Jiang, Jihoon Kim, Lucila Ohno-Machado
Objective The classification of complex or rare patterns
in clinical and genomic data requires the availability of
a large, labeled patient set. While methods that operate
on large, centralized data sources have been extensively
used, little attention has been paid to understanding
whether models such as binary logistic regression (LR)
can be developed in a distributed manner, allowing
researchers to share models without necessarily sharing
Material and methods Instead of bringing data to
a central repository for computation, we bring
computation to the data. The Grid Binary LOgistic
REgression (GLORE) model integrates decomposable
partial elements or non-privacy sensitive prediction
values to obtain model coefficients, the variance-
covariance matrix, the goodness-of-fit test statistic, and
the area under the receiver operating characteristic
Results We conducted experiments on both simulated
and clinically relevant data, and compared the
computational costs of GLORE with those of a traditional
LR model estimated using the combined data. We
showed that our results are the same as those of LR to
a 10?15precision. In addition, GLORE is computationally
Limitation In GLORE, the calculation of coefficient
gradients must be synchronized at different sites, which
involves some effort to ensure the integrity of
communication. Ensuring that the predictors have the
same format and meaning across the data sets is
Conclusion The results suggest that GLORE performs as
well as LR and allows data to remain protected at their
In biomedical, translational, and clinical research, it
is important to share data to obtain sample sizes
that are meaningful and potentially accelerate
discoveries.1This is necessary to expedite pattern
recognition related to relatively rare events or
conditions, such as complications from invasive
procedures, adverse events associated with new
medications, association of disease with a rare gene
variant, and many others. Although electronic data
networks have been established for this purpose, in
the form of disease registries, clinical data ware-
discovery related to clinical trial recruitment, etc,
many of these initiatives are based on federated
models in which the actual data never leave the
institution of origin, for example, as in the model
used at the Clinical Evaluative Sciences in Ontario
(MCHP).2However, the statistics and predictive
models that can be developed in these distributed
networks have been very limited, often consisting
somewhat obfuscated to preserve the privacy of
individuals).3 4Many clinical pattern recognition
tasks5e8are highly complex, involving multiple
factors. To support human decision making in
models9e16have been developed and applied in
a clinical context. Recently, various systems were
developed for assisting with tasks as diverse as
automatically discovering drug treatment patterns
in electronic health records,17improving patient
safety via automated laboratory-based adverse
event grading,18predicting the outcome of renal
transplantation,19guiding the treatment of hyper-
cholesterolemia,11making prognoses for patients
undergoing surgical procedures,20 21and estimating
the success of assisted reproduction techniques.22
Multiple risk calculators for cardiovascular disease
prediction are based on the Framingham study.13
Among the most popular prediction models, the
logistic regression (LR)23model is widely adopted
in biomedical research, such as the Model for End-
stage Liver Disease (MELD)24and many other
simplicity and the interpretability of the estimated
parameters. In an LR model, the independent vari-
ables constitute a vector X of several variables that
help classify a case into positive or negative as
represented by the dependent binary variable Y. In
order to do this, the LR model estimates coefficients
for each of the dependent variables. For example,
the classification of temperature (independent
variable X) into ‘fever’ (dependent variable Y) may
be done by an LR model and sufficient examples,
such that the model ‘discovers’ that temperatures
above 388C (100.48F) are associated with ‘fever’.
The LR is based on a simple sigmoid function (see
figure 1) and is backed by information theory,28
which provides a theoretical justification for its
The classic LR model, however, has limitations in
operating on federated data sets, or distributed
data, since the training phase (ie, learning of
parameters) involves looking at all the data, which
are usually located at a central repository. Data that
are distributed across institutions have to be
combined for the classic LR algorithm to work.
Although sharing and dissemination can largely
improve the power of the analysis,29it is often not
possible to combine distributed data due to
concerns related to the privacy of individuals30 31or
still need tobe
owing largely to its
<An additional appendix is
published online only. To view
this file please visit the journal
Division of Biomedical
Informatics, Department of
Medicine, University of
California San Diego, La Jolla,
Dr Yuan Wu, Division of
Department of Medicine,
University of California San
Diego, La Jolla, CA 92093, USA;
Received 19 January 2012
Accepted 19 March 2012
Published Online First
17 April 2012
This paper is freely available
online under the BMJ Journals
unlocked scheme, see http://
758J Am Med Inform Assoc 2012;19:758e764. doi:10.1136/amiajnl-2012-000862
Research and applications
Acknowledgments We thank Dr. Hamish Fraser and Dr. Kelly Zou for providing the
clinical data, and Mr. Kiltesh Patel for helpful discussions. We thank two anonymous
reviewers for their feedback, which helped us improve the original manuscript.
Contributors YW and LOM contributed equally to the writing of this article. The other
authors are ranked according to their contributions.
Funding The authors were funded in part by NIH grants R01LM009520,
U54HL108460, R01HS019913, and UL1RR031980.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.
Ohno-Machado L, Bafna V, Boxwala AA, et al. iDASH: integrating data for analysis,
anonymization, and sharing. J Am Med Inform Assoc 2012;19:196e201.
Willison DJ. Use of data from the electronic health record for health research:
current governance challenges and potential approaches. In: Johnston S, Ranford J,
eds. OPC Guidance Documents, Annual Reports to Parliament. Ottawa, Ont: Office of
the Privacy Commissioner of Canada, 2009:1e32.
Murphy SN, Gainer V, Mendis M, et al. Strategies for maintaining patient privacy in
i2b2. J Am Med Inform Assoc 2011;18(Suppl 1):103e8.
Vinterbo SA, Sarwate AD, Boxwala A. Protecting count queries in study design.
J Am Med Inform Assoc 2012;19:750e7.
Denekamp Y, Boxwala AA, Kuperman G, et al. A meta-data model for knowledge in
decision support systems. AMIA Annu Symp Proc 2003:826.
Jiang X. Predictions for Biomedical Decision Support. 2010. http://reports-archive.
Katz MS, Efstathiou JA, D’Amico AV, et al. The ’CaP Calculator’: an online
decision support tool for clinically localized prostate cancer. BJU Int
Ohno-Machado L, Wang SJ, Mar P, et al. Decision support for clinical trial eligibility
determination in breast cancer. Proc AMIA Symp 1999:340e4.
Jiang X, Boxwala A, El-Kareh R, et al. A patient-Driven Adaptive Prediction
Technique (ADAPT) to improve personalized risk estimation for clinical decision
support. J Am Med Inform Assoc 2012;19:e137ee44.
Jiang X, El-Kareh R, Ohno-Machado L. Improving Predictions in Imbalanced Data
Using Pairwise Expanded Logistic Regression. Washington, DC: AMIA Annual
Symposium Proceedings, 2011:625e34.
Karp I, Abrahamowicz M, Bartlett G, et al. Updated risk factor values and the ability
of the multivariable risk score to predict coronary heart disease. Am J Epidemiol
Leslie WD, Lix LM, Johansson H, et al. Independent clinical validation of a Canadian
FRAX tool: fracture prediction and model calibration. J Bone Mineral Res
Wilson PW, D’Agostino RB, Levy D, et al. Prediction of coronary heart disease using
risk factor categories. Circulation 1998;97:1837e47.
Matheny M, Ohno-Machado L, Resnic F. Discrimination and calibration of
mortality risk prediction models in interventional cardiology. J Biomed Inform
Ohno-Machado L. Modeling medical prognosis: survival analysis techniques. J
Biomed Informat 2001;34:428e39.
Ohno-Machado L, Resnic FS, Matheny ME. Prognosis in critical care. Annu Rev
Biomed Eng 2006;8:567e99.
Savova GK, Olson JE, Murphy SP, et al. Automated discovery of drug treatment
patterns for endocrine therapy of breast cancer within an electronic medical record. J
Am Med Inform Assoc 2012;19:e83ee9.
Niland JC, Stiller T, Neat J, et al. Improving patient safety via automated laboratory-
based adverse event grading. J Am Med Inform Assoc 2011;19:111e15.
Lasserre J, Arnold S, Vingron M, et al. Predicting the outcome of renal
transplantation. J Am Med Inform Assoc 2012;19:255e62.
Talos I, Zou K, Ohno-Machado L, et al. Supratentorial low-grade glioma resectability:
statistical predictive analysis based on anatomic MR features and tumor
characteristics. Radiology 2006;239:506e13.
Resnic F, Ohno-Machado L, Selwyn A, et al. Simplified risk score models accurately
predict the risk of major in-hospital complications following percutaneous coronary
intervention. Am J Cardiol 2001;88:5e9.
Racowsky C, Ohno-Machado L, Kim J, et al. Is there an advantage in scoring early
embryos on more than one day? Hum Reprod 2009;24:2104e13.
Hosmer DW, Lemeshow S. Applied Logistic Regression. New York: Wiley-
Kamath PS, Kim W. The model for end-stage liver disease (MELD). Hepatology
Boxwala AA, Kim J, Grillo JM, et al. Using statistical and machine learning to help
institutions detect suspicious access to electronic health records. J Am Med Inform
Frisse ME, Johnson KB, Nian H, et al. The financial impact of health information
exchange on emergency department care. J Am Med Inform Assoc
Seidling HM, Phansalkar S, Seger DL, et al. Factors influencing alert acceptance:
a novel approach for predicting the success of clinical decision support. J Am Med
Inform Assoc 2011;18:479e84.
Shtatland ES, Barton MB. Information theory makes logistic regression special.
Annual Conference of NorthEast SAS Users’ Group (neseg). Pittsburgh, PA: NESEG,
Osl M, Dreiseitl S, Kim J, et al. Effect of data combination on predictive modeling:
a study using gene expression data. AMIA Annu Symp Proc 2010:567e71.
El Emam K, Hu J, Mercer J, et al. A secure protocol for protecting the identity of
providers when disclosing data for disease surveillance. J Am Med Inform Assoc
Loukides G, Denny JC, Malin B. The disclosure of diagnosis codes can breach
research participants’ privacy. J Am Med Inform Assoc 2010;17:322e7.
Vaszar LT, Cho MK, Raffin TA. Privacy issues in personalized medicine.
Sweeney L. Privacy and medical-records research. N Engl J Med 1998;338:1077.
author reply 77e8.
Calloway SD, Venegas LM. The new HIPAA law on privacy and confidentiality. Nurs
Adm Q 2002;26:40e54.
Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an
application to boosting. J Comput Syst Sci 1997;55:119e39.
Vapnik NV. The Nature of Statistical Learning Theory. New York: Springer-Verlag,
Gambs S, Ke ´gl B, Aı ¨meur E. Privacy-preserving boosting. Data Min Knowl Discov
Vaidya J, Yu H, Jiang X. Privacy-preserving SVM classification. Knowl Inform Syst
Yu H, Jiang X, Vaidya J. Privacy-preserving SVM using nonlinear kernels on
horizontally partitioned data. Symposium on Applied Computing (SAC). Dijon, France:
Yu H, Vaidya J, Jiang X. Privacy-preserving svm classification on vertically partitioned
data. Advances in Knowledge Discovery and Data Mining. 2006;3918:647e56.
Chu CT, Kim SK, Lin YA, et al. Map-reduce for machine learning on multicore. Adv
Neural Inform Process Syst 2007;19:281e88.
Sanil AP, Karr AF, Lin X, et al. Privacy Preserving Regression Modelling via
Distributed Computation. Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. Seattle, WA: ACM,
Minka T. A Comparison of Numerical Optimizers for Logistic Regression. Pittsburgh,
PA: Carnegie Mellon University, Technical Report, 2003.
Hosmer DW, Hosmer T, Le Cessie S, et al. A comparison of goodness-of-fit tests for
the logistic regression model. Stat Med 1997;16:965e80.
Kramer AA, Zimmerman JE. Assessing the calibration of mortality benchmarks in
critical care: the Hosmer-Lemeshow test revisited. Crit Care Med 2007;35:2052e56.
Lemeshow S, Hosmer DW. A review of goodness of fit statistics for use in the
development of logistic regression models. Am J Epidemiol 1982;115:92e106.
Hanley J, McNeil B. The meaning and use of the area under a receiver operating
characteristic (ROC) curve. Radiology 1982;143:29e36.
Harrell F, Califf R, Pryor D, et al. Evaluating the yield of medical tests. JAMA
Zou KH, Liu AI, Bandos AI, et al. Statistical Evaluation Of Diagnostic Performance:
Topics in ROC Analysis. Boca Raton, FL: Chapman & Hall/CRC Biostatistics Series,
Lasko TA, Bhagwat JG, Zou KH, et al. The use of receiver operating characteristic
curves in biomedical informatics. J Biomed Inform 2005;38:404e15.
Kennedy RL, Burton AM, Fraser HS, et al. Early diagnosis of acute myocardial
infarction using clinical and electrocardiographic data at presentation: derivation and
evaluation of logistic regression models. Eur Heart J 1996;17:1181e91.
Weber GM, Murphy SN, McMurry AJ, et al. The Shared Health Research
Information Network (SHRINE): a prototype federated query tool for clinical data
repositories. J Am Med Inform Assoc 2009;16:624e30.
Stolba N, Nguyen TM, Tjoa AM. Data Warehouse Facilitating Evidence-Based
Medicine. In: Nguyen TM, ed. Complex Data Warehousing and Knowledge Discovery
for Advanced Retrieval Development: Innovative Methods and Applications. Hershey,
PA: IGI Global, 2010:174e207.
Dwork C. Differential privacy. Proceedings of the 33rd International Colloquium on
Automata, Languages and Programming (ICALP) (2); 10-14 July 2006, Venice, Italy.
Germany: Springer, 2006.
PAGE fraction trail=7
764J Am Med Inform Assoc 2012;19:758e764. doi:10.1136/amiajnl-2012-000862
Research and applications