Page 1

Grid Binary LOgistic REgression (GLORE): building

shared models without sharing data

Yuan Wu, Xiaoqian Jiang, Jihoon Kim, Lucila Ohno-Machado

ABSTRACT

Objective The classification of complex or rare patterns

in clinical and genomic data requires the availability of

a large, labeled patient set. While methods that operate

on large, centralized data sources have been extensively

used, little attention has been paid to understanding

whether models such as binary logistic regression (LR)

can be developed in a distributed manner, allowing

researchers to share models without necessarily sharing

patient data.

Material and methods Instead of bringing data to

a central repository for computation, we bring

computation to the data. The Grid Binary LOgistic

REgression (GLORE) model integrates decomposable

partial elements or non-privacy sensitive prediction

values to obtain model coefficients, the variance-

covariance matrix, the goodness-of-fit test statistic, and

the area under the receiver operating characteristic

(ROC) curve.

Results We conducted experiments on both simulated

and clinically relevant data, and compared the

computational costs of GLORE with those of a traditional

LR model estimated using the combined data. We

showed that our results are the same as those of LR to

a 10?15precision. In addition, GLORE is computationally

efficient.

Limitation In GLORE, the calculation of coefficient

gradients must be synchronized at different sites, which

involves some effort to ensure the integrity of

communication. Ensuring that the predictors have the

same format and meaning across the data sets is

necessary.

Conclusion The results suggest that GLORE performs as

well as LR and allows data to remain protected at their

original sites.

INTRODUCTION

In biomedical, translational, and clinical research, it

is important to share data to obtain sample sizes

that are meaningful and potentially accelerate

discoveries.1This is necessary to expedite pattern

recognition related to relatively rare events or

conditions, such as complications from invasive

procedures, adverse events associated with new

medications, association of disease with a rare gene

variant, and many others. Although electronic data

networks have been established for this purpose, in

the form of disease registries, clinical data ware-

housesforqualityimprovement

discovery related to clinical trial recruitment, etc,

many of these initiatives are based on federated

models in which the actual data never leave the

institution of origin, for example, as in the model

andcohort

used at the Clinical Evaluative Sciences in Ontario

(ICES),ManitobaCentre

(MCHP).2However, the statistics and predictive

models that can be developed in these distributed

networks have been very limited, often consisting

ofsimplecounts(which

somewhat obfuscated to preserve the privacy of

individuals).3 4Many clinical pattern recognition

tasks5e8are highly complex, involving multiple

factors. To support human decision making in

complexsituations,

models9e16have been developed and applied in

a clinical context. Recently, various systems were

developed for assisting with tasks as diverse as

automatically discovering drug treatment patterns

in electronic health records,17improving patient

safety via automated laboratory-based adverse

event grading,18predicting the outcome of renal

transplantation,19guiding the treatment of hyper-

cholesterolemia,11making prognoses for patients

undergoing surgical procedures,20 21and estimating

the success of assisted reproduction techniques.22

Multiple risk calculators for cardiovascular disease

prediction are based on the Framingham study.13

Among the most popular prediction models, the

logistic regression (LR)23model is widely adopted

in biomedical research, such as the Model for End-

stage Liver Disease (MELD)24and many other

clinical applications,25e27

simplicity and the interpretability of the estimated

parameters. In an LR model, the independent vari-

ables constitute a vector X of several variables that

help classify a case into positive or negative as

represented by the dependent binary variable Y. In

order to do this, the LR model estimates coefficients

for each of the dependent variables. For example,

the classification of temperature (independent

variable X) into ‘fever’ (dependent variable Y) may

be done by an LR model and sufficient examples,

such that the model ‘discovers’ that temperatures

above 388C (100.48F) are associated with ‘fever’.

The LR is based on a simple sigmoid function (see

figure 1) and is backed by information theory,28

which provides a theoretical justification for its

power.

The classic LR model, however, has limitations in

operating on federated data sets, or distributed

data, since the training phase (ie, learning of

parameters) involves looking at all the data, which

are usually located at a central repository. Data that

are distributed across institutions have to be

combined for the classic LR algorithm to work.

Although sharing and dissemination can largely

improve the power of the analysis,29it is often not

possible to combine distributed data due to

concerns related to the privacy of individuals30 31or

forHealthPolicy

stillneedtobe

numerousprediction

owing largely to its

<An additional appendix is

published online only. To view

this file please visit the journal

online (http://dx.doi.org/10.

1136/amiajnl-2012-000862).

Division of Biomedical

Informatics, Department of

Medicine, University of

California San Diego, La Jolla,

California, USA

Correspondence to

Dr Yuan Wu, Division of

Biomedical Informatics,

Department of Medicine,

University of California San

Diego, La Jolla, CA 92093, USA;

y6wu@ucsd.edu

Received 19 January 2012

Accepted 19 March 2012

Published Online First

17 April 2012

This paper is freely available

online under the BMJ Journals

unlocked scheme, see http://

jamia.bmj.com/site/about/

unlocked.xhtml

758J Am Med Inform Assoc 2012;19:758e764. doi:10.1136/amiajnl-2012-000862

Research and applications

Page 2

the privacy of institutions.32e34Such a scenario brings new

challenges to the classic learning problem of the LR model.

Although we, among others, have shown that certain machine

learning models like boosting35and support vector machines36

can be trained in a distributed fashion37e40and produce accurate

models, extending this advantage to LR requires a specialized

strategy. A recent work to compute LR with Map-reduce41is

most relevant to our work, but its focus lies in parallelization for

computational speed rather than for privacy-preserving data

analysis. Furthermore, it does not elaborate on how to provide

evaluation indices for these models (eg, Hosmer-Lemeshow

goodness-of-fit test or areas under the receiver operating char-

acteristic (ROC) curve (AUCs)) in a distributed fashion.

Researchers often refine their models for inclusion or exclusion

of some predictors, variable transformations, and other pre-

processing steps. Without evaluation strategies that can be

privacy-preserving, the value of performing LR in a distributed

fashion is very limited. Previous work by other authors in

privacy-preserving distributed linear regression was based on

vertical partitions of data: multiple data owners each had

different attributes for the same observation.42Our previous

work in distributed support vector machines is also related to

vertical partitions.40In this manuscript, we propose a new

algorithm, Grid Binary LOgistic REgression (GLORE), to fit a LR

model in a distributed fashion using information from locally

hosted databases containing different observations that share

the same attributes (ie, horizontal partitions of datadstackable

sets of patient records), without sharing the sensitive original

patient data from these databases, as shown in figure 2. This is

not a trivial task: distributed linear regression is much easier to

implement than distributed logistic regression, since there is

a closed form solution for the former but not for the latter.

Specifically, GLORE estimates the coefficients of an ordinary

LR model by integrating decomposable intermediary results (and

not the actual patient data) to calculate model parameters and

test statistics. The resulting model is calculated in a privacy-

preserving manner and performs as well as classic LR. In the

Methodology section, we will discuss details for estimating

model coefficients, estimating the variance-covariance matrix of

coefficients, performing a goodness-of-fit test, and calculating

the AUC in a distributed fashion. Commonly reported statistics

for LR, including CIs, Z test statistics and their p values, and

ORs can be obtained by using these estimated coefficients and

their variance-covariance matrix.

METHODOLOGY

The LR model is an instance of a generalized linear model with

a logit (ie, fðzÞ ¼ log

z

1 ? z) link function (illustrated in figure 1).

PðY ¼ jXÞ

1?PðY ¼ 1jXÞ¼ XB

1

1þe?Xb,

1þeXbdenote probabilities of an event Y to be 1

(ie, ‘fever’) or 0 (ie, absence of ‘fever’), conditioned on observa-

tion of a feature vector X (eg, 1018F), respectively. The parameter

log

where

ðY ¼ 1jXÞ ¼ pðX;bÞ ¼

1?pðX;bÞ ¼

andPðY ¼ 0jXÞ ¼

1

Figure 1

dimensional data (for example, X¼body temperature. pðX;bÞ is

a sigmoid function relating temperature and the probability that

a particular record contained the word ‘fever’. Dots on the upper and

lower horizontal lines correspond to positive (‘fever’) and negative

(absence of ‘fever’) observations, respectively. Beta is the estimated

parameter.

Illustration of a logistic regression model using one-

Figure 2

LOgistic REgression (GLORE) model.

Data sets hosted in different institutions

(ie, A, B, and C) are processed locally

through the same virtual engine (ie,

GLORE code) to compute non-sensitive

intermediary results, which are

exchanged and combined to obtain the

final global model parameters at the

central site. A similar distributed

process is executed for evaluation of

the model.

Pipeline for the Grid Binary

J Am Med Inform Assoc 2012;19:758e764. doi:10.1136/amiajnl-2012-000862759

Research and applications

Page 3

vector b corresponds to the set of weights or coefficients that

need to be estimated and that will be multiplied by the values

for the features X (ie, Xb) to make predictions.

Estimating model coefficients

In order to explain how GLORE works, it is important to remind

readers how traditional LR works. To estimate the final value for

the parameter vector b, an LR model iteratively maximizes the

likelihood of obtaining X given an initial b. At each iteration, the

algorithm produces bðkþ1Þ, which is based on the previous bðkÞ.

The initial b can start with all coefficients set to zero, eg, the

estimated probability of PðY ¼ 1jXÞ ¼ pðX;bÞ ¼

based on observations of a binary response Yand a feature vector

X (ie, a set of predictors) from each of the data sites. To compute

the maximum likelihood of getting the observed data, the LR

estimation algorithm first finds the derivative of the log likeli-

hood function, then applies the Newton-Raphson method43to

find its maximum, that is, the value of b for which the derivative

function equals zero. For details of the log likelihood function,

please refer to section 1 of the online supplementary appendix.

We also explain how the Newton-Raphson method works in

section 4 of the online supplementary appendix.

In each Newton-Raphson iteration for the LR parameter

estimation problem, the first and second derivatives of log like-

lihood function are both decomposable, that is, they can be

calculated separately for a subset of observations, and then

combined, with the same result as if they were calculated on the

complete set. Hence, a GLORE update can be finished by

combining intermediary results from across all local sites. Please

refer to the online appendix (sections 1 and 2) for technical

details on model coefficient estimation.

Because intermediary results from individual sites do not lose

any information, GLORE guarantees accurate estimation of

parameters through summation. Note that the information

exchanged consists of partially aggregated intermediary results

rather than the raw data, hence they are more privacy-preserving

than would be the case if we transmitted all patient data to

a central site.

Furthermore, since the calculations can be done in parallel,

each step takes only as long as the maximum time for the sites

(ie, the slowest site will determine how long each step takes).

The time to transmit the intermediary results is usually negli-

gible, as only one vector of coefficient adjustments needs to be

sent. After setting the same initial values, at each iteration,

GLORE uses the summation of partial intermediary results from

multiple sites to update the coefficients and sends them back to

the sites for another iteration.

A central engine is efficient in terms of computation, but the

process could occur via peer-to-peer transmittal of intermediary

results. Assuming that there are m features (ie, variables) available

over multiple sites, at each iteration, intermediary results of a (m

+1)3(m+1) matrix (ie, the variance-covariance matrix of coeffi-

cients) and a (m+1)-dimensional vector (ie, the gradients for

coefficient adjustment) from each site must be transmitted to the

central engine or to all other sites, depending on the design

choice. The GLORE framework with a central computing node

can reduce the probability of network delays when compared to

the GLORE framework based on a peer-to-peer architecture.

1

1 þ e?Xbis

Estimating the variance-covariance matrix

Besides model coefficients, variance-covariance matrix estima-

tion is also important, as it is necessary for the computation of

the CIs of individual predictions.9Like the model coefficient

estimation, it can be done by integrating decomposable partial

elements. Please refer to the online supplementary appendix

(section 2) for technical details.

Estimating goodness-of-fit test statistics

The Hosmer and Lemeshow (H-L) test23 44 45is commonly used

to check model fit for LR. This section discusses how to perform

an H-L test to check for GLORE’s fit, without sharing patient

data. The null and alternative hypotheses of the H-L test are

that (1) ‘the model provides an adequate fit,’ and (2) ‘the model

does not fit the data,’ respectively. That is, when the p values for

this test are below 0.05, we can reject the hypothesis that the

model fits the data well, meaning that the model is not well

calibrated. To calculate the H-L statistic, we have to sort cases by

their predictions and create groups from which we establish

a proxy for the ‘true probability’ (ie, the fraction of positive

cases in the group). Let us denote Ojas the number of positive

observations in the jth group and Ejas the sum of predictions in

the jth group, respectively. The parameter nj refers to the

number of records in the jth group. In the box ‘Algorithm 1’

below, we introduce a procedure to obtain the H-L test statistic

for GLORE (ie, the C version of this test, that is based on

percentiles to determine the groups, which is more robust than

the H version of the test that is based on fixed thresholds to

determine the groups46) without sharing patient data (ie, indi-

vidual patient features or individual patient outcomes). In the

following algorithm, class labels are not shared with all other

sites or the central engine. Instead, only the total number of

labeled records with outcome ‘1’ per group from a site needs to

be transmitted to the central engine.

Estimating the area under the ROC curve (AUC)

The AUC47is widely used to evaluate the discrimination

performance of predictive models. To calculate the AUC, it is

necessary to estimate the total number of pairs for which

positive observations (ie, ‘one’-labeled records) rank higher than

negative observations (ie, ‘zero’-labeled records). The box ‘Algo-

rithm 2’ below shows the details of how AUCs are estimated

without sharing individual patient features or patient outcomes.

Besides transmissions between the central engine and all sites,

Algorithm 1 Computing the H-L statistic for GLORE (C

version)

Step 1. Each site transmits probability estimates, that is, pðxi;^bÞ

’s for their observations to the central engine.

Step 2. The central engine sorts all pðxi;^bÞ ’s and evenly groups

the sorted pðxi;^bÞ ’s into deciles*, and computes the estimated

probability for each decile as the sum of predictions in that decile.

Step 3. The central engine sends the indices of predictions in

each decile to their original sites.

Step 4. Each site finds the number of records labeled ‘1’ in each

decile among its own records based on the indices from the

central engine, and transmits this number to the central engine.

Step 5. The central engine combines the numbers of records

labeled ‘1’ from all sites to obtain the total number of records

labeled ‘1’ in each category, and, together with the results from

step 2, computes the H-L statistic.

*Deciles are commonly used in the H-L C test, but other types of

percentiles can be used depending on the size of the data set.

760J Am Med Inform Assoc 2012;19:758e764. doi:10.1136/amiajnl-2012-000862

Research and applications

Page 4

this algorithm requires peer-to-peer communication to keep

patient outcomes protected.

In figure 3 we use a simple artificial example with only two

sites (A and B) to explain how algorithm 2 works. pAand OA

indicate predicted probabilities and class labels from site A, and

pBand OBare predicted probabilities and class labels from site B.

Note that in procedures 1 and 5, only predicted probabilities (ie,

no class labels) are sent to the other site, as in our previous

strategy for H-L tests. The AUC, which is equivalent to the c-

index,48can be calculated in three steps: (1) count all one-labeled

records that have predictions that are larger than the predictions

across all zero-labeled records; (2) count all one-labeled records

that have predictions that are equal to the predictions across all

zero-labeled records; and (3) sum the counts of step (1) and half

the counts of step (2) and divide this number by the total number

of pairs formed by zero-labeled and one-labeled records.47

Sometimes, we might also want to display an entire ROC

curve instead of calculating a single AUC score. In this case, using

as threshold each prediction pðxi;bÞ, a full contingency table (ie,

true positive, false positive, true negative, and false negative) can be

calculated for each threshold. The ROC curve results from the

points (1?specificity, sensitivity) that are calculated from each of

these contingency tables. Please refer to Zou et al49and a review

article by Lasko et al50for details on ROC curves. In GLORE, one

site needs to send all predictions and their corresponding

contingency tables to the central engine. The central engine then

needs to merge the information to compute the sensitivity and

specificity at all thresholds. It is worth noting that, although this

Algorithm 2 Computing the AUC using GLORE

Step 1. Each site transmits probability estimates, that is, pðxi;^bÞ

’s of their observations to all other sites.

Step 2. Each site finds the ranking* of each predicted probability

transmitted from all other sites among the zero-labeled records in

this site, and sends the rankings of these probabilities back to

their original sites.

Step 3. Each site calculates the rank sum for each of its one-

labeled records using the ranks sent back from all other sites.

Step 4. Each site finds the ranking* of each of its one-labeled

records among the zero-labeled records in this site.

Step 5. Each site computes the sums of the ranks regarding its

own one-labeled records using the intermediary results from

steps 3 and 4.

Step 6. Each site sends rank sums from step 5 and counts of their

own one-labeled and zero labeled records to the central engine.

Step 7. The central engine computes the AUC as the summation

of all rank sums from step 6 divided by the product of the total

one-labeled and zero-labeled records.

*Ranking is the sum of the number of zero-labeled records with

smaller predictions and half the number of zero-labeled records

with equal predictions.

Figure 3

the curve (AUC) using GLORE. (A)

Exchange numbers of one-labeled and

zero-labeled records between site A and

site B. (B) Compute rank sums for

records in A. 1: Calculate the rank of

each probability in A among zero-

labeled records in B. 2: Calculate the

rank of each one-labeled probability in A

among zero-labeled records in A. 3: Find

the one-labeled records from step 1 (ie,

bounded in red boxes). 4: Combine

ranks for one-labeled records from

procedures 2 and 3 to get the rank

sums for A. (c) Compute rank sums for

records in site B. 5: Calculate the rank

of each probability in B among zero-

labeled records in A. 6: Calculate the

rank of each one-labeled probability in B

among zero-labeled records in B. 7: Find

the one-labeled records from step 5 (ie,

bounded in red boxes). 8: Combine

ranks for one-labeled records from

procedures 6 and 7 to get rank sums for

A.

Calculating the area under

J Am Med Inform Assoc 2012;19:758e764. doi:10.1136/amiajnl-2012-000862761

Research and applications

Page 5

procedure is straightforward, it may lead to more privacy leak

than algorithm 1, since class labels (ie, patient outcomes) asso-

ciated with predicted probabilities can be recovered by the central

or peer site, depending on the strategy selected.

Remark

We verified in both simulated and clinical data experiments that

the proposed GLORE will produce the same estimated coeffi-

cients, that is, with precision 0 (10?15), together with accurate

variance-covariance matrix estimation, goodness-of-fit statistics,

and AUC, when compared to the classic LR trained in

a centralized manner. It is also worth noting that, although the

GLORE coefficient estimation process needs to transmit inter-

mediary results in each Newton-Raphson iteration, usually

a small number (<15) of iterations is necessary to achieve high

precision such as O(10?6). After the parameter estimation is

done, only a one-time data transmission is needed for estimating

the variance-covariance matrix, computing the model fit

statistic, and computing the AUC.

Experiments

We used the statistic computing language R to conduct our

experiments with simulated data in which the true generating

model is known, and also on clinical data to validate GLORE.

Simulation study

In a simulation study, we compared two-site GLORE (assuming

data are evenly partitioned between two sites) and ordinary LR

(combining all data for computation). We used a total sample

size of 1000 (500 for each site) and nine features (ie, variables).

First, we simulated all features from a standard normal distri-

bution, then simulated the response from a binomial distribu-

tion assuming that the log odds of the response being 1 was

a linear function of features (ie, all coefficients were set to 1). We

conducted the study on 100 runs to compare coefficient esti-

mation difference between GLORE and LR for the same simu-

lated data. This simple study shows that the number of

Newton-Raphson iterations to convergence is always six when

10?6precision is set for the iteration stop criterion.

Table 1 shows the mean absolute difference between two-site

GLORE and LR estimations for all 10 coefficients (nine features

plus one intercept) at each iteration, where the mean is calculated

for 100 runs. There are no substantial differences between esti-

mations from GLORE and LR for all coefficients at all iterations.

We also pick one of the 100 runs to graphically present the

convergence paths of the estimations for three coefficients (inter-

cept, X1, and X2) in figures 3 and 4, showing that there is

no difference between two convergence paths for these three

coefficients.

Experiments using a clinical data set

Our clinical data set is related to myocardial infarction at

Edinburgh, UK,51which has one binary outcome, 48 features

and 1253 records, and was used to illustrate GLORE with

two sites. Records are evenly partitioned (627 vs 626) between

the two sites. We picked nine non-redundant features in this

data set using methods described in Kennedy et al51for GLORE

fitting. We also used another data set,49which contains two

cancer biomarkers (CA-19 and CA-125). This data set has 141

records, one binary outcome denoting the presence or absence

of cancer, and two features denoting the two cancer markers.

The 141 records were split into 71 and 70 for two sites. Tables 2

and3 showcoefficient estimates

errors, Z test statistics and p values for the Edinburgh data and

for the CA-19 and CA-125 cancer marker data, respectively.

Using algorithm 1, the H-L test statistic equals 8.036 with

a p value 0.430 for the Edinburgh data, and the H-L test statistic

equals 3.510 with a p value 0.898 for the CA-19 and CA-125

data, which are no different from the results obtained from

traditional centralized LR models. Seven and 12 Newton-

Raphson iterations were needed for convergence with 10?6

precision for the Edinburgh data and the CA-19/CA-125 data,

respectively. In addition, using algorithm 2, we found that the

AUCs were 0.965 and 0.891 for the Edinburgh data and for the

CA-19 and CA-125 data, respectively, which are no different

from the results obtained from traditional centralized LR

models.

andtheirstandard

Figure 4

intercept, X1, and X2. The estimation difference between GLORE and

classic LR is smaller than 10?15for all iterations, as shown in table 1.

The convergence paths of the two-site GLORE estimations for

Table 1

Mean absolute difference between two-site GLORE and LR estimations

Iteration1st2nd 3rd4th5th6th

Intercept

X1

X2

X3

X4

X5

X6

X7

X8

X9

7.33E?17

4.42E?16

5.30E?16

4.46E?16

3.87E?16

4.88E?16

4.62E?16

4.25E?16

4.61E?16

4.45E?16

3.05E?16

3.34E?16

3.15E?16

2.60E?16

3.12E?16

3.19E?16

3.09E?16

2.96E?16

3.06E?16

3.26E?16

2.09E?16

2.52E?16

2.35E?16

2.28E?16

2.29E?16

2.11E?16

2.45E?16

2.45E?16

2.43E?16

2.48E?16

1.11E?16

1.02E?16

1.02E?16

1.21E?16

1.19E?16

1.18E?16

1.30E?16

1.29E?16

1.24E?16

1.17E?16

8.55E?17

8.77E?17

8.44E?17

7.99E?17

7.55E?17

8.66E?17

8.10E?17

9.21E?17

6.44E?17

7.77E?17

8.88E?17

8.88E?17

8.66E?17

9.21E?17

8.10E?17

8.55E?17

7.55E?17

8.88E?17

7.77E?17

8.55E?17

762J Am Med Inform Assoc 2012;19:758e764. doi:10.1136/amiajnl-2012-000862

Research and applications