# Aggregating published prediction models with individual participant data: a comparison of different approaches.

**ABSTRACT** During the recent decades, interest in prediction models has substantially increased, but approaches to synthesize evidence from previously developed models have failed to keep pace. This causes researchers to ignore potentially useful past evidence when developing a novel prediction model with individual participant data (IPD) from their population of interest. We aimed to evaluate approaches to aggregate previously published prediction models with new data. We consider the situation that models are reported in the literature with predictors similar to those available in an IPD dataset. We adopt a two-stage method and explore three approaches to calculate a synthesis model, hereby relying on the principles of multivariate meta-analysis. The former approach employs a naive pooling strategy, whereas the latter accounts for within-study and between-study covariance. These approaches are applied to a collection of 15 datasets of patients with traumatic brain injury, and to five previously published models for predicting deep venous thrombosis. Here, we illustrated how the generally unrealistic assumption of consistency in the availability of evidence across included studies can be relaxed. Results from the case studies demonstrate that aggregation yields prediction models with an improved discrimination and calibration in a vast majority of scenarios, and result in equivalent performance (compared with the standard approach) in a small minority of situations. The proposed aggregation approaches are particularly useful when few participant data are at hand. Assessing the degree of heterogeneity between IPD and literature findings remains crucial to determine the optimal approach in aggregating previous evidence into new prediction models. Copyright © 2012 John Wiley & Sons, Ltd.

**0**

**0**

**·**

**1**Bookmark

**·**

**178**Views

- Thomas P.A. Debray, Hendrik Koffijberg, Daan Nieboer, Yvonne Vergouwe, Ewout W. Steyerberg, Karel G.M. Moons[show abstract] [hide abstract]

**ABSTRACT:**Published clinical prediction models are often ignored during the development of novel prediction models despite similarities in populations and intended usage. The plethora of prediction models that arise from this practice may still perform poorly when applied in other populations. Incorporating prior evidence might improve the accuracy of prediction models and make them potentially better generalizable. Unfortunately, aggregation of prediction models is not straightforward, and methods to combine differently specified models are currently lacking. We propose two approaches for aggregating previously published prediction models when a validation dataset is available: model averaging and stacked regressions. These approaches yield user-friendly stand-alone models that are adjusted for the new validation data. Both approaches rely on weighting to account for model performance and between-study heterogeneity but adopt a different rationale (averaging versus combination) to combine the models. We illustrate their implementation in a clinical example and compare them with established methods for prediction modeling in a series of simulation studies. Results from the clinical datasets and simulation studies demonstrate that aggregation yields prediction models with better discrimination and calibration in a vast majority of scenarios, and results in equivalent performance (compared to developing a novel model from scratch) when validation datasets are relatively large. In conclusion, model aggregation is a promising strategy when several prediction models are available from the literature and a validation dataset is at hand. The aggregation methods do not require existing models to have similar predictors and can be applied when relatively few data are at hand. Copyright © 2014 John Wiley & Sons, Ltd.Statistics in Medicine 01/2014; · 2.04 Impact Factor - SourceAvailable from: Thomas P A DebrayThomas P A Debray, Karel G M Moons, Ghada Mohammed Abdallah Abo-Zaid, Hendrik Koffijberg, Richard David Riley[show abstract] [hide abstract]

**ABSTRACT:**A fundamental aspect of epidemiological studies concerns the estimation of factor-outcome associations to identify risk factors, prognostic factors and potential causal factors. Because reliable estimates for these associations are important, there is a growing interest in methods for combining the results from multiple studies in individual participant data meta-analyses (IPD-MA). When there is substantial heterogeneity across studies, various random-effects meta-analysis models are possible that employ a one-stage or two-stage method. These are generally thought to produce similar results, but empirical comparisons are few. We describe and compare several one- and two-stage random-effects IPD-MA methods for estimating factor-outcome associations from multiple risk-factor or predictor finding studies with a binary outcome. One-stage methods use the IPD of each study and meta-analyse using the exact binomial distribution, whereas two-stage methods reduce evidence to the aggregated level (e.g. odds ratios) and then meta-analyse assuming approximate normality. We compare the methods in an empirical dataset for unadjusted and adjusted risk-factor estimates. Though often similar, on occasion the one-stage and two-stage methods provide different parameter estimates and different conclusions. For example, the effect of erythema and its statistical significance was different for a one-stage (OR = 1.35, [Formula: see text]) and univariate two-stage (OR = 1.55, [Formula: see text]). Estimation issues can also arise: two-stage models suffer unstable estimates when zero cell counts occur and one-stage models do not always converge. When planning an IPD-MA, the choice and implementation (e.g. univariate or multivariate) of a one-stage or two-stage method should be prespecified in the protocol as occasionally they lead to different conclusions about which factors are associated with outcome. Though both approaches can suffer from estimation challenges, we recommend employing the one-stage method, as it uses a more exact statistical approach and accounts for parameter correlation.PLoS ONE 01/2013; 8(4):e60650. · 3.73 Impact Factor - [show abstract] [hide abstract]

**ABSTRACT:**The use of individual participant data (IPD) from multiple studies is an increasingly popular approach when developing a multivariable risk prediction model. Corresponding datasets, however, typically differ in important aspects, such as baseline risk. This has driven the adoption of meta-analytical approaches for appropriately dealing with heterogeneity between study populations. Although these approaches provide an averaged prediction model across all studies, little guidance exists about how to apply or validate this model to new individuals or study populations outside the derivation data. We consider several approaches to develop a multivariable logistic regression model from an IPD meta-analysis (IPD-MA) with potential between-study heterogeneity. We also propose strategies for choosing a valid model intercept for when the model is to be validated or applied to new individuals or study populations. These strategies can be implemented by the IPD-MA developers or future model validators. Finally, we show how model generalizability can be evaluated when external validation data are lacking using internal-external cross-validation and extend our framework to count and time-to-event data. In an empirical evaluation, our results show how stratified estimation allows study-specific model intercepts, which can then inform the intercept to be used when applying the model in practice, even to a population not represented by included studies. In summary, our framework allows the development (through stratified estimation), implementation in new individuals (through focused intercept choice), and evaluation (through internal-external validation) of a single, integrated prediction model from an IPD-MA in order to achieve improved model performance and generalizability. Copyright © 2013 John Wiley & Sons, Ltd.Statistics in Medicine 01/2013; · 2.04 Impact Factor

Page 1

Aggregating published prediction models with

individual participant data: a comparison of

different approaches

Thomas P. A. Debray∗a, Hendrik Koffijberga, Yvonne Vergouweb,

Karel G. M. Moonsa†and Ewout W. Steyerbergb†

During recent decades interest in prediction models has substantially increased, but approaches to synthesize

evidence from previously developed models have failed to keep pace. This causes researchers to ignore potentially

useful past evidence when developing a novel prediction model with individual participant data (IPD) from their

population of interest. We aimed to evaluate approaches to aggregate previously published prediction models

with new data. We consider the situation that models are reported in the literature with predictors similar to

those available in an IPD dataset. We adopt a two-stage method and explore three approaches to calculate a

synthesis model, hereby relying on the principles of multivariate meta-analysis. The former approach employs

a naive pooling strategy, whereas the latter account for within- and between-study covariance. These approaches

are applied to a collection of 15 datasets of patients with Traumatic Brain Injury, and to 5 previously published

models for predicting Deep Venous Thrombosis. Here, we illustrated how the generally unrealistic assumption of

consistency in the availability of evidence across included studies can be relaxed. Results from the case studies

demonstrate that aggregation yields prediction models with an improved discrimination and calibration in a vast

majority of scenarios, and result in equivalent performance (compared to the standard approach) in a small

minority of situations. The proposed aggregation approaches are particularly useful when few participant data are

at hand. Assessing the degree of heterogeneity between IPD and literature findings remains crucial to determine

the optimal approach in aggregating previous evidence into new prediction models. Copyright c ? 2012 John Wiley

& Sons, Ltd.

Keywords: Prediction research; Prediction models; Meta-analysis; Logistic regression; Multivariable;

Bayesian inference

Cite as: Debray, TP., Koffijberg, H., Vergouwe, Y., Moons, KGM., Steyerberg EW. (2012). Aggregating published

prediction models with individual participant data: a comparison of different approaches. Statistics in Medicine, Accepted

for publication. DOI: 10.1002/sim.5412

aJulius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands.

bCenter for Medical Decision Sciences, Department of Public Health, Erasmus Medical Center Rotterdam, Rotterdam, The Netherlands.

∗Correspondence to: Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Stratenum 6.131, P.O.Box 85500, 3508GA Utrecht, The

Netherlands. E-mail: T.Debray@umcutrecht.nl

Contract/grant sponsor: Netherlands Organization for Scientific Research (9120.8004, 918.10.615 and 916.11.126)

†Equal Contribution

Page 2

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

1. Introduction

It is well known that many prediction models do not generalize well across patient populations [1, 2, 3, 4, 5, 6]. This

quandary may occur, e.g., when prediction models are developed from small data sets, when too many predictors were

studied compared to the effective sample size, or when the population in which the model is validated or applied diverges

(substantially) from the population where the model was developed. Although the use of larger datasets for model

development covers a straightforward solution, in practice this option is frequently not possible due to, for example,

cost constraints, ethical considerations or inclusion problems.

It is remarkable that despite the scarcity of individual participant data, there is an abundance of prediction models in the

medical literature, even for the same clinical problem. For example, there are over 60 published models aiming to predict

outcome after breast cancer [7, 8], over 25 for predicting long-term outcome in neurotrauma patients [9], and about 10 to

diagnose venous thromboembolism. This dispersion of information reduces the scientific and clinical utility of prognostic

research overall. Prior knowledge from previous research goes unused and clinicians are left to pick from a cacophony of

unreliable prognostic models with limited scope. This is undesirable for all parties involved.

Conceptually, combining prior knowledge from multiple studies is already widespread in etiologic and intervention

research, in the form of meta-analyses [10]. More elaborate approaches, e.g. for synthesizing the accuracy of diagnostic

tests [11], have also recently emerged but remain largely lacking in prediction research, despite the fact that the potential

gains are arguably even greater [12]. The closest existing equivalent techniques focus upon updating of existing prediction

models that are being applied to a different setting [3, 5, 13, 14, 15]. Approaches for using prior knowledge in prediction

research are underdeveloped [12]. Some published approaches rely on evidence that is typically not published, such as

covariance matrices or regression coefficients, or lack a formal statistical foundation [16, 17].

We aimed to investigate how previously published prediction models or studies can be used in the development of

a (new) prediction model when published models and the individual participant data incorporate similar predictors. We

realize that published prediction models often differ in their composition through the inclusion of different covariates in

the models, the transformations and coding applied, and adjustment for overfitting [18, 19]. We here assume as a start that

identical model formulations are available for the published prediction models.

We adopt the two-stage method proposed by Riley et al [20] and explore three approaches to aggregate the published

prediction models (with similar predictors) with individual participant data (IPD). These approaches reduce the available

IPD to Aggregate Data (AD), and combine this evidence with the AD from the literature (i.e. the published prediction

models). The first two approaches calculate an overall synthesis model, whereas the third approach employs a Bayesian

perspective to adapt the coefficients of previously published prediction models with the IPD at hand. The approaches are

evaluated here through testing the predictive performance of prediction models for 6 month outcome in 15 Traumatic Brain

Injury (TBI) datasets [21, 22]. In addition, we illustrate their application in a genuine example involving the prediction of

Deep Vein Thrombosis (DVT).

2. Methods

We consider the situation in which an individual participant dataset (IPD) as well as a number of previously published

multivariate logistic regression models are available. The IPD is described by i = 1,...,K independent predictors, a

dichotomous outcome, and contains NIPDsubjects. The characteristics and observed outcome of subject s = 1,...,NIPD

in these data are denoted as xs1,...xsKand ysrespectively. The Aggregate Data (AD) from the literature studies are

represented by the published prediction models, and can be obtained from individual study publications or directly from

the study authors themselves. We assume that the literature models have a similar set of predictors as the IPD, and were

developed with a similar prediction task in mind. Furthermore, we assume that for each of j = 1,...,M previously

published prediction models, the estimated regression coefficientsˆβ0j,...,ˆβKjand their corresponding standard errors

ˆ σ0j,..., ˆ σKjare available. The regression coefficients obtained from the IPD are denoted asˆβ1,IPD,...,ˆβK,IPD(with

interceptˆβ0,IPD) and their respective variance-covariance matrix asˆΣIPD. Although we focus on the presence of one IPD,

it is possible to add additional IPDs in a similar manner.

From this situation, we propose three approaches to then combine the literature models with the IPD and derive a novel,

aggregated prediction model with coefficients β0,UPD,...,βK,UPDand variance-covariance matrix ΣUPD(with variance

elements σ2

by Riley et al. [20], where the available IPD are reduced to AD, and then combined with existing AD using meta-analytical

techniques. Specifically, the IPD is first reduced toˆβ0,IPD,...,ˆβK,IPDandˆΣIPD, and then aggregated withˆβ0j,...,ˆβKj

and ˆ σ0j,..., ˆ σKjusing meta-analysis techniques appropriate for multivariate synthesis. The first two approaches derive

an average synthesis model across the included study populations, which may not be relevant to the population of interest.

0,UPD,...,σ2

K,UPDwhere UPD stands for “updated”). These approaches adopt the two-stage method described

2

www.netstorm.beStatist. Med. 2012

Page 3

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

For this reason, the third approach assumes that the IPD reflects the clinically relevant population, and uses the synthesis

model from the literature for updating the regression coefficients from the IPD. Finally, all aggregation approaches re-

estimate the model intercept in the IPD to ensure that updated models remain well calibrated. For all three approaches

this can be achieved by fitting a logistic regression model in the IPD, using an offset variable that is calculated from the

updated regression coefficients:

Pr(ys= 1) = logit−1(β0,adj+ offset)

where offset =ˆβ1,UPDxs1+ ... +ˆβK,UPDxsK

(1)

(2)

Inthisexpression,β0,adjistheonlyfreeparameterthatisusedasnewestimatefortheinterceptoftheaggregatedprediction

model. The variance-covariance matrixˆΣUPDcan be adjusted according to the variance-correlation decomposition:

?

Cov(ˆβ0,adj,ˆβi,UPD) =

ˆ σ0,adj

ˆ σ0,UPD

?

Cov(ˆβ0,UPD,ˆβi,UPD) where i = 1,...,K

(3)

All approaches were implemented in R 2.14.1 [23]. The corresponding source code is available on request.

2.1. Univariate meta-analysis

A straightforward strategy to combine the previously published prediction models with IPD is to summarize their

corresponding multivariate coefficients and standard errors. We propose the weighted least squares (WLS) approach

as a first simple approach to combine the coefficients. Appropriate weights for the coefficients can be obtained from

their corresponding standard errors or study sample size when these are not available. This approach corresponds to a

typical meta-analysis involving fixed or random effects as commonly applied to univariate regression coefficients or effect

estimates. Here, the coefficientˆβijis weighted according to wij= 1/(ˆ σ2

As the coefficients are pooled independently for each predictor, dependencies between regression coefficients are

ignored. This simplification is not necessarily problematic when the previously published regression coefficients are

homogeneous. However, when estimates for these coefficients are known to be correlated across studies, a more advanced

approach that accounts for between-study covariance may be more appropriate. We will discuss such an approach below.

ij+ τ2

j) with τ2

jthe between-study variance ofˆβj.

2.2. Multivariate meta-analysis

The concept of multivariate meta-analysis is relatively new to the medical literature, and can be seen as a generalization

of DerSimonian and Laird’s methodology for summarizing effect estimates [10, 24]. In contrast to univariate meta-

analysis, the multivariate approach accounts for within-study covariance (instead of within-study variance). Furthermore,

multivariate meta-analysis estimates between-study covariance (rather than between-study variance) of regression

coefficients, and may therefore better account for heterogeneity across studies. This explicit distinction of within- and

between study (co)variance has become paramount in epidemiological research. For this reason, we do not pursue other

potentially useful approaches where evidence is aggregated from a different perspective, such as the Generalized Least

Squares approach proposed by Becker et al [16].

In this section we present a generalized random effects model that accounts for within-study and between-study

covariance of the regression coefficients when pooling them. A univariate [25] and bivariate random effects model [26]

for this purpose can be generalized as follows:

(β0,β1,...,βk)T

l∼ NK+1(µre,(Σre)l)

(4)

with

(Σre)l= Σbs+ Σl

(5)

and

Σbs=

τ2

0

τ01

...

τ0K

τ01

τ2

1

...

τ1K

...

...

...

...

τ0K

τ1K

...

τ2

K

(6)

and

Σl=

σ2

0

cov(β0,β1)

σ2

...

cov(β1,βK)

...

...

...

...

cov(β0,βK)

cov(β1,βK)

...

σ2

K

cov(β0,β1)

...

cov(β0,βK)

1

l

(7)

Statist. Med. 2012

www.netstorm.be

3

Page 4

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

In the expressions above, between-study estimates are denoted as bs, and random-effects estimates as re. Here l denotes

each included set of predictors from literature and IPD, i.e. l = {1,...,M,IPD}.

We explicitly distinguish between the within-study and between-study covariance of the regression coefficients,

denoted as Σl (for study l) and Σbs respectively. Estimates for (β0,β1,...,βK)l and Σl can be obtained from

(ˆβ0,ˆβ1,...,ˆβK)landˆΣlrespectively. The unknown parameters in µreand Σbscan be estimated with maximum likelihood,

and provide the pooled means µUPD= µreand covariance matrix ΣUPD=

log-likelihood is given by ?(µre,Σbs) =??l(µre,Σbs) where ?l(µre,Σbs) = log(Pr(β0l,...,βKl|µre,(Σre)l)) and

procedure, we used the independently pooled estimates of the previously published regression coefficients as initial values

for µre, and a zero-matrix as initial choice for Σbs. In addition, we used the Cholesky decomposition to ensure that Σbsis

positive semi-definite.

Although Σl is fully defined for the individual participant data, its non-diagonal entries are usually unknown for

previously published regression coefficients. For this reason, we propose to impute missing entries inˆΣlbased on the

observed correlations inˆΣIPD, according to

??M+1

l=1(Σre)−1

l

?−1

. Their corresponding

Pr(β0l,...,βKl|µre,(Σre)l) ∼ NK+1(µre,(Σre)l). To facilitate convergence of the maximum likelihood estimation

ˆΣφψl=?

Cov(ˆβφl,ˆβψl) =

?

Cov(ˆβφ,IPD,ˆβψ,IPD)ˆ σφlˆ σψl

ˆ σφ,IPDˆ σψ,IPD

(8)

with φ,ψ = 0,...,K. This imputation strategy assumes that the within-study covariance of regression coefficients is

exchangeable across all studies. Alternatively, it is possible to restrict non-diagonal entries inˆΣlto zero, according to

ˆΣl= diag(ˆ σ2

then the correlations from the IPD are likely to be closer to the underlying correlations in the included AD. Furthermore, it

is possible to assume a common correlation value amongst all slopes (e.g.ˆΣφψl= 0.2 ˆ σφlˆ σψl), or to introduce uncertainty

in the correlation parameter(s) by adopting a Bayesian perspective [16, 27]. Finally, simulation studies have revealed that

multivariate meta-analysis models appear to be fairly robust to errors made in approximating within-study covariances

when only summary effect estimates (here represented by the regression coefficients) are of interest [27].

The complexity of the meta-analysis is mostly defined by Σbs. If each element in this matrix is modeled as an unknown

parameter, a full random effects meta-analysis is performed. Conversely, if all (non-diagonal) entries in Σbs and Σl

are restricted to zero, the regression coefficients are pooled independently as described in section 2.1. Furthermore, it

is possible to perform a reduced random effects meta-analysis by restricting a selection of Σbs-elements to zero. For

instance, we can assume fixed effects for β1by choosing τ2

be introduced in a similar manner. We argue that by restricting the amount of unknown parameters in Σbs, estimates for

their corresponding values may become more robust. The stability of µreand Σbsmay further be improved by introducing

(weakly) informative prior distributions. Unfortunately, such approach ultimately requires the use of highly advanced

distributional families which may not have a straightforward interpretation or implementation. Implementing these is

beyond the scope of this article.

Finally, the described approach can easily be extended to scenarios in which multiple IPDs are available. In these

scenarios, Σlis fully defined for multiple studies and hence allows an improved estimation of the unknown parameters.

Alternatively, it is possible to adopt a one-stage approach that does not reduce the IPD to AD, but instead accounts for the

fact that some studies provide IPD, and some studies provide only AD [28]. Similarly, when no IPDs are available, the

non-diagonal entries of Σlare (probably) undefined for all studies and making reasonable assumptions about these entries

becomes more important to obtaining valid results.

0l, ˆ σ2

1l,..., ˆ σ2

Kl). The former approach may be more appropriate in more homogeneous sets of studies, as

1= τ0,1= τ1,2= ... = τ1,K= 0. Additional fixed effects can

2.3. Bayesian Inference

The approaches described in section 2.1 and 2.2 estimate a ‘pooled’ prediction model whenever a number of previously

published prediction models as well as IPD are available. It may be clear that an average synthesis model across the

included study populations may not always reflect the population of interest. Here, we assume that the IPD represents

the clinically relevant population. Good prediction in these particular subjects is hence of primary interest. Therefore,

we consider an alternative approach where the evidence from existing prediction models is used to update the regression

coefficients from the IPD. To this purpose, we apply a Bayesian framework where a summary of the previously published

regression coefficients serves as prior for the regression coefficients in the IPD. This summary of literature evidence can

4

www.netstorm.beStatist. Med. 2012

Page 5

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

be obtained through the approach described in section 2.2:

µPRIOR= µre

(9)

ΣPRIOR=

?M

j=1

?

(Σre)−1

j

?−1

(10)

Note that this prior distribution does not include estimates from the IPD. Instead, we assume that the estimated coefficients

from the IPD follow a multivariate normal distribution with mean µIPDand covariance matrix ΣIPD. This distribution

represents the likelihood and can be formulated as Pr(β0,IPD,...,βK,IPD|µIPD,ΣIPD) ∼ NK+1(µIPD,ΣIPD). We

propose to construct a conjugate prior distribution for µIPDwith Pr(µIPD) ∼ NK+1(µPRIOR,ΣPRIOR) such that the

posterior density Pr(µIPD|β0,IPD,...,βk,IPD,ΣIPD) ∼ NK+1(µPOST,ΣPOST) can be determined analytically:

µUPD=?Σ−1

IPD

PRIOR+ Σ−1

IPD

?−1?Σ−1

PRIORµPRIOR+ Σ−1

IPDµIPD

?

(11)

ΣUPD=?Σ−1

PRIOR+ Σ−1

?−1

(12)

Here, the parameters µIPDand ΣIPDcan be substituted by (ˆβ0,IPD,...,ˆβK,IPD) andˆΣIPDrespectively. Consequently, the

vector µUPDrepresents the expected (posterior) value of the multivariate regression coefficients β0,UPD,...,βK,UPD, and

ΣUPDrepresents the expected (posterior) value of the corresponding variance-covariance matrix. When multiple IPDs are

available, it is possible to subsequently add each IPD using Bayesian Inference.

3. Application: Traumatic Brain Injury

We tested univariate meta-analysis, multivariate meta-analysis, Bayesian Inference and Standard Logistic Regression

(SLR) modeling (i.e. analysis using the IPD only) on 15 empirical datasets of Traumatic Brain Injury (TBI) patients. TBI

is a leading cause of death and disability worldwide with a substantial economic burden [29, 30]. It is difficult to establish

a reliable prognosis on admission [31]. This requires the consideration of multiple and easily accessible risk factors in

multivariable prognostic models [22, 5, 32, 33]. Many prognostic models with admission data are readily available from

the literature [32]. However, most models were developed on relatively small sample sizes originating from a single center

or region and lack external validation [9, 32]. Therefore, their aggregation might improve the generalization of novel

prognostic models.

3.1. Application Setup

To test the potential value of our approaches we used 15 series of individual participant data collected in the International

Mission for Prognosis and Analysis of Clinical Trials in TBI (IMPACT) project [21]. The outcome used in each of these

trials was the Glasgow Outcome Scale score (GOS) at 6 months after injury, dichotomized between severe and moderate

disability.

We fitted a logistic regression model to each of the available datasets, and considered a core set of conventional TBI

prognostic factors (age, motor score and pupil response to light) (Table 1) [22, 32]. In this manner, we aimed to simulate

scenarios in which a common set of core predictors is available and can be aggregated with individual participant data.

We realize that for many genuine examples the assumption of literature models sharing the same set of parameters is

unrealistic. This problem also arises in our application, where some of the previously published regression coefficients are

unknown because some studies did not contain all categories of the motor score or pupil response. Instead of discarding

the corresponding predictors from the aggregated model, we propose using uninformative regression coefficients when

they cannot be estimated from the data. We argue that this strategy can also be applied in other examples where the

literature models do not share the same set of parameters. Finally, we measured the Area under the Receiver Operator

Characteristic curve (AUC) and the Brier Score (BS) of the aggregated models as indication of performance. Whereas the

former quantifies the model’s ability to distinguish high-risk from low-risk patients, the latter assesses the accuracy of its

predictions [34, 35].

Statist. Med. 2012

www.netstorm.be

5

Page 6

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

Table 1. Estimated regression coefficients (and standard error) from the IMPACT data.

CharacteristicsCodingLogistic Regression coefficients for favorable versus unfavorable outcome after 6 Months TBI

TINT

1118

RCT

TIUS

1041

RCT

SLIN

409

RCT

SAPHIR

919

RCT

PEGSOD

1510

RCT

HIT I

350

RCT

UK4

791

Obs.

TCDB

603

Obs.

Patients

Study Type

Intercept

Age,years

Motor score

ˆβ0

ˆβ1

ˆβ2

ˆβ3

ˆβ4

ˆβ5

-2.48 (0.21)

0.03 (0.00)

1.49 (0.95)

1.84 (0.24)

1.10 (0.19)

0.51 (0.17)

Ref.

NA

Ref.

0.82 (0.19)

1.28 (0.22)

-3.06 (0.25)

0.04 (0.01)

1.42 (0.76)

1.93 (0.25)

1.61 (0.23)

0.81 (0.18)

Ref.

NA

Ref.

0.28 (0.23)

1.29 (0.19)

-2.06 (0.34)

0.03 (0.01)

NA

1.69 (0.38)

0.63 (0.29)

0.28 (0.27)

Ref.

NA

Ref.

1.08 (0.28)

2.08 (0.80)

-2.43 (0.24)

0.04 (0.00)

0.69 (0.23)

1.50 (0.25)

0.47 (0.23)

0.19 (0.20)

Ref.

0.34 (1.23)

Ref.

1.22 (0.17)

NA

-2.76 (0.21)

0.04 (0.00)

1.52 (0.17)

2.58 (0.25)

1.47 (0.21)

0.82 (0.18)

Ref.

NA

Ref.

0.48 (0.19)

1.05 (0.13)

-2.66 (0.47)

0.03 (0.01)

1.36 (0.37)

2.53 (0.54)

1.95 (0.47)

0.80 (0.42)

Ref.

1.08 (0.77)

Ref.

0.42 (0.35)

2.15 (0.42)

-2.13 (0.26)

0.04 (0.01)

1.33 (0.35)

1.67 (0.42)

1.18 (0.49)

0.44 (0.25)

Ref.

0.94 (0.24)

Ref.

0.80 (0.26)

2.04 (0.28)

-2.24 (0.33)

0.05 (0.01)

2.05 (0.39)

2.14 (0.37)

0.74 (0.33)

0.48 (0.28)

Ref.

-0.14 (0.47)

Ref.

0.70 (0.35)

1.54 (0.26)

None

Extension

Abnormal flexion

Normal flexion

Localizes/obeys

Untestable/missing

Both pupils reacted

One pupil reacted

No pupil reacted

ˆβ6

Pupillary reactivity

ˆβ7

ˆβ8

SKB

126

RCT

EBIC

822

Obs.

HIT II

819

RCT

NABIS

385

RCT

CSTAT

517

RCT

PHARMOS

856

RCT

APOE

756

Obs.

Patients

Study Type

Intercept

Age,years

Motor score

ˆβ0

ˆβ1

ˆβ2

ˆβ3

ˆβ4

ˆβ5

-1.77 (0.68)

0.04 (0.02)

0.56 (0.63)

0.63 (0.71)

1.30 (0.76)

-0.18 (0.74)

Ref.

-0.65 (0.74)

Ref.

1.09 (0.46)

NA

-3.12 (0.28)

0.04 (0.00)

1.61 (0.28)

1.90 (0.35)

1.53 (0.36)

1.33 (0.27)

Ref.

1.12 (0.25)

Ref.

1.01 (0.29)

1.44 (0.23)

-2.70 (0.28)

0.03 (0.01)

1.07 (0.24)

2.07 (0.35)

1.63 (0.29)

0.48 (0.25)

Ref.

0.97 (0.34)

Ref.

0.37 (0.24)

1.26 (0.23)

-2.14 (0.41)

0.04 (0.01)

0.97 (0.33)

1.69 (0.40)

1.76 (0.40)

0.75 (0.33)

Ref.

0.77 (0.72)

Ref.

1.03 (0.37)

1.18 (0.29)

-2.46 (0.35)

0.03 (0.01)

0.88 (0.41)

1.49 (0.33)

1.14 (0.32)

0.07 (0.29)

Ref.

NA

Ref.

1.53 (0.28)

1.87 (0.32)

-1.50 (0.24)

0.02 (0.01)

0.54 (0.34)

1.31 (0.27)

1.03 (0.23)

0.64 (0.19)

Ref.

0.51 (0.23)

Ref.

0.52 (0.19)

0.47 (0.37)

-3.15 (0.27)

0.04 (0.00)

1.31 (1.15)

NA

NA

1.28 (0.56)

Ref.

1.17 (0.19)

Ref.

0.87 (0.37)

2.04 (0.36)

None

Extension

Abnormal flexion

Normal flexion

Localizes/obeys

Untestable/missing

Both pupils reacted

One pupil reacted

No pupil reacted

ˆβ6

Pupillary reactivity

ˆβ7

ˆβ8

Note: NA, not available.

6

www.netstorm.be

Statist. Med. 2012

Page 7

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

3.2. Practical Example

As an illustration, we used the HIT I study [36] as IPD, the HIT II study [37] as validation data, and the prediction models

of the remaining studies as previously published evidence (Table 2). We calculated the I2index of heterogeneity for each

separate (and known) regression coefficient of the previously published prediction models by performing a univariate

meta-analysis [38]. These coefficients were found to be moderately to strongly heterogeneous with I2(ˆβ0) = 0.71 ,

I2(ˆβ1) = 0.15, I2(ˆβ2) = 0.49, I2(ˆβ3) = 0.40, I2(ˆβ4) = 0.52, I2(ˆβ5) = 0.48, I2(ˆβ6) = 0.54, I2(ˆβ7) = 0.53 and I2(ˆβ8) =

0.61. These estimates should however be interpreted with caution, as much discrepancy between the previously published

regression coefficients is due to small standard errors. Next, we imputed previously published regression coefficients that

could not be estimated from the data and performed a sensitivity analysis to assess two different imputation approaches.

To this effect we evaluatedˆβφ= 0 with ˆ σ2

Finally, we aggregated the previously published prediction models with the IPD. The considered approaches are: Standard

Logistic Regression (SLR) modeling ignoring the literature studies, univariate meta-analysis, multivariate meta-analysis

and Bayesian Inference. We also performed a logistic regression analysis using all available IPD datasets (except for the

validation study), and used the resulting model as “gold standard” for comparing the aggregated models. Because the

multivariate meta-analysis approach requires the within-study covariance of the previously published prediction models to

be fully specified, we evaluated two strategies for imputing missing (i.e. non-diagonal) entries in Σl. As explained above,

we compared a strategy that involved imputing missing covariance entries based on observed correlation in the IPD with

a strategy based on restricted non-diagonal entries in Σlto zero.

Results (Table 2) from this example illustrate that particular choices for imputing missing regression coefficients and

unknown within-study covariance do not have a large impact on the resulting prediction model. Although each strategy

yields somewhat different estimated regression coefficients, most variation seems to arise from the uncertainty in the

available regression coefficients. The example also illustrates that regression coefficients of aggregated prediction models

are more similar to the coefficients from the reference “gold standard” model (compared to SLR modeling). Furthermore,

we noticed that prediction models incorporating prior evidence achieved slightly improved AUC and Brier scores. It is

possible that improvements in this particular example are relatively small due to the strong relation between the IPD and

validation data (the HIT II study is a follow-up study of the HIT I study). Finally, we noticed a considerable decrease in

the standard errors of estimated regression coefficients when prior evidence was incorporated. Although these errors are

not of primary concern in prediction research, they reflect an improved stability of the derived prediction models.

φ= 100, and compared it with a mean imputation with ˆ σ2

φ=?M

j=1ˆ σ2

φj.

3.3. Performance study

In order to evaluate the overall performance of aggregation models, we performed a split-sample procedure where

individual participant data and validation data were sampled (without replacement) from acommon dataset. The prediction

models generated from the remaining datasets were used as prior evidence for the aggregation methods. This procedure

was repeated 100 times for each scenario to ensure stable estimates of model performance. We evaluated NIPD= 500 and

NIPD= 200, and imputed unknown regression coefficients according toˆβφ= 0 with ˆ σ2

Results indicate that all aggregation approaches perform similarly, and yield prediction models with an improved AUC

and Brier score (Table 3). These improvements particularly occur in small datasets (NIPD= 200), but do not necessarily

disappear when more IPD is at hand (NIPD= 500). Furthermore, we noticed that aggregated prediction models perform

similarly compared to models derived with the IPD from all original studies (Full IPD modeling). Finally, we noticed that

standard errors of aggregated regression coefficients tend to be smaller when estimated with multivariate meta-analysis

(compared to univariate meta-analysis).

φ= 100.

Statist. Med. 2012

www.netstorm.be

7

Page 8

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

Table 2. An illustration of the proposed approaches in the TBI application: updated intercept (Int.) and regression coefficients (and standard error) when the HIT I

study (N = 350) is used as individual participant dataset, the HIT II study (N = 819) as validation dataset and the remaining studies as evidence from the literature.

(Int.)Age, yearsMotor Score *Pupillary Reactivity **

ˆβ0

ˆβ1

ˆβ2

ˆβ3

ˆβ4

ˆβ5

ˆβ6

ˆβ7

ˆβ8

AUC BS

SLR modeling

Analysis ignoring literature studies

Full IPD modeling

Analysis with IPD of all original studies

stacked

-2.66 (0.47) 0.03 (0.01) 1.36 (0.37)2.53 (0.54) 1.95 (0.47)0.80 (0.42)1.08 (0.77)0.42 (0.35) 2.15 (0.42)0.745 (0.017) 0.206 (0.008)

-2.52 (0.07) 0.04 (0.00)1.22 (0.07)1.88 (0.08) 1.21 (0.07) 0.60 (0.06)0.98 (0.08)0.80 (0.06)1.48 (0.06)0.749 (0.017)0.207 (0.007)

Uninformative regression coefficients for missing estimates in the literature models (ˆβφ= 0 with ˆ σ2

φ= 100 )

1.81 (0.12)

1.81 (0.09)

Univariate meta-analysis

Multivariate meta-analysis

missing within-study covariance restricted to

zero

Multivariate meta-analysis

missing within-study covariance imputed

from IPD

Bayesian Inference

missing within-study covariance restricted to

zero

-2.67 (0.12)

-2.67 (0.12)

0.04 (0.00)

0.04 (0.00)

1.20 (0.13)

1.21 (0.10)

1.17 (0.12)

1.17 (0.10)

0.60 (0.09)

0.60 (0.07)

0.82 (0.13)

0.81 (0.11)

0.83 (0.10)

0.83 (0.07)

1.46 (0.12)

1.44 (0.12)

0.749 (0.017)

0.749 (0.017)

0.203 (0.007)

0.203 (0.007)

-2.67 (0.12) 0.04 (0.00)1.20 (0.08) 1.81 (0.08)1.17 (0.07) 0.60 (0.06)0.82 (0.10)0.83 (0.07) 1.46 (0.07)0.749 (0.017) 0.203 (0.007)

-2.65 (0.12)0.04 (0.00) 1.19 (0.11)1.83 (0.09)1.19 (0.09)0.59 (0.07)0.81 (0.11)0.81 (0.07) 1.51 (0.12) 0.749 (0.017) 0.203 (0.007)

Mean imputation for missing estimates in the literature models (with ˆ σ2

φ=?M

0.04 (0.00)

0.04 (0.00)

j=1ˆ σ2

1.20 (0.13)

1.21 (0.10)

φj)

Univariate meta-analysis

Multivariate meta-analysis

missing within-study covariance restricted to

zero

Multivariate meta-analysis

missing within-study covariance imputed

from IPD

Bayesian Inference

missing within-study covariance restricted to

zero

-2.67 (0.12)

-2.67 (0.12)

1.81 (0.12)

1.81 (0.09)

1.17 (0.12)

1.17 (0.10)

0.60 (0.09)

0.60 (0.07)

0.81 (0.13)

0.81 (0.11)

0.83 (0.10)

0.83 (0.07)

1.46 (0.12)

1.44 (0.12)

0.749 (0.017)

0.749 (0.017)

0.203 (0.007)

0.203 (0.007)

-2.67 (0.12) 0.04 (0.00)1.20 (0.08)1.81 (0.08) 1.17 (0.07) 0.60 (0.06)0.81 (0.10)0.83 (0.07) 1.46 (0.07)0.749 (0.017)0.203 (0.007)

-2.65 (0.12) 0.04 (0.00)1.19 (0.11)1.83 (0.09)1.21 (0.08) 0.59 (0.07)0.81 (0.11)0.79 (0.10)1.47 (0.08) 0.749 (0.017)0.202 (0.007)

Note: The Area under the Receiver Operator Characteristic curve (AUC) and the Brier Score (BS) of the aggregated models are presented as measure of performance in HIT II. Standard errors for the AUC were obtained through

the standard error of the Somer’s D statistic. Standard errors for the Brier score were estimated according to sd[(ps− os)2]/√N. The categorical variables Motor Score (*) and Pupillary Reactivity (**) were coded as factors

(cfr.Table 1).

8

www.netstorm.be

Statist. Med. 2012

Page 9

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

Table 3. Performance of aggregated prediction models, expressed by means of the Area under the Receiver Operator Characteristic curve (AUC) and the Brier Score

(BS).

UK4EBIC

NIPD= 500(NVAL= 291)

AUC (SE)

NIPD= 200(NVAL= 591)

AUC (SE)

NIPD= 500(NVAL= 322)

AUC (SE)

NIPD= 200(NVAL= 622)

AUC (SE)BS (SE)BS (SE) BS (SE)BS (SE)

SLR modeling

Analysis ignoring literature studies

Full IPD modeling

Analysis with IPD of all original studies

stacked

Univariate meta-analysis

Multivariate meta-analysis

missing within-study covariance restricted to

zero

Bayesian Inference

missing within-study covariance restricted to

zero

0.813 (0.022)0.165 (0.010) 0.801 (0.011)0.172 (0.006)0.810 (0.019) 0.179 (0.010) 0.801 (0.013)0.185 (0.007)

0.822 (0.020)0.174 (0.009) 0.821 (0.008)0.176 (0.003) 0.814 (0.019)0.176 (0.009) 0.814 (0.010)0.176 (0.004)

0.821 (0.020)

0.820 (0.020)

0.162 (0.009)

0.162 (0.009)

0.820 (0.008)

0.820 (0.008)

0.164 (0.005)

0.164 (0.005)

0.815 (0.019)

0.815 (0.019)

0.176 (0.009)

0.176 (0.009)

0.814 (0.010)

0.814 (0.010)

0.176 (0.004)

0.177 (0.005)

0.820 (0.020)0.162 (0.009)0.820 (0.008)0.164 (0.005)0.814 (0.019) 0.176 (0.009)0.814 (0.010) 0.177 (0.005)

HIT IIPHARMOS

NIPD= 500(NVAL= 319)

AUC (SE)

NIPD= 200(NVAL= 619)

AUC (SE)

NIPD= 500(NVAL= 356)

AUC (SE)

NIPD= 200(NVAL= 656)

AUC (SE) BS (SE)BS (SE) BS (SE) BS (SE)

SLR modeling

Analysis ignoring literature studies

Full IPD modeling

Analysis with IPD of all original studies

stacked

Univariate meta-analysis

Multivariate meta-analysis

missing within-study covariance restricted to

zero

Bayesian Inference

missing within-study covariance restricted to

zero

0.739 (0.021)0.201 (0.008)0.728 (0.013)0.207 (0.007)0.642 (0.024)0.237 (0.007)0.627 (0.017) 0.243 (0.007)

0.744 (0.020)0.205 (0.007) 0.742 (0.010)0.207 (0.004)0.653 (0.022)0.242 (0.008)0.656 (0.009)0.242 (0.004)

0.744 (0.020)

0.745 (0.020)

0.199 (0.008)

0.198 (0.008)

0.743 (0.010)

0.743 (0.010)

0.199 (0.005)

0.199 (0.005)

0.654 (0.023)

0.654 (0.024)

0.236 (0.008)

0.236 (0.008)

0.657 (0.009)

0.657 (0.009)

0.236 (0.004)

0.236 (0.004)

0.745 (0.019)0.198 (0.008) 0.743 (0.010)0.199 (0.005) 0.654 (0.024)0.236 (0.008) 0.657 (0.009)0.236 (0.004)

Note: For multivariate meta-analysis and Bayesian Inference, we used uninformative regression coefficients when missing.

Statist. Med. 2012

www.netstorm.be

9

Page 10

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

4. Application: Deep Venous Thrombosis

To confirm the potential value of the proposed approaches, we describe a genuine clinical example involving the

prediction of Deep Venous Thrombosis (DVT). In this example, we aggregated 5 previously published prediction

models [39, 40, 41, 42, 43, 44] with one IPD set, and evaluated different strategies for coping with missing predictor

values and within-study covariance. We used an IPD (N = 1,028) from the Amsterdam-Maastricht-Utrecht Study on

thromboEmbolism (AMUSE-1) [45] and aggregated these data with the prediction models described below. A detailed

description of the predictors can be found in the Appendix. After aggregation, we validated the original and aggregated

models in an independent dataset of 791 participants [46].

Unfortunately, we encountered some difficulties during incorporation of the previously published prediction models.

For instance, some articles did not report the original regression coefficients and standard errors of the prediction model

and reported a scoring rule with weights instead, with score = weight1x1+ ... + weightKxK(eg. Wells rule, modified

Wells rule and Hamilton rule). We attempted to reconstruct the original regression coefficients and standard errors by

deriving a prediction model in the IPD with the scoring rule as single variable, according to:

Pr(DVTpresence) = logit−1(βadj0+ βadj1score)

(13)

The resulting slopeˆβadj1is then multiplied with the reported weights to obtain an estimate for the original regression

coefficients, andˆβadj0is used as estimate for the model intercept. Conservative estimates for the corresponding standard

errors can be obtained by assuming

σadj1=

?M

j=1

?

σ−2

j

?−1/2

(14)

This assumption implies that the standard errors σj are equal for all regression coefficients of the model under

consideration. The standard error for the model intercept can directly obtained from ˆ σadj0. Alternatively, reported p-

values of regression coefficients can be converted into standard errors by assuming normality [47]. An advantage of this

approach is that the AUC of reconstructed models remains equal to the performance of the original models, as the linear

predictors are proportionally identical.

We illustratethis approach usingthe Wellsrule.

WellsScore = 1 malign + 1 par + 1 surg + 1 tend + 1 leg + 1 calfdif3 + 1 pit + 1 vein − 2 altdiagn.

We attempted to reconstruct the original regression coefficients and standard errors by deriving a prediction

model in the IPD with the Wells score as single variable. This approach yielded the following model:

Pr(DVTpresence) = logit−1(−2.66 + 0.52 WellsScore). Consequently, we may reconstruct the original regression

coefficients as follows:ˆβ0= 2.66,ˆβmalign= 0.52,ˆβpar= 0.52,ˆβsurg= 0.52,ˆβtend= 0.52,ˆβleg= 0.52,ˆβcalfdif3= 0.52,

ˆβpit= 0.52,ˆβvein= 0.52 andˆβaltdiagn= −1.04. We found ˆ σadj0= 0.15 and ˆ σadj1= 0.05, such that ˆ σ0= 0.15 and

ˆ σmalign,..., ˆ σaltdiagn= 0.16.

We applied the previously published models in the validation data, and observed an AUC < 0.634, and a Brier score

> 0.133 for most models, with exception of the Oudega model (AUC = 0.767 and Brier score = 0.125).

This ruleconsistsofnineclinicalitems where

4.1. Evidence Aggregation

Consequently, we aggregated the previously published prediction models with the IPD. The approaches considered are:

standard logistic regression (ignoring the evidence from the literature), univariate meta-analysis, multivariate meta-

analysis and Bayesian Inference. Because a relatively large number of predictors were considered, including all of

them would preclude multivariate meta-analysis that would lead to clinically viable prediction models (15 predictors

+ intercept). Hence we focused on a subset of 4 important predictors: malign, surg, calfdif3 and ddimdich. A summary

of the evidence from each of the literature sources and from the IPD is presented in Table 4. These were then pooled. In

order to appraise the quality of the derived model (which only included 4 core predictors), we also fitted a more complex

prediction model where we considered the 8 predictors from the Oudega model. The AUC of the resulting model however

decreased from 0.72 to 0.70, indicating that the simplified model is more generalizable and presents a better reference

for comparing the aggregated prediction models. Finally, we compared the simplified aggregated models to a more

extensive model derived with univariate meta-analysis using the 8 predictors from the Oudega model. This model yielded

the following regression coefficients (and standard error):ˆβ0= −4.70 (0.10),ˆβcalfdif3= 0.63 (0.08),ˆβddimdich= 2.45

(0.28)ˆβmalign= 0.79 (0.20),ˆβnotraum= 0.58 (0.15),ˆβoachst= 1.01 (0.15),ˆβsex= 0.54 (0.11),ˆβsurg= 0.46 (0.08) and

ˆβvein= 0.48 (0.09).

10

www.netstorm.be Statist. Med. 2012

Page 11

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

Table 4. Overview of reconstructed regression coefficients (and standard errors) of the previously published prediction

models in the DVT application.

CharacteristicsLogistic regression coefficients for DVT outcome

Prediction Model

Patients

Wells

593

Modified Wells Gagne

276

Hamilton Oudega

1295

IPD (4)

1028

IPD (8)

1028

530 309

(Intercept)

altdiagn

calfdif3

ddimdich

eryt

histdvt

leg

malign

notraum

oachst

par

pit

sex

surg

tend

vein

-2.66 (0.15)

-1.05 (0.16)

0.52 (0.16)

-2.77 (0.15)

-1.06 (0.17)

0.53 (0.17)

-1.69 (0.10)

-1.77 (0.19)

0.70 (0.19)

-2.72 (0.17) -5.47 (NA)-3.95 (0.28)-4.67 (0.37)

0.43 (0.18) 1.13 (0.34)

3.01 (0.91)

0.86 (0.20)

2.39 (0.29)

0.87 (0.21)

2.40 (0.30)

0.43 (0.18)

0.87 (0.18) 0.53 (0.17)

0.53 (0.17)

0.53 (0.17)

0.63 (0.19)

0.52 (0.16)

0.52 (0.16) 1.69 (0.19)0.87 (0.18) 0.42 (0.24)

0.60 (0.19)

0.75 (0.24)

0.77 (0.36)0.68 (0.36)

0.55 (0.25)

-12.44 (535) 1.17 (0.19)

0.52 (0.16)

0.52 (0.16)

0.53 (0.17)

0.53 (0.17)

0.87 (0.18)

0.43 (0.18)

0.43 (0.18)

0.59 (0.18)

0.38 (0.19)

0.60 (0.21)

-0.04 (0.38)0.52 (0.16)

0.52 (0.16)

0.52 (0.16)

0.53 (0.17)

0.53 (0.17)

0.53 (0.17)

0.53 (0.19)-0.13 (0.37)

0.48 (0.16)0.22 (0.26)

Note: IPD (4) and IPD (8) represent the models derived from the AMUSE-1 study, with 4 and 8 core predictors respectively.

4.2. Results in the DVT case study

Results in Table 5 indicate that the aggregated prediction models, despite including few(er) predictors, are superior to

models that do not incorporate evidence from the literature. However, we also noticed that the Oudega model outperforms

the aggregated models in terms of AUC (but achieves a similar Brier score). This discrepancy decreases when an extended

model with 8 predictors using univariate meta-analysis is derived (AUC = 0.759 and Brier Score = 0.124). These results

possibly indicate that the Oudega model considerably contributes to the discriminative ability of the aggregated models.

Particularly, it is the only literature model with a regression coefficient for ddimdich, a relatively strong predictor in

DVT. We noticed thatˆβddimdichwas considerably smaller in the IPD and aggregated models, and much larger in the

Oudega model and validation data (ˆβddimdich= 3.95, adjusted for the 4 core predictors), which may partially explain

the decrease in discriminative ability. Furthermore, results indicate that different implementations for multivariate meta-

analysis perform similarly. Estimated regression coefficients and standard errors, on the other hand, may considerably

differ according to the implemented approach. For instance, we noticed that uninformative imputation yielded relatively

large standard errors forˆβddimdich. Possibly, these errors are inflated in multivariate meta-analysis because some of the

estimated between-study correlations take extreme values: ρ(ˆβddimdich,ˆβ0) = −0.79 and ρ(ˆβddimdich,ˆβmalign) = −0.97

[48]. Finally, we noticed that standard errors of aggregated regression coefficients tend to be smallest when estimated with

Bayesian Inference.

5. Discussion

In line with previous research, we found that the aggregation and incorporation of previously published prediction models

can indeed improve the performance of a novel prediction model [3, 13, 26, 49]. The case-studies demonstrate that the

proposedmethodsareparticularlyusefulwhenfewparticipantdataareathand.Althoughtheaggregationmethodsperform

similarly in most scenarios, multivariate meta-analysis and Bayesian Inference tend to yield smaller confidence intervals

for the regression coefficients. According to previous research, this may be related to the fact that these approaches

take more evidence into account [50], and allow more flexibility. The inclusion of additional evidence (i.e. within-study

covariance) may, however, also introduce additional uncertainty and cause estimation difficulties, resulting in an inflation

of standard errors [48, 27]. Finally, results indicate that the proposed aggregation approaches may considerably reduce

Statist. Med. 2012

www.netstorm.be

11

Page 12

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

Table 5. Multivariate regression coefficients (and standard error) of the aggregated prediction models in the DVT

application.

ˆβ0

ˆβmalign

ˆβsurg

ˆβcalfdif3

ˆβddimdich

AUCBS

SLR modeling

-3.95

(0.28)

0.77

(0.36)

-0.13

(0.37)

0.86

(0.20)

2.39

(0.29)

0.723

(0.021)

0.123

(0.007)

Uninformative regression coefficients for missing estimates in the literature models (ˆβφ= 0 with ˆ σ2

φ= 100)

0.730

(0.019)

0.730

(0.019)

0.738

(0.020)

Univariate meta-analysis

-3.94

(0.10)

-3.52

(0.10)

-3.28

(0.10)

0.80

(0.20)

0.75

(0.17)

0.49

(0.14)

0.46

(0.08)

0.40

(0.11)

0.45

(0.08)

0.63

(0.08)

0.64

(0.10)

0.68

(0.10)

2.44

(0.28)

1.95

(1.02)

1.64

(0.20)

0.123

(0.007)

0.122

(0.007)

0.122

(0.007)

Multivariate meta-analysis

Bayesian Inference

Mean imputation for missing estimates in the literature models (with ˆ σ2

φ=?M

0.63

(0.08)

0.74

(0.12)

0.80

(0.10)

j=1ˆ σ2

φj)

Univariate meta-analysis

-4.08

(0.10)

-3.96

(0.10)

-3.88

(0.10)

0.80

(0.20)

0.72

(0.18)

0.72

(0.16)

0.46

(0.08)

0.40

(0.09)

0.38

(0.08)

2.60

(0.24)

2.43

(0.45)

2.30

(0.21)

0.730

(0.019)

0.738

(0.020)

0.738

(0.020)

0.123

(0.007)

0.123

(0.007)

0.123

(0.007)

Multivariate meta-analysis

Bayesian Inference

Note: SLR modeling is a standard logistic regression analysis ignoring evidence from the literature, univariate meta-analysis ignores within- and between-study

covariance, multivariate meta-analysis and Bayesian Inference restrict missing within-study covariance to zero. The Area under the Receiver Operator Characteristic

curve (AUC) and the Brier Score (BS) of the aggregated models are presented together with their standard error as measure of performance in the validation dataset.

model complexity without comprising their predictive accuracy. Particularly, by focusing on a set of core predictors, the

model can be pruned effectively.

In this article we evaluated and compared three evidence aggregation approaches in two case studies using real clinical

data. The two case-studies demonstrate that aggregation yields prediction models with an improved discrimination and

calibration in a vast majority of scenarios, and result in equivalent performance (compared to the standard approach)

in a small minority of situations. The exact preconditions for this occurrence could not be definitively established here.

Possibly data aggregation is little added value in scenarios where derivation and validation populations are highly similar

and the AD from the literature is relatively different. The exact causes need to be further explored.

Finally, we have illustrated how the generally unrealistic assumption of consistency in the availability of evidence

across included studies can be relaxed for real-life scenarios. Specifically, we have demonstrated how these methods can

be applied when predictor values, covariance data and even original regression coefficients are unknown. The fact that

aggregation of such evidence succeeds in improving the performance of novel prediction models underscores the value

and versatility of this methodology, as illustrated in the DVT example.

Based on these results from our empirical studies, the following tentative guidelines can be proposed. First, when

there are relatively many IPD at hand and evidence from the literature is strongly heterogeneous with these data, the

standard approach by fitting a new model (from scratch) from that data set without incorporating or synthesizing the

published evidence is acceptable. Secondly, when the evidence from the literature is moderately heterogeneous, or the

IPD is relatively small, Bayesian Inference (and multivariate meta-analysis) may improve calibration and discrimination

of the newly developed prediction model. Even when the actual degree of heterogeneity is unknown, these approaches may

stillbepreferred tothestandardapproachof fittinganentirelynewmodel fromscratch,andis relativelyeasytoimplement.

Finally, when the evidence from the literature is (relatively) homogeneous, univariate meta-analysis represents a superior

approach for improving or updating the newly developed prediction model. Heterogeneity may be quantified using the

I2-statistic, where published criteria suggest adjectives of low, moderate, and high to I2values of 25%, 50%, and 75%

[38].

Limitations

address the potential impact of selection bias. Conceivably, pooled regression coefficients may be over- or underestimated

when important predictors are excluded. This problem may arise when literature models are derived using data-driven

Although we addressed important aspects of aggregating data in the two case-studies, we did not assess or

12

www.netstorm.be Statist. Med. 2012

Page 13

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

selection with stepwise methods, and particularly in small samples [51]. Furthermore, the selection of a core set of

predictors may introduce additional bias when the excluded regression coefficients are strongly influential or correlated

with the included predictors. This is known as confounding of pooled effects, and usually results in underestimation of

pooled regression coefficients (as predictors are typically positive in clinical prediction research). It is therefore important

to select a reasonable set of core predictors when pooling differently specified prediction.

Another potential limitation of this article is the fact that only two clinical examples were examined. Conceivably these

may not be representative of the majority of clinical prediction research and our evaluation of the evidence aggregation

methods are not reproducible in different scenarios. We feel that this is unlikely since the examples used, TBI and DVT,

are two typical areas of clinical prediction research for which we included numerous articles (15 and 5, respectively). We

welcome the evaluation of these approaches in other case-studies by other authors.

Finally, our DVT application illustrates that aggregated prediction models generally improve the predictive accuracy of

novel prediction models, but do not always outperform previously published prediction models in terms of discriminative

ability. We demonstrated that this situation may occur when a strong predictor is poorly available from the literature,

and not well estimated in the IPD. Moreover, it is well known that the AUC is not the most sensitive measure to assess

incremental value of predictors [52, 53]. For this reason, we also considered model accuracy in terms of the Brier Score.

Conclusion

model with a similar set of predictors is both feasible and beneficial when IPD are available. Particularly in small datasets

we noticed that the inclusion of such aggregate evidence may provide considerable leverage to improve the regression

coefficients and discriminative ability of the new prediction model. However, it remains paramount that researchers

identify to what extent the previously published prediction models are comparable with those in the available IPD, as

the justification of the considered approaches depends on the clinical relevance of the aggregated model. Future research

may therefore focus on the quantification of heterogeneity across prediction models. In conclusion, aggregation is better

or at least equivalent. Real life clinical examples support these conclusions.

The incorporation of previously published prediction models into the development of a novel prediction

Acknowledgement

We would like to thank Stan Buckens for his input and comments during the review process.

References

1. Bleeker SE, Moll HA, Steyerberg EW, Donders ART, Derksen-Lubsen G, Grobbee DE, Moons KGM. External validation is necessary in prediction

research: a clinical example. Journal of Clinical Epidemiology Sep 2003; 56(9):826–832, doi:10.1016/S0895-4356(03)00207-5.

2. Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Annals of Internal Medicine Mar 1999; 130(6):515–524.

3. Moons KGM, Altman DG, Vergouwe Y, Royston P. Prognosis and prognostic research: application and impact of prognostic models in clinical practice.

British Medical Journal 2009; 338, doi:10.1136/bmj.b606.

4. Steyerberg EW, Bleeker SE, Moll HA, Grobbee DE, Moons KGM. Internal and external validation of predictive models: a simulation study of bias and

precision in small samples. Journal of Clinical Epidemiology May 2003; 56(5):441–447, doi:10.1016/S0895-4356(03)00047-7.

5. Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer: New York, 2009.

6. Toll DB, Janssen KJM, Vergouwe Y, Moons KGM. Validation, updating and impact of clinical prediction rules: a review. Journal of Clinical

Epidemiology Nov 2008; 61(11):1085–1094, doi:10.1016/j.jclinepi.2008.04.008.

7. Altman DG. Prognostic models: a methodological framework and review of models for breast cancer. Cancer Investigation Mar 2009; 27(3):235–243,

doi:10.1080/07357900802572110.

8. Meads C, Ahmed I, Riley RD. A systematic review of breast cancer incidence risk prediction models with meta-analysis of their performance. Breast

Cancer Research and Treatment Oct 2011; 132:1–13, doi:10.1007/s10549-011-1818-2.

9. Perel P, Edwards P, Wentz R, Roberts I. Systematic review of prognostic models in traumatic brain injury. BMC Medical Informatics and Decision

Making 2006; 6:38, doi:10.1186/1472-6947-6-38.

10. DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials Sep 1986; 7(3):177–188, doi:10.1016/0197-2456(86)90046-2.

11. ReitsmaJB,GlasAS,RutjesAWS,ScholtenRJPM,BossuytPM,ZwindermanAH.Bivariateanalysisofsensitivityandspecificityproducesinformative

summary measures in diagnostic reviews. Journal of Clinical Epidemiology Oct 2005; 58(10):982–990, doi:10.1016/j.jclinepi.2005.02.022.

12. Hemingway H, Riley RD, Altman DG. Ten steps towards improving prognosis research. British Medical Journal 2009; 339:b4184, doi:10.1136/bmj.

b4184.

13. Steyerberg EW, Eijkemans MJ, Van Houwelingen JC, Lee KL, Habbema JD. Prognostic models based on literature and individual patient data in logistic

regression analysis. Statistics in Medicine Jan 2000; 19(2):141–160, doi:10.1002/(SICI)1097-0258(20000130)19:2?141::AID-SIM334?3.0.CO;2-O.

14. Steyerberg EW, Borsboom GJJM, van Houwelingen HC, Eijkemans MJC, Habbema JDF. Validation and updating of predictive logistic regression

models: a study on sample size and shrinkage. Statistics in Medicine Aug 2004; 23(16):2567–2586, doi:10.1002/sim.1844.

15. Van Houwelingen HC, Thorogood J. Construction, validation and updating of a prognostic model for kidney graft survival. Statistics in Medicine Sep

1995; 14(18):1999–2008, doi:10.1002/sim.4780141806.

16. Becker B, Wu M. The synthesis of regression slopes in meta-analysis. Statistical Science 2007; 22:414–429, doi:10.1214/07-STS243.

Statist. Med. 2012

www.netstorm.be

13

Page 14

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

17. de Leeuw C, Klugkist I. Augmenting data with published results in Bayesian linear regression. Multivariate Behavioral Research 2012; 47(3):369–391,

doi:10.1080/00273171.2012.673957.

18. Higgins J, Thompson S, Deeks J, Altman D. Statistical heterogeneity in systematic reviews of clinical trials: a critical appraisal of guidelines and

practice. Journal of Health Services Research and Policy Jan 2002; 7(1):51–61, doi:10.1258/1355819021927674.

19. Bal´ azs K, Hidegkuti I, De Boeck P. Detecting Heterogeneity in Logistic Regression Models. Applied Psychological Measurement 2006; 30(4):322–344,

doi:10.1177/0146621605286315.

20. Riley RD, Simmonds MC, Look MP. Evidence synthesis combining individual patient data and aggregate data: a systematic review identified current

practice and possible methods. Journal of Clinical Epidemiology May 2007; 60(5):431–439, doi:10.1016/j.jclinepi.2006.09.009.

21. Marmarou A, Lu J, Butcher I, McHugh GS, Mushkudiani NA, Murray GD, Steyerberg EW, Maas AIR. IMPACT database of traumatic brain injury:

design and description. Journal of Neurotrauma Feb 2007; 24(2):239–250, doi:10.1089/neu.2006.0036.

22. Steyerberg EW, Mushkudiani N, Perel P, Butcher I, Lu J, McHugh GS, Murray GD, Marmarou A, Roberts I, Habbema JDF, et al.. Predicting Outcome

after Traumatic Brain Injury: Development and International Validation of Prognostic Scores Based on Admission Characteristics. PLoS Medicine Aug

2008; 5(8):e165, doi:10.1371/journal.pmed.0050165.

23. R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing Vienna Austria 2011;

.

24. Jackson D, White IR, Thompson SG. Extending DerSimonian and Laird’s methodology to perform multivariate random effects meta-analyses. Statistics

in Medicine May 2010; 29(12):1282–1297, doi:10.1002/sim.3602.

25. Sutton AJ, Kendrick D, Coupland C. Meta-analysis of individual- and aggregate-level data. Statistics in Medicine 2008; 27(5):651–669, doi:

10.1002/sim.2916.

26. van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in meta-analysis: multivariate approach and meta-regression. Statistics in Medicine

Feb 2002; 21(4):589–624, doi:10.1002/sim.1040.

27. Ishak KJ, Platt RW, Joseph L, Hanley JA. Impact of approximating or ignoring within-study covariances in multivariate meta-analyses. Statistics in

Medicine Feb 2008; 27(5):670–686, doi:10.1002/sim.2913.

28. Riley RD, Lambert PC, Staessen JA, Wang J, Gueyffier F, Thijs L, Boutitie F. Meta-analysis of continuous outcomes combining individual patient data

and aggregate data. Statistics in Medicine May 2008; 27(11):1870–1893, doi:10.1002/sim.3165.

29. Hyder AA, Wunderlich CA, Puvanachandra P, Gururaj G, Kobusingye OC. The impact of traumatic brain injuries: a global perspective.

NeuroRehabilitation 2007; 22(5):341–353.

30. Levack WMM, Kayes NM, Fadyl JK. Experience of recovery and outcome following traumatic brain injury: a metasynthesis of qualitative research.

Disability and Rehabilitation 2010; 32(12):986–999, doi:10.3109/09638281003775394.

31. Jennett B, Teasdale G, Braakman R, Minderhoud J, Knill-Jones R. Predicting outcome in individual patients after severe head injury. Lancet May 1976;

1(7968):1031–1034, doi:10.1016/S0140-6736(76)92215-7.

32. Mushkudiani NA, Hukkelhoven CWPM, Hern´ andez AV, Murray GD, Choi SC, Maas AIR, Steyerberg EW. A systematic review finds methodological

improvements necessary for prognostic models in determining traumatic brain injury outcomes. Journal of Clinical Epidemiology Apr 2008; 61(4):331–

343, doi:10.1016/j.jclinepi.2007.06.011.

33. Abu-Hanna A, Lucas PJ. Prognostic models in medicine. AI and statistical approaches. Methods of Information in Medicine Mar 2001; 40(1):1–5.

34. Brier GW. Verification of forecasts expressed in terms of probability. Monthly Weather Review 1950; 78(1):1–3.

35. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology Apr 1982; 143(1):29–36.

36. Bailey I, Bell A, Gray J, Gullan R, Heiskanan O, Marks PV, Marsh H, Mendelow DA, Murray G, Ohman J, et al.. A trial of the effect of nimodipine on

outcome after head injury. Acta Neurochir (Wien) 1991; 110(3-4):97–105.

37. The European Study Group on Nimodipine in Severe Head Injury. A multicenter trial of the efficacy of nimodipine on outcome after severe head injury.

Journal of Neurosurgery 1994; 80(5):797–804, doi:10.3171/jns.1994.80.5.0797.

38. Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. British Medical Journal Sep 2003; 327(7414):557–560,

doi:10.1136/bmj.327.7414.557.

39. Wells PS, Anderson DR, Bormanis J, Guy F, Mitchell M, Gray L, Clement C, Robinson KS, Lewandowski B. Value of assessment of pretest probability

of deep-vein thrombosis in clinical management. Lancet Dec 1997; 350(9094):1795–1798.

40. Wells PS, Anderson DR, Rodger M, Forgie M, Kearon C, Dreyer J, Kovacs G, Mitchell M, Lewandowski B, Kovacs MJ. Evaluation of D-dimer in the

diagnosis of suspected deep-vein thrombosis. The New England Journal of Medicine Sep 2003; 349(13):1227–1235, doi:10.1056/NEJMoa023153.

41. Gagne P, Simon L, Le Pape F, Bressollette L, Mottier D, Le Gal G. [clinical prediction rule for diagnosing deep vein thrombosis in primary care]. La

Presse M´ edicale Apr 2009; 38(4):525–533, doi:10.1016/j.lpm.2008.09.022.

42. Subramaniam RM, Snyder B, Heath R, Tawse F, Sleigh J. Diagnosis of lower limb deep venous thrombosis in emergency department patients:

performance of Hamilton and modified Wells scores. Annals of Emergency Medicine Dec 2006; 48(6):678–685, doi:10.1016/j.annemergmed.2006.

04.010.

43. Geersing GJ, Janssen KJ, Oudega R, van Weert H, Stoffers H, Hoes A, Moons K, on behalf of the AMUSE Study. Diagnostic classification in patients

with suspected deep venous thrombosis: physicians’ judgement or a decision rule? Journal of the Royal College of General Practitioners Oct 2010;

60(579):742–748, doi:10.3399/bjgp10X532387.

44. Oudega R, Moons KGM, Hoes AW. Ruling out deep venous thrombosis in primary care. A simple diagnostic algorithm including D-dimer testing.

Thrombosis and Haemostasis Jul 2005; 94(1):200–205, doi:10.1160/TH04-12-0829.

45. B¨ uller HR, Ten Cate-Hoek AJ, Hoes AW, Joore MA, Moons KGM, Oudega R, Prins MH, Stoffers HEJH, Toll DB, van der Velde EF, et al.. Safely

ruling out deep venous thrombosis in primary care. Annals of Internal Medicine Feb 2009; 150(4):229–235.

46. Toll DB, Oudega R, Vergouwe Y, Moons KGM, Hoes AW. A new diagnostic rule for deep vein thrombosis: safety and efficiency in clinically relevant

subgroups. Family Practice Feb 2008; 25(1), doi:10.1093/fampra/cmm075.

47. Altman DG, Bland JM. How to obtain the confidence interval from a p value. British Medical Journal 2011; 343:d2090, doi:10.1136/bmj.d2090.

48. Riley RD, Abrams KR, Sutton AJ, Lambert PC, Thompson JR. Bivariate random-effects meta-analysis and the estimation of between-study correlation.

BMC Medical Research Methodology 2007; 7:3, doi:10.1186/1471-2288-7-3.

49. Janssen KJM, Vergouwe Y, Kalkman CJ, Grobbee DE, Moons KGM. A simple method to adjust clinical prediction models to local circumstances.

Canadian Journal of Anaesthesia Mar 2009; 56(3):194–201, doi:10.1007/s12630-009-9041-x.

50. Jackson D, Riley R, White IR. Multivariate meta-analysis: Potential and promise. Statistics in Medicine Jan 2011; 30(20):2481–2498, doi:10.1002/sim.

4172.

14

www.netstorm.be Statist. Med. 2012

Page 15

T. P. A. Debray, H. Koffijberg, Y. Vergouwe et al.

51. Steyerberg EW, Eijkemans MJ, Habbema JD. Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. Journal of

Clinical Epidemiology Oct 1999; 52(10):935–942, doi:10.1016/S0895-4356(99)00103-1.

52. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation Feb 2007; 115(7):928–935, doi:10.1161/

CIRCULATIONAHA.106.672402.

53. Pencina MJ, D’Agostino Sr RB, D’Agostino Jr RB, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve

to reclassification and beyond. Statistics in Medicine Jan 2008; 27(2):157–172, doi:10.1002/sim.2929.

Statist. Med. 2012

www.netstorm.be

15