Content uploaded by Joachim Arts

Author content

All content in this area was uploaded by Joachim Arts on Jan 06, 2023

Content may be subject to copyright.

Content uploaded by Laurens Deprez

Author content

All content in this area was uploaded by Laurens Deprez on Jan 24, 2022

Content may be subject to copyright.

Data-driven preventive maintenance for a heterogeneous machine portfolio∗

Laurens Deprez†1, Katrien Antonio2,3, Joachim Arts1, and Robert Boute2,4,5

1Luxembourg Centre for Logistics and Supply Chain Management, University of Luxembourg, Luxembourg.

2Faculty of Economics and Business, KU Leuven, Belgium.

3Faculty of Economics and Business, University of Amsterdam, The Netherlands.

4Technology & Operations Management Area, Vlerick Business School, Belgium.

5VCCM, Flanders Make, Belgium.

January 6, 2023

Abstract

We describe a data-driven approach to optimize periodic maintenance policies for a heterogeneous port-

folio with diﬀerent machine proﬁles. When insuﬃcient data are available per proﬁle to assess failure

intensities and costs accurately, we pool the data of all machine proﬁles and evaluate the eﬀect of (ob-

servable) machine characteristics by calibrating appropriate statistical models. This reduces maintenance

costs compared to a stratiﬁed approach that splits the data into subsets per proﬁle and a uniform ap-

proach that treats all proﬁles the same.

Keywords: preventive maintenance, data pooling, proportional hazards, small data

1 Introduction

Despite advances in condition monitoring, time- or age-based periodic preventive maintenance is still com-

mon practice in many companies. Under this policy, preventive maintenance (PM) interventions are sched-

uled at a periodic interval, either measured in calendar time or running hours. This interval is determined by

trading oﬀ the failure rate and corresponding failure costs against the cost of performing preventive mainte-

nance. Most literature assumes that the failure behaviour, or failure intensity, is known, in line with Barlow

and Hunter [1960]. In practice, however, the failure distribution may not be known and should be estimated

based on historical data. This may introduce some challenges. A small machine population may not provide

suﬃcient maintenance and failure data for an accurate estimation. The lack of suﬃcient data to accurately

estimate regression parameters is known as the ‘small data’ problem. Even when the machine population is

of adequate size, the machine population might be heterogeneous and as such may contain multiple diﬀerent

machine proﬁles. This heterogeneity can be caused by diﬀerent running conditions, environmental factors

or diﬀerent manufacturing plants where the machines were assembled and might induce diﬀerent failure

intensities and costs. Estimation by stratifying, or splitting the data, per machine proﬁle might again induce

the small data problem. Aggregating these machine data without taking into account the heterogeneity in

the machine population leads to a maintenance policy that is optimal for an average machine, but it is not

tailored to a particular machine (proﬁle). The novelty of our approach is that we introduce a multivariate

time-to-failure and cost model such that the data of all machines (across the diﬀerent machine proﬁles) can

be pooled. This allows diﬀerentiating the PM policy over the diﬀerent machine proﬁles, while at same time

∗To appear in Operations Research Letters

†corresponding author: laurens.deprez@uni.lu, 6 Rue Richard Coudenhove-Kalergi, 1359 Luxembourg, Luxembourg

1

obtains accurate time-to-failure and maintenance cost estimates for a heterogeneous machine population.

The latter is not possible when stratifying data per machine proﬁle.

Service providers, who maintain the assets of their customers, can make use of such an approach. Their

maintenance portfolio provides them potentially with ample historical maintenance and failure data, yet

data per machine proﬁle might be limited. Our objective is to optimize the periodic maintenance interval

by explicitly considering the heterogeneity in the machine portfolio that is induced by observable machine

characteristics. Our policy optimization applies to a ﬁnite horizon, motivated by the ﬁnite duration of service

maintenance contracts or (extended) warranties [Nakagawa and Mizutani,2009;Dursun et al.,2022]. We

focus on observable machine characteristics that remain constant during the contract horizon, e.g. running

conditions, country, operating environment and industry type. This categorizes each machine in a speciﬁc

machine proﬁle, speciﬁed by these machine characteristics. By tailoring the maintenance policy to the

machine proﬁle, the resulting cost optimization may lead to higher proﬁt margins or lower contract prices.

Population heterogeneity has been studied by Dursun et al. [2022], Abdul-Malak et al. [2019], and

de Jonge et al. [2015], among others. In Dursun et al. [2022]; Abdul-Malak et al. [2019], and de Jonge et al.

[2015], parts (or machines) originate from multiple, mostly two, diﬀerent sub-populations with diﬀerent fail-

ure distributions. It is unknown to which sub-population a spare part belongs but the failure distribution of

each sub-population itself is known. In their analysis, the sub-population of the part is inferred by observing

its failure behaviour and adapting the maintenance policy accordingly. In contrast, we consider the case

where it is known to which sub-population machines belong as machines are labeled by their machine pro-

ﬁle, characterized by observable machine characteristics. The underlying failure distribution, however, is not

known and we infer the failure behaviour and maintenance costs for each machine proﬁle from data. Drent

et al. [2020] also deals with an unknown failure distribution but assumes population homogeneity. In their

paper the failure distribution and maintenance policy are inferred by means of a Bayesian approach as infor-

mation accumulates during the machine operation. The information accumulation process consists of both

censored (i.e., preventive replacements) and uncensored (i.e., corrective replacements) observations of the un-

derlying lifetime distribution. This leads to an inherent exploration-exploitation trade-oﬀ. Exploration (i.e.,

a longer periodic maintenance interval) increases the probability of corrective replacement, but at the same

time leads to accumulating more, valuable, information. In our approach there is no exploration-exploitation

trade-oﬀ as the data over all machine proﬁles has been collected prior to the moment of estimation.

Our methodology relies on a data set with historical failure and maintenance data of a heterogeneous

machine portfolio. We learn the failure behaviour and costs and asses the eﬀect of (observable) machine

characteristics by calibrating appropriate statistical models. The calibrated statistical models enable the

optimization of each machine proﬁle’s periodic maintenance interval. We study how the resulting mainte-

nance policies are more cost-eﬀective than (1) a uniform approach that disregards the machine proﬁles, and

(2) a stratiﬁed approach that splits the data per machine proﬁle to take into account each machine proﬁle.

The next section introduces our reliability model, the optimality condition for the periodic maintenance

interval and the predictive models to calibrate the failure behaviour and costs. Section 3numerically eval-

uates how and when our data pooling approach reduces maintenance costs. Section 4concludes.

2 Maintenance policies for a heterogeneous machine population

We ﬁrst set out the details of our reliability model. We then specify the optimality condition for a periodic,

preventive maintenance policy over a ﬁnite horizon for a heterogeneous machine population. Finally we

establish the statistical models and their calibration to data to estimate the failure rate, and the costs of

failure and preventive maintenance.

2

2.1 Reliability model

We consider the optimization of the number of preventive maintenance interventions, n, for a machine

during a ﬁnite horizon [0,∆t], e.g. the coverage period of a service agreement. In our analysis we will

assume that failures require minimal, as-good-as-old, corrective maintenance [see e.g. Barlow and Proschan,

1965;Lindqvist et al.,2006;Wu and Zuo,2010;Doyen and Gaudoin,2011;Arts and Basten,2018] and

planned, preventive maintenance actions are perfect. The latter justiﬁes a periodic maintenance policy, where

preventive maintenance is executed with a ﬁxed interval ∆tpm =∆t

n+1 , at times tk=k∆tpm (k= 1, ..., n).

Each machine is characterized by a set of machine characteristics or covariates x, e.g. operating conditions,

industry type, or country, that remain the same during the planning horizon.

Failure times We specify the machine-dependent failure intensity function λ(t) in each maintenance in-

terval under the Cox proportional hazards assumption [Cox,1975] with the same baseline failure intensity

function λ0(t) for each machine and βrepresenting the impact of the machine characteristics x. The pro-

portional hazards model is a versatile model to analyze the eﬀect of operating conditions (or covariates) on

the lifetime of a system. Kumar and Westberg [1997] have shown that models from the proportional hazards

family appear to be the better ones for analyzing the eﬀect of the covariates. It is also very convenient to

add terms or cross products. The practical value of the proportional hazards model has been demonstrated,

for instance, by Barabadi et al. [2014] in a case study in mining equipment to identify the covariates that in-

ﬂuence the reliability. We acknowledge that, like any multi-variate model, there is a risk of mis-speciﬁcation

in proportional hazard models. We show in Section 3.5 that the beneﬁts of data pooling outweigh this risk.

The machine-dependent failure intensity function λ(t) deﬁnes a non-homogeneous Poisson process [Moller

and Waagepetersen,2003] for the arrival of failures. With tthe time since the last PM intervention, the

machine-speciﬁc failure intensity function λ(t) is then characterized by

λ(t) = λ0(t) exp(β′·x) for t∈[0, tk−tk−1) (∀k),(1)

where t0= 0 and tn+1 = ∆t, respectively referring to the start and end of the planning horizon. Assuming

perfect preventive maintenance, the failure intensity function λ(t) is the same for each inter-PM interval.

The baseline failure intensity λ0(t) can take any parametric form. We deﬁne λ0(t) by a Weibull failure

intensity function, which is regularly used in the reliability literature to model failure times [Bobbio et al.,

1980;Wang et al.,2000;Wu,2019], with scale parameter α∈R+

0and shape γ∈R+,

λ0(t) = γαγtγ−1.

Costs of maintenance The expected preventive maintenance cost, cp(x), and expected failure cost,

cf(x) are modeled with a gamma generalized linear model (GLM) to account for the impact of the machine

characteristics x. [Delong et al.,2021] argues that gamma models are appropriate to model cost data given

their positive support and right-skewed distribution. We denote the scale of the gamma distribution by θf

for the failure costs. For the expected failure costs, the GLM with categorical explanatory variables xis

speciﬁed by [see Ohlsson and Johansson,2010, for an overview]

cf(x) = exp β′

f·(1,x)= exp

βf,0+

q

X

j=1

βf,j xj

,(2)

with β′

f·(1,x) the linear predictor. The impact of the machine characteristics xon the cost of failure is

captured by βf. We remark that the ﬁrst component of the vector βfacts as an intercept and consequently

the length of the vector βfis one plus the length of vector x, i.e. 1+ q. The expected preventive maintenance

costs cp(x) are similarly speciﬁed. Their respective parameters are denoted with subscript p. We remark

3

that we set the parameters for the cost of preventive maintenance cp(x) and the cost of failure cf(x) such

that cp(x)< cf(x) for each combination of machine characteristics x.

Total maintenance costs The expected total maintenance cost over the horizon [0,∆t] for a policy with

preventive maintenance at (consecutive) times {t1, t2, ..., tn}(⊂[0,∆t]), on a machine with characteristics

xis then deﬁned by,

Cx({t1, t2, ..., tn}) = cf(x)

n

X

k=0 Ztk+1−tk

0

λ(t)dt +ncp(x)

=cf(x)

n

X

k=0 Ztk+1−tk

0

λ0(t) exp(β′·x)dt +ncp(x)

=cf(x) exp(β′·x)

n

X

k=0

Λ0(tk+1 −tk) + ncp(x),

(3)

where Λ0(t) = Rt

0λ0(u)du. The integral Rtk+1−tk

0λ(u)du represents the expected number of failures between

tkand tk+1. The latter is a property of non-homogeneous Poisson processes Lewis and Shedler [1979].

By summing over all preventive maintenance intervals we obtain all failures during the horizon [0,∆t].

Equation (3) can be simpliﬁed for periodic maintenance with ﬁxed interval ∆tpm =∆t

n+1 . In this case, the

expected total maintenance costs Cx(n) is a function of the number of preventive maintenance interventions,

n,

Cx(n) = cf(x) exp(β′·x)(n+ 1)Λ0(∆tpm) + ncp(x).(4)

Observe that the impact of the machine characteristics xon the failure rate, i.e. exp(β′·x) is equivalent to

an increase in the expected failure costs with the same factor, exp(β′·x).

2.2 Optimality condition for a diﬀerentiated, periodic policy

We denote n⋆the optimal number of PMs that minimizes the total maintenance costs in Eq. (4) during the

planning horizon [0,∆t]. Although n⋆depends on the machine characteristics x, we will adopt n⋆instead

of n⋆(x) to simplify notation.

Proposition 1. If the baseline failure intensity λ0(t)is strictly increasing in t(∈R+), then the optimal

number of PMs, n⋆, is the smallest n∈N0that satisﬁes,

(n+ 1)Λ0∆t

n+ 1−(n+ 2)Λ0∆t

n+ 2≤cp(x)

cf(x) exp(β′·x).(5)

Proof. The second order derivative of the total maintenance cost (for n∈R+) is

d2Cx(n)

dn2=cf(x) exp(β′·x)∆t2

(n+ 1)3λ′

0∆t

n+ 1,

with λ′

0(t) the ﬁrst order derivative of λ0(t). If λ0(t) is strictly increasing, i.e. λ′

0(t)>0 (∀t∈R+), then Cx(n)

is convex on R+. The optimal n∗satisﬁes the ﬁrst order optimality condition, i.e. it is the smallest n(∈N0)

for which ∆1[Cx](n)≥0 where the forward diﬀerence of the cost function ∆1[Cx](n) = Cx(n+1)−Cx(n). ■

2.3 Calibration of predictive models to estimate failure intensity and costs

The optimal periodic maintenance policy depends on the failure intensity and the costs of failure and

maintenance of each machine proﬁle (as deﬁned by the characteristics x). We now describe the statistical

models and their calibration to estimate this failure intensity and the costs given a data set with failure and

maintenance records.

4

Time-to-failure model From the failure intensity function for each inter-PM interval, λ(t), (Eq. (1)) and

the timings of the PM interventions, {t1, t2, ..., tn}, we deﬁne the failure intensity function λT(t) in absolute

time, i.e. since the start of the observation horizon,

λT(t) =

λ0(t) exp(β′·x) if 0 ≤t<t1

λ0(t−t1) exp(β′·x) if t1≤t<t2

λ0(t−t2) exp(β′·x) if t2≤t<t3

.

.

.

λ0(t−tn) exp(β′·x) if tn≤t≤∆t.

The timings of the PM interventions, or the PM interval, should be known to set up the expression for

λT(t). This is not an issue, however, since the PM interventions are planned upfront. Denote R(t|t−) the

reliability, or survival function, for the time-to-next-failure, with t−the time of the previous failure,

R(t|t−) = exp −Zt

t−

λT(u)du!.

To calibrate the time-to-failure model and consequently ﬁnd estimates for the parameters of the baseline

failure intensity function λ0(t), i.e. αand γ, and the impact of the covariates β, we maximize the time-to-

failure log-likelihood. The events of interest are the failures as well as the end of the observation horizon for

each machine in the data. The latter acts as a censoring event. Each event is characterized by the vector

(tf, tf,−,x, δ), where tfis the event time, tf,−the time of the previous event, xthe machine characteristics

and δ∈ {0,1}where δ= 0 indicates a censored event, i.e. the end of the observation horizon, and δ= 1 a

failure. Summing over failure events jon machine iin the data provides the expression for the log-likelihood

for the time-to-failure data,

L(α, γ, β) =

N

X

i=1

fi

X

j=1

δjlog (λT(tf,j )) + log (R(tf,j |tf,j,−)) ,(6)

where Nis the total number of machines and fiis the number of failures (including the end of the observation

horizon) on machine i. Consequently, the failures contribute with the logarithm of the probability density

function and the end of the observation horizon with the logarithm of the reliability. Maximizing the log-

likelihood L(α, γ, β) for the time-to-failure model leads to estimates for the parameters αand γof the

baseline failure intensity function λ0(t) and the impact of the covariates β.

Costs model We calibrate separate gamma generalized linear models (GLMs), as speciﬁed in Eq. (2), for

the expected preventive maintenance costs cp(x) and the expected failure costs cf(x) taking into account the

machine characteristics x. This provides estimates for βpand βf, the impact of the machine characteristics

xon the preventive maintenance costs and on the failure costs respectively, and for θpand θf, the scale

parameter of the gamma distribution of the preventive maintenance costs and of the failure costs respectively.

Benchmark approaches The approach described above makes use of all available failure and mainte-

nance data by pooling the data across all machines. We therefore refer to this approach as the pooling

approach. Yet, by specifying and calibrating the impact of the machine characteristics xon the failure in-

tensity and costs, our approach can diﬀerentiate the optimal periodic policy per machine proﬁle. We specify

two benchmark approaches. First, a uniform approach that aggregates the data, but disregards the machine

proﬁles, calibrates the time-to-failure and costs models ignoring the machine characteristics x. This is iden-

tical to setting β= 0 when optimizing the log-likelihood in Eq. (6). Similarly, we also ignore the machine

5

characteristics xwhen calibrating the cost models. This gives us only estimates for βp,0, θp, βf,0and θp.

This approach also makes use of all available data, but the models are calibrated as if all machines would

have no machine-speciﬁc characteristics, resulting in a uniform PM policy with identical, optimized number

of PM interventions n⋆for each machine (proﬁle). Second, we also benchmark against a stratiﬁed approach.

For this approach, we split or stratify the data in subsets that only contain data on a single machine proﬁle,

i.e. combination of machine characteristics, and then calibrate the time-to-failure and costs models in similar

fashion to the uniform approach, i.e. by ignoring the machine characteristics x, for each subset. The models

for each machine proﬁle serve as input to optimize the number of PM interventions. Although these policies

are capable of diﬀerentiating over the diﬀerent machine proﬁles, they use less data to calibrate the models.

The latter can lead to less accurate estimates and inferior cost performance of the resulting maintenance

policies. In our numerical analysis we will also benchmark against the oracle approach. The oracle knows

the distribution of the failure behaviour and costs, and their associated parameters and is hence equivalent

to assuming an inﬁnite number of observations. Here we use the true distributions (rather than estimates)

to ﬁnd the optimal number of preventive maintenance interventions for each machine proﬁle. These oracle

policies serve as a lower bound on the costs.

3 Results and insights

We set up a numerical experiment to assess the value of diﬀerentiating the PM policy per machine proﬁle by

pooling the data using the approach described in the previous section. To do so, we generate a data set with

maintenance and failure records from a heterogeneous machine portfolio with diﬀerent machine proﬁles. The

simulation engine to generate this data set is described in Section 3.1. We calibrate the parameters of the

time-to-failure and cost models making use of maximum likelihood estimation (described in Section 2.3), and

apply the optimality condition (Proposition 1) to prescribe the optimal number of preventive maintenance

interventions for each machine proﬁle. We report the cost performance of this approach in Section 3.2 and

illustrate how each approach performs with limited amounts of data in Section 3.3. In Section 3.4 we check

whether it is actually worth diﬀerentiating the PM policy at all compared to adopting a uniform PM policy

that is identical across all machine proﬁles. Finally, Section 3.5 studies model mis-speciﬁcation where the

ﬁtted model has a diﬀerent parametric form from the model from which the data were simulated.

3.1 Simulation engine

We generate a data set of failures and maintenance records for a heterogeneous machine portfolio, following

the reliability model introduced in Section 2.1. We consider a portfolio with 240 machines, of which 90%

is observed for ∆ti= 5 years and 10% has a shorter history of ∆ti

d

∼U[1,5] years. We characterize each

machine by 4 features, x= (x1,1, x1,2, x1,3, x1,4)T∈ {0,1}4which are randomly assigned following a uniform

distribution. This leads to 16 diﬀerent machine proﬁles, each occurring in the portfolio with equal probability.

The preventive maintenance interventions in the data set are executed periodically with an interval of

∆tPM = 1 year. Recurrent failure times in a PM interval are generated by inverse transform sampling from

the failure distribution determined by the failure intensity function in Eq. (1) [Metcalfe and Thompson,

2006;Cook and Lawless,2007;Jahn-Eimermacher et al.,2015;P´enichoux et al.,2015]. The costs of failures

and preventive maintenance interventions are positive, follow a right-skewed distribution and are dependent

of machine characteristics x. To accommodate for these properties, they are sampled from a gamma GLM

[Denuit et al.,2007;De Jong et al.,2008].

The simulation parameters that we have used to generated failure times and costs are summarized in

Table 3(Appendix A). Our parameters set the mean time-to-failure for the machine proﬁle x= (0000) equal

to 1.27 years, if the machine would not be maintained preventively. The expected costs cp(x) and cf(x) for

machine proﬁle x= (0000) are respectively 30 and 300. Without loss of generality, we let the costs of a PM

6

machine proﬁle oracle pooling stratiﬁed uniform

0 0 0 0 10 9 11 10

0 0 0 1 6 6 5 10

0 0 1 0 8 8 7 10

0 0 1 1 5 5 4 10

0 1 0 0 13 11 6 10

0 1 0 1 9 8 6 10

0 1 1 0 11 9 9 10

0 1 1 1 7 6 5 10

1 0 0 0 14 13 13 10

1 0 0 1 9 9 10 10

1 0 1 0 11 11 13 10

1 0 1 1 7 7 5 10

1 1 0 0 18 15 21 10

1 1 0 1 12 11 10 10

1 1 1 0 15 13 15 10

1 1 1 1 10 9 5 10

Table 1: Prescribed number of PMs during a contract horizon of 5 years for a single data set.

be independent of the machine characteristics, as the optimality condition (Eq. (5)) only depends on the

ratio of cp(x) and cf(x). In Table 4(Appendix B), we display an extract of the failure and maintenance

records in our simulated data set. In Sections 3.2-3.4, the data are generated from correctly speciﬁed models,

i.e., of the same parametric form. Section 3.5 studies model mis-speciﬁcation where the ﬁtted model has a

diﬀerent parametric form from the model from which the data were simulated.

3.2 What is the value of data pooling?

Table 1reports for a single simulated data set the optimal number of preventive maintenance interventions,

n⋆, during a time horizon of ∆t= 5 years for each of the 16 machine proﬁles, as determined by the

diﬀerent approaches. These diﬀerent approaches all rely on the optimality condition in Proposition 1, yet

with diﬀerent estimations of the failure behaviour and the costs. We refer to Section 2.3 for a discussion

of these approaches. Depending on the (simulated) data set, the estimations of the failure behaviour and

the costs, and therefore also the prescribed number of PM interventions, may slightly diﬀer. To report the

cost performance of each of these approaches, we therefore generated 100 data sets with identical simulation

parameters. Table 2reports the average total maintenance costs for each approach (making use of Eq. (4)),

as well as the empirical 95%-conﬁdence interval over the 100 simulation runs. The costs are given with

respect to the costs of the oracle approach by means of a multiplicative ratio. Furthermore, we include the

cost performance for an average machine proﬁle. The maintenance cost for the average machine proﬁle is

determined by taking the average of the costs of all machine proﬁles with equal weights. Since the oracle

serves as a lower bound, all other approaches will have higher or at best equal costs. The oracle provides

the optimal diﬀerentiated maintenance strategy over the diﬀerent machine proﬁles. To obtain these policies,

however, we would need inﬁnite amount of data to exactly know the underlying failure behaviour and costs.

The maintenance costs obtained under the stratiﬁed approach are on average only 5% higher compared

to the oracle for the average machine proﬁle (see Table 2). However, for machine proﬁle x= (1100) the

cost performance resulting from a stratiﬁed approach is worse, i.e. 8.3% on average, compared to the oracle.

For a speciﬁc data set, we also observe that there can be a large discrepancy between the prescribed number

of PMs by the oracle and the stratiﬁed approach, e.g. the oracle and the stratiﬁed approach respectively

prescribe 13 and 6 PMs for machine proﬁle x= (0100) (see Table 1). In general, there is also a lot of spread

in the cost performance under the stratiﬁed approach. The 97.5% quantiles for the maintenance costs are

very high, up to 74.4% higher than the costs obtained by the oracle (see proﬁle x= (0011) in Table 2).

7

machine proﬁle oracle pooling (%) stratiﬁed (%) uniform (%)

0 0 0 0 634.09 100.5 (100,102.2) 104.2 (100,120.3) 100.3 (100,102.2)

0 0 0 1 415.9 100.2 (100,102.2) 105 (100,134.5) 109.5 (101.8,112.4)

0 0 1 0 513.71 100.3 (100,100.8) 104.6 (100,137.4) 102.8 (100,104.2)

0 0 1 1 334.48 100.3 (100,102.1) 107.5 (100,174.4) 121.6 (108.5,126.2)

0 1 0 0 822.79 100.9 (100,105) 105 (100,134.4) 103.2 (101.5,111)

0 1 0 1 542.25 100.3 (100,101.5) 105.6 (100,132.8) 101.6 (100,102.7)

0 1 1 0 668.46 100.6 (100,103.4) 103.7 (100,123.7) 100.3 (100,103.4)

0 1 1 1 438.12 100.3 (100,100.6) 104.9 (100,139.3) 107.5 (101.1,110)

1 0 0 0 866.42 100.9 (100,104.9) 103.7 (100,126.3) 104.6 (102.5,113.6)

1 0 0 1 570.88 100.3 (100,100.6) 103.8 (100,114.1) 101 (100,101.7)

1 0 1 0 704.05 100.6 (100,102.1) 103.9 (100,127.6) 100.7 (100,104.9)

1 0 1 1 462.11 100.3 (100,101.3) 104.2 (100,128.6) 105.6 (100.4,107.8)

1 1 0 0 1121.07 101.8 (100,111.5) 108.3 (100,136.6) 115.4 (111.5,130.8)

1 1 0 1 741.59 100.6 (100,103.3) 103.5 (100,119.8) 101.3 (100.2,106.7)

1 1 1 0 912.53 101.2 (100,106.6) 106.5 (100,135.6) 106.2 (103.7,116.4)

1 1 1 1 602.3 100.3 (100,101.2) 103.3 (100,116.9) 100.4 (100,101.2)

average 646.92 100.7 (100,103.5) 105 (101.1,111.7) 105 (104.6,108.5)

Table 2: Average costs (and empirical 95%-conﬁdence interval) over 100 simulated data sets with identical

parameters. Costs are determined exact using Eq. (4). We display the absolute costs for the oracle and the

relative costs w.r.t the oracle for the other approaches. The relative costs are determined by division with

the costs realized under oracle. The average machine proﬁle’s costs are determined by taking the average

over all machine proﬁles with equal weights.

The lack of suﬃcient data per proﬁle due to the stratiﬁcation of the data set, may lead to poor and volatile

performance.

The uniform approach pools the data across machine proﬁles, yet without tailoring the maintenance

policies to each machine proﬁle. This approach alleviates the lack of data, but also leads to an average loss

of 5% compared to the oracle, in this case due to lack of diﬀerentiation in the PM policies. Although the

average performance of the uniform approach and the stratiﬁed approach is (almost) the same, the spread

on the costs under the uniform approach is much smaller. This is also observed from the smaller 97.5%

quantiles for speciﬁc machine proﬁles compared to the stratiﬁed approach.

The pooling approach makes use of all the data, and also diﬀerentiates the maintenance policies over

the diﬀerent machine proﬁles. This leads to considerably better performance compared to the stratiﬁed and

uniform approaches, only having a loss of 0.7% on average with respect to the oracle. The improvement is

not only for the average machine, it also is the case for the diﬀerent machine proﬁles. Furthermore, the

spread in performance, as quantiﬁed by the 95%-conﬁdence intervals, is considerably smaller compared to

the stratiﬁed and uniform approach, both for an average machine and for each individual machine proﬁle.

3.3 How does data pooling overcome the small data problem?

Although the pooling and stratiﬁed approaches have a similar goal, i.e. tailoring the maintenance policy

to its machine proﬁle, they use the data set in a diﬀerent way. While the pooling approach uses all the

data over the diﬀerent proﬁles, relying on the assumption of proportional hazards for the failure behaviour

and the assumption of a GLM for the costs, the stratiﬁed approach only considers the data per proﬁle

completely disjoint from the others. Splitting the data set per machine proﬁle produces smaller subsets of

data, potentially inducing a small data problem. The consequent underperformance of the stratiﬁed approach

is due to the fact that insuﬃcient data may be available per machine proﬁle to estimate the failure behaviour

and costs accurately from the data. Clearly, if suﬃcient data would be available for each machine proﬁle,

both the stratiﬁed and pooling approach converge to the oracle. Yet, the rate at which both approaches

converge is diﬀerent.

8

pooling stratiﬁed

1.00

1.25

1.50

1.75

2.00

0 200 400 600

number of machines

relative costs

(a) average machine proﬁle

pooling stratiﬁed

1

2

3

4

5

0 200 400 600

number of machines

relative costs

(b) worst machine proﬁle

Figure 1: Relative costs for the pooling approach and the stratiﬁed approach with respect to the oracle in

function of the number of machines that generate data, averaged over 40 simulated data sets with identical

parameters and with the empirical 90%-conﬁdence interval shaded. Panel (a) Relative costs of the average

machine proﬁle; panel (b) Relative costs of the worst performing machine proﬁle.

Figure 1demonstrates this convergence by displaying the relative costs of the stratiﬁed and the pooling

approach with respect to the oracle when we gradually increase the number of machines in the data set,

and thus the number of failure and maintenance records that are used to estimate the failure behaviour and

costs. We start from a data set generated by 10 machines observed during 5 years and consider increases

of 10 machines at a time. These increases correspond to an additional 50 machine years of historical data

(recall that the mean time-to-failure for machine proﬁle x= (0000) is equal to 1.27 years). We report the

average cost performance over 40 simulated data sets with identical parameters and focus on the average

machine and worst performing machine proﬁle, i.e. the machine proﬁle that has the highest relative costs

with respect to the oracle at any given size of the data set. Figure 1shows how both the stratiﬁed and

the pooling approach converge to the oracle costs. Yet, the rate of convergence for the pooling is much

higher than the stratiﬁed approach. Also the 90%-conﬁdence intervals for the pooling approach are smaller

and shrink faster than the stratiﬁed approach. To get insight in the rate of convergence, we consider the

relative costs, averaged over the 40 simulated data sets, of the average proﬁle as a function of the number

of machines, and we look for asuch that,

average relative costs = a

number of machines + 1.

We ﬁnd ﬁtted values for aequal to 1.715 and 13.341 for the pooling and stratiﬁed approach respectively.

The ratio of these values indicates that the pooling approach converges to the oracle over seven and half

times faster than the stratiﬁed approach in the number of machines on average. It shows how the pooling

approach requires much less data to obtain adequate performance.

Another downside of the stratiﬁed approach is that it cannot prescribe the number of PMs for a machine

proﬁle that is not available in the data set. For instance, if we consider a data set generated by only 10

machines, not all 16 proﬁles will be represented in the data. This also explains the sharp cost increase of

the worst performing machine proﬁle under the stratiﬁed approach when the machine portfolio is small (see

Figure 1, panel (b)). In that case, increasing the number of machines also increases the number of machine

proﬁles in the data set, and with that also the likelihood of a worse performing proﬁle. This is not the

case under a pooling approach. Even when certain combinations of machine characteristics, i.e. machine

proﬁles, may not be observed in the data, the pooling approach can still prescribe the preferred number

of PM interventions. This property is essential for an OEM or a service provider when they expand their

maintenance portfolio with machine proﬁles, for which no data is yet available.

9

●

●

●

●

●●●●●●●●●●●●●●●●

0

1

2

3

5 10 15 20

number of PMs

condition LHS

(a) Simulation parameters

●

●

●

●

●●●●●●●●●●●●●●●●

0

1

2

3

5 10 15 20

number of PMs

condition LHS

(b) Less heterogeneity

●

●

●

●

●●●●●●●●●●●●●●●

0

1

2

3

5 10 15 20

number of PMs

condition LHS

(c) Larger Weibull shape γ

Figure 2: Left-hand-side of the optimality condition (dots) and min(cratio(x)) and max(cratio (x)) (dashed

horizontal lines). The available number of PMs determined by min(cratio(x)) and max(cratio (x)) are high-

lighted in light blue.

3.4 When is it valuable to diﬀerentiate the PM policy?

Whereas our numerical experiment has shown how our pooling approach is capable to diﬀerentiate the

PM policy per machine proﬁle, it is worthwhile to check whether diﬀerentiating the maintenance policy is

actually worth the eﬀort compared to adopting a uniform PM policy that is the same for all machines. We

do so by careful analysis of the optimality condition in Proposition 1. Speciﬁcally, we consider the equivalent

condition under the assumption of a Weibull failure intensity function,

(α∆t)γ(n+ 2)γ−1−(n+ 1)γ−1

((n+ 1)(n+ 2))γ−1≤cratio(x),(7)

where

cratio(x) = cp(x)

cf(x) exp(β′·x)

denotes the machine proﬁle speciﬁc cost ratio. This ratio provides an indication how many PM interventions

are economic for a machine proﬁle x: a low value of cratio (x) indicates that it is preferred to perform

many PM interventions on this machine proﬁle, as the cost of preventive maintenance is low compared to

the cost of failure, and vice versa for a high value of cr atio(x). The left-hand-side of (7) depends on the

baseline failure intensity, characterized by the Weibull scale αand shape γ, and the contract length ∆t.

Both are independent of the machine proﬁle x. The right-hand-side cratio(x) captures all the eﬀects of the

heterogeneity in the machine portfolio. It depends on the machine proﬁle’s costs of failure via cf(x) and

preventive maintenance via cp(x) and the impact of the proﬁle on the failure behaviour, exp(β′·x).

It is insightful to visualize this optimality condition. The dots in Figure 2visualize the left-hand-side of (7)

for each value of n, the number of PM interventions performed during the contract horizon. The minimum

and maximum value of cr atio(x) over the machine proﬁles, resp. min(cratio(x)) and max(cratio (x)), are

indicated by dashed horizontal lines. These two horizontal lines deﬁne the region of the optimal number

of PMs for the machine proﬁles, deﬁned by x. We highlight these PM policies in light blue. When there

are many light blue dots, the optimal PM policies diﬀer much across the diﬀerent machine proﬁles. In the

extreme case where there is only a single point in this region, then the same number of PMs is optimal for

all machine proﬁles and diﬀerentiation of the PM policy is not required.

An extensive analysis of the optimality condition and the resulting number of diﬀerentiated PM policies,

reveals that less heterogeneity in the machine portfolio, resulting in a smaller gap between min(cratio(x))

and max(cratio(x)) with the horizontal, dashed lines closer together, leads to less diﬀerentiation in the PM

policies over the diﬀerent machine proﬁles (see Figure 2b). Also, a larger Weibull shape γof the failure

10

behaviour (while adapting scale αaccordingly to maintain a constant mean-time-to-failure), makes the left-

hand-side of the optimality condition steeper, i.e. the left-hand-side decreases faster (see Figure 2c). This

reduces the number of diﬀerent optimal PM policies across the proﬁles and diminishes the eﬀect of the

heterogeneity of the machine population.

3.5 Model mis-speciﬁcation vs data-pooling

The data pooling approach requires the speciﬁcation of relevant terms in (1). In certain contexts, a modeler

may not correctly identify all the terms that should be present in a proportional hazard model. The stratiﬁed

approach is immune to such mis-speciﬁcations, but requires large amounts of data to obtain accurate esti-

mates. The data-pooling approach is susceptible to such mis-speciﬁcations but has the advantage of being

able to pool the data. Below we consider a set-up to study which eﬀect is dominant.

We consider a set-up with only two (binary) covariates, i.e. x1, x2∈ {0,1}, impacting the time-to-failure.

We simulate the failure times for each machine from following machine-speciﬁc failure intensity function

˜

λ(t),

˜

λ(t) = λ0(t) exp βx1+βx2+β

ρx1x2for t∈[0, tk−tk−1) (∀k),(8)

where t0= 0 and tn+1 = ∆t, respectively refer to the start and end of the planning horizon. We choose

the same βfor the covariates x1and x2to ensure that the impact of the cross-term (x1x2) is of the same

magnitude as the main eﬀects. The parameter ρcontrols the relative impact of the cross-term. The simulation

of failure and maintenance costs is not changed from the original set-up. We compare the stratiﬁed approach

with the data pooling approach but let the data pooling approach ﬁt the following mis-speciﬁed model:

b

λ(t) = λ0(t) exp (βx1+βx2) for t∈[0, tk−tk−1) (∀k).(9)

Thus we can interpret ρ−1as measuring the amount of model mis-speciﬁcation of the data pooling approach.

Note that the stratiﬁed approach does not suﬀer from mis-speciﬁcation since it makes no assumption of the

impact of the covariates on the time-to-failure (nor the failure or maintenance costs). Consequently, the

resulting maintenance policies for the pooling approach will result from the mis-speciﬁed model and we will

be able to test the impact of this mis-speciﬁcation.

In order to assess the impact of the mis-speciﬁcation, we compare the costs relative to the oracle of both

the pooling approach and the stratiﬁed approach for decreasing ρ(increased model mis-speciﬁcation). The

oracle is of course adapted to consider the inﬂuence of the cross-term. The results are displayed in Figure 3.

They show that the eﬀect of data pooling outweighs the eﬀect of model mis-speciﬁcation in all considered

settings. This is a strong indication, that, while mis-speciﬁcation is likely to happen, its negative impact is

easily oﬀset by the beneﬁts of data pooling.

4 Conclusion

This paper describes a data-driven approach to optimize the periodic, preventive maintenance policies for a

heterogeneous machine population over a ﬁnite time horizon. The heterogeneity of the machine population

is characterized by observable machine characteristics that induce diﬀerent machine proﬁles. Our approach

pools the available data of failure and maintenance records over the diﬀerent machine proﬁles in order to

learn as best as possible the failure behaviour and the costs of failure and maintenance for each machine

proﬁle. We rely on the assumption of proportional hazards for the failure behaviour and on the assumption of

a gamma GLM for the costs to accomplish this data pooling. In conjunction with the estimates for the failure

behaviour and the costs, our optimality condition for the number of preventive maintenance interventions

delivers tailored maintenance policies for each machine proﬁle. By means of numerical experiments, we

compare our pooling approach with both a stratiﬁed approach that splits the data per machine proﬁle and

11

auniform approach that disregards the machine proﬁles and prescribes the same uniform PM policy for

all machine proﬁles. The pooling approach outperforms these benchmarks and additionally has a smaller

spread on its performance. We also show how the pooling approach is more data-eﬃcient than the stratiﬁed

approach, even under mild model mis-speciﬁcation. This means that the pooling approach obtains better

performing maintenance policies for the same amount of data. Finally, we investigate when it is worth

diﬀerentiating the PM policy given the heterogeneity in the machine population and the failure intensity,

motivating the approach introduced in this paper.

Bibliography

David T Abdul-Malak, Jeﬀrey P Kharoufeh, and Lisa M Maillart. Maintaining systems with heterogeneous spare

parts. Naval Research Logistics (NRL), 66(6):485–501, 2019.

Joachim Arts and Rob Basten. Design of multi-component periodic maintenance programs with single-component

models. IISE transactions, 50(7):606–615, 2018.

Abbas Barabadi, Javad Barabady, and Tore Markeset. Application of reliability models with covariates in spare part

prediction and optimization–a case study. Reliability Engineering & System Safety, 123:1–7, 2014.

RE Barlow and F Proschan. Mathematical theory of reliability. Wiley, 1965.

Richard Barlow and Larry Hunter. Optimum preventive maintenance policies. Operations research, 8(1):90–100,

1960.

A Bobbio, A Cumani, A Premoli, and O Saracco. Modeling and identiﬁcation of nonexponential distributions by

homogeneous Markov processes. In Proc. of the Sixth Advances in Reliability Symposium, 1980.

Richard J Cook and Jerald Lawless. The statistical analysis of recurrent events. Springer Science & Business Media,

2007.

David R Cox. Partial likelihood. Biometrika, 62(2):269–276, 1975.

Piet De Jong, Gillian Z Heller, et al. Generalized linear models for insurance data. Cambridge University Press

Cambridge, 2008.

Bram de Jonge, Arjan S Dijkstra, and Ward Romeijnders. Cost beneﬁts of postponing time-based maintenance

under lifetime distribution uncertainty. Reliability Engineering & System Safety, 140:15–21, 2015.

Lukasz Delong, Mathias Lindholm, and Mario V W¨uthrich. Collective reserving using individual claims data. Scan-

dinavian Actuarial Journal, pages 1–28, 2021.

Michel Denuit, Xavier Mar´echal, Sandra Pitrebois, and Jean-Fran¸cois Walhin. Actuarial model ling of claim counts:

Risk classiﬁcation, credibility and bonus-malus systems. John Wiley & Sons, 2007.

Laurent Doyen and Olivier Gaudoin. Modeling and assessment of aging and eﬃciency of corrective and planned

preventive maintenance. IEEE Transactions on Reliability, 60(4):759–769, 2011.

Collin Drent, Stella Kapodistria, and Onno Boxma. Censored lifetime learning: Optimal bayesian age-replacement

policies. Operations Research Letters, 48(6):827–834, 2020.

˙

Ipek Dursun, Alp Ak¸cay, and Geert-Jan Van Houtum. Age-based maintenance under population heterogeneity:

Optimal exploration and exploitation. European Journal of Operational Research, 301(3):1007–1020, 2022.

Antje Jahn-Eimermacher, Katharina Ingel, Ann-Kathrin Ozga, Stella Preussler, and Harald Binder. Simulating

recurrent event data with hazard functions deﬁned on a total time scale. BMC Medical Research Methodology, 15

(1):16, 2015.

Dhananjay Kumar and Ulf Westberg. Some reliability models for analyzing the eﬀect of operating conditions.

International Journal of Reliability, Quality and Safety Engineering, 4(2):133 – 148, 1997.

12

PA W Lewis and Gerald S Shedler. Simulation of nonhomogeneous poisson processes by thinning. Naval research

logistics quarterly, 26(3):403–413, 1979.

Bo Henry Lindqvist et al. On the statistical modeling and analysis of repairable systems. Statistical science, 21(4):

532–551, 2006.

Chris Metcalfe and Simon G Thompson. The importance of varying the event generation process in simulation

studies of statistical methods for recurrent events. Statistics in medicine, 25(1):165–179, 2006.

Jesper Moller and Rasmus Plenge Waagepetersen. Statistical inference and simulation for spatial point processes.

CRC Press, 2003.

Toshio Nakagawa and Satoshi Mizutani. A summary of maintenance policies for a ﬁnite interval. Reliability Engi-

neering & System Safety, 94(1):89–96, 2009.

Esbj¨orn Ohlsson and Bj¨orn Johansson. Non-life insurance pricing with generalized linear models, volume 2. Springer,

2010.

Juliette P´enichoux, Thierry Moreau, and Aur´elien Latouche. Simulating recurrent events that mimic actual data: a

review of the literature with emphasis on event-dependence. arXiv preprint arXiv:1503.05798, 2015.

W Wang, PA Scarf, and MAJ Smith. On the application of a model of condition-based maintenance. Journal of the

Operational Research Society, 51(11):1218–1227, 2000.

Shaomin Wu. A failure process model with the exponential smoothing of intensity functions. European Journal of

Operational Research, 275(2):502–513, 2019.

Shaomin Wu and Ming J. Zuo. Linear and nonlinear preventive maintenance models. IEEE Transactions on

Reliability, 59(1):242 – 249, 2010.

13

pooling stratiﬁed

1.00

1.25

1.50

1.75

0 50 100 150 200 250

number of machines

relative costs

(a) ρ= 100

pooling stratiﬁed

1.0

1.5

2.0

2.5

0 50 100 150 200 250

number of machines

relative costs

(b) ρ= 50

pooling stratiﬁed

1.00

1.25

1.50

1.75

0 50 100 150 200 250

number of machines

relative costs

(c) ρ= 20

pooling stratiﬁed

1.00

1.25

1.50

1.75

2.00

0 50 100 150 200 250

number of machines

relative costs

(d) ρ= 10

pooling stratiﬁed

1.00

1.25

1.50

1.75

2.00

0 50 100 150 200 250

number of machines

relative costs

(e) ρ= 5

pooling stratiﬁed

1.0

1.1

1.2

1.3

1.4

0 50 100 150 200 250

number of machines

relative costs

(f) ρ= 2

Figure 3: Relative costs for the pooling approach and the stratiﬁed approach with respect to the oracle in

function of the number of machines that generate data according to the set-up in Section 3.5, averaged over

20 simulated data sets with identical parameters and with the empirical 90%-conﬁdence interval shaded.

14

A Parameters used to generate data

Time-to-failure α γ β1β2β3β4

0.7 2 0.4 0.3 -0.3 -0.5

PM costs θpβp,0βp,1βp,2βp,3βp,4

15 log(30) 0 0 0 0

Failure costs θfβf,0βf,1βf ,2βf,3βf,4

15 log(300) 0.2 0.2−0.1−0.3

Table 3: Parameter values used to generate a data set of failure and maintenance records, with αand γ

resp. the scale and shape of the Weibull baseline failure intensity function and βithe impact of covariates x,

θpand θfresp. the shape of the gamma distributed failure and preventive maintenance costs, and βp,i , βf,i

their respective machine proﬁle-dependent impact.

B Extract of the generated data set

imachine proﬁle time type costs ∆tiδ

1 1 1 1 0 1 PM 28.26 5 1

1 1 1 1 0 1.91 FAIL 400.33 5 1

1 1 1 1 0 2 PM 29.4 5 1

1 1 1 1 0 3 PM 23.82 5 1

1 1 1 1 0 3.86 FAIL 333.31 5 1

1 1 1 1 0 4 PM 37.74 5 1

1 1 1 1 0 4.93 FAIL 616.39 5 1

1 1 1 1 0 5 END 0 5 0

2 0 1 1 0 1 PM 13.48 5 1

2 0 1 1 0 1.59 FAIL 274.38 5 1

2 0 1 1 0 2 PM 47.39 5 1

2 0 1 1 0 3 PM 25.78 5 1

2 0 1 1 0 3.51 FAIL 254.78 5 1

2 0 1 1 0 4 PM 37.03 5 1

2 0 1 1 0 5 END 0 5 0

3 1 1 0 1 0.98 FAIL 375.79 5 1

3 1 1 0 1 1 PM 29.93 5 1

3 1 1 0 1 2 PM 32.52 5 1

3 1 1 0 1 2.52 FAIL 215.42 5 1

3 1 1 0 1 3 PM 34.29 5 1

3 1 1 0 1 3.71 FAIL 334.9 5 1

3 1 1 0 1 4 PM 24.39 5 1

3 1 1 0 1 4.59 FAIL 212.41 5 1

3 1 1 0 1 5 END 0 5 0

Table 4: Extract of simulated data set with records of three machines.

15