Content uploaded by Vito Muggeo

Author content

All content in this area was uploaded by Vito Muggeo on Aug 19, 2020

Content may be subject to copyright.

Selecting number of breakpoints in segmented

regression: implementation in the R package

segmented

Vito M.R. Muggeo∗

Universit`a di Palermo, Italy

The R package segmented performs estimation of segmented regression models

with a ﬁxed number of breakpoints. In this brief note, we discuss implementation

of two procedures to carry out selection of the number of breakpoints: 0,1, . . ., up

to a speciﬁed maximum number Kmax.

The former approach is based on the Bayesian information criterion BIC =

−2`+ log(n)edf, where `is the model maximized log likelihood, nthe sample size,

and edf the degrees of freedom. For Gaussian errors, a ‘generalized’ BIC can be

written as

BICCn= log(ˆσ2) + edf log(n)

nCn(1)

where ˆσ2is the estimated residual variance and Cnis a constant regulating the

penalizing term, Cn= log(n) for instance. The ‘right’ number of breakpoints

corresponds to the model with the lowest BIC. The larger Cn, the more parsimonious

the selected model. Cn= 1 leads to the usual BIC which appears to perform well

in segmented regression models.

An alternative approach to select the number of breakpoints relies on sequential

hypothesis testing procedure: it consists in performing diﬀerent hypothesis systems

starting from the ﬁrst one H0: 0 vs. H1:Kmax . Depending on rejection or not of

the null hypothesis, the procedure tests for the next hypothesis system by increasing

the number of breakpoints speciﬁed in H0or decreasing the one postulated under

H1. Figure 1 illustrates the procedure when the maximum number of breakpoints

is Kmax = 2.

Figure 1: Selecting the number of breakpoints via sequential hypothesis testing.

Starting with the H0: 0 breakpoints vs. H1: 2 breakpoints, the procedure ends up

with selecting 0, 1, or 2 breakpoints

The p-value for each hypothesis in Figure 1 can be obtained via the Davies or

the Score test (functions davies.test() or pscore.test()), while to control for

∗Email: vito.muggeo@unipa.it

1

over-rejection of the null hypotheses at the overall level α, the Bonferroni correction

can be employed comparing each p-value with α/Kmax . Alternatively, to mitigate

conservativeness of the Bonferroni correction, we could consider diﬀerent signiﬁ-

cance levels, namely the level α/2 for the ﬁrst-level hypothesis (‘0 vs 2’) and αfor

the second-level hypotheses (‘0 vs 1’ or ‘1 vs 2’) . Table 1 reports results from a sim-

ple simulation study: data were generated from regression models with K0= 0,1

or 2 breakpoints and two sample sizes, and at each run the procedure displayed

in Figure 1 was applied using the nominal level α= 0.05, and the simple or the

adjusted Bonferroni correction. Both procedures guarantee acceptable performance

with empirical rejection rates lower than the nominal 0.05; further discussion of

results is beyond the goal of this note.

Table 1: Number of times (out of 1,000 runs) corresponding to selection of 0,

1, or 2 breakpoints when using the simple (B) or the adjusted Bonferroni (aB)

corrections. The Score test and α= 0.05 are used and each pair of values refers to

two adjustments discussed in the text: ‘simple’ and ‘adjusted’ respectively.

b

K

0 1 2

K0nB aB B aB B aB

0 50 966 948 33 50 1 2

100 975 947 25 53 0 0

1 50 0 0 985 949 15 51

100 0 0 975 950 25 50

2 50 127 78 572 566 301 356

100 0 0 404 291 596 709

The aforementioned procedures are implemented by the function selgmented()

which will be included in next releases of segmented . Currently the function per-

forms selection of number of breakpoints via the BIC (any Kmax can be set) or via

sequential hypothesis testing (only Kmax = 2 is allowed) and it is available at

https://www.researchgate.net/publication/343737731

The function reads as

> args(selgmented)

function(olm, seg.Z, alpha = 0.05, type = c("score","davies","bic"),

control = seg.control(), return.fit = FALSE, bonferroni=FALSE,

Kmax = 2)

olm is a simple lm object, and seg.Z is one-side formula reporting the segmented

variable (which can be omitted if olm includes just a covariate). type speciﬁes the

statistic test being used (BIC or hypothesis testing via Score or Davies test), and

if bonferroni=TRUE the signiﬁcance level is always α/2 otherwise it is αfor the

second-level hypotheses as discussed above. Finally, return.fit=FALSE just returns

b

K(and possible p-values), while return.fit=TRUE also returns the ﬁtted segmented

object. Kmax, the maximum number of breakpoints to be tested, is currently 2 and

cannot be changed if type is ’score’ or ’davies’.

Here a simple example using simulated data by means of code from ?segmented

> set.seed(12)

2

> xx<-1:100

> zz<-runif(100)

> yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)

> dati<-data.frame(x=xx,y=yy,z=zz)

> out.lm<-lm(y~x,data=dati)

To select the number of breakpoints, we just need to ﬁt a simple linear model,

and to use the selgmented() function.

> #--- using the Score test (default)

> os<-selgmented(out.lm)

Hypothesis testing to detect no. of breakpoints

statistic: score level: 0.05 Bonferroni correction: FALSE

p-value ’0 vs 2’ = < 2.2e-16 p-value ’1 vs 2’ = 2.453e-09

Overall p-value = < 2.2e-16

No. of selected breakpoints: 2

>

> #--- using the BIC (up to 3 breakpoints)

> os<-selgmented(out.lm, Kmax=3, type="bic")

BIC to detect no. of breakpoints

BIC values:

0123

716.3031 696.9431 545.1816 552.3765

No. of selected breakpoints: 2

selgmented() is at a preliminary stage, but it appears to work reasonably well

for the basic examples, although some bugs could occur. Next releases of segmented

will include updates and improvements of the new function selgmented().

Bugs and comments should be sent to vito.muggeo@unipa.it

3