Technical ReportPDF Available

Selecting number of breakpoints in segmented regression: implementation in the R package segmented

Authors:
Selecting number of breakpoints in segmented
regression: implementation in the R package
segmented
Vito M.R. Muggeo
Universit`a di Palermo, Italy
The R package segmented performs estimation of segmented regression models
with a fixed number of breakpoints. In this brief note, we discuss implementation
of two procedures to carry out selection of the number of breakpoints: 0,1, . . ., up
to a specified maximum number Kmax.
The former approach is based on the Bayesian information criterion BIC =
2`+ log(n)edf, where `is the model maximized log likelihood, nthe sample size,
and edf the degrees of freedom. For Gaussian errors, a ‘generalized’ BIC can be
written as
BICCn= log(ˆσ2) + edf log(n)
nCn(1)
where ˆσ2is the estimated residual variance and Cnis a constant regulating the
penalizing term, Cn= log(n) for instance. The ‘right’ number of breakpoints
corresponds to the model with the lowest BIC. The larger Cn, the more parsimonious
the selected model. Cn= 1 leads to the usual BIC which appears to perform well
in segmented regression models.
An alternative approach to select the number of breakpoints relies on sequential
hypothesis testing procedure: it consists in performing different hypothesis systems
starting from the first one H0: 0 vs. H1:Kmax . Depending on rejection or not of
the null hypothesis, the procedure tests for the next hypothesis system by increasing
the number of breakpoints specified in H0or decreasing the one postulated under
H1. Figure 1 illustrates the procedure when the maximum number of breakpoints
is Kmax = 2.
Figure 1: Selecting the number of breakpoints via sequential hypothesis testing.
Starting with the H0: 0 breakpoints vs. H1: 2 breakpoints, the procedure ends up
with selecting 0, 1, or 2 breakpoints
The p-value for each hypothesis in Figure 1 can be obtained via the Davies or
the Score test (functions davies.test() or pscore.test()), while to control for
Email: vito.muggeo@unipa.it
1
over-rejection of the null hypotheses at the overall level α, the Bonferroni correction
can be employed comparing each p-value with α/Kmax . Alternatively, to mitigate
conservativeness of the Bonferroni correction, we could consider different signifi-
cance levels, namely the level α/2 for the first-level hypothesis (‘0 vs 2’) and αfor
the second-level hypotheses (‘0 vs 1’ or ‘1 vs 2’) . Table 1 reports results from a sim-
ple simulation study: data were generated from regression models with K0= 0,1
or 2 breakpoints and two sample sizes, and at each run the procedure displayed
in Figure 1 was applied using the nominal level α= 0.05, and the simple or the
adjusted Bonferroni correction. Both procedures guarantee acceptable performance
with empirical rejection rates lower than the nominal 0.05; further discussion of
results is beyond the goal of this note.
Table 1: Number of times (out of 1,000 runs) corresponding to selection of 0,
1, or 2 breakpoints when using the simple (B) or the adjusted Bonferroni (aB)
corrections. The Score test and α= 0.05 are used and each pair of values refers to
two adjustments discussed in the text: ‘simple’ and ‘adjusted’ respectively.
b
K
0 1 2
K0nB aB B aB B aB
0 50 966 948 33 50 1 2
100 975 947 25 53 0 0
1 50 0 0 985 949 15 51
100 0 0 975 950 25 50
2 50 127 78 572 566 301 356
100 0 0 404 291 596 709
The aforementioned procedures are implemented by the function selgmented()
which will be included in next releases of segmented . Currently the function per-
forms selection of number of breakpoints via the BIC (any Kmax can be set) or via
sequential hypothesis testing (only Kmax = 2 is allowed) and it is available at
https://www.researchgate.net/publication/343737731
The function reads as
> args(selgmented)
function(olm, seg.Z, alpha = 0.05, type = c("score","davies","bic"),
control = seg.control(), return.fit = FALSE, bonferroni=FALSE,
Kmax = 2)
olm is a simple lm object, and seg.Z is one-side formula reporting the segmented
variable (which can be omitted if olm includes just a covariate). type specifies the
statistic test being used (BIC or hypothesis testing via Score or Davies test), and
if bonferroni=TRUE the significance level is always α/2 otherwise it is αfor the
second-level hypotheses as discussed above. Finally, return.fit=FALSE just returns
b
K(and possible p-values), while return.fit=TRUE also returns the fitted segmented
object. Kmax, the maximum number of breakpoints to be tested, is currently 2 and
cannot be changed if type is ’score’ or ’davies’.
Here a simple example using simulated data by means of code from ?segmented
> set.seed(12)
2
> xx<-1:100
> zz<-runif(100)
> yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
> dati<-data.frame(x=xx,y=yy,z=zz)
> out.lm<-lm(y~x,data=dati)
To select the number of breakpoints, we just need to fit a simple linear model,
and to use the selgmented() function.
> #--- using the Score test (default)
> os<-selgmented(out.lm)
Hypothesis testing to detect no. of breakpoints
statistic: score level: 0.05 Bonferroni correction: FALSE
p-value ’0 vs 2’ = < 2.2e-16 p-value ’1 vs 2’ = 2.453e-09
Overall p-value = < 2.2e-16
No. of selected breakpoints: 2
>
> #--- using the BIC (up to 3 breakpoints)
> os<-selgmented(out.lm, Kmax=3, type="bic")
BIC to detect no. of breakpoints
BIC values:
0123
716.3031 696.9431 545.1816 552.3765
No. of selected breakpoints: 2
selgmented() is at a preliminary stage, but it appears to work reasonably well
for the basic examples, although some bugs could occur. Next releases of segmented
will include updates and improvements of the new function selgmented().
Bugs and comments should be sent to vito.muggeo@unipa.it
3
... The segmentation used to eliminate background noise and outliers, and also to avoid different sampling rates into the sensor network. In particular, we adopt a piecewise linear segmentation by using the "segmented" package in R [27][28] with "selgmented" function, which selects the number of breakpoints of the segmented linear relationship according to the Bayesian Information Criterion (BIC). This allows to compare models with more than 2 breakpoints till the maximum number of breakpoints indicated [27]. ...
... In particular, we adopt a piecewise linear segmentation by using the "segmented" package in R [27][28] with "selgmented" function, which selects the number of breakpoints of the segmented linear relationship according to the Bayesian Information Criterion (BIC). This allows to compare models with more than 2 breakpoints till the maximum number of breakpoints indicated [27]. The outcome of this segmentation consists of a series of linear models, separated by breakpoints, that fit the sub-processes of the risk measures composing the overall process. ...
Chapter
Occupational Health and Safety Risk Assessment can undoubtedly benefit from enabling technologies of Industry 4.0, with the aim of collecting and analyzing the big data related to the occupational risk factors arising into workplaces. In this paper, the assessment of the occupational risk is addressed by means of a multi-criteria approach. Indeed, after the pre-treatment of the time series of the said risk factors by means of a segmentation algorithm, a TOPSIS approach is implemented to assess the dynamic, individual and integrated risk to which a worker is subjected over the time. Finally, a numerical example is reported to illustrate the proposed in practice.
... To quantify increase-decline dynamics through time, we fit RWI and CC time-series into a segmented regression framework in R. For each time-series a simple linear model was built, from which the most appropriate number of breakpoints and their estimated year was identified based on the model with the lowest Bayesian information criterion and incorporating the Davies' and pseudo score tests (Davies, 2002;Muggeo, 2020Muggeo, , 2008. This breaks the time-series into segments, and we calculated Theil-Sen slope (TS) and Mann-Kendal significance (MK) for each segment. ...
Article
Full-text available
Time-series of satellite-derived vegetation proxies and tree-rings widths (TRW) are similar, providing temporal records of forest productivity change from different perspectives and processes. Previous research on this relationship has focused on temperature or moisture limited coniferous forests, using lower spatial resolution (e.g., 8000 m) satellites and normalized difference vegetation index (NDVI) to test relationships over 15–30 years. There is an opportunity to leverage recent advances in building Landsat (30 m) time-series to expand comparisons into new forest types (e.g., coniferous vs. deciduous), areas (e.g., fragmented forests) and over longer periods (e.g., nearly 50 years). However, a better understanding of factors that influence relationship strength is required. We compared tree-ring measurements, converted to a ring width index (RWI), and Landsat tasseled cap angle (TCA) derived canopy cover (CC) from 1972 to 2018 across 16 deciduous, mixed, and coniferous stands in southern Ontario, Canada. For all chronologies, overall relationship strength was assessed with correlation approaches (RWI-CC, both vs. climate), and shorter-term increase-decline trends were compared with segmented regression. There were significant forest type differences, with coniferous chronologies correlating stronger with CC than deciduous. Deciduous chronologies and CC had opposite connections with summer temperature, with climate warming increasing CC and coniferous RWI but not deciduous RWI from 1980 to 2010. More recent decline at most sites appears related to a major ice storm, but multiple factors may be coexisting. We tested the utility of tree-rings for validating nearly 50 years of Landsat-observed change in urban–rural temperate forests, identifying useful connections at coniferous sites. However, there are limitations to comparing long-term Landsat time-series (based on yearly summer data) with annual tree-ring growth.
ResearchGate has not been able to resolve any references for this publication.