ArticlePDF Available

Changepoint: An R Package for Changepoint Analysis

Authors:

Abstract and Figures

One of the key challenges in changepoint analysis is the ability to detect multiple changes within a given time series or sequence. The changepoint package has been de-veloped to provide users with a choice of multiple changepoint search methods to use in conjunction with a given changepoint method and in particular provides an implementa-tion of the recently proposed PELT algorithm. This article describes the search methods which are implemented in the package as well as some of the available test statistics whilst highlighting their application with simulated and practical examples. Particular empha-sis is placed on the PELT algorithm and how results differ from the binary segmentation approach.
Content may be subject to copyright.
JSS Journal of Statistical Software
June 2014, Volume 58, Issue 3. http://www.jstatsoft.org/
changepoint: An RPackage for Changepoint Analysis
Rebecca Killick
Lancaster University
Idris A. Eckley
Lancaster University
Abstract
One of the key challenges in changepoint analysis is the ability to detect multiple
changes within a given time series or sequence. The changepoint package has been de-
veloped to provide users with a choice of multiple changepoint search methods to use in
conjunction with a given changepoint method and in particular provides an implementa-
tion of the recently proposed PELT algorithm. This article describes the search methods
which are implemented in the package as well as some of the available test statistics whilst
highlighting their application with simulated and practical examples. Particular empha-
sis is placed on the PELT algorithm and how results differ from the binary segmentation
approach.
Keywords: segmentation, break points, search methods, bioinformatics, energy time series, R.
1. Introduction
There is a growing need to be able to identify the location of multiple change points within
time series. However, as datasets increase in length the number of possible solutions to
the multiple changepoint problem increases combinatorially. Over the years several multiple
changepoint search algorithms have been proposed to overcome this challenge, most notably
the binary segmentation algorithm (Scott and Knott 1974;Sen and Srivastava 1975); the
segment neighborhood algorithm (Auger and Lawrence 1989;Bai and Perron 1998) and more
recently the PELT algorithm (Killick, Fearnhead, and Eckley 2012a). This paper describes
the changepoint package (Killick, Eckley, and Haynes 2014), available for R(RCore Team
2014) from the Comprehensive RArchive Network (CRAN) at http://CRAN.R-project.
org/package=changepoint. Package changepoint makes each of these algorithms available,
thus enabling users to select which method they would like to use for their analysis.
We are by no means the first to develop a changepoint package for the Renvironment. At
the time of writing several such packages exist, including those which provide a single test
statistic e.g., sde (Iacus 2009), bcp (Erdman and Emerson 2007) and/or are designed for a
2changepoint: An RPackage for Changepoint Analysis
specific (typically genomic) application e.g., cumSeg (Muggeo 2012), DNAcopy (Seshan and
Olshen 2008). More comprehensive Rpackages are also available such as strucchange (Zeileis,
Leisch, Hornik, and Kleiber 2002) for changes in regression and cpm (Ross 2013) for online
changepoint detection. However, all of the aforementioned packages implement a single search
method for detecting multiple changepoints. In contrast, the changepoint package uniquely
provides a choice of search algorithms for multiple changepoint detection in addition to a
variety of test statistics. In particular the package implements the search algorithms for a
selection of popular changepoint and penalty types. Specifically methods are implemented
for the change in mean and/or variance settings with a similar argument structure where
each function outputs an object of class ‘cpt’. Such an approach is deliberate to breed
familiarity and ease of use. Whilst the package is driven from these core functions, part of
our philosophy is to make it easier for others to use and adapt code snippets as appropriate.
To this end we have deliberately coded each part of a method in an individual function
which is also exported. Whilst several test statistics are included in the changepoint package
there are currently some notable gaps which are covered by other software. These include
changes in regression (see strucchange,Zeileis et al. 2002) and changes in autocorrelation
(see AutoPARM available from Davis, Lee, and Rodriguez-Yam 2006). In addition there is
currently no general software available whereby the user can supply their own cost function
and this would be an interesting avenue to pursue. A list of general changepoint software, and
indeed recent preprints in the area, are available from The Changepoint Repository (Killick,
Nam, Aston, and Eckley 2012b,http://changepoint.info).
The remainder of the paper is structured as follows. A brief background to changepoint
analysis is given in Section 2before Section 3describes the ‘cpt’ class and its methods.
Following this the three main functions; cpt.mean,cpt.var and cpt.meanvar are described
and explored using simulated and practical examples. In these sections particular emphasis
is placed on how to identify multiple changepoints and the difference between exact and
approximate methods. The paper is summarized in Section 7, where we provide a discussion.
2. Changepoint detection
This section begins by introducing the reader to changepoints through the single changepoint
problem before considering the extension to multiple changepoints. In its simplest form,
changepoint detection is the name given to the problem of estimating the point at which the
statistical properties of a sequence of observations change. Detecting such changes is impor-
tant in many different application areas. Recent examples include climatology (Reeves, Chen,
Wang, Lund, and Lu 2007), bioinformatic applications (Erdman and Emerson 2008), finance
(Zeileis, Shah, and Patnaik 2010), oceanography (Killick, Eckley, Jonathan, and Ewans 2010)
and medical imaging (Nam, Aston, and Johansen 2012).
More formally, let us assume we have an ordered sequence of data, y1:n= (y1, . . . , yn). A
changepoint is said to occur within this set when there exists a time, τ∈ {1, . . . , n 1},
such that the statistical properties of {y1, . . . , yτ}and {yτ+1, . . . , yn}are different in some
way. Extending this idea of a single changepoint to multiple changes, we will have a number
of changepoints, m, together with their positions, τ1:m= (τ1, . . . , τm). Each changepoint
position is an integer between 1 and n1 inclusive. We define τ0= 0 and τm+1 =n, and
assume that the changepoints are ordered so that τi< τjif, and only if, i<j. Consequently
the mchangepoints will split the data into m+ 1 segments, with the ith segment containing
Journal of Statistical Software 3
data y(τi1+1):τi. Each segment will be summarized by a set of parameters. The parameters
associated with the ith segment will be denoted {θi, φi}, where φiis a (possibly null) set of
nuisance parameters and θiis the set of parameters that we believe may contain changes.
Typically we want to test how many segments are needed to represent the data, i.e., how
many changepoints are present and estimate the values of the parameters associated with
each segment.
2.1. Single changepoint detection
Let us briefly recap the likelihood based framework for changepoint detection. Before con-
sidering the more general problem of identifying τ1:mchangepoint positions, we first consider
the identification of a single changepoint. The detection of a single changepoint can be posed
as a hypothesis test. The null hypothesis, H0, corresponds to no changepoint (m= 0) and
the alternative hypothesis, H1, is a single changepoint (m= 1).
We now introduce the general likelihood ratio based approach to test this hypothesis. The
potential for using a likelihood based approach to detect changepoints was first proposed by
Hinkley (1970) who derives the asymptotic distribution of the likelihood ratio test statistic
for a change in the mean within normally distributed observations. The likelihood based
approach was extended to changes in variance within normally distributed observations by
Gupta and Tang (1987). The interested reader is referred to Silva and Teixeira (2008) and
Eckley, Fearnhead, and Killick (2011) for a more comprehensive review.
A test statistic can be constructed which we will use to decide whether a change has occurred.
The likelihood ratio method requires the calculation of the maximum log-likelihood under
both null and alternative hypotheses. For the null hypothesis the maximum log-likelihood is
log p(y1:n|ˆ
θ), where p(·) is the probability density function associated with the distribution of
the data and ˆ
θis the maximum likelihood estimate of the parameters.
Under the alternative hypothesis, consider a model with a changepoint at τ1, with τ1
{1,2, . . . , n 1}. Then the maximum log likelihood for a given τ1is
ML(τ1) = log p(y1:τ1|ˆ
θ1) + log p(y(τ1+1):n|ˆ
θ2).(1)
Given the discrete nature of the changepoint location, the maximum log-likelihood value
under the alternative is simply maxτ1ML(τ1), where the maximum is taken over all possible
changepoint locations. The test statistic is thus
λ= 2 max
τ1
ML(τ1)log p(y1:n|ˆ
θ).
The test involves choosing a threshold, c, such that we reject the null hypothesis if λ > c. If
we reject the null hypothesis, i.e., detect a changepoint, then we estimate its position as ˆτ1
the value of τ1that maximizes ML(τ1). The appropriate value for this parameter cis still an
open research question with several authors devising pvalues and other information criteria
under different types of changes. We refer the interested reader to Guyon and Yao (1999);
Chen and Gupta (2000); Lavielle (2005); Birge and Massart (2007) for interesting discussions
and suggestions for c.
It is clear that the likelihood test statistic can be extended to multiple changes simply by
summing the likelihood for each of the msegments. The problem becomes one of identifying
4changepoint: An RPackage for Changepoint Analysis
the maximum of ML(τ1:m) over all possible combinations of τ1:m. The following section
explores existing search methods that address this problem.
2.2. Multiple changepoint detection
With increased collection of time series and signal streams there is a growing need to be
able to efficiently and accurately estimate the location of multiple changepoints. This section
briefly introduces the main search methods available for identifying multiple changepoints
within the changepoint package. Arguably the most common approach to identify multiple
changepoints in the literature is to minimize
m+1
X
i=1 C(y(τi1+1):τi)+βf (m) (2)
where Cis a cost function for a segment e.g., negative log-likelihood and βf(m) is a penalty
to guard against over fitting (a multiple changepoint version of the threshold c). This is
the approach which we adopt in this paper and the accompanying package. A brute force
approach to solve this minimization considers 2n1solutions reducing to n1
mif mis known.
The changepoint package implements three multiple changepoint algorithms that minimize
(2); binary segmentation (Edwards and Cavalli-Sforza 1965), segment neighborhoods (Auger
and Lawrence 1989) and the recently proposed pruned exact linear time (PELT) (Killick et al.
2012a). Each of these algorithms is briefly described in the following paragraphs, for more
information see the corresponding references.
At the time of writing binary segmentation is arguably the most widely used multiple change-
point search method and originates from the work of Edwards and Cavalli-Sforza (1965), Scott
and Knott (1974) and Sen and Srivastava (1975). Briefly, binary segmentation first applies a
single changepoint test statistic to the entire data, if a changepoint is identified the data is
split into two at the changepoint location. The single changepoint procedure is repeated on
the two new data sets, before and after the change. If changepoints are identified in either
of the new data sets, they are split further. This process continues until no changepoints are
found in any parts of the data. This procedure is an approximate minimization of (2) with
f(m) = mas any changepoint locations are conditional on changepoints identified previously.
Binary segmentation is thus an approximate algorithm but is computationally fast as it only
considers a subset of the 2n1possible solutions. The computational complexity of the al-
gorithm is O(nlog n) but this speed can come at the expense of accuracy of the resulting
changepoints (see Killick et al. 2012a, for details).
The segment neighborhood algorithm was proposed by Auger and Lawrence (1989) and fur-
ther explored in Bai and Perron (1998). The algorithm minimizes the expression given by
Equation 2exactly using a dynamic programming technique to obtain the optimal segmenta-
tion for m+ 1 changepoints reusing the information that was calculated for mchangepoints.
This reduces the computational complexity from O(2n) for a naive search to O(Qn2) where
Qis the maximum number of changepoints to identify. Whilst this algorithm is exact, the
computational complexity is considerably higher than that of binary segmentation.
The binary segmentation and segment neighborhood algorithms would appear to indicate a
trade-off between speed and accuracy however this need not be the case. The PELT algorithm
proposed by Killick et al. (2012a) is similar to that of the segment neighborhood algorithm
in that it provides an exact segmentation. However, due to the construction of the PELT
Journal of Statistical Software 5
algorithm, it can be shown to be more computationally efficient, due to its use of dynamic
programming and pruning which can result in an O(n) search algorithm subject to certain
assumptions being satisfied, the majority of which are not particularly onerous. Indeed the
main assumption that controls the computational time is that the number of changepoints
increases linearly as the data set grows, i.e., changepoints are spread throughout the data
rather than confined to one portion.
All three search algorithms are available within the changepoint package. The following
sections introduce the structure of the package, its S4 class – ‘cpt’ and the core functions
that enable quick and efficient analysis of changepoint problems.
3. Introduction to the package and the ‘cpt’ class
The changepoint package introduces a new object class called ‘cpt’ to store changepoint anal-
ysis objects. This section provides an introduction to the structure and methods associated
with the ‘cpt’ class, together with examples of its specific use.
Each of the core functions outputs an object of the ‘cptS4 class. The class has been
constructed such that the ‘cpt’ object contains the main features required for a changepoint
analysis and future summaries. Each of these is stored within a slot entry in the ‘cpt’ class.
The slots within the class are,
data.set – a time series (‘ts’) object containing the numeric values of the data;
cpttype – characters describing the type of changepoint sought e.g., mean, variance;
method – characters denoting the single or multiple changepoint search method applied;
test.stat – characters denoting the test statistic, i.e., assumed distribution / distribution-
free method;
pen.type – characters denoting the penalty type, e.g., AIC, BIC, manual;
pen.value – the numeric value of the penalty used in the analysis;
cpts – a numeric vector giving the estimated changepoint locations always ending in n,
the length of the time series in the data.set slot;
ncpts.max – the numeric maximum number of changepoints searched for, e.g., 1, 5, Inf
and denoted Qin Section 2;
param.est – a list of parameters where each element in the list is a vector of the
estimated numeric parameter values for each segment, denoted θiin Section 2;
date – the system time / date when the analysis was performed.
Slots of an S4 object are typically accessed using the @ symbol (in contrast to the $ for S3
objects). Whilst this is still possible in the changepoint package, we have created accessor and
replacement functions to control the access and replacement of slots. The accessor functions
are simply the slot names. For example data.set(x) displays the vector of data contained
within the ‘cpt’ object x. The class slots are automatically populated with the correct infor-
mation obtained from the completed analysis. Feedback from trials with the package users
6changepoint: An RPackage for Changepoint Analysis
indicate that the accessor and replacement functions aid ease-of-use for those unfamiliar with
S4 classes. Further demonstration of how the accessor and replacement functions work in
practice are given in the examples within each section.
In addition to accessor and replacement functions, the changepoint package also contains a
couple of extra functions that a user may find useful. The first of these is the ncpts function
which, given a ‘cpt’ object from a changepoint analysis, returns the number of identified
changepoints. This can be particularly useful if the number of changepoints is expected to be
large and/or users wish to quickly check whether the returned number of changepoints is equal
to the maximum searched for when using the binary segmentation or segment neighborhood
search algorithms. Similarly the second additional function, seg.len, returns the size of
the segments, i.e., how many observations there are between consecutive changepoints. This
may be useful when performing a changepoint analysis as short segments can be used as an
indicator that the penalty function may be set too low.
All the functions described above are related to the ‘cpt’ class within the changepoint package.
The following section reviews the methods that act on the ‘cpt’ class.
3.1. Methods within the ‘cpt’ class
The methods associated with the ‘cpt’ class are summary,print,plot,coef and logLik.
The summary and print methods display standard information about the ‘cpt’ object. The
summary function displays a synopsis of the results from the analysis including number of
changepoints and, where this number is small, the location of those changepoints. In contrast,
the print function prints details pertaining to the S4 class including slot names and when
the S4 object was created.
Having performed a changepoint analysis, it is often helpful to be able to plot the changepoints
on the original data to visually inspect whether the estimated changepoints are reasonable. To
this end we include a plot method for the ‘cpt’ class. The method adapts to the assumed type
of changepoint, providing a different output dependent on the type of change. For example, a
change in variance is denoted by a vertical line at the changepoint location whereas a change
in mean is indicated by horizontal lines depicting the mean value in different segments.
Similarly once a changepoint analysis has been conducted one may wish to retrieve the param-
eter values for each segment or the log likelihood for the fitted data. These can be obtained
using the standard coef and logLik generics; examples are given in the code detailed below.
The following sections explore the use of the core functions within the changepoint package.
We begin in Section 4by demonstrating the key steps to a changepoint analysis via the
cpt.mean function. Sections 5and 6utilize the steps in the change in mean analysis to
explore changes in variance and both mean and variance respectively.
4. Changes in mean: The cpt.mean function
Early work on changepoint problems focused on identifying changes in mean and includes the
work of Page (1954) and Hinkley (1970) who created the likelihood ratio and cumulative sum
(CUSUM) test statistics respectively.
Within the changepoint package all change in mean methods are accessed using the cpt.mean
function. The function is structured as follows:
Journal of Statistical Software 7
cpt.mean(data, penalty = "SIC", pen.value = 0, method = "AMOC", Q = 5,
test.stat = "Normal", class = TRUE, param.estimates = TRUE)
The arguments within this function are:
data – A vector or ‘codets’ object containing the data within which to find a change
in mean. If multiple datasets require to be analyzed, then this can be a matrix where
each row is considered a separate dataset.
penalty – Choice of "None","SIC","BIC","AIC","Hannan-Quinn","Asymptotic"
and "Manual" penalties. If "Manual" is specified, the manual penalty is contained in
pen.value. If "Asymptotic" is specified, the theoretical type I error is contained in
pen.value. The predefined penalties listed do not count the changepoint as a parame-
ter, postfix a 1 e.g., "SIC1" to count the changepoint as a parameter.
pen.value – The theoretical type I error e.g., 0.05 when using the "Asymptotic"
penalty. Alternatively when using the "Manual" penalty it is a numeric value or text
which when evaluated results in a penalty value.
method – Single or multiple changepoint method. Choice of "AMOC" (at most one
change), "PELT","SegNeigh" or "BinSeg". Default is "AMOC". See Section 2for further
details of methods.
Q– When using the "BinSeg" method this is the maximum number of changepoints
to search for. When using the "SegNeigh" method this is the maximum number of
segments (number of changepoints + 1) to search for. This is not required for the
"PELT" method as this automatically selects the number of segments.
test.stat – The test statistic, i.e., assumed distribution or distribution-free method
for data. Choice of "Normal" or "CUSUM". The test statistics behind the distributional
options are contained within Hinkley (1970) for the "Normal" option and Page (1954)
for the "CUSUM" option.
class – Logical. If TRUE then an object of class ‘cpt’ is returned.
param.estimates – Logical. If TRUE and class = TRUE then parameter estimates are
returned. If FALSE or class = FALSE no parameter estimates are returned.
Briefly the search options consist of exact methods: PELT (O(n) if assumptions are sat-
isfied), segment neighborhoods (O(Qn2)); and approximate methods: binary segmentation
(O(nlog n)). Further details of the search options in the method argument are given in Sec-
tion 2.
Several standard penalty functions used within changepoint analysis have been included in
this function. These are: SIC (Schwarz information criterion), BIC (Bayesian information
criterion), AIC (Akaike information criterion) and Hannan-Quinn. The authors will seek to
include further penalty functions, such as minimum description length (MDL) (Davis et al.
2006), in future versions of the package. The user can also enter a manual penalty value by
numeric value or formula. An example of using a manual penalty value with a formula is given
in Section 4.1. In addition to the standard Rfunctions, the following variables are available
for the user to utilize:
8changepoint: An RPackage for Changepoint Analysis
tau – the proposed changepoint location (only available when using "AMOC");
null – the likelihood under the null model of no changepoint (only available when using
"AMOC");
alt – the likelihood under the alternative model of a single changepoint (only available
when using "AMOC");
diffparam – the difference in the number of parameters between the no changepoint
and single changepoint model e.g., for a Normal distribution, 1 for a change in mean or
variance and 2 for a change in both mean and variance;
n– the length of the data.
Thus if one wanted to use a penalty based on the ratio of the lengths of data before and
after the change, then one may use penalty = "Manual",pen.value = "tau / (n - tau)".
Note this is only possible using "AMOC".
The remainder of this section gives a worked example exploring how to identify a change in
mean.
4.1. Example: Changes in mean
We now describe the general structure of a changepoint analysis using the changepoint pack-
age. We begin by demonstrating the various possible stages within a change in mean analysis.
To this end we simulate a dataset (m.data) of length 400 with multiple changepoints at 100,
200, 300. The sequence has four segments and the means for each segment are 0, 1, 0, 0.2.
R> library("changepoint")
R> set.seed(10)
R> m.data <- c(rnorm(100, 0, 1), rnorm(100, 1, 1), rnorm(100, 0, 1),
+ rnorm(100, 0.2, 1))
R> ts.plot(m.data, xlab = "Index")
Imagine that we have been presented with this dataset and are asked to perform a changepoint
analysis. The first question we aim to answer is“Is there a change within the data?”. Our first
choice in answering this question is whether we wish to consider a single change or whether
multiple changes are plausible. From a visual inspection of the data in Figure 1(a), we suspect
multiple changes in mean may exist.
The challenge in multiple changepoint detection is identifying the optimal number and location
of changepoints as the number of solutions increases rapidly with the size of the data. In this
example where n= 400, we have 399 possible solutions for a single changepoint, for two
changes there are 79401 possible solutions and this is not taking into account that we do not
know how many changes there are! As such it is clearly desirable to use an efficient method
for searching the large solution space.
Any of the three search methods could be used to detect these changes. For this example we
will compare the PELT and binary segmentation search methods as this provides a comparison
between exact and alternative algorithms (see Section 2). For now we will assume that the
dataset is independent and Normally distributed and consider an alternative towards the end
of this section.
Journal of Statistical Software 9
R> m.pelt <- cpt.mean(m.data, method = "PELT")
R> plot(m.pelt, type = "l", cpt.col = "blue", xlab = "Index",
+ cpt.width = 4)
R> cpts(m.pelt)
[1] 97 192 273 353 362 366
R> m.binseg <- cpt.mean(m.data, method = "BinSeg")
R> plot(m.binseg, type = "l", xlab = "Index", cpt.width = 4)
R> cpts(m.binseg)
[1] 79 99 192 273
In this case, where we use the default SIC penalty, the cpts function returned 6 changepoints
(97, 192, 273, 353, 362, 366) for PELT and 4 changepoints (79, 99, 192, 273) for binary
segmentation. By construction we know that there are three changepoints within the dataset.
We can either believe that there are six/four changes or consider that the method is too
sensitive and try to compensate by increasing the penalty. The choice of appropriate penalty
is still an open question and typically depends on many factors including the size of the
changes and the length of segments, both of which are unknown prior to analysis (see Guyon
and Yao 1999;Lavielle 2005;Birge and Massart 2007). As new approaches to penalty choice
become available we will seek to include them within the changepoint package. In current
practice, the choice of penalty is often assessed by plotting the data and changepoints to see
if they seem reasonable.
Figure 1(b) shows the m.pelt changepoints. Note that there are two changes towards the end
of the dataset which have very small segments. These are plausibly artifacts of the data rather
than true changes in the underlying process. In an effort to remove these seemingly spurious
changepoints we can increase the penalty to 1.5 * log(n) rather than log(n) (SIC). This
change is achieved by changing the penalty type to "Manual" and setting the value argument
to "1.5 * log(n)". Figure 1(d) shows the result which seem more plausible.
R> m.pm <- cpt.mean(m.data, penalty = "Manual", pen.value = "1.5 * log(n)",
+ method = "PELT")
R> plot(m.pm, type = "l", cpt.col = "blue", xlab = "Index", cpt.width = 4)
R> cpts(m.pm)
[1] 97 192 273
On the other hand, if we only consider the changepoints identified by the binary segmentation
algorithm in Figure 1(c) then we may plausibly believe that there are four changes within
the data as the spurious segment is much larger. However, for comparison we also perform
the analysis with the increased penalty and find that the changepoints identified remain the
same.
R> m.bsm <- cpt.mean(m.data, "Manual", pen.value = "1.5 * log(n)",
+ method = "BinSeg")
R> cpts(m.bsm)
10 changepoint: An RPackage for Changepoint Analysis
Index
m.data
0 100 200 300 400
−2 −1 0 1 2 3
(a) m.data.
Index
data.set.ts(x)
0 100 200 300 400
−2 −1 0 1 2 3
(b) PELT changepoints with default penalty.
Index
data.set.ts(x)
0 100 200 300 400
−2 −1 0 1 2 3
(c) Binary segmentation changepoints with de-
fault penalty.
Index
data.set.ts(x)
0 100 200 300 400
−2 −1 0 1 2 3
(d) PELT changepoints with manual penalty.
Figure 1: Plot of the simulated dataset m.data along with horizontal lines for the underlying
(fitted) mean.
[1] 79 99 192 273
Recall from Section 2that both the segment neighborhood and PELT algorithms are exact.
Thus, for a linear penalty, the only difference between them is their computational time. A
user can use the below commands on their own computer to identify their personal speedup
for this example.
R> system.time(cpt.mean(m.data, method = "SegNeigh"))
R> system.time(cpt.mean(m.data, method = "PELT"))
Using modern computers for this example PELT will return a time needed of 0.001 or 0.002
seconds compared to segment neighborhoods where the authors have seen a range from 0.4
to 1.1 seconds for the time needed.
Journal of Statistical Software 11
As a final note on this example, if the Normal assumption made at the start of the analysis
is questionable then the CUSUM method, which has no distributional assumptions, can be
used by adding the argument test.stat = "CUSUM".
Thus far we have only considered a simulated example. In the next section we apply the
cpt.mean function to some Glioblastoma data previously analyzed by Lai, Johnson, Kucher-
lapati, and Park (2005).
4.2. Case study: Glioblastoma
Lai et al. (2005) compare different methods for segmenting array comparative genomic hy-
bridization (aCGH) data from Glioblastoma multiforme (GBM), a type of brain tumor. These
arrays were developed to identify DNA copy number alteration corresponding to chromosomal
aberrations. High-throughput aCGH data are intensity ratios of diseased vs. control samples
indexed by the location on the genome. Values greater than 1 indicate diseased samples have
additional chromosomes and values less than 1 indicate fewer chromosomes. Detection of
these aberrations can aid future screening and treatments of diseases.
The example we consider is from Figure 4 in Lai et al. (2005), the data is replicated in the
changepoint package for ease. Following Lai et al. (2005) we fit a Normal distribution with a
piecewise constant mean using a likelihood criterion. Figure 2demonstrates that PELT (with
default penalty) gives the same segmentation as the CGHseg method from Lai et al. (2005).
R> data("Lai2005fig4", package = "changepoint")
R> Lai.default <- cpt.mean(Lai2005fig4[, 5], method = "PELT")
R> plot(Lai.default, pch = 20, col = "grey", cpt.col = "black", type = "p",
+ xlab = "Index")
R> cpts(Lai.default)
[1] 81 85 89 96 123 133
R> coef(Lai.default)
$mean
[1] 0.2468910 4.6699210 0.4495538 4.5902489 0.2079891 4.2913844 0.2291286
5. Changes in variance: The cpt.var function
Whilst considerable research effort has been given to the change in mean problem, Chen and
Gupta (1997) observe that the detection of changes in variance has received comparatively
little attention. Much of the work in this area builds on the foundational work of Hinkley
(1970) in the change in mean setting. See for example Hsu (1979), Horvath (1993) and Chen
and Gupta (1997) who extend Hinkley’s ideas to the change in variance setting. Existing
methods within the change in variance literature find it hard to detect subtle changes in
variability, see Killick et al. (2010).
Within the changepoint package all change in variance methods are accessed using the cpt.var
function. The function is structured as follows:
12 changepoint: An RPackage for Changepoint Analysis
Index
data.set.ts(x)
0 50 100 150 200
−2 0 2 4
Figure 2: Plot of the GBM data along with horizontal lines for the underlying mean.
cpt.var(data, penalty, pen.value, know.mean = FALSE, mu = NA, method, Q,
test.stat = "Normal", class, param.estimates)
The data,penalty,pen.value,method,Q,class and param.estimates arguments are the
same as for the cpt.mean function (see Section 4). The three remaining arguments are
interpreted as follows.
know.mean – This logical argument is only required for test.stat = "Normal". If TRUE
then the mean is assumed known and mu is taken as its value. If FALSE and mu = NA
(default value) then the mean is estimated via maximum likelihood. If FALSE and the
value of mu is supplied, mu is not estimated but is counted as an estimated parameter
for decisions.
mu – Only required for test.stat = "Normal". Numerical value of the true mean of
the data (if known). Either single value or vector of length nrow(data). If data is a
matrix and mu is a single value, the same mean is used for each row.
test.stat – The test statistic, i.e., assumed distribution or distribution-free method
for data. Choice of "Normal" or "CSS". The test statistics behind the distributional
options are contained within Chen and Gupta (2000) for the "Normal" option and Chen
and Gupta (1997) for the "CSS" option.
The remainder of this section is a worked example considering changes in variability within
wind speeds.
Journal of Statistical Software 13
5.1. Case study: Irish wind speeds
With the increase of wind based renewables in the power grid, there has become great interest
in forecasting wind speeds. Often modelers assume a constant dependence structure when
modeling the existing data before producing a forecast. Here we conduct a naive changepoint
analysis of wind speed data which are available in the Rpackage gstat (Pebesma 2004).
The data provided are daily wind speeds from 12 meteorological stations in the Republic of
Ireland. The data has previously been analyzed by several authors including Haslett and
Raftery (1989) and Gneiting, Genton, and Guttorp (2007). These analyses were concerned
with a spatial-temporal model for 11 of the 12 sites. Here we consider a single site, Claremorris
depicted in Figure 3.
R> data("wind", package = "gstat")
R> ts.plot(wind[, 11], xlab = "Index")
The variability of the data appears smaller in some sections and larger in others, this motivates
a search for changes in variability. Wind speeds are by nature diurnal and thus have a periodic
mean. The change in variance approaches within the cpt.var function require the data to
have a fixed value mean over time and thus this periodic mean must be removed prior to
analysis. Whilst there are a range of options for removing this mean, we choose to take first
differences as this does not require any modeling assumptions. Following this we assume that
the differences follow a Normal distribution with changing variance and thus use the cpt.var
function. Again we compare the analyses provided by the PELT and binary segmentation
algorithms.
R> wind.pelt <- cpt.var(diff(wind[, 11]), method = "PELT")
R> plot(wind.pelt, xlab = "Index")
R> logLik(wind.pelt)
-like -likepen
37328.68 37856.13
R> wind.bs <- cpt.var(diff(wind[, 11]), method = "BinSeg")
Warning message:
In binseg.var.norm(coredata(data), Q, pen.value, know.mean, mu) :
The number of changepoints identified is Q, it is advised to increase Q to
make sure changepoints have not been missed.
R> ncpts(wind.bs)
[1] 5
Note that unlike the PELT algorithm, the binary segmentation algorithm has only found
5 changepoints. This is because we used the default value of the parameters that set Q
= 5 which results in a maximum of 5 changepoints identified. Whilst a warning message
is produced, when performing an analysis using binary segmentation this should always be
checked and the default increased if necessary.
14 changepoint: An RPackage for Changepoint Analysis
R> wind.bs <- cpt.var(diff(wind[, 11]), method = "BinSeg", Q = 60)
R> plot(wind.bs, xlab = "Index")
R> ncpts(wind.bs)
[1] 8
R> logLik(wind.bs)
-like -likepen
37998.37 38068.69
As we are considering the negative log-likelihood the smaller value provided by PELT is
preferred. Even when eye-balling the results, it would appear that the PELT segmentation is
more appropriate than that of the binary segmentation analysis, see Figure 3.
6. Changes in mean and variance: The cpt.meanvar function
The changepoint package contains four distributional choices for a change in both the mean
and variance; Exponential, Gamma, Poisson and Normal. The Exponential, Gamma and
Poisson distributional choices only require a change in a single parameter to change both
the mean and the variance. In contrast, the Normal distribution requires a change in two
parameters. The multiple parameter changepoint problem has been considered by many
authors including Horvath (1993) and Picard, Robin, Lavielle, Vaisse, and Daudin (2005).
Each distributional option is available within the cpt.meanvar function which has a similar
structure to the cpt.mean and cpt.var functions from previous sections. The basic call
format is as follows:
cpt.meanvar(data, penalty, pen.value, method, Q, test.stat = "Normal", class,
param.estimates, shape = 1)
The data,penalty,pen.value,method,Q,class and param.estimates arguments are the
same as those described for the cpt.mean function (see Section 4). The remaining arguments
are interpreted as follows.
test.stat – The test statistic, i.e., assumed distribution of data. Choice of "Normal",
"Gamma","Exponential" or "Poisson".
shape – Value of the known shape parameter required when test.stat = "Gamma".
Following the format of previous sections we briefly describe a case study using data on notable
inventions / discoveries.
6.1. Case study: Discoveries
This section considers the dataset called discoveries available within the datasets package in
the base distribution of R. The data are the counts of the number of “great” inventions and/or
scientific discoveries in each year from 1860 to 1959. Our approach models each segment as
following a Poisson distribution with its own rate parameter. Again we compare the results
for both PELT and binary segmentation search methods.
Journal of Statistical Software 15
Index
wind[, 11]
0 1000 2000 3000 4000 5000 6000
0 5 10 15 20 25 30
(a)
Index
data.set.ts(x)
0 1000 2000 3000 4000 5000 6000
−20 −10 0 10 20
(b)
Index
data.set.ts(x)
0 1000 2000 3000 4000 5000 6000
−20 −10 0 10 20
(c)
Figure 3: (a) Republic of Ireland hourly wind speeds, (b) and (c) show the first differences of
(a) with vertical lines depicting changepoints identified by (b) PELT and (c) binary segmen-
tation.
R> data("discoveries", package = "datasets")
R> dis.pelt <- cpt.meanvar(discoveries, test.stat = "Poisson",
+ method = "PELT")
R> plot(dis.pelt, cpt.width = 3)
R> cpts.ts(dis.pelt)
[1] 1883 1888 1932 1952
16 changepoint: An RPackage for Changepoint Analysis
Time
data.set.ts(x)
1860 1880 1900 1920 1940 1960
0 2 4 6 8 10 12
Figure 4: Discoveries dataset with identified changepoints.
R> dis.bs <- cpt.meanvar(discoveries, test.stat = "Poisson",
+ method = "BinSeg")
R> cpts.ts(dis.bs)
[1] 1883 1888 1932 1952
The number and year of the changepoints identified by both methods are the same. Here
we have used the cpts.ts function to return the date of the changepoints rather than their
position within the sequence of data.
7. Summary
The unique contribution of the changepoint package is that the user has the ability to select
the multiple changepoint search method for analysis. The package contains three such meth-
ods: segment neighborhood; binary segmentation and PELT and this paper has described
and demonstrated some differences between these approaches. The multiple changepoint
search methods are available both for changes in mean and/or variance using distributional
or distribution-free assumptions utilizing both established and novel methods. As such the
changepoint package is useful both for practitioners to implement existing methods and for
researchers to compare the performance of new approaches against the established literature.
Acknowledgments
The authors wish to thank Paul Fearnhead for helpful discussions and encouragement as they
developed this work as well as the editor and anonymous referees for helpful feedback on
Journal of Statistical Software 17
earlier versions of this manuscript. R. Killick and I.A. Eckley acknowledge financial support
from Shell Research Limited and the Engineering and Physical Sciences Research Council
(EPSRC).
References
Auger IE, Lawrence CE (1989). “Algorithms for the Optimal Identification of Segment Neigh-
borhoods.” Bulletin of Mathematical Biology,51(1), 39–54.
Bai J, Perron P (1998). “Estimating and Testing Linear Models with Multiple Structural
Changes.” Econometrica,66(1), 47–78.
Birge L, Massart P (2007). “Minimal Penalties for Gaussian Model Selection.” Probability
Theory and Related Fields,138(1), 33–73.
Chen J, Gupta AK (1997). “Testing and Locating Variance Changepoints with Application
to Stock Prices.” Journal of the American Statistical Association,92(438), 739–747.
Chen J, Gupta AK (2000). Parametric Statistical Change Point Analysis. Birkhauser.
Davis RA, Lee TC, Rodriguez-Yam GA (2006). “Structural Break Estimation for Nonsta-
tionary Time Series Models.” Journal of the American Statistical Association,101(473),
223–239.
Eckley IA, Fearnhead P, Killick R (2011). “Analysis of Changepoint Models.” In D Barber,
AT Cemgil, S Chiappa (eds.), Bayesian Time Series Models. Cambridge University Press.
Edwards AWF, Cavalli-Sforza LL (1965). “A Method for Cluster Analysis.” Biometrics,21(2),
362–375.
Erdman C, Emerson JW (2007). bcp: An RPackage for Performing a Bayesian Analysis
of Change Point Problems.” Journal of Statistical Software,23(3), 1–13. URL http:
//www.jstatsoft.org/v23/i03/.
Erdman C, Emerson JW (2008). “A Fast Bayesian Change Point Analysis for the Segmenta-
tion of Microarray Data.” Bioinformatics,24(19), 2143–2148.
Gneiting T, Genton MG, Guttorp P (2007). “Geostatistical Space-Time Models, Stationarity,
Separability and Full Symmetry.” In Statistical Methods for Spatio-Temporal Systems, pp.
151–175. Chapman & Hall/CRC.
Gupta AK, Tang J (1987). “On Testing Homogeneity of Variances for Gaussian Models.”
Journal of Statistical Computation and Simulation,27(2), 155–173.
Guyon X, Yao J (1999). “On the Underfitting and Overfitting Sets of Models Chosen by
Order Selection Criteria.” Journal of Multivariate Analysis,70(2), 221–249.
Haslett J, Raftery AE (1989). “Space-Time Modelling with Long-Memory Dependence: As-
sessing Ireland’s Wind Power Resource.” Journal of the Royal Statistical Society C,38(1),
1–50.
18 changepoint: An RPackage for Changepoint Analysis
Hinkley DV (1970). “Inference about the Change-Point in a Sequence of Random Variables.”
Biometrika,57(1), 1–17.
Horvath L (1993). “The Maximum Likelihood Method of Testing Changes in the Parameters
of Normal Observations.” The Annals of Statistics,21(2), 671–680.
Hsu DA (1979). “Detecting Shifts of Parameter in Gamma Sequences with Applications to
Stock Price and Air Traffic Flow Analysis.” Journal of the American Statistical Association,
74(365), 31–40.
Iacus SM (2009). sde: Simulation and Inference for Stochastic Differential Equations.R
package version 2.0.10, URL http://CRAN.R-project.org/package=sde.
Killick R, Eckley I, Haynes K (2014). changepoint: An RPackage for Changepoint Analysis.
Rpackage version 1.1.5, URL http://CRAN.R-project.org/package=changepoint.
Killick R, Eckley IA, Jonathan P, Ewans K (2010). “Detection of Changes in the Charac-
teristics of Oceanographic Time-Series using Statistical Change Point Analysis.” Ocean
Engineering,37(13), 1120–1126.
Killick R, Fearnhead P, Eckley IA (2012a). “Optimal Detection of Changepoints with a
Linear Computational Cost.” Journal of the American Statistical Association,107(500),
1590–1598.
Killick R, Nam CFH, Aston JAD, Eckley IA (2012b). “changepoint.info: The Changepoint
Repository.” URL http://changepoint.info/.
Lai WR, Johnson MD, Kucherlapati R, Park PJ (2005). “Comparative Analysis of Algorithms
for Identifying Amplifications and Deletions in Array CGH Data.” Bioinformatics,21(19),
3763–3770.
Lavielle M (2005). “Using Penalized Contrasts for the Change-Point Problem.” Signal Pro-
cessing,85(8), 1501–1510.
Muggeo VMR (2012). cumSeg: Change Point Detection in Genomic Sequences.Rpackage
version 1.1, URL http://CRAN.R-project.org/package=cumSeg.
Nam CFH, Aston JAD, Johansen AM (2012). “Quantifying the Uncertainty in Change
Points.” Journal of Time Series Analysis,33(5), 807–823.
Page ES (1954). “Continuous Inspection Schemes.” Biometrika,41(1–2), 100–115.
Pebesma EJ (2004). “Multivariable Geostatistics in S: The gstat Package.” Computers &
Geosciences,30(7), 683–691.
Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ (2005). “A Statistical Approach for Array
CGH Data Analysis.” BMC Bioinformatics,6(27), 1–14.
RCore Team (2014). R: A Language and Environment for Statistical Computing.RFounda-
tion for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
Journal of Statistical Software 19
Reeves J, Chen J, Wang XL, Lund R, Lu Q (2007). “A Review and Comparison of Changepoint
Detection Techniques for Climate Data.” Journal of Applied Meteorology and Climatology,
46(6), 900–915.
Ross GJ (2013). cpm: Sequential Parametric and Nonparametric Change Detection.R
package version 1.1, URL http://CRAN.R-project.org/package=cpm.
Scott AJ, Knott M (1974). “A Cluster Analysis Method for Grouping Means in the Analysis
of Variance.” Biometrics,30(3), 507–512.
Sen A, Srivastava MS (1975). “On Tests for Detecting Change in Mean.” The Annals of
Statistics,3(1), 98–108.
Seshan VE, Olshen A (2008). DNAcopy: DNA Copy Number Data Analysis.Rpack-
age version 1.24.0, URL http://www.Bioconductor.org/packages/release/bioc/html/
DNAcopy.html.
Silva EG, Teixeira AAC (2008). “Surveying Structural Change: Seminal Contributions and a
Bibliometric Account.” Structural Change and Economic Dynamics,19(4), 273–300.
Zeileis A, Leisch F, Hornik K, Kleiber C (2002). strucchange: An RPackage for Testing
for Structural Change in Linear Regression Models.” Journal of Statistical Software,7(2),
1–38. URL http://www.jstatsoft.org/v07/i02/.
Zeileis A, Shah A, Patnaik I (2010). “Testing, Monitoring, and Dating Structural Changes in
Exchange Rate Regimes.” Computational Statistics & Data Analysis,54(6), 1696–1706.
Affiliation:
Rebecca Killick
Department of Mathematics & Statistics
Lancaster University
LA1 4YF, United Kingdom
E-mail: r.killick@lancs.ac.uk
URL: http://www.lancs.ac.uk/~killick/
Journal of Statistical Software http://www.jstatsoft.org/
published by the American Statistical Association http://www.amstat.org/
Volume 58, Issue 3 Submitted: 2013-01-10
June 2014 Accepted: 2014-02-23
... The changepoint package by Killick and Eckley (2014) considers a variety of test statistics for detecting change-points among which the likelihood ratio. The strucchange package by Zeileis et al. (2002) provides methods for detecting changes in linear regression models. ...
Preprint
Full-text available
This paper reviews the most common situations where one or more regularity conditions which underlie classical likelihood-based parametric inference fail. We identify three main classes of problems: boundary problems, indeterminate parameter problems--which include non-identifiable parameters and singular information matrices--and change-point problems. The review focuses on the large-sample properties of the likelihood ratio statistic, though other approaches to hypothesis testing and connections to estimation may be mentioned in passing. We emphasize analytical solutions and acknowledge software implementations where available. Some summary insight about the possible tools to derivate the key results is given.
... Initially, approaches that enable the reduction of dimensional complexity in timeseries data (Ali et al., 2019) including ARIMA (Autoregressive Integrated Moving Average) (McCleary and Hay, 1980;Box et al., 2008), Discrete Wave Analysis (Percival and Walden, 2000;Abuadbba and Khalil, 2015;Aldrich, 2020) and change point analysis (Killick and Eckley, 2014) were considered. Additionally, Discrete Fourier Transforms (DFT) and Wavelet Transform methods were reviewed as these can detect periodicity, yet neither technique can reveal long-term trends or anomalies (Zhu and Guo, 2017). ...
Thesis
Full-text available
The UK high street is constantly changing and evolving in response to, for example, online sales, out-of-town developments, and economic crises. With over 10 years of hourly footfall counts from sensors across the UK, this study was an opportunity to perform a longitudinal and quantitative investigation to diagnose how these changes are reflected in the changing patterns of pedestrian activity. Footfall provides a recognised performance measure of place vitality. However, through a lack of data availability due to historic manual counting methods, few opportunities to contextualise the temporal patterns longitudinally have existed. This study therefore investigates daily, weekly, and annual footfall patterns, to diagnose the similarities and differences between places as social activity patterns from UK high streets evolve over time. Theoretically, footfall is conceptualised within the framework of Territorology and Assemblage Theory, conceptually underpinning a quantitative approach to represent the collective meso-level (street and town-centre) patterns of footfall (social) activity. To explore the data, the periodic signatures of daily, weekly, and annual footfall are extracted using STL (seasonal trend decomposition using Loess) algorithms and the outputs are then analysed using fuzzy clustering techniques. The analyses successfully identify daily, weekly, and annual periodic patterns and diagnose the varying social activity patterns for different urban place types and how places, both individually and collectively are changing. Footfall is demonstrated to be a performance measure of meso-scale changes in collective social activity. For place management, the fuzzy analysis provides an analytical tool to monitor the annual, weekly, and daily footfall signatures providing an evidence-based diagnostic of how places are changing over time. The place manager is therefore better able to identify place specific interventions that correspond to the usage patterns of visitors and adapt these interventions as behaviours change.
... To understand change points, the package "changepoint" (Killick and Eckley, 2014) was used in (R Core Team, 2022) on the existing data set. We also tested using the package "ecp" (James and Matteson, 2014), specifically, E-Divisive, to perform hierarchical divisive estimation of multiple change points on multivariate data. ...
Article
This study quantifies the air quality impact on population mortality from an actuarial perspective, considering implications to the industry through the application of findings. The study focuses on the increase in mortality from air quality changes due to extreme weather impacts. We conduct an empirical study using monthly Californian climate and mortality data from 1999 to 2019 to determine whether adding PM2.5 as a factor improves forecast excess mortality. Expected mortality is defined using the rolling five-year average of observed mortality for each county. We compared three statistical models, namely a Generalised Linear Model (GLM), a Generalised Additive Model (GAM), and an Extreme Gradient Boosting (XGB) regression model. We find including PM2.5 improves the performance of all three models and that the GAM performs the best in terms of predictive accuracy. Change points are also considered to determine whether significant events trigger changes in mortality over extended periods. Based on several identified change points, some wildfires trigger heightened excess mortality.
... The topology-based methods of pathway analysis include ROntoTools_PE (41), ROntoTools_pDIS (42) and SPIA (43). The statistics analysis methods include the changepoint package (44) and the pROC package (45). The visualizing analysis methods include the pROC package (45) and the Pathview package (34). ...
Article
Full-text available
Colon ascendens stent peritonitis (CASP) surgery induces a leakage of intestinal contents which may cause polymicrobial sepsis related to post-operative failure of remote multi-organs (including kidney, liver, lung and heart) and possible death from systemic syndromes. Mechanisms underlying such phenomena remain unclear. This article aims to elucidate the mechanisms underlying the CASP-model sepsis by analyzing real-world GEO data (GSE24327_A, B and C) generated from mice spleen 12 hours after a CASP-surgery in septic MyD88-deficient and wildtype mice, compared with untreated wildtype mice. Firstly, we identify and characterize 21 KO MyD88-associated signaling pathways, on which true key regulators (including ligands, receptors, adaptors, transducers, transcriptional factors and cytokines) are marked, which were coordinately, significantly, and differentially expressed at the systems-level, thus providing massive potential biomarkers that warrant experimental validations in the future. Secondly, we observe the full range of polymicrobial (viral, bacterial, and parasitic) sepsis triggered by the CASP-surgery by comparing the coordinated up- or down-regulations of true regulators among the experimental treatments born by the three data under study. Finally, we discuss the observed phenomena of “systemic syndrome”, “cytokine storm” and “KO MyD88 attenuation”, as well as the proposed hypothesis of “spleen-mediated immune-cell infiltration”. Together, our results provide novel insights into a better understanding of innate immune responses triggered by the CASP-model sepsis in both wildtype and MyD88-deficient mice at the systems-level in a broader vision. This may serve as a model for humans and ultimately guide formulating the research paradigms and composite strategies for the early diagnosis and prevention of sepsis.
... These periods were compared with the timings of significant changes in mean values for the SPD, based on changepoint analysis. Changepoint analysis was performed using the R package changepoint (v2.2.2: Killick and Eckley, 2014), with the binary segmentation approach (Edwards and Cavalli-Sforza, 1965) used to identify the mean values of the SPD where changes occurred. We used Superposed Epoch Analysis (SEA) to examine the charcoal zscore time series in a 750-year time window before and after each key date. ...
Article
Full-text available
The relative importance of climate change and human activities in influencing regional fire regimes during the Holocene is still a matter of debate. The introduction of agriculture during the Neolithic provides an opportunity to examine the impact of human activities on fire regimes. Here, we examine changes in fire regimes across Iberia between 10,000 and 3500 cal. BP, reconstructed using sedimentary charcoal records. We compare the regional fire history with estimates of changes in population size, reconstructed based on summed probability distributions of radiocarbon dates on archaeological material. We also compare the fire records and population reconstructions with the timing of the onset of agriculture across the region as indicated by archaeological data. For Iberia as a whole, there are two intervals of rapid population increase centred on ca. 7400 and ca. 5400 cal. BP. Periods of rapid population growth, either for the region as a whole or more locally, do not closely align with changes in charcoal accumulation. Charcoal accumulation had already begun to increase ca. 400 years prior to the onset of the Neolithic and continued to increase for ca. 750 years afterwards, indicating that changes in fire are not directly associated with the introduction of agriculture. Similarly, there is no direct relationship between changes in charcoal accumulation and later intervals of rapid population growth. There is also no significant relationship between population size and charcoal accumulation across the period of analysis. Our analyses show that the introduction of agriculture and subsequent increases in population are not directly linked with changes in fire regimes in Iberia and support the idea that changes in fire are largely driven by other factors such as climate.
... further investigate the impact of climate change on watershed processes, we conducted change-point analyses in the time series (1931-2100) of PCA score means of the first principal components (PC1s) that explained the largest variance of the variables (Anderson et al., 2009), to identify the years when PC1s exhibited or are projected to exhibit significant shifts, using the R-package of "changepoint" (Killick and Eckley, 2014). We have selected the binary segmentation ("BinSeg" in R) method. ...
Article
Full-text available
Climate change increasingly affects primary productivity and biogeochemical cycles in forest ecosystems at local and global scales. To predict change in vegetation, soil, and hydrologic processes, we applied an integrated biogeochemical model Photosynthesis-EvapoTranspration and BioGeoChemistry (PnET-BGC) to two high-elevation forested watersheds in the southern Appalachians in the US under representative (or radiative) concentration pathway (RCP)4.5 and RCP8.5 scenarios. We investigated seasonal variability of the changes from current (1986–2015) to future climate scenarios (2071–2100) for important biogeochemical processes/states; identified change points for biogeochemical variables from 1931 to 2100 that indicate potential regime shifts; and compared the climate change impacts of a lower-elevation watershed (WS18) with a higher-elevation watershed (WS27) at the Coweeta Hydrologic Laboratory, North Carolina, United States. We find that gross primary productivity (GPP), net primary productivity (NPP), transpiration, nitrogen mineralization, and streamflow are projected to increase, while soil base saturation, and base cation concentration and ANC of streamwater are projected to decrease at the annual scale but with strong seasonal variability under a changing climate, showing the general trend of acidification of soil and streamwater despite an increase in primary productivity. The predicted changes show distinct contrasts between lower and higher elevations. Climate change is predicted to have larger impact on soil processes at the lower elevation watershed and on vegetation processes at the higher elevation watershed. We also detect five change points of the first principal component of 17 key biogeochemical variables simulated with PnET-BGC between 1931 and 2100, with the last change point projected to occur 20 years earlier under RCP8.5 (2059 at WS18 and WS27) than under RCP4.5 (2079 at WS18 and 2074 at WS27) at both watersheds. The change points occurred earlier at WS18 than at WS27 in the 1980s and 2010s but in the future are projected to occur earlier in WS27 (2074) than WS18 (2079) under RCP4.5, implying that changes in biogeochemical cycles in vegetation, soil, and streams may be accelerating at higher-elevation WS27.
Article
Effective species management often requires understanding patterns of movement and habitat use. A common approach in identifying where individuals reside relies upon chemical tracers from the environment that are incorporated into an individual's tissues. For fish, isotopes in their otoliths, specifically the portion of their otolith formed during their larval stage, have been used to identify the natal origin. Complicating this work, however, is the fact that during this life stage, there is a shift in the source of isotopes deposited onto the growing otolith from maternally to environmentally derived. The objective of this study was to identify the portion of the otolith representing this transition to environmentally derived isotopes so as to accurately investigate questions of natal origin for a threatened population of fall Chinook salmon (Oncorhynchus tshawytscha). We exposed developing larvae to four treatments that differed in terms of their water strontium isotope ratio (87Sr/86Sr) and used change‐point analysis of otolith 87Sr/86Sr and strontium to calcium ratio (Sr/Ca) to identify the otolith radius corresponding to the transition to environmentally derived isotopes. Our results indicated this transition occurred, on average, at 132 μm (87Sr/86Sr; ±50 μm standard deviation) and 127 μm (Sr/Ca; ±29 μm) from the otolith core, which corresponded to the developmental time between hatching and exogenous feeding. A substantial proportion of our otoliths (i.e., 61%) did not show convergence between otolith and water 87Sr/86Sr by the end of the 113‐day experiment, which was likely due to the dietary contribution of marine‐based feed. Therefore, we were unable to recommend an otolith radius to target for the purposes of reconstructing natal origin apart from being beyond approximately 130 μm.
Article
The challenge of efficiently identifying anomalies in data sequences is an important statistical problem that now arises in many applications. Although there has been substantial work aimed at making statistical analyses robust to outliers, or point anomalies, there has been much less work on detecting anomalous segments, or collective anomalies, particularly in those settings where point anomalies might also occur. In this article, we introduce collective and point anomalies (CAPA), a computationally efficient approach that is suitable when collective anomalies are characterized by either a change in mean, variance, or both, and distinguishes them from point anomalies. Empirical results show that CAPA has close to linear computational cost as well as being more accurate at detecting and locating collective anomalies than other approaches. We demonstrate the utility of CAPA through its ability to detect exoplanets from light curve data from the Kepler telescope and its capacity to detect machine faults from temperature data.
Book
Full-text available
Livro oriundo do projeto de pesquisa entitulado IMPACTOS DAS MUDANÇAS CLIMÁTICAS EM EXTREMOS HIDROLÓGICOS (SECAS E CHEIAS), financiado pelas CAPES e Agência Nacional de Águas e Saneamento Básico por meio do Edital Mudanças do Clima e Recursos Hídricos n° 19/2015. Instituições participantes do projeto de pesquisa Programa de Pós-Graduação em Engenharia Civil (Recursos Hídricos) da Universidade Federal do Ceará Programa de Pós-Graduação em Tecnologia Ambiental e Recursos Hídricos da Universidade de Brasília Programas de Pós-Graduação em Engenharia Civil e Ambiental, em Gestão e Regulação de Recursos Hídricos, e em Sistemas Agroindustriais, da Universidade Federal de Campina Grande Organizadores Francisco de Assis de Souza Filho (UFC) Carlos de Oliveira Galvão (UFCG) Dirceu Silveira Reis Junior (UnB) Revisores Daniel Antônio Camelo Cid (UFC) Maycon Breno Macena da Silva (UFCG) Apresentação As mudanças climáticas têm nos recursos hídricos uma de suas dimensões mais relevantes. Os impactos das mudanças climáticas nos extremos hidrológicos (secas e cheias) podem impor aumento significativo da vulnerabilidade das populações humanas e do desenvolvimento social. Avaliar os riscos de aumento da frequência destes eventos e as severidades dos mesmos é passo inicial e necessário para a proposição de estratégias de adaptação que possibilitem maior resiliência da sociedade à variabilidade e mudança climática. Objetivando construir análise deste processo e propostas de mitigação, a Universidade Federal do Ceará (UFC), a Universidade de Brasília (UnB) e a Universidade Federal de Campina Grande (UFCG) decidiram constituir uma rede de colaboração com outras instituições internacionais e submeter proposta para o Edital Mudanças do Clima e Recursos Hídricos n° 19/2015 CAPES-ANA. A proposta intitulada “Impactos das Mudanças Climáticas em Extremos Hidrológicos (Secas e Cheias)” recebeu financiamento deste edital e os resultados do trabalho de pesquisa financiados por este projeto constituem os capítulos do presente livro. Os grupos de pesquisa da UFCG, UFC e UnB possuem colaboração anterior a este projeto, notadamente na Rede Brasileira de Pesquisas sobre Mudanças Climáticas Globais – REDE CLIMA, e as atividades desenvolvidas neste projeto podem ser consideradas no contexto desta rede de colaboração. As pesquisas foram desenvolvidas nos Programas de Pós-Graduação em Engenharia Civil (Recursos Hídricos) da Universidade Federal do Ceará, em Tecnologia Ambiental e Recursos Hídricos da Universidade de Brasília, em Engenharia Civil e Ambiental, em Gestão e Regulaçã o de Recursos Hídricos, e em Sistemas Agroindustriais, da Universidade Federal de Campina Grande. Alunos de graduação também foram envolvidos no projeto. Diversos pesquisadores deste projeto tiveram bolsas financiadas pelo CNPq, pela CAPES e pela Fundação Cearense de Apoio ao Desenvolvimento Científico e Tecnológico (FUNCAP), a quem agradecemos. O presente livro é dividido em três partes: (i) modelos climáticos e detecção de mudanças; (ii) impactos das mudanças climáticas e (iii) estratégias de adaptação à mudança climática.
Article
Full-text available
This paper introduces ideas and methods for testing for structural change in linear regression models and presents how these have been realized in an R package called strucchange. It features tests from the generalized fluctuation test framework as well as from the F test (Chow test) framework. Extending standard significance tests it contains methods to fit, plot and test empirical fluctuation processes (like CUSUM, MOSUM and estimatesbased processes) on the one hand and to compute, plot and test sequences of F statistics with the supF, aveF and expF test on the other. Thus, it makes powerful tools available to display information about structural changes in regression relationships and to assess their significance. Furthermore it is described how incoming data can be monitored online.
Article
Full-text available
Many time series are characterised by abrupt changes in structure, such as sudden jumps in level or volatility. We consider changepoints to be those time points which divide a dataset into distinct homogeneous segments. In practice the number of changepoints will not be known. The ability to detect changepoints is important for both methodological and practical reasons including: the validation of an untested scientific hypothesis [27]; monitoring and assessment of safety critical processes [14]; and the validation of modelling assumptions [21]. The development of inference methods for changepoint problems is by no means a recent phenomenon, with early works including [39], [45] and [28]. Increasingly the ability to detect changepoints quickly and accurately is of interest to a wide range of disciplines. Recent examples of application areas include numerous bioinformatic applications [37, 15], the detection of malware within software [51], network traffic analysis [35], finance [46], climatology [32] and oceanography [34]. In this chapter we describe and compare a number of different approaches for estimating changepoints. For a more general overview of changepoint methods, we refer interested readers to [8] and [11]. The structure of this chapter is as follows. First we introduce the model we focus on. We then describe methods for detecting a single changepoint and methods for detecting multiple changepoint, which will cover both frequentist and Bayesian approaches.
Article
Full-text available
The change point model framework introduced in Hawkins, Qiu, and Kang (2003) and Hawkins and Zamba (2005a) provides an effective and computationally efficient method for detecting multiple mean or variance change points in sequences of Gaussian random variables, when no prior information is available regarding the parameters of the distribution in the various segments. It has since been extended in various ways by Hawkins and Deng (2010), Ross, Tasoulis, and Adams (2011), Ross and Adams (2012) to allow for fully nonparametric change detection in non-Gaussian sequences, when no knowledge is available regarding even the distributional form of the sequence. Another extension comes from Ross and Adams (2011) and Ross (2014) which allows change detection in streams of Bernoulli and Exponential random variables respectively, again when the values of the parameters are unknown. This paper describes the R package cpm, which provides a fast implementation of all the above change point models in both batch (Phase I) and sequential (Phase II) settings, where the sequences may contain either a single or multiple change points.
Article
This article explores testing and locating multiple variance changepoints in a sequence of independent Gaussian random variables (assuming known and common mean). This type of problem is very common in applied economics and finance. A binary procedure combined with the Schwarz information criterion (SIC) is used to search all of the possible variance changepoints existing in the sequence. The simulated power of the proposed procedure is compared to that of the CUSUM procedure used by Inclán and Tiao to cope with variance changepoints. The SIC and unbiased SIC for this problem are derived. To obtain the percentage points of the SIC criterion, the asymptotic null distribution of a function of the SIC is obtained, and then the approximate percentage points of the SIC are tabulated. Finally, the results are applied to the weekly stock prices. The unknown but common mean case is also outlined at the end.
Article
It is sometimes useful in an analysis of variance to split the treatments into reasonably homogeneous groups. Multiple comparison procedures are often used for this purpose, but a more direct method is to use the techniques of cluster analysis. This approach is illustrated for several sets of data, and a likelihood ratio test is developed for judging the significance of differences among the resulting groups.
Article
A method for investigating the relationships of points in multi-dimensional space is described. Using an analysis of variance technique, the points are divided into the two most-compact clusters, and the process repeated sequentially so that a `tree' diagram is formed. The application of the method to problems of classification is particularly stressed, and numerical examples are given.
Article
In this article a technique for detecting shift of scale parameter in a sequence of independent gamma random variables is discussed. Distribution theories and related properties of the test statistic are investigated. Numerical critical points and test powers are tabulated for two specific variables. Other useful techniques are also summarized. The methods are then applied to the analysis of stock-market returns and air traffic flows. These two examples are studied in detail to illustrate the use of the proposed method compared to other available techniques. The empirical examples also illuminate the importance of the treatment of stochastic instability in statistical applications.