Conference PaperPDF Available

Efficient Algorithms for Sequence Segmentation

Authors:

Abstract and Figures

The sequence segmentation problem asks for a partition of the se- quence into k non-overlapping segments that cover all data points such that each segment is as homogeneous as possible. This problem can be solved optimally using dynamic programming in O(n2k) time, where n is the length of the sequence. Given that sequences in practice are too long, a quadratic algorithm is not an adequately fast solution. Here, we present an alternative constant- factor approximation algorithm with running time O(n4=3k5=3). We call this algorithm the DNS algorithm. We also consider the recursive application of the DNS algorithm, that results in a faster algorithm (O(n log log n) running time) with O(log n) approxima- tion factor, and study the accuracy/efficiency tradeoff. Extensive experimental results show that these algorithms outperform other widely-used heuristics. The same algorithms can speed up solu- tions for other variants of the basic segmentation problem while maintaining constant their approximation factors. Our techniques can also be used in a streaming setting, with sublinear memory re- quirements.
Content may be subject to copyright.
A preview of the PDF is not available
... The ego-network sequence segmentation problem is conceptually similar to the well-known problem of time-series segmentation [2], [3] but introduces two challenges: (I) how to represent a sequence of graphs with a summary; and (II) how to construct an optimal summary efficiently. These challenges do not arise in time-series segmentation, as simple statistics such as the median or mean can serve as optimal representatives with respect to popular error measures [2]. ...
... The ego-network sequence segmentation problem is conceptually similar to the well-known problem of time-series segmentation [2], [3] but introduces two challenges: (I) how to represent a sequence of graphs with a summary; and (II) how to construct an optimal summary efficiently. These challenges do not arise in time-series segmentation, as simple statistics such as the median or mean can serve as optimal representatives with respect to popular error measures [2]. The problem is also related to anomaly (a.k.a. ...
... 3. We build on the above results to design two segmentation algorithms: one exact based on dynamic programming [2], [22]; and another based on a top-down heuristic [2]. Both can use any of our algorithms for JM or WJM as a subroutine to construct the summaries. ...
Article
Full-text available
An ego-network is a graph representing the interactions of a node ( ego ) with its neighbors and the interactions among those neighbors. A sequence of ego-networks having the same ego can thus model the evolution of these interactions over time. We introduce the problem of segmenting a sequence of ego-networks into k segments, for any given integer k . Each segment is represented by a summary network, and the goal is to minimize the total loss of representing k segments by k summaries. The problem allows partitioning the sequence into homogeneous segments with respect to the activities or properties of the ego (e.g., to identify time periods when a user acquired different circles of friends in a social network) and to compactly represent each segment with a summary. The main challenge is to construct a summary that represents a collection of ego-networks with minimum loss. To address this challenge, we employ Jaccard Median (JM), a well-known NP-hard problem for summarizing sets, for which, however, no effective and efficient algorithms are known. We develop a series of algorithms for JM offering different effectiveness/efficiency trade-offs: (I) an exact exponential-time algorithm, based on Mixed Integer Linear Programming; (II) exact and approximation polynomial-time algorithms for minimizing an upper bound of the objective function of JM; and (III) efficient heuristics for JM, which are based on an effective scoring scheme and one of them also on sketching. We also study a generalization of the segmentation problem, in which there may be multiple edges between a pair of nodes in an ego-network. To tackle this problem, we develop a series of algorithms, based on a more general problem than JM, called Weighted Jaccard Median WJM: (I) an exact exponential-time algorithm, based on Mixed Integer Linear Programming; (II) exact algorithms for minimizing an upper bound of the objective function of WJM; and (III) efficient heuristics, based on the percentiles of edge multiplicities and one of them also on divide-and-conquer. By building upon the above results, we design algorithms for segmenting a sequence of ego-networks. Experiments with 10 real datasets and with synthetic datasets show that our algorithms produce optimal or near-optimal solutions to JM or to WJM, and that they substantially outperform state-of-the-art methods which can be employed for ego-network segmentation.
... Roughly, we shift from a local error view to a global one. Among the many algorithms proposed for sequence segmentation, we have adapted and evaluated the dynamic programming algorithm described by Terzi and Tsaparas (2006), which provides exact solutions and the top-down greedy algorithm that provides approximate solutions. ...
... Sequence segmentation (Terzi and Tsaparas 2006) aims at efficiently summarizing a long sequence by a few key representatives. A typical summary consists in computing a piece-wise constant approximation of the sequence. ...
... When processing large sequences, the quadratic complexity with respect to the sequence length makes this approach impractical. To this end, heuristic approaches have been developed to solve the segmentation problem (Terzi and Tsaparas 2006). Some examples of episodes. ...
Article
Full-text available
This paper proposes the sky-signature model, an extension of the signature model Gautrais et al. (in: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Springer, 2017b) to multi-objective optimization. The signature approach considers a sequence of itemsets, and given a number k it returns a segmentation of the sequence in k segments such that the number of items occuring in all segments is maximized. The limitation of this approach is that it requires to manually set k, and thus fixes the temporal granularity at which the data is analyzed. The sky-signature model proposed in this paper removes this requirement, and allows to examine the results at multiple levels of granularity, while keeping a compact output. This paper also proposes efficient algorithms to mine sky-signatures, as well as an experimental validation both real data both from the retail domain and from natural language processing (political speeches).
... The ego-network sequence segmentation problem is conceptually similar to the well-known problem of time-series segmentation [2], [3] but introduces two challenges: (I) how to represent a sequence of graphs with a summary; and (II) how to construct an optimal summary efficiently. These challenges do not arise in time-series segmentation, as statistics such as the median or mean can serve as optimal representatives with respect to popular error measures [2]. ...
... The ego-network sequence segmentation problem is conceptually similar to the well-known problem of time-series segmentation [2], [3] but introduces two challenges: (I) how to represent a sequence of graphs with a summary; and (II) how to construct an optimal summary efficiently. These challenges do not arise in time-series segmentation, as statistics such as the median or mean can serve as optimal representatives with respect to popular error measures [2]. The problem is also related to anomaly detection [4], [5], [6]. ...
... 3. We build on the above results to design two segmentation algorithms: one exact based on dynamic programming [12], [2]; and another based on a top-down heuristic [2]. Both algorithms can use any of our algorithms for the JM problem as a subroutine to construct the summaries. ...
Conference Paper
Full-text available
An ego-network is a graph representing the interactions of a node (ego) with its neighbors and the interactions among those neighbors. A sequence of ego-networks having the same ego can thus model the evolution of these interactions over time. We introduce the problem of segmenting a sequence of ego-networks into k segments, for any given integer k. Each segment is represented by a summary network, and the goal is to minimize the total loss of representing k segments by k summaries. The problem allows partitioning the sequence into homogeneous segments with respect to the activities or properties of the ego (e.g., to identify time periods when a user acquired different circles of friends in a social network) and to compactly represent each segment with a summary. The main challenge is to construct a summary that represents a collection of ego-networks with minimum loss. To address this challenge, we employ Jaccard Median (JM), a well-known NP-hard problem for summarizing sets, for which, however, no effective and efficient algorithms are known. We develop a series of algorithms for JM offering different effectiveness/efficiency trade-offs: (I) an exact exponential-time algorithm, based on Mixed Integer Linear Programming and (II) exact and approximation polynomial-time algorithms for minimizing an upper bound of the objective function of JM. By building upon these results, we design two algorithms for segmenting a sequence of ego-networks that are effective, as shown experimentally.
... MSE compression is also known as adaptive piecewise constant approximation [9] and as segmentation problem [38]. It can be solved exactly via dynamic programming in O(n 2 m) time [6]. ...
... To reduce the computational complexity, several heuristics and approximations for MSE compression have been proposed [9,25,38,43]. Also ADA compression can be regarded as a heuristic for MSE compression since it greedily averages two consecutive elements. ...
Article
Full-text available
Computing a sample mean of time series under dynamic time warping is NP-hard. Consequently, there is an ongoing research effort to devise efficient heuristics. The majority of heuristics have been developed for the constrained sample mean problem that assumes a solution of predefined length. In contrast, research on the unconstrained sample mean problem is underdeveloped. In this article, we propose a generic average-compress (AC) algorithm to address the unconstrained problem. The algorithm alternates between averaging (A-step) and compression (C-step). The A-step takes an initial guess as input and returns an approximation of a sample mean. Then the C-step reduces the length of the approximate solution. The compressed approximation serves as initial guess of the A-step in the next iteration. The purpose of the C-step is to direct the algorithm to more promising solutions of shorter length. The proposed algorithm is generic in the sense that any averaging and any compression method can be used. Experimental results show that the AC algorithm substantially outperforms current state-of-the-art algorithms for time series averaging.
... The problem can be approached as a sequence segmentation task, considering the sequence of pages within a large PDF file. Previous studies (Terzi and Tsaparas, 2006;Wei et al., 2018) have examined methods for sequence segmentation. Text segmentation, in particular, involves dividing text into different parts based on topics or context. ...
Article
Purpose We demonstrate the practical application of machine learning (ML) techniques in document processing, addressing the increasing need for digitalization in the real estate industry and beyond. Our focus lies on identifying efficient algorithms for extracting individual documents from multi-page PDF files. Through the implementation of these algorithms, organizations can accelerate the digitization of paper-based files on a large scale, eliminating the laborious process of one-by- one scanning. Additionally, we showcase ML-powered methods for automating the classification of both digital and digitized documents, thereby simplifying the categorization process. Design/methodology/approach We compare two segmentation models that are presented in this paper to analyze the individual pages within a bulk scan, identifying the starting and ending points of each document contained in the PDF. This process involves extracting relevant features from both the textual content and page design elements, such as fonts, layouts and existing page numbers. By leveraging these features, the algorithm accurately splits multi-document PDFs into their respective components. An outlook is provided with a classification code that effectively categorizes the segmented documents into different real estate document classes. Findings The case study provides an overview of different ML methods employed in the development of these models while also evaluating their performance across various conditions. As a result, it offers insight into solutions and lessons learned for processing documents in real estate on a case-by-case basis. The findings presented in this study lay the groundwork for addressing this prevalent problem. The methods, for which we provide the code as open source, establish a solid foundation for expediting real estate document processing, enabling a seamless transition from scanning or inbox management to digital storage, ultimately facilitating machine-based information extraction. Practical implications The process of digitally managing documents in the real estate industry can be a daunting task, particularly due to the substantial volume of documents involved, whether they are paper-based, digitized or in digital formats. Our approach aims to streamline this often tedious and time-consuming process by offering two models as simplified solutions that encourage companies to embrace much-needed digitization. The methods we present in this context are crucial for digitizing all facets of real estate management, offering significant potential in advancing PropTech business cases. The open-source codes can be trained further by researchers and practitioners with access to large volumes of documents. Originality/value This study illustrates effective methods for processing paper-based, digitized and digital files, along with tailored ML models designed to enhance these methods, particularly within the real estate sector. The methods are showcased on two datasets, and lessons learned are discussed.
... These approaches, although fast, are heuristics and have no theoretical guarantees of the approximation quality. A divide-and-segment approach, an approximation algorithm with theoretical guarantees on the approximation quality was given by Terzi and Tsaparas (2006). ...
Preprint
Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer K, and some measure of homogeneity, the task is to split the sequence into K contiguous segments that are maximally homogeneous. A classic approach to find the optimal solution is by using a dynamic program. Unfortunately, the execution time of this program is quadratic with respect to the length of the input sequence. This makes the algorithm slow for a sequence of non-trivial length. In this paper we study segmentations whose measure of goodness is based on log-linear models, a rich family that contains many of the standard distributions. We present a theoretical result allowing us to prune many suboptimal segmentations. Using this result, we modify the standard dynamic program for one-dimensional log-linear models, and by doing so reduce the computational time. We demonstrate empirically, that this approach can significantly reduce the computational burden of finding the optimal segmentation.
... For time series, a concept similar to simplification called segmentation has been extensively studied in the area of data mining [11,40,69]. The standard approach for computing exact segmentations is to use dynamic programming which yields a running time of O(n 2 ). ...
Preprint
The Fr\'echet distance is a popular distance measure for curves. We study the problem of clustering time series under the Fr\'echet distance. In particular, we give (1+ε)(1+\varepsilon)-approximation algorithms for variations of the following problem with parameters k and \ell. Given n univariate time series P, each of complexity at most m, we find k time series, not necessarily from P, which we call \emph{cluster centers} and which each have complexity at most \ell, such that (a) the maximum distance of an element of P to its nearest cluster center or (b) the sum of these distances is minimized. Our algorithms have running time near-linear in the input size for constant ε\varepsilon, k and \ell. To the best of our knowledge, our algorithms are the first clustering algorithms for the Fr\'echet distance which achieve an approximation factor of (1+ε)(1+\varepsilon) or better. Keywords: time series, longitudinal data, functional data, clustering, Fr\'echet distance, dynamic time warping, approximation algorithms.
... These approaches, although fast, are heuristics and have no theoretical guarantees of the approximation quality. A divide-and-segment approach, an approximation algorithm with theoretical guarantees on the approximation quality was given by Terzi and Tsaparas (2006). ...
Preprint
Discovering the underlying structure of a given graph is one of the fundamental goals in graph mining. Given a graph, we can often order vertices in a way that neighboring vertices have a higher probability of being connected to each other. This implies that the edges form a band around the diagonal in the adjacency matrix. Such structure may rise for example if the graph was created over time: each vertex had an active time interval during which the vertex was connected with other active vertices. The goal of this paper is to model this phenomenon. To this end, we formulate an optimization problem: given a graph and an integer K, we want to order graph vertices and partition the ordered adjacency matrix into K bands such that bands closer to the diagonal are more dense. We measure the goodness of a segmentation using the log-likelihood of a log-linear model, a flexible family of distributions containing many standard distributions. We divide the problem into two subproblems: finding the order and finding the bands. We show that discovering bands can be done in polynomial time with isotonic regression, and we also introduce a heuristic iterative approach. For discovering the order we use Fiedler order accompanied with a simple combinatorial refinement. We demonstrate empirically that our heuristic works well in practice.
... If the global objective is a sum of individual segment costs, then the problem can be solved with a classic dynamic program approach [6] in O n 2 k time. As this may be too slow speed-up techniques yielding approximation guarantees have been proposed [11,20,21]. If the cost function is based on one-parameter log-linear models, it is possible to speed-up the segmentation problem significantly in practice [19], even though the worstcase running time remains O n 2 k . ...
Preprint
Full-text available
Change point detection plays a fundamental role in many real-world applications, where the goal is to analyze and monitor the behaviour of a data stream. In this paper, we study change detection in binary streams. To this end, we use a likelihood ratio between two models as a measure for indicating change. The first model is a single bernoulli variable while the second model divides the stored data in two segments, and models each segment with its own bernoulli variable. Finding the optimal split can be done in O(n) time, where n is the number of entries since the last change point. This is too expensive for large n. To combat this we propose an approximation scheme that yields (1ϵ)(1 - \epsilon) approximation in O(ϵ1log2n)O(\epsilon^{-1} \log^2 n) time. The speed-up consists of several steps: First we reduce the number of possible candidates by adopting a known result from segmentation problems. We then show that for fixed bernoulli parameters we can find the optimal change point in logarithmic time. Finally, we show how to construct a candidate list of size O(ϵ1logn)O(\epsilon^{-1} \log n) for model parameters. We demonstrate empirically the approximation quality and the running time of our algorithm, showing that we can gain a significant speed-up with a minimal average loss in optimality.
Article
Dynamic programming (DP) is a broadly applicable algorithmic design paradigm for the efficient, exact solution of otherwise intractable, combinatorial problems. However, the design of such algorithms is often presented informally in an ad-hoc manner. It is sometimes difficult to justify the correctness of these DP algorithms. To address this issue, this paper presents a rigorous algebraic formalism for systematically deriving DP algorithms, based on semiring polymorphism. We start with a specification, construct a (brute-force) algorithm to compute the required solution which is self-evidently correct because it exhaustively generates and evaluates all possible solutions meeting the specification. We then derive, primarily through the use of shortcut fusion, an implementation of this algorithm which is both efficient and correct. We also demonstrate how, with the use of semiring lifting, the specification can be augmented with combinatorial constraints and through semiring lifting, show how these constraints can also be fused with the derived algorithm. This paper furthermore demonstrates how existing DP algorithms for a given combinatorial problem can be abstracted from their original context and re-purposed to solve other combinatorial problems. This approach can be applied to the full scope of combinatorial problems expressible in terms of semirings. This includes, for example: optimization, optimal probability and Viterbi decoding, probabilistic marginalization, logical inference, fuzzy sets, differentiable softmax, and relational and provenance queries. The approach, building on many ideas from the existing literature on constructive algorithmics, exploits generic properties of (semiring) polymorphic functions, tupling and formal sums (lifting), and algebraic simplifications arising from constraint algebras. We demonstrate the effectiveness of this formalism for some example applications arising in signal processing, bioinformatics and reliability engineering. Python software implementing these algorithms can be downloaded from: http://www.maxlittle.net/software/dppolyalg.zip.
Conference Paper
Full-text available
In this paper, we analyze local search heuristics for the k-median and facility location problems. We define the {\em locality gap\/} of a local search procedure as the maximum ratio of a locally optimum solution (obtained using this procedure) to the global optimum. For k-median, we show that local search with swaps has a locality gap of exactly 5. When we permit p facilities to be swapped simultaneously then the locality gap of the local search procedure is exactly 3+2/p. This is the first analysis of local search for k-median that provides a bounded performance guarantee with only k medians. This also improves the previous known 4 approximation for this problem. For Uncapacitated facility location, we show that local search, which permits adding, dropping and swapping a facility, has a locality gap of exactly 3. This improves the 5 bound of Korupolu et al. We also consider a capacitated facility location problem where each facilitym has a capacity and we are allowed to open multiple copies of a facility. For this problem we introduce a new operation which opens one or more copies of a facility and drops zero or more facilities. We prove that local search which permits this new operation has a locality gap between 3 and 4. instances where it is not necessary to satisfy every demand. Our algorithms provide the optimum total profit, while stretching the definition of locality by a constant and violating the required demands by a constant. We prove that without this stretch, the problem becomes NP-Hard to approximate. facility location, we show that local search, which permits adding, dropping and swapping a facility, has a locality gap of exactly 3. This improves the 5 bound of Korupolu et al. We also consider a capacitated facility location problem where each facilitym has a capacity and we are allowed to open multiple copies of a facility. For this problem we introduce a new operation which opens one or more copies of a facility and drops zero or more facilities. We prove that local search which permits this new operation has a locality gap between 3 and 4.
Conference Paper
Full-text available
We study the problem of segmenting a sequence into k pieces so that the resulting segmentation satisfies monotonicity or unimodality constraints. Unimodal functions can be used to model phenomena in which a measured variable first increases to a certain level and then decreases. We combine a well-known unimodal regression algorithm with a simple dynamic-programming approach to obtain an optimal quadratic-time algorithm for the problem of unimodal k-segmentation. In addition, we describe a more efficient greedy-merging heuristic that is experimentally shown to give solutions very close to the optimal. As a concrete application of our algorithms, we describe two methods for testing if a sequence behaves unimodally or not. Our experimental evaluation shows that our algorithms and the proposed unimodality tests give very intuitive results.
Article
Full-text available
A segmentation algorithm based on the Jensen-Shannon entropic divergence is used to decompose long-range correlated DNA sequences into statistically significant, compositionally homogeneous patches. By adequately setting the significance level for segmenting the sequence, the underlying power-law distribution of patch lengths can be revealed. Some of the identified DNA domains were uncorrelated, but most of them continued to display long-range correlations even after several steps of recursive segmentation, thus indicating a complex multi-length-scaled structure for the sequence. On the other hand, by separately shuffling each segment, or by randomly rearranging the order in which the different segments occur in the sequence, shuffled sequences preserving the original statistical distribution of patch lengths were generated. Both types of random sequences displayed the same correlation scaling exponents as the original DNA sequence, thus demonstrating that neither the internal structure of patches nor the order in which these are arranged in the sequence is critical; therefore, long-range correlations in nucleotide sequences seem to rely only on the power-law distribution of patch lengths. © 1996 The American Physical Society.
Article
Improving on Our ExplanationIntellectual Impact and LegacyFurther ReadingReferences
Conference Paper
We present the first linear time (1+")-approximation al- gorithm for the k-means problem for fixed k and ". Our al- gorithm runs in O(nd) time, which is linear in the size of the input. Another feature of our algorithm is its simplic- ity - the only technique involved is random sampling.
Article
This report indicates the level of computer development and application in each of the thirty countries of Europe, most of which were recently visited by the author
Article
The availability of genome-wide mRNA expression data for organisms whose genome is fully sequenced provides a unique data set from which to decipher how transcription is regulated by the upstream control region of a gene. A new algorithm is presented which decomposes DNA sequence into the most probable "dictionary" of motifs or words. Identification of words is based on a probabilistic segmentation model in which the significance of longer words is deduced from the frequency of shorter words of various length. This eliminates the need for a separate set of reference data to define probabilities, and genome-wide applications are therefore possible. For the 6,000 upstream regulatory regions in the yeast genome, the 500 strongest motifs from a dictionary of size 1,200 match at a significance level of 15 standard deviations to a database of cis-regulatory elements. Analysis of sets of genes such as those up-regulated during sporulation reveals many new putative regulatory sites in addition to identifying previously known sites.
Article
The existence of whole genome sequences makes it possible to search for global structure in the genome. We consider modeling the occurrence frequencies of discrete patterns (such as starting points of ORFs or other interesting phenomena) along the genome. We use piecewise constant intensity models with varying number of pieces, and show how a reversible jump Markov Chain Monte Carlo (RJMCMC) method can be used to obtain a posteriori distribution on the intensity of the patterns along the genome. We apply the method to modeling the occurrence of ORFs in the human genome. The results show that the chromosomes consist of 5–35 clearly distinct segments, and that the posteriori number and length of the segments shows significant variation. On the other hand, for the yeast genome the intensity of ORFs is nearly constant. Contact: Marko.Salmenkivi@cs.helsinki.fiJuha.Kere@biosci.ki.seHeikki.Mannila@cs.helsinki.fi