Article

A linear space algorithm for com-puting longest common subsequences

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Mining the MLCS s from these sequences is becoming a more and more important research topic and facing severe challenges. In the past forty years, in order to efficiently tackle the LCS/MLCS problem, various types of LCS/MLCS algorithms [2,3,6,7,11,12,14,15,16,17,19] and tools (e.g., SAMtools 1 , BLAST 2 , Clustal Omega 3 , etc.) have been proposed, which can be divided into two categories: classical dynamic programming and dominant-point-based algorithms. It has been demonstrated that the dominantpoint-based LCS/MLCS algorithms have an overwhelming advantage over classical dynamic programming ones due to their great reduction in the size of search space by orders of magnitude [16]. ...
... where Fig. 1 illustrates the score matrix L of two sequences S1 = ACT AGCT A and S2 = T CAGGT AT over the alphabet Σ = {A, C, G, T } and the process of extracting an LCS =CAGT A from L. To further reduce time and space complexities, various improved dynamic programming LCS/MLCS algorithms [1,3,6] have been proposed. For example, Hirschberg [3] presented a new LCS algorithm based on the divide-and-conquer approach, which reduces the space complexity to O(m + n); however, its time complexity remains to be O(mn). ...
... where Fig. 1 illustrates the score matrix L of two sequences S1 = ACT AGCT A and S2 = T CAGGT AT over the alphabet Σ = {A, C, G, T } and the process of extracting an LCS =CAGT A from L. To further reduce time and space complexities, various improved dynamic programming LCS/MLCS algorithms [1,3,6] have been proposed. For example, Hirschberg [3] presented a new LCS algorithm based on the divide-and-conquer approach, which reduces the space complexity to O(m + n); however, its time complexity remains to be O(mn). Masek and Paterson [14] put forward an improved dynamic programming LCS algorithm for two sequences with length n using a fast computing method of edit distance, whose worst time complexity is O(n 2 / log n). ...
Conference Paper
Full-text available
Information in various applications is often expressed as character sequences over a finite alphabet (e.g., DNA or protein sequences). In Big Data era, the lengths and sizes of these sequences are growing explosively, leading to grand challenges for the classical NP-hard problem, namely searching for the Multiple Longest Common Subsequences (MLCS) from multiple sequences. In this paper, we first unveil the fact that the state-of-the-art MLCS algorithms are unable to be applied to long and large-scale sequences alignments. To overcome their defects and tackle the longer and large-scale or even big sequences alignments, based on the proposed novel problem-solving model and various strategies, e.g., parallel topological sorting, optimal calculating, reuse of intermediate results, subsection calculation and serialization, etc., we present a novel parallel MLCS algorithm. Exhaustive experiments on the datasets of both synthetic and real-world biological sequences demonstrate that both the time and space of the proposed algorithm are only linear in the number of dominants from aligned sequences, and the proposed algorithm significantly outperforms the state-of-the-art MLCS algorithms, being applicable to longer and large-scale sequences alignments.
... However, straightforward recovery of the edit operations in the sliding window using a BZ sketch takes O d 2 log n bits. To improve on this space complexity, we modify the classical Hirschberg's algorithm [17]. Recall that Hirschberg algorithm is a dynamic programming algorithm that finds the optimal sequence alignment between two strings of length n using O (n log n) bits of space. ...
... In summary, the algorithm stores the following data: Proof: The classic Hirschberg's algorithm [17], [19] returns the locations of the optimal edit operations between S and T in O (m log m) space. However, if we do not care about the locations of the edit operations for ed(S, T ) > d, then we can optimize the space down to O (m + d log m) bits using ideas from [30]. ...
Article
Analyzing patterns in data streams generated by network traffic, sensor networks, or satellite feeds is a challenge for systems in which the available storage is limited. In addition, real data is noisy, which makes designing data stream algorithms even more challenging. Motivated by such challenges, we study algorithms for detecting the similarity of two data streams that can be read in sync. Two strings S,TΣnS, T\in \Sigma^n form a d-near-alignment if the distance between them in some given metric is at most d. We study the problem of identifying a longest substring of S and T that forms a d-near-alignment under the edit distance, in the simultaneous streaming model. In this model, symbols of strings S and T are streamed at the same time, and the amount of available processing space is sublinear in the length of the strings. We give several algorithms, including an exact one-pass algorithm that uses O(d2+dlogn)\mathcal{O}(d^2+d\log n) bits of space. We couple these results with comparable lower bounds.
... This research takes advantage of current high performance architectures to further enhance the time performance of HT-NGH algorithm. Although pair-wise alignment methods reach optimal accuracy[21][22][23], they are still consuming processes. NGH algorithm[21]reduces the time and space required to build the alignment. ...
... Hirschberg algorithm[22]presents a solution to solve the space problem of the Needleman-Wunsch algorithm, which reduces the space requirement to í µí±‚(í µí±ší µí±–í µí±› (í µí±š, í µí±›)). Unlike the Needleman-Wunsch and Smith-Waterman algorithms, Hirschberg algorithm divides the similarity matrix into two smaller matrices depending on the fact that similarity matrix can be filled from both directions; top down and bottom up. ...
Conference Paper
Full-text available
In bioinformatics, pair-wise alignment plays a significant role insequence comparison by rating the similarities and distances between protein, DeoxyriboNucleic Acid (DNA), and RiboNucleic Acid (RNA) sequences. Sequence comparison considered as a key stone in building distance matrices. Due to the rapid growth of molecular databases, the need for faster sequence comparison and alignment has become anecessity. High performance computing impacthas increased in the last decade through providing many high performance architectures and tools. In this paper we present a parallel shared memory design for a dynamic programming algorithm named Hash Table-N-Gram-Hirschberg (HT-NGH) an extension of Hashing-N-Gram-Hirschberg (HNGH) and N-Gram-Hirschberg (NGH) algorithm, to speed up the sequence alignment construction process. The focus of the proposed method ison the transformation phase of HT-NGH algorithm since it takes10% of HT-NGH overall run time. The experimental evaluation of the proposed parallel designshows an enhancement in the execution time and speedup without sacrificing the accuracy. However, the decomposition method might slightly slowdown the proposed algorithm due to the differences in performance between the processing units.
... The length of these strings are used to score the sub-sequence and is calculated with the global similarity score. This score is converted to percentage scale and returned [26]–[31]. Fig. 3. describesthe ACS algorithm in detail and Algorithm 1 shows the needful steps to implement it. If a song has multiple similar parts, this algorithm takes it into consideration. ...
... The length of these strings are used to score the sub-sequence and is calculated with the global similarity score. This score is converted to percentage scale and returned [26] [31]. Fig. 3. describes the ACS algorithm in detail and Algorithm 1 shows the needful steps to implement it. ...
... Symbolic alignment is performed by an implementation of the Hirschberg algorithm [8], with naive Levenshtein [9] edit costs. The Hirschberg algorithm is well suited for long symbolic Figure 1: Runtime T chunker + MAUS, TP chunker + MAUS, P chunker + MAUS and MAUS-only phonetic segmentation at different input audio durations on an HP ProLiant DL160 G6 server, using 6 threads alignment tasks as its memory requirements are linear, and its quadratic runtime requirements can be parallelized to a certain degree. ...
... Symbolic alignment is performed by an implementation of the Hirschberg algorithm[8], with naive Levenshtein[9]edit costs. The Hirschberg algorithm is well suited for long symbolicFigure 1: Runtime T chunker + MAUS, TP chunker + MAUS, P chunker + MAUS and MAUS-only phonetic segmentation at different input audio durations on an HP ProLiant DL160 G6 server, using 6 threads alignment tasks as its memory requirements are linear, and its quadratic runtime requirements can be parallelized to a certain degree. ...
Conference Paper
Full-text available
Forced alignment tools such as the Munich Automatic Segmen-tation System (MAUS) [1] do not scale well with input size. In this paper, we present a preprocessor chunk segmentation tool to combat this problem. It dramatically decreases MAUS's run-time on recordings of duration up to three hours, while also having a slightly positive effect on segmentation accuracy. We hope that this tool will advance the use of non-scientific transcribed recordings, such as audio books or broadcasts, in phonetic research. The chunker tool will be made available as a free web service at the Bavarian Archive for Speech Signals (BAS) [2].
... A[i] A[i] [6]. Considering k as the number of errors, and i as the i th diagonal, the dynamic programming technique is based on the recurrence relation L(i, k): ...
... The Landau-Vishkin Algorithm, proposed originally in[13], solves the APM. This algorithm is based on a dynamic programming technique which, at the k th iteration, obtains the maximal extension of diagonals of the dynamic programming table[6]. Considering k as the number of errors, and i as the i th diagonal, the dynamic programming technique is based on the recurrence relation L(i, k): ...
Article
Full-text available
The approximate pattern matching problem ( ) consists in locating all occurrences of a given pattern P in a text T allowing a specific amount of errors. Due to the character of real applications and the fact that solutions do not have error-free nature in Computer Science, solving is crucial for developing meaningful applications. Recently, Ilie, Navarro and Tinta presented a fast algorithm to solve based on the well-known Landau-Vishkin algorithm. However, the amount of available memory limits the usage of their algorithm, since it requires all the answer array be in memory. In this article, a practical semi-external memory method to solve is presented. The method is based on the direct-comparison variation from Ilie et al.'s algorithm. Performance tests with real data of length up to 1.2 GB showed that the presented method is about 5 times more space-efficient than Ilie et al.'s algorithm and yet, has a competitive trade-off regarding time.
... It builds the alignment with O (MN) time and O (MN) space. Furthermore, the Hirschberg algorithm [16], which is the space-saving version of the Needleman-Wunsch and Smith-Waterman algorithms, produces the alignment with O (min (mn)) space. An enhancement on the Hirschberg algorithm, named N-Gram-Hirschberg (NGH) [17], was proposed for further space and time reduction. ...
... The proposed algorithm outperforms the Smith-Waterman algorithm in terms of the space and time requirements, and it reduces these without sacrificing the sensitivity of the original algorithm. The Hirschberg algorithm [16] proposed a solution for the space issue of the Needleman-Wunsch algorithm, where it reduces the space to O (min (m,n)). Unlike the Needleman-Wunsch and Smith-Waterman algorithms, the Hirschberg algorithm divides the similarity matrix into two parts. ...
... A first linear space algorithm was proposed by Yoshifumi Sakai [8]. The space cost of the algorithm of Yang et al. was reduced to linear by a careful application of Hirschbergs divide-and-conquer method [4]. The space complexity of the algorithm of Katriel and Kutz [6] was also reduced from O(nl) to O(m) by using the same divide-and-conquer method of Hirschberg [4]. ...
... The space cost of the algorithm of Yang et al. was reduced to linear by a careful application of Hirschbergs divide-and-conquer method [4]. The space complexity of the algorithm of Katriel and Kutz [6] was also reduced from O(nl) to O(m) by using the same divide-and-conquer method of Hirschberg [4]. In this paper, we solve the problem in a new insight. ...
Article
This paper reformulates the problem of finding a longest common increasing subsequence of the two given input sequences in a very succinct way. An extremely simple linear space algorithm based on the new formula can find a longest common increasing subsequence of sizes n and m respectively, in time O(nm) using additional min{n,m}+1\min\{n,m\}+1 space.
... Our overall approach is similar in spirit to the classic divide and conquer algorithm by Hirschberg [11] for computing a longest common subsequence of two strings of lengths in linear space. Let A be the Thompson NFA (TNFA) for R built according to Thompson's rules [23] (see also Fig. 1) with m states, and let Q be the string of length n. ...
Article
Full-text available
Given a regular expression R and a string Q the regular expression matching problem is to determine if Q is a member of the language generated by R. The classic textbook algorithm by Thompson [C. ACM 1968] constructs and simulates a non-deterministic finite automaton in O(nm) time and O(m) space, where n and m are the lengths of the string and the regular expression, respectively. Assuming the strong exponential time hypothesis Backurs and Indyk [FOCS 2016] showed that this result is nearly optimal. However, for most applications determining membership is insufficient and we need to compute \emph{how we match}, i.e., to identify or replace matches or submatches in the string. Using backtracking we can extend Thompson's algorithm to solve this problem, called regular expression parsing, in the same asymptotic time but with a blow up in space to Ω(nm)\Omega(nm). Surprisingly, all existing approaches suffer the same or a similar quadratic blow up in space and no known solutions for regular expression parsing significantly improve this gap between matching and parsing. In this paper, we overcome this gap and present a new algorithm for regular expression parsing using O(nm) time and O(n+m)O(n + m) space. To achieve our result, we develop a novel divide and conquer approach similar in spirit to the classic divide and conquer technique by Hirshberg [C. ACM 1975] for computing a longest common subsequence of two strings in quadratic time and linear space. We show how to carefully decompose the problem to handle cyclic interactions in the automaton leading to a subproblem construction of independent interest. Finally, we generalize our techniques to convert other existing state-set transition algorithms for matching to parsing using only linear space.
... Smith-Waterman requires O(mn) space to optimally align two sequences of lengths m and n. Hirschberg's algorithm [32] can improve the space complexity to linear O(m + n), but is rarely used in practice because of its performance. As a result, heuristics, such as Banded Smith-Waterman [13], X-drop [84] and My- ers bit-vector algorithm [56], that do not guarantee optimal alignments but have linear space and time complexity, have become popular. ...
Conference Paper
Genomics is transforming medicine and our understanding of life in fundamental ways. Genomics data, however, is far outpacing Moore»s Law. Third-generation sequencing technologies produce 100X longer reads than second generation technologies and reveal a much broader mutation spectrum of disease and evolution. However, these technologies incur prohibitively high computational costs. Over 1,300 CPU hours are required for reference-guided assembly of the human genome, and over 15,600 CPU hours are required for de novo assembly. This paper describes "Darwin" --- a co-processor for genomic sequence alignment that, without sacrificing sensitivity, provides up to $15,000X speedup over the state-of-the-art software for reference-guided assembly of third-generation reads. Darwin achieves this speedup through hardware/algorithm co-design, trading more easily accelerated alignment for less memory-intensive filtering, and by optimizing the memory system for filtering. Darwin combines a hardware-accelerated version of D-SOFT, a novel filtering algorithm, alignment at high speed, and with a hardware-accelerated version of GACT, a novel alignment algorithm. GACT generates near-optimal alignments of arbitrarily long genomic sequences using constant memory for the compute-intensive step. Darwin is adaptable, with tunable speed and sensitivity to match emerging sequencing technologies and to meet the requirements of genomic applications beyond read assembly.
... Ukoliko je učenik odgovorio u granicama zadane tolerancije (ispod 0,05) odgovor mu se priznajte kao točan. Granica tolerancije dobiva se dijeljenjem Levensthein-ove udaljenosti [117] s ukupnim brojem znakova u nazivu koncepta ili relacije. ...
Thesis
Full-text available
Prikazani su rezultati istraživanja provedeni u procesu razvoja, implementacije i primjene novog izvornog modela za oblikovanje i isporuku znanja u inteligentnom sustavu za upravljanje učenjem - Knowledge Design and Delivery model. Istraživanje, razvoj i primjena KD&D modela se fokusira na modeliranje učenika radi utvrđivanja znanja učenika i prilagođavanja aktualnoj razini znanja učenika. Prilagođavanje se ostvaruje u okruženju kibernetičkog modela sustava. Sjedinili smo promišljanje učenja, poučavanja i testiranja znanja u inteligentnom sustavu za upravljanje učenjem s kibernetičkim modelom sustava. U vezi s tim su za inteligentne sustave za upravljanje učenjem istaknute i dvije bitne odrednice inteligencije povezane sa: dijagnostikom znanja učenika i pružanjem pravodobne pomoći u otklanjanju pogrešnih poimanja i nepoznavanju koncepata područnog znanja. Stvorili smo novu platformu sustava za inteligentno upravljanje znanjem. Prototip programske podrške CM Tutor je pokazao prilagodljivo stjecanje znanja učenika temeljeno na postupku modeliranja učenika u okruženju višekriterijskih matematičkih metoda i stereotipa znanja učenika. Prototip programske podrške CM Tutor je postavljen u eksperimentalno okruženje s ciljem utvrđivanja učinka i kvalitete dostignutog stupnja razvijenosti kao i osjećaja zadovoljstva sudionika tijekom učenja, poučavanja i testiranja znanja. This work shows the results of the research made in the process of development, implementation and the use of a new model - Knowledge Design and Delivery model. Research, development and the use of KD&D model is focused on student modelling for determining students` knowledge and adapting to actual level of students` knowledge. The adaptive is achieved in the environment of cybernetic model system. We have joined deliberation of learning, teaching and testing of knowledge in intelligent learning management systems with cybernetic model system. Related to that there are two important facts of intelligence connected with diagnosis of student`s knowledge and giving the correct help in misconception and missing conception of certain area. We have created a new platform for intelligent learning management systems. The prototype program CM Tutor has shown adaptive acquisition of students knowledge on the principle of students modelling in the environment of multicriterion decision mathematical methods and stereotype of students` knowledge. CM Tutor has been set into experimental area with the goal of confirming the effect size and the quality of reached level of development as well as the feeling of satisfaction of the participant during the process of learning, teaching and testing of knowledge.
... Hirschberg [21] first considered this problem. Based on dynamic programming, in 1990 Huang [22] proposed the first memory linear algorithm, whose complexity is proportional to the sum of the lengths of the sequences being aligned. ...
Article
Full-text available
In this paper we consider the application of dynamic programming in bioinformatics for comparing and aligning DNA sequences. Dynamic programming algorithms are old-fashioned in terms of computational complexity, but in terms of accuracy these algorithms are still irreplaceable.
... The name-nick matching requires a similarity measure -among the possible candidates for such a measure, the length of Longest Common Subsequence (Hirschberg, 1975) was selected as a trade-off between quality and implementation complexity. The LCS measure is based solely on matching characters (unlike, for instance, the Levenstein distance), i.e. ...
... Similarity measures between sequences[7]could then be generalized to this formalism. To determine the overlapping part between two trajectories, we generalized the principle of the longest common subsequence (LCS) to this formalism[8]. ...
Conference Paper
Comparing care trajectories helps improve health services. Medico-administrative databases are useful for automatically reconstructing the patients' history of care. Care trajectories can be compared by determining their overlapping parts. This comparison relies on both semantically-rich representation formalism for care trajectories and an adequate similarity measure. The longest common subsequence (LCS) approach could have been appropriate if representing complex care trajectories as simple sequences was expressive enough. Furthermore, by failing to take into account similarities between different but semantically close medical events, the LCS overestimates differences. We propose a generalization of the LCS to a more expressive representation of care trajectories as sequences of sets. A set represents a medical episode composed by one or several medical events, such as diagnosis, drug prescription or medical procedures. Moreover, we propose to take events' semantic similarity into account for comparing medical episodes. To assess our approach, we applied the method on a care trajectories' sample from patients who underwent a surgical act among three kinds of acts. The formalism reduced calculation time, and introducing semantic similarity made the three groups more homogeneous.
... Algorithms to find the best alignments (the ones having the maximal score) have also been well studied. Since[27]developed a dynamic programming algorithm for global alignment, many improvements or variants have been developed—[16]for a linear space improvement,[31]for local alignment,[15]for affine gap penalty,[25,2,20]for fast heuristic local alignment, and many more. A detailed review of LCSs algorithms can be found in[4]. ...
Article
The length of the longest common subsequences (LCSs) is often used as a similarity measurement to compare two (or more) random words. Below we study its statistical behavior in mean and variance using a Monte-Carlo approach from which we then develop a hypothesis testing method for sequences similarity. Finally, theoretical upper bounds are obtained for the Chv\'atal-Sankoff constant of multiple sequences.
... Hirschberg improvement. Hirschberg came up with a lin- ear space algorithm that was an improved version of the pairwise sequence alignment algorithm (Hirschberg, 1975). Given two sequences S of length m and T of length n (assuming that T is longer, i.e. n > m), the Hirschberg algorithm splits sequence T near the mid- dle resulting in two subsequences T a and T b . ...
Article
A multitude of algorithms for sequence comparison, short-read assembly and whole-genome alignment have been developed in the general context of molecular biology, to support technology development for high-throughput sequencing, numerous applications in genome biology and fundamental research on comparative genomics. The computational complexity of these algorithms has been previously reported in original research papers, yet this often neglected property has not been reviewed previously in a systematic manner and for a wider audience. We provide a review of space and time complexity of key sequence analysis algorithms and highlight their properties in a comprehensive manner, in order to identify potential opportunities for further research in algorithm or data structure optimization. The complexity aspect is poised to become pivotal as we will be facing challenges related to the continuous increase of genomic data on unprecedented scales and complexity in the foreseeable future, when robust biological simulation at the cell level and above becomes a reality.
... For instance, He and von Davier (2016) draw on process data recorded in problem solving in technology-rich environments items in PIAAC to address how sequences of actions (n-grams) recorded in problem-solving items are related to task performance. Sukkarieh, von Davier, and Yamamoto (2012) used the longest common subsequence algorithms (LCS; e.g.,Hirschberg, 1975Hirschberg, , 1977) in a multilingual environment to compare sequences that test takers selected in a reading task against expert-generated ideal solutions. These methods are worth further exploration to investigate the associations between sequences of actions and CPS skills and to extract sequence patterns for different CPS proficiency levels. ...
Chapter
Full-text available
Collaborative problem solving (CPS) is a critical and necessary skill in educational settings and the workforce. The assessment of CPS in the Programme for International Student Assessment (PISA) 2015 focuses on the cognitive and social skills related to problem solving in collaborative scenarios: establishing and maintaining shared understanding, taking appropriate actions to solve problems, and establishing and maintaining group organization. This chapter draws on measures of the CPS domain in PISA 2015 to address the development and implications of CPS items, challenges, and solutions related to item design, as well as computational models for CPS data analysis in large-scale assessments. Measuring CPS skills is not only a challenge compared to measuring individual skills but also an opportunity to make the cognitive processes in teamwork observable. An example of a released CPS unit in PISA 2015 will be used for the purpose of illustration. This study also discusses future perspectives in CPS analysis using multidimensional scaling, in combination with process data from log files, to track the process of students’ learning and collaborative activities.
... If a sequence " S " is the subsequence of two or more than two known sequences and meanwhile is the longest among all sequences, then it is the LCS of the known sequence [16, 17]. Hirschberg [18] has provided a solution to this problem with the dynamic programming algorithm. We assume that there are two strings X and Y, of which X = {a0, a1, . . . ...
Article
Full-text available
ICD-10(International Classification of Diseases 10th revision) is a classification of a disease, symptom, procedure, or injury. Diseases are often described in patients’ medical records with free texts, such as terms, phrases and paraphrases, which differ significantly from those used in ICD-10 classification. This paper presents an improved approach based on the Longest Common Subsequence (LCS) and semantic similarity for automatic Chinese diagnoses, mapping from the disease names given by clinician to the disease names in ICD-10. LCS refers to the longest string that is a subsequence of every member of a given set of strings. The proposed method of improved LCS in this paper can increase the accuracy of processing in Chinese disease mapping.
... UNIX's diff [40] is a widely used file differencing tool that highlights differences between two documents on the line level. Files are treated as ordered sequences of lines, and the longest common subsequence of lines between files is computed [41, 42]. Besides longest common subsequence, edit distance (also called Levenshtein distance) [42] is also widely adopted by file differencing tools to find lines that are inserted, deleted or substituted between different files. ...
Thesis
Full-text available
Presentation slides are one of the most important tools for today's knowledge workers to present knowledge, exchange information, and discuss ideas for business, education and research purposes. Presentation slide composition is an important job for these presentation composers. To create presentation slides, one common practice is to start from existing slides. One of the primary reasons of slide reuse is to repurpose existing content in existing presentation files for various events, audiences , formats, etc. For example, when many researchers and lectures create new presentation slides, they reuse the lecture notes used in university courses and the reports presented in academic conferences. In business applications, people often combine materials used in previous presentations to create a summary, and modify existing slides in order to present to different audiences. However, browsing these existing files and searching relevant materials is a time-consuming task. It is difficult to remember where all the contents reside. People often remember only some keywords, an image, a diagram or a slide. To this end the search and retrieval method for presentation slide reuse is necessary to develop. Detecting reused materials in presentation slides benefits many presentation-related applications; e.g., assisting composer in tracking changes in multiple versions, understanding existing presentation slides, and assembling existing slides to make new ones, etc. Although the slide retrieval method for reuse and the method to compare different versions of a presentation file have been proposed, they are either based on slide-to-slide comparison or file-to-file comparison. In many cases, only an individual element such as a sentence, a table, an image or a diagram, is copied from one file to another, but overall the slides and the files differ significantly, and thus the reuse element cannot be identified by these methods. Many knowledge workers demonstrate presentations using slide show software such as Microsoft PowerPoint, Keynote, and OpenOffice. Although these tools provide easy ways for slide preparation by inserting texts, images, animations, etc., traditional slide show software lack in the functions of the slide structure and the content support. Many researchers propose slide generation and composition methods for presentation slides. Some of them extract presentation contents from paper, while others based on the outline wizard. But all of the proposed method and system generates presentation slides automatically, that's to say, user have no choice about the structure of the presentations, and cannot participate in the contents and the layouts of the slides. In order to achieve these goals of presentation slides searching, managing and design supporting, this thesis introduces the respective approaches to effective presentation slide reuse support. The fundamental approaches to these researches are to propose content-based element search methods among presentation slides, then provide a
... The third algorithm computes only the length of LCSS which is enough for computing the distance between two sequences. We refer the reader to (Hirschberg, 1975 ) where Hirschberg describes a linear space algorithm for computing the actual LCSS. Algorithm 3 is not ready yet for comparing time series because in the case of time series the values are real numbers and not discrete symbols. ...
... The alignment is performed using a multithreaded and vectorised full dynamic programming algorithm (Needleman & Wunsch, 1970) adapted from SWIPE (Rognes, 2011). Due to the extreme memory requirements of this method when aligning two long sequences, an alternative algorithm described by Hirschberg (1975) and Myers & Miller (1988) is used when the product of the length of the sequences is greater than 25,000,000, corresponding to aligning two 5,000 bp sequences. This alternative algorithm uses only a linear amount of memory but is considerably slower. ...
Article
Full-text available
Background VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use. Methods When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads. Results VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based or de novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e., format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available at https://github.com/torognes/vsearch under either the BSD 2-clause license or the GNU General Public License version 3.0. Discussion VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.
... As the core letters in a word might not be contiguous, the matching made use of the computation of the Longest Common Subsequence (LCS) [16]. As its name suggests, it extracts the longest sequence of letters, either contiguous or not, but in same order shared by the two words. ...
Article
Full-text available
Assessment of the similarities between texts has been studied for decades from different perspectives and for several purposes. One interesting perspective is the morphology. This article reports the results on a study on the assessment of the morphological relatedness between natural language words. The main idea is to adapt a formal string alignment algorithm namely Needleman-Wunsch’s to accommodate the statistical char-acteristics of the words in order to approximate how similar are the linguistic morphologies of the two words. The approach is unsupervised from end to end and the experiments show an nDCG reaching 87% and an r-precision reaching 81%.
... The question also appears in algebraic statistics [38]: there the objective function is the tropicalisation of a co-ordinate polynomial of a particular hidden Markov model. The special case α = β = 0 corresponds to the problem of finding longest common subsequence (LCS) of the words η x and η y , which has been intensively studied by computer scientists [6, 22, 29, 32] and mathematicians [2, 8, 18, 23, 27, 28]. On the other hand, the alignment score L ...
Article
Full-text available
We study the sequence alignment problem and its independent version, the discrete Hammersley process with an exploration penalty. We obtain rigorous upper bounds for the number of optimality regions in both models near the soft edge. At zero penalty the independent model becomes an exactly solvable model and we identify cases for which the law of the last passage time converges to a Tracy-Widom law.
... This work exploits string sequence alignment algorithms such as Hirschberg's algorithm (Hirschberg, 1975) and the Ratciff/Obershelp algorithm (Black, 2004) in the same vein as recent work by Durrett and DeNero (2013) and Nicolai et al. (2015). In these frameworks, the fewest number of edits required to convert one string to another are considered to be morphological give → gave g i v e g a v e Rule -i+a kitab → kutub k i t a b k u t u b Rule -i+u -a+u springen → gesprungen s p r i n g e n ge s p r u n g e n Rule +g+e -i +u rules. ...
... OSU The OSU system (King, 2016) also used a pipelined approach. They first extracted sequences of edit operations using Hirschberg's algorithm (Hirschberg, 1975). This reduces the string-to-string mapping problem to a sequence tagging problem. ...
... Because lengths of signature sequences may exceed 10,000, In the sequence comparison step we adopt Hirschberg's Algorithm[9], of which the memory complexity is O(min(m, n)), to avoid running out memory when computing the LCS. ...
Conference Paper
With the prevailing of smart devices (e.g., smart phone, routers, cameras), more and more programs are ported from traditional desktop platform to embedded hardware with ARM or MIPS architecture. While the compiled binary code differs significantly due to the variety of CPU architectures, these ported programs share the same code base of the desktop version. Thus it is feasible to utilize the program of commodity computer to help understand those cross-compiled binaries and locate functions with similar semantics. However, as instruction sets of different architectures are generally incomparable, it is difficult to conduct a static cross-architecture binary code similarity comparison. To address, we propose a semantic-based approach to fulfill this target. We dynamically extract the signature, which is composed of conditional operations behaviors as well as system call information, from binaries on different platforms with the same manner. Then the similarity of signatures is measured to help identify functions in ported programs. We have implemented the approach in MOCKINGBIRD, an automated analysis tool to compare code similarity between binaries across architectures. MOCKINGBIRD supports mainstream architectures and is able to analyze ELF executables on Linux platform. We have evaluated MOCKINGBIRD with a set of popular programs with cross-compiled versions. The results show our approach is not only effective for dealing with this new issue of cross-architecture binary code comparison, but also improves the accuracy of similarity based function identification due to the utilization of semantic information.
... The classic problem of the search for the longest common string in two symbol sequences has a long story [1,2,6,8,9,11]. In spite of a number of deep and valuable results [3][4][5]7,10] obtained in the algorithm implementation for the problem, it is computationally challenging and an extremely active research field. In brief, the problem we address here is the following. ...
Conference Paper
Full-text available
A new method to identify all sufficiently long repeating nucleotide substrings in one or several DNA sequences is proposed. The method based on a specific gauge applied to DNA sequences that guarantees identification of the repeating substrings. The method allows the matching substrings to contain a given level of errors. The gauge is based on the development of a heavily sparse dictionary of repeats, thus drastically accelerating the search procedure. Some biological applications illustrate the method.
Chapter
A well-known method to classify anomalous and normal system behavior is clustering of log lines, which effectively allows to learn about the usual system events and their frequencies. However, this approach has been successfully applied mostly for forensic purposes only, where log data dumps are investigated retrospectively. In order to make this concept applicable in real-time, i.e., at the time the log lines are produced and processed in streams, some major extensions to existing approaches are required. Especially distance based clustering approaches usually fail building the required large distance matrices and rely on time-consuming recalculations of the cluster map on every arriving log line. In this setting, real-time classification of log lines and their corresponding events is typically a non-trivial task, for which an incremental clustering approach seems suitable. We introduce a semi-supervised concept for incremental clustering of log data that builds the basis for a novel real-time anomaly detection solution based on log data streams. Its operation is independent from the syntax and semantics of the processed log lines, which makes it generally applicable.
Thesis
This thesis work is part of the 3D NeuroSecure project. It is an investment project, that aims to develop a secure collaborative solution for therapeutic innovation using high performance processing(HPC) technology to the biomedical world. This solution will give the opportunity for experts from different fields to navigate intuitivelyin the Big Data imaging with access via 3D light terminals. Biomedicaldata protection against data leaks is of foremost importance. As such,the client environnement and communications with the server must besecured. We focused our work on the development of antimalware solutionon the Android OS. We emphasizes the creation of new algorithms,methods and tools that carry advantages over the current state-of-the-art, but more importantly that can be used effectively ina production context. It is why, what is proposed here is often acompromise between what theoretically can be done and its applicability. Algorithmic and technological choices are motivated by arelation of efficiency and performance results. This thesis contributes to the state of the art in the following areas:Static and dynamic analysis of Android applications, application web crawling.First, to search for malicious activities and vulnerabilities, oneneeds to design the tools that extract pertinent information from Android applications. It is the basis of any analysis. Furthermore,any classifier or detector is always limited by the informative power of underlying data. An important part of this thesis is the designing of efficient static and dynamic analysis tools forapplications, such as an reverse engineering module, a networkcommunication analysis tool, an instrumented Android system, an application web crawlers etc.Neural Network initialization, training and anti-saturation techniques algorithm.Neural Networks are randomly initialized. It is possible to control the underlying random distribution in order to the reduce the saturation effect, the training time and the capacity to reach theglobal minimum. We developed an initialization procedure that enhances the results compared to the state-of-the-art. We also revisited ADAM algorithm to take into account interdependencies with regularization techniques, in particular Dropout. Last, we use anti-saturation techniques and we show that they are required tocorrectly train a neural network.An algorithm for collecting the common sequences in a sequence group.We propose a new algorithm for building the Embedding Antichain fromthe set of common subsequences. It is able to process and represent allcommon subsequences of a sequence set. It is a tool for solving the Systematic Characterization of Sequence Groups. This algorithm is a newpath of research toward the automatic creation of malware familydetection rules.
Article
The development of web applications based on the Internet, especially the mobile Internet, is very rapid and the traditional test methods have limitations. This paper proposes a method based on user session similarity and an improved agglutination clustering algorithm to automatically generate test cases. This approach not only guarantees the validity of the test, but also maintains the order in the user's session. This method gives the definition of similarity between two URLs, and then uses a dynamic programming algorithm to calculate the similarity between two user sessions. Secondly, a bottom-up aggregation-level algorithm is used to similarly cluster user sessions. Finally, a new method was used to select representative test cases and remove redundant test cases. The experimental results show that the method of this paper can generate suitable test cases quickly and efficiently.
Conference Paper
Iterative wavefront algorithms for evaluating dynamic programming recurrences exploit optimal parallelism but show poor cache performance. Tiled-iterative wavefront algorithms achieve optimal cache complexity and high parallelism but are cache-aware and hence are not portable and not cache-adaptive. On the other hand, standard cache-oblivious recursive divide-and-conquer algorithms have optimal serial cache complexity but often have low parallelism due to artificial dependencies among subtasks. Recently, we introduced cache-oblivious recursive wavefront (COW) algorithms, which do not have any artificial dependencies, but they are too complicated to develop, analyze, implement, and generalize. Though COW algorithms are based on fork-join primitives, they extensively use atomic operations for ensuring correctness, and as a result, performance guarantees (i.e., parallel running time and parallel cache complexity) provided by state-of-the-art schedulers (e.g., the randomized work-stealing scheduler) for programs with fork-join primitives do not apply. Also, extensive use of atomic locks may result in high overhead in implementation. In this paper, we show how to systematically transform standard cache-oblivious recursive divide-and-conquer algorithms into recursive wavefront algorithms to achieve optimal parallel cache complexity and high parallelism under state-of-the-art schedulers for fork-join programs. Unlike COW algorithms these new algorithms do not use atomic operations. Instead, they use closed-form formulas to compute the time when each divide-and-conquer function must be launched in order to achieve high parallelism without losing cache performance. The resulting implementations are arguably much simpler than implementations of known COW algorithms. We present theoretical analyses and experimental performance and scalability results showing a superiority of these new algorithms over existing algorithms.
Conference Paper
Prediction of disease severity is highly essential for understanding the progression of disease and initiating an alternative path of execution, which is priceless in treatment planning. An online decision support system (ODeSS) is proposed here for stratification of the patients who may need Endoscopic Retrograde Cholangio-Pancreatography (ERCP) and recommend an alternate path of execution. By this an immediate intervention can be avoided. In this study gallstone disease (GSD) whose prevalence is increasing in India is considered. ODeSS is a versatile non-linear information model which clustered the traces based on the duration of its completion. This is a Retrospective analyses of 575 traces. ODeSS applied the technique of longest common subsequence for identifying the sequence of an online execution and discovering to which cluster of variants it may belong. This discovery assist in taking appropriate clinical decision by recommending an alternative path of execution for such cases which may need emergency interventions. ODeSS performance was evaluated using area under receiver operating characteristic curve (area under ROC curve). This showed an accuracy of 0.9653 in prediction. The proposed model was validated using ROC curve in k-fold cross validation. Hence the proposed ODeSS can be used to conduct a non-linear statistical analysis since, the relationships between the predictive variables are not linear. It can be used as a clinical practice to recommend the path of execution. This would assist in better treatment planning, avoiding future complications.
Conference Paper
Locating longest common subsequences is a typical and important problem. The original version of locating longest common subsequences stretches a longer alignment between a query and a database sequence finds all alignments corresponding to the maximal length of common subsequences. However, the original version produces a lot of results, some of which are meaningless in practical applications and rise to a lot of time overhead. In this paper, we firstly define longest common subsequences with limited penalty to compute the longest common subsequences whose penalty values are not larger than a threshold τ\tau . This helps us to find answers with good locality. We focus on the efficiency of this problem. We propose a basic approach for finding longest common subsequences with limited penalty. We further analyze features of longest common subsequences with limited penalty, and based on it we propose a filter-refine approach to reduce number of candidates. We also adopt suffix array to efficiently generate common substrings, which helps calculating the problem. Experimental results on three real data sets show the effectiveness and efficiency of our algorithms.
Conference Paper
This paper presents an efficient and reliable protocol that enables a pipelined transmission using two channels in wireless sensor networks. Nodes in the network form a tree originating from a sink node. A sharable slot is allocated to each tree level and one unique channel is assigned to every other level in the tree. Data transmission is performed from the lowest level to the highest level, allowing two simultaneous transmissions. The level-order data transmission using a sharable slot also increases the reliability of data reception greatly because every node has multiple chances of receiving the same data while it reduces competition for data transmission. Using the simulator Cooja on Contiki OS, we showed that the proposed approach far outperformed the Deluge protocol in terms of completion time and control overhead.
Article
Full-text available
Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today’s supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures.
Article
We present AUTOGEN---an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run on a DP table of small size, and automatically identifies a recursive access pattern and a corresponding provably correct recursive algorithm for solving the DP recurrence. We use AUTOGEN to autodiscover efficient algorithms for several well-known problems. Our experimental results show that several autodiscovered algorithms significantly outperform parallel looping and tiled loop-based algorithms. Also these algorithms are less sensitive to fluctuations of memory and bandwidth compared with their looping counterparts, and their running times and energy profiles remain relatively more stable. To the best of our knowledge, AUTOGEN is the first algorithm that can automatically discover new nontrivial divide-and-conquer algorithms.
Article
This paper proposes an efficient parallel algorithm for an important class of dynamic programming problems that includes Viterbi, Needleman--Wunsch, Smith--Waterman, and Longest Common Subsequence. In dynamic programming, the subproblems that do not depend on each other, and thus can be computed in parallel, form stages, or wavefronts. The algorithm presented in this paper provides additional parallelism allowing multiple stages to be computed in parallel despite dependences among them. The correctness and the performance of the algorithm relies on rank convergence properties of matrix multiplication in the tropical semiring, formed with plus as the multiplicative operation and max as the additive operation. This paper demonstrates the efficiency of the parallel algorithm by showing significant speedups on a variety of important dynamic programming problems. In particular, the parallel Viterbi decoder is up to 24× faster (with 64 processors) than a highly optimized commercial baseline.<!-- END_PAGE_1 --
Article
Full-text available
Biological sequence alignment is a very popular application in Bioinformatics, used routinely worldwide. Many implementations of biological sequence alignment algorithms have been proposed for multicores, GPUs, FPGAs and CellBEs. These implementations are platform-specific; porting them to other systems requires considerable programming effort. This article proposes and evaluates MASA, a flexible and customizable software architecture that enables the execution of biological sequence alignment applications with three variants (local, global, and semiglobal) in multiple hardware/software platforms with block pruning, which is able to reduce significantly the amount of data processed. To attain our flexibility goals, we also propose a generic version of block pruning and developed multiple parallelization strategies as building blocks, including a new asynchronous dataflow-based parallelization, which may be combined to implement efficient aligners in different platforms. We provide four MASA aligner implementations for multicores (OmpSs and OpenMP), GPU (CUDA), and Intel Phi (OpenMP), showing that MASA is very flexible. The evaluation of our generic block pruning strategy shows that it significantly outperforms the previously proposed block pruning, being able to prune up to 66.5&percnt; of the cells when using the new dataflow-based parallelization strategy.
ResearchGate has not been able to resolve any references for this publication.