Publications (34)8.59 Total impact
 [Show abstract] [Hide abstract]
ABSTRACT: From the text: This special issue of Theoretical Computer Science contains extended versions of selected contributions to the 23rd International symposium on algorithms and computation (ISAAC 2012) held in Taipei, Taiwan, on December 19–21, 2012.Theoretical Computer Science 12/2014; 544:1–2. · 0.49 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: Human leukocyte antigen (HLA) genes are critical genes involved in important biomedical aspects, including organ transplantation, autoimmune diseases and infectious diseases. The gene family contains the most polymorphic genes in humans and the difference between two alleles is only a single base pair substitution in many cases. The next generation sequencing (NGS) technologies could be used for high throughput HLA typing but in silico methods are still needed to correctly assign the alleles of a sample. Computer scientists have developed such methods for various NGS platforms, such as Illumina, Roche 454 and Ion Torrent, based on the characteristics of the reads they generate. However, the method for PacBio reads was less addressed, probably owing to its high error rates. The PacBio system has the longest read length among available NGS platforms, and therefore is the only platform capable of having exon 2 and exon 3 of HLA genes on the same read to unequivocally solve the ambiguity problem caused by the "phasing" issue.BMC Bioinformatics 09/2014; 15(1):296. · 3.02 Impact Factor 
Article: GUEST EDITORS' FOREWORD
International Journal of Computational Geometry & Applications 12/2013; 23(06):425426. · 0.18 Impact Factor 
Conference Paper: Preserving inversion phylogeny reconstruction
[Show abstract] [Hide abstract]
ABSTRACT: Tractability results are rare in the comparison of gene orders for more than two genomes. Here we present a lineartime algorithm for the small parsimony problem (inferring ancestral genomes given a phylogeny on an arbitrary number of genomes) in the case gene orders are permutations, that evolve by inversions not breaking common gene intervals, and these intervals are organised in a linear structure. We present two examples where this allows to reconstruct the ancestral gene orders in phylogenies of several γProteobacteria species and Burkholderia strains, respectively. We prove in addition that the large parsimony problem (where the phylogeny is output) remains NPcomplete.Proceedings of the 12th international conference on Algorithms in Bioinformatics; 09/2012  [Show abstract] [Hide abstract]
ABSTRACT: In this paper, we study the palindrome retrieval problem with the input string compressed into runlength encoded form. Given a runlength encoded string rle(T)rle(T), we show how to preprocess rle(T)rle(T) to support subsequent queries of the longest palindrome centered at any specified position and having any specified number of mismatches between its arms. We present two algorithms for the problem, both taking time and space polynomial in the compressed string size. Let nn denote the number of runs of rle(T)rle(T) and let kk denote the number of mismatches. The first algorithm, devised for small kk, identifies the desired palindrome in O(logn+min{k,n})O(logn+min{k,n}) time with O(nlogn)O(nlogn) preprocessing time, while the second algorithm achieves O(log2n)O(log2n) query time, independent of kk, after O(n2logn)O(n2logn)time preprocessing.Theoretical Computer Science 05/2012; 432:28–37. · 0.49 Impact Factor  [Show abstract] [Hide abstract]
ABSTRACT: Keyword search is a friendly mechanism for users to identify desired information in XML databases, and LCA is a popular concept for locating the meaningful subtrees corresponding to query keywords. Among all the LCAbased approaches, MaxMatch [9] is the only one which could achieve the property of monotonicity and consistency, by outputting only contributors instead of the whole subtree. Although the MaxMatch algorithm performs efficiently in some cases, there is still room for improvement. In this paper, we first propose to improve its performance by avoiding unnecessary index accesses. We then speed up the process of subset detection, which is a core procedure for determining contributors. The resultant algorithm is called MinMap and MinMap+, respectively. At last, we analytically and empirically demonstrate the efficiency of our methods. According to our experiments, our two algorithms work better than the existing one, and MinMap+ is particularly helpful when the breadth of the tree is large and the number of keywords grows.SIGMOD Record. 01/2011; 40:510.  [Show abstract] [Hide abstract]
ABSTRACT: A tandem duplication random loss (TDRL) operation duplicates a contiguous segment of genes, followed by the random loss of one copy of each of the duplicated genes. Although the importance of this operation is founded by several recent biological studies, it has been investigated only rarely from a theoretical point of view. Of particular interest are sorting TDRLs which are TDRLs that, when applied to a permutation representing a genome, reduce the distance towards another given permutation. The identification of sorting genome rearrangement operations in general is a key ingredient of many algorithms for reconstructing the evolutionary history of a set of species. In this paper we present methods to compute all sorting TDRLs for two given gene orders. In addition, a closed formula for the number of sorting TDRLs is derived and further properties of sorting TDRLs are investigated. It is also shown that the theoretical findings are useful for identifying unique sorting TDRL scenarios for mitochondrial gene orders.Journal of Discrete Algorithms 01/2011; 9:3248. 
Conference Paper: Identifying Relevant Matches with NOT Semantics over XML Documents.
[Show abstract] [Hide abstract]
ABSTRACT: Keyword search over XML documents has been widely studied in recent years. It allows users to retrieve relevant data from XML documents without learning complicated query languages. SLCA (smallest lowest common ancestor)based keyword search is a common mechanism to locate the desirable LCAs for the given query keywords, but the conventional SLCAbased keyword search is for ANDonly semantics. In this paper, we extend the SLCA keyword search to a more general case, where the keyword query could be an arbitrary combination of AND, OR, and NOT operators. We further define the query result based on the monotonicity and consistency properties, and propose an efficient algorithm to figure out the SLCAs and the relevant matches. Since the keyword query becomes more complex, we also discuss the variations of the monotonicity and consistency properties in our framework. Finally, the experimental results show that the proposed algorithm runs efficiently and gives reasonable query results by measuring the processing time, scalability, precision, and recall.Database Systems for Advanced Applications  16th International Conference, DASFAA 2011, Hong Kong, China, April 2225, 2011, Proceedings, Part I; 01/2011  [Show abstract] [Hide abstract]
ABSTRACT: In this paper, we consider a commonly used compression scheme called runlength encoding. We provide both lower and upper bounds for the problems of comparing two runlength encoded strings. Specifically, we prove the 3sumhardness for both the wildcard matching problem and the kmismatch problem with runlength compressed inputs. Given two runlength encoded strings of m and n runs, such a result implies that it is very unlikely to devise an o(mn)time algorithm for either of them. We then present an inplace algorithm running in O(mnlogm) time for their combined problem, i.e. kmismatch with wildcards. We further demonstrate that if the aim is to report the positions of all the occurrences, there exists a stronger barrier of Ω(mnlogm)time, matching the running time of our algorithm. Moreover, our algorithm can be easily generalized to a twodimensional setting without impairing the time and space complexity.Journal of Complexity 01/2010; 26:364374. · 1.22 Impact Factor 
Conference Paper: A Fully Compressed Algorithm for Computing the Edit Distance of RunLength Encoded Strings.
[Show abstract] [Hide abstract]
ABSTRACT: In this paper, a commonly used data compression scheme, called runlength encoding, is employed to speed up the computation of edit distance between two strings. Our algorithm is the first to achieve “fully compressed,” meaning that it runs in time polynomial in the number of runs of both strings. Specifically, given two strings, compressed into m and n runs, m ≤ n, we present an O(mn 2)time algorithm for computing the edit distance of the two strings. Our approach also gives the first fully compressed algorithm for approximate matching of a pattern of m runs in a text of n runs in O(mn 2) time.Algorithms  ESA 2010, 18th Annual European Symposium, Liverpool, UK, September 68, 2010. Proceedings, Part I; 01/2010 
Conference Paper: Faster Algorithms for Searching Relevant Matches in XML Databases.
[Show abstract] [Hide abstract]
ABSTRACT: Keyword search is a friendly mechanism for the end user to identify interesting nodes in XML databases, and the SLCA (smallest lowest common ancestor)based keyword search is a popular concept for locating the desirable subtrees corresponding to the given query keywords. However, it does not evaluate the importance of each node under those subtrees. Liu and Chen proposed a new concept contributor to output the relevant matches instead of all the keyword nodes. In this paper, we propose two methods, MinMap and SingleProbe, that improve the efficiency of searching the relevant matches by avoiding unnecessary index accesses. We analytically and empirically demonstrate the efficiency of our approaches. According to our experiments, both approaches work better than the existing one. Moreover, SingleProbe is generally better than MinMap if the minimum frequency and the maximum frequency of the query keywords are close.Database and Expert Systems Applications, 21st International Conference, DEXA 2010, Bilbao, Spain, August 30  September 3, 2010, Proceedings, Part I; 01/2010 
Conference Paper: Identifying Approximate Palindromes in RunLength Encoded Strings.
[Show abstract] [Hide abstract]
ABSTRACT: We study the problem of identifying palindromes in compressed strings. The underlying compression scheme is called runlength encoding, which has been extensively studied and widely applied in diverse areas. Given a runlength encoded string RLE(T)\textsc{rle}(T), we show how to preprocess RLE(T)\textsc{rle}(T) to support efficient retrieval of the longest palindrome with a specified center position and a tolerated number of mismatches between its two arms. Let n be the number of runs of RLE(T)\textsc{rle}(T) and k be the tolerated number of mismatches. We present two algorithms for the problem, both with preprocessing time polynomial in the number of runs. The first algorithm, devised for small k, identifies the desired palindrome in O(logn + min {k,n}) time with O(nlogn) preprocessing time, while the second algorithm achieves O(log2 n) query time, independent of k, after O(n 2logn)time preprocessing.Algorithms and Computation  21st International Symposium, ISAAC 2010, Jeju Island, Korea, December 1517, 2010, Proceedings, Part II; 01/2010  Int. J. Found. Comput. Sci. 01/2010; 21:925939.
 [Show abstract] [Hide abstract]
ABSTRACT: We study the problem of finding all maximal approximate gapped palindromes in a string. More specifically, given a string S of length n, a parameter q ≥ 0 and a threshold k > 0, the problem is to identify all substrings in S of the form uvw such that (1) the Levenshtein distance between u and w r is at most k, where w r is the reverse of w and (2) v is a string of length q. The best previous work requires O(k 2 n) time. In this paper, we propose an O(kn)time algorithm for this problem by utilizing an incremental string comparison technique. It turns out that the core technique actually solves a more general incremental string comparison problem that allows the insertion, deletion, and substitution of multiple symbols.12/2009: pages 10841093;  [Show abstract] [Hide abstract]
ABSTRACT: A tandem duplication random loss (TDRL) operation duplicates a contiguous segment of genes, followed by the loss of one copy of each of the duplicated genes. Although the importance of this operation is founded by several recent biological studies, it has been investigated only rarely from a theoretical point of view. Of particular interest are sorting TDRLs which are TDRLs that, when applied to a permutation representing a genome, reduce the distance towards another given permutation. The identification of sorting genome rearrangement operations in general is a key ingredient of many algorithms for reconstructing the evolutionary history of a set of species. In this paper we present methods to compute all sorting TDRLs for two given gene orders. In addition, a closed formula for the number of sorting TDRLs is derived and further properties of sorting TDRLs are investigated. It is also shown that the theoretical findings are useful for identifying unique sorting TDRL scenarios for mitochondrial gene orders.06/2009: pages 301313; 
Conference Paper: Finding All Sorting Tandem Duplication Random Loss Operations.
Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lille, France, June 2224, 2009, Proceedings; 01/2009 
Conference Paper: Finding All Approximate Gapped Palindromes.
Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 1618, 2009. Proceedings; 01/2009 
Conference Paper: Approximate Matching for RunLength Encoded Strings Is 3sumHard.
[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we consider a commonly used compression scheme called runlength encoding (abbreviated rle). We provide lower bounds for problems of approximately matching two rle strings. Specifically, we show that the wildcard matching and kmismatches problems for rle strings are 3sumhard. For two rle strings of m and n runs, such a result implies that it is very unlikely to devise an o(mn)time algorithm for either problem. We then propose an O(mn + plogm)time sweepline algorithm for their combined problem, i.e. wildcard matching with mismatches, where p ≤ mn is the number of matched or mismatched runs. Furthermore, the problem of aligning two rle strings is also shown to be 3sumhard.Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lille, France, June 2224, 2009, Proceedings; 01/2009  [Show abstract] [Hide abstract]
ABSTRACT: The range minimum query problem, RMQ for short, is to preprocess a sequence of real numbers A[1…n] for subsequent queries of the form: “Given indices i, j, what is the index of the minimum value of A[i…j]?” This problem has been shown to be linearly equivalent to the LCA problem in which a tree is preprocessed for answering the lowest common ancestor of two nodes. It has also been shown that both the RMQ and LCA problems can be solved in linear preprocessing time and constant query time under the unitcost RAM model. This paper studies a new query problem arising from the analysis of biological sequences. Specifically, we wish to answer queries of the form: “Given indices i and j, what is the maximumsum segment of A[i…j]?” We establish the linear equivalence relation between RMQ and this new problem. As a consequence, we can solve the new query problem in linear preprocessing time and constant query time under the unitcost RAM model. We then present alternative lineartime solutions for two other biological sequence analysis problems to demonstrate the utilities of the techniques developed in this paper.Discrete Applied Mathematics 01/2007; 155:20432052. · 0.72 Impact Factor 
Article: Improved algorithms for the
Theor. Comput. Sci. 01/2006; 362:162170.
Publication Stats
121  Citations  
8.59  Total Impact Points  
Top Journals
Institutions

2014

National Chung Hsing University
臺中市, Taiwan, Taiwan


2004–2012

National Taiwan University
 Department of Computer Science and Information Engineering
T’aipei, Taipei, Taiwan
