ABSTRACT: The Boolean network can be used as a mathematical model for gene regulatory networks. An attractor, which is a state of a Boolean network repeating itself periodically, can represent a stable stage of a gene regulatory network. It is known that the problem of finding an attractor of the shortest period is NPhard. In this article, we give a fixedparameter algorithm for detecting a singleton attractor (SA) for a Boolean network that has only AND and OR Boolean functions of literals and has bounded treewidth k. The algorithm is further extended to detect an SA for a constantdepth nested canalyzing Boolean network with bounded treewidth. We also prove the fixedparameter intractability of the detection of an SA for a general Boolean network with bounded treewidth. IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences 01/2015; E98.A(1):384390. DOI:10.1587/transfun.E98.A.384 · 0.23 Impact Factor

Theoretical Computer Science 12/2014; 544(4):1–2. DOI:10.1016/j.tcs.2014.06.034 · 0.66 Impact Factor

ABSTRACT: Background
Human leukocyte antigen (HLA) genes are critical genes involved in important biomedical aspects, including organ transplantation, autoimmune diseases and infectious diseases. The gene family contains the most polymorphic genes in humans and the difference between two alleles is only a single base pair substitution in many cases. The next generation sequencing (NGS) technologies could be used for high throughput HLA typing but in silico methods are still needed to correctly assign the alleles of a sample. Computer scientists have developed such methods for various NGS platforms, such as Illumina, Roche 454 and Ion Torrent, based on the characteristics of the reads they generate. However, the method for PacBio reads was less addressed, probably owing to its high error rates. The PacBio system has the longest read length among available NGS platforms, and therefore is the only platform capable of having exon 2 and exon 3 of HLA genes on the same read to unequivocally solve the ambiguity problem caused by the “phasing” issue.
Results
We proposed a new method BayesTyping1 to assign HLA alleles for PacBio circular consensus sequencing reads using Bayes’ theorem. The method was applied to simulated data of the three loci HLAA, HLAB and HLADRB1. The experimental results showed its capability to tolerate the disturbance of sequencing errors and external noise reads.
Conclusions
The BayesTyping1 method could overcome the problems of HLA typing using PacBio reads, which mostly arise from sequencing errors of PacBio reads and the divergence of HLA genes, to some extent.
Electronic supplementary material
The online version of this article (doi:10.1186/1471210515296) contains supplementary material, which is available to authorized users. BMC Bioinformatics 09/2014; 15(1):296. DOI:10.1186/1471210515296 · 2.58 Impact Factor

International Journal of Computational Geometry & Applications 12/2013; 23(06):425426. DOI:10.1142/S0218195913020032 · 0.08 Impact Factor

ABSTRACT: Tractability results are rare in the comparison of gene orders for more than two genomes. Here we present a lineartime algorithm for the small parsimony problem (inferring ancestral genomes given a phylogeny on an arbitrary number of genomes) in the case gene orders are permutations, that evolve by inversions not breaking common gene intervals, and these intervals are organised in a linear structure. We present two examples where this allows to reconstruct the ancestral gene orders in phylogenies of several γProteobacteria species and Burkholderia strains, respectively. We prove in addition that the large parsimony problem (where the phylogeny is output) remains NPcomplete. Proceedings of the 12th international conference on Algorithms in Bioinformatics; 09/2012

ABSTRACT: In this paper, we study the palindrome retrieval problem with the input string compressed into runlength encoded form. Given a runlength encoded string rle(T)rle(T), we show how to preprocess rle(T)rle(T) to support subsequent queries of the longest palindrome centered at any specified position and having any specified number of mismatches between its arms. We present two algorithms for the problem, both taking time and space polynomial in the compressed string size. Let nn denote the number of runs of rle(T)rle(T) and let kk denote the number of mismatches. The first algorithm, devised for small kk, identifies the desired palindrome in O(logn+min{k,n})O(logn+min{k,n}) time with O(nlogn)O(nlogn) preprocessing time, while the second algorithm achieves O(log2n)O(log2n) query time, independent of kk, after O(n2logn)O(n2logn)time preprocessing. Theoretical Computer Science 05/2012; 432:28–37. DOI:10.1016/j.tcs.2012.01.023 · 0.66 Impact Factor

ISAAC; 01/2012

ABSTRACT: Keyword search is a friendly mechanism for users to identify desired information in XML databases, and LCA is a popular concept for locating the meaningful subtrees corresponding to query keywords. Among all the LCAbased approaches, MaxMatch [9] is the only one which could achieve the property of monotonicity and consistency, by outputting only contributors instead of the whole subtree. Although the MaxMatch algorithm performs efficiently in some cases, there is still room for improvement. In this paper, we first propose to improve its performance by avoiding unnecessary index accesses. We then speed up the process of subset detection, which is a core procedure for determining contributors. The resultant algorithm is called MinMap and MinMap+, respectively. At last, we analytically and empirically demonstrate the efficiency of our methods. According to our experiments, our two algorithms work better than the existing one, and MinMap+ is particularly helpful when the breadth of the tree is large and the number of keywords grows. ACM SIGMOD Record 07/2011; 40(1):510. DOI:10.1145/2007206.2007208 · 1.05 Impact Factor

ABSTRACT: A tandem duplication random loss (TDRL) operation duplicates a contiguous segment of genes, followed by the random loss of one copy of each of the duplicated genes. Although the importance of this operation is founded by several recent biological studies, it has been investigated only rarely from a theoretical point of view. Of particular interest are sorting TDRLs which are TDRLs that, when applied to a permutation representing a genome, reduce the distance towards another given permutation. The identification of sorting genome rearrangement operations in general is a key ingredient of many algorithms for reconstructing the evolutionary history of a set of species. In this paper we present methods to compute all sorting TDRLs for two given gene orders. In addition, a closed formula for the number of sorting TDRLs is derived and further properties of sorting TDRLs are investigated. It is also shown that the theoretical findings are useful for identifying unique sorting TDRL scenarios for mitochondrial gene orders. Journal of Discrete Algorithms 03/2011; 9(1):3248. DOI:10.1016/j.jda.2010.09.006

ABSTRACT: Keyword search over XML documents has been widely studied in recent years. It allows users to retrieve relevant data from
XML documents without learning complicated query languages. SLCA (smallest lowest common ancestor)based keyword search is
a common mechanism to locate the desirable LCAs for the given query keywords, but the conventional SLCAbased keyword search
is for ANDonly semantics. In this paper, we extend the SLCA keyword search to a more general case, where the keyword query
could be an arbitrary combination of AND, OR, and NOT operators. We further define the query result based on the monotonicity and consistency properties, and propose an efficient algorithm to figure out the SLCAs and the relevant matches. Since the keyword query
becomes more complex, we also discuss the variations of the monotonicity and consistency properties in our framework. Finally,
the experimental results show that the proposed algorithm runs efficiently and gives reasonable query results by measuring
the processing time, scalability, precision, and recall. Database Systems for Advanced Applications  16th International Conference, DASFAA 2011, Hong Kong, China, April 2225, 2011, Proceedings, Part I; 01/2011

ABSTRACT: In this paper, we consider a commonly used compression scheme called runlength encoding. We provide both lower and upper bounds for the problems of comparing two runlength encoded strings. Specifically, we prove the 3sumhardness for both the wildcard matching problem and the kmismatch problem with runlength compressed inputs. Given two runlength encoded strings of m and n runs, such a result implies that it is very unlikely to devise an o(mn)time algorithm for either of them. We then present an inplace algorithm running in O(mnlogm) time for their combined problem, i.e. kmismatch with wildcards. We further demonstrate that if the aim is to report the positions of all the occurrences, there exists a stronger barrier of Ω(mnlogm)time, matching the running time of our algorithm. Moreover, our algorithm can be easily generalized to a twodimensional setting without impairing the time and space complexity. Journal of Complexity 08/2010; 26(4):364374. DOI:10.1016/j.jco.2010.03.003 · 1.50 Impact Factor

ABSTRACT: In this paper, a commonly used data compression scheme, called runlength encoding, is employed to speed up the computation
of edit distance between two strings. Our algorithm is the first to achieve “fully compressed,” meaning that it runs in time
polynomial in the number of runs of both strings. Specifically, given two strings, compressed into m and n runs, m ≤ n, we present an O(mn
2)time algorithm for computing the edit distance of the two strings. Our approach also gives the first fully compressed algorithm
for approximate matching of a pattern of m runs in a text of n runs in O(mn
2) time. Algorithms  ESA 2010, 18th Annual European Symposium, Liverpool, UK, September 68, 2010. Proceedings, Part I; 01/2010

ABSTRACT: We study the problem of identifying palindromes in compressed strings. The underlying compression scheme is called runlength
encoding, which has been extensively studied and widely applied in diverse areas. Given a runlength encoded string RLE(T)\textsc{rle}(T), we show how to preprocess RLE(T)\textsc{rle}(T) to support efficient retrieval of the longest palindrome with a specified center position and a tolerated number of mismatches
between its two arms. Let n be the number of runs of RLE(T)\textsc{rle}(T) and k be the tolerated number of mismatches. We present two algorithms for the problem, both with preprocessing time polynomial
in the number of runs. The first algorithm, devised for small k, identifies the desired palindrome in O(logn + min {k,n}) time with O(nlogn) preprocessing time, while the second algorithm achieves O(log2
n) query time, independent of k, after O(n
2logn)time preprocessing. Algorithms and Computation  21st International Symposium, ISAAC 2010, Jeju Island, Korea, December 1517, 2010, Proceedings, Part II; 01/2010

ABSTRACT: Keyword search is a friendly mechanism for the end user to identify interesting nodes in XML databases, and the SLCA (smallest
lowest common ancestor)based keyword search is a popular concept for locating the desirable subtrees corresponding to the
given query keywords. However, it does not evaluate the importance of each node under those subtrees. Liu and Chen proposed
a new concept contributor to output the relevant matches instead of all the keyword nodes. In this paper, we propose two methods, MinMap and SingleProbe, that improve the efficiency of searching the relevant matches by avoiding unnecessary index accesses. We analytically and
empirically demonstrate the efficiency of our approaches. According to our experiments, both approaches work better than the
existing one. Moreover, SingleProbe is generally better than MinMap if the minimum frequency and the maximum frequency of
the query keywords are close. Database and Expert Systems Applications, 21st International Conference, DEXA 2010, Bilbao, Spain, August 30  September 3, 2010, Proceedings, Part I; 01/2010

ABSTRACT: We study the problem of finding all maximal approximate gapped palindromes in a string. More specifically, given a string S of length n, a parameter q ≥ 0 and a threshold k > 0, the problem is to identify all substrings in S of the form uvw such that (1) the Levenshtein distance between u and w
r
is at most k, where w
r
is the reverse of w and (2) v is a string of length q. The best previous work requires O(k
2
n) time. In this paper, we propose an O(kn)time algorithm for this problem by utilizing an incremental string comparison technique. It turns out that the core technique
actually solves a more general incremental string comparison problem that allows the insertion, deletion, and substitution
of multiple symbols. 12/2009: pages 10841093;

ABSTRACT: A tandem duplication random loss (TDRL) operation duplicates a contiguous segment of genes, followed by the loss of one copy
of each of the duplicated genes. Although the importance of this operation is founded by several recent biological studies,
it has been investigated only rarely from a theoretical point of view. Of particular interest are sorting TDRLs which are
TDRLs that, when applied to a permutation representing a genome, reduce the distance towards another given permutation. The
identification of sorting genome rearrangement operations in general is a key ingredient of many algorithms for reconstructing
the evolutionary history of a set of species. In this paper we present methods to compute all sorting TDRLs for two given
gene orders. In addition, a closed formula for the number of sorting TDRLs is derived and further properties of sorting TDRLs
are investigated. It is also shown that the theoretical findings are useful for identifying unique sorting TDRL scenarios
for mitochondrial gene orders. 06/2009: pages 301313;

ABSTRACT: In this paper, we consider a commonly used compression scheme called runlength encoding (abbreviated rle). We provide lower bounds for problems of approximately matching two rle strings. Specifically, we show that the wildcard matching and kmismatches problems for rle strings are 3sumhard. For two rle strings of m and n runs, such a result implies that it is very unlikely to devise an o(mn)time algorithm for either problem. We then propose an O(mn + plogm)time sweepline algorithm for their combined problem, i.e. wildcard matching with mismatches, where p ≤ mn is the number of matched or mismatched runs. Furthermore, the problem of aligning two rle strings is also shown to be 3sumhard. Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lille, France, June 2224, 2009, Proceedings; 01/2009

Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 1618, 2009. Proceedings; 01/2009