Publications (200)147.17 Total impact

Article: misFinder: Identify misassemblies in an unbiased manner using reference and pairedend reads
[Show abstract] [Hide abstract]
ABSTRACT: Background: Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing pairedend reads aligned to the assembled sequences and determining inconsistent features alone misassembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as misassembled sequence). Results: We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their misassembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned pairedend reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the misassembled sequence by analyzing the aligned pairedend reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls. Conclusions: We tested the performance of misFinder on both simulated and real pairedend reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive misassemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from misassembled sequence. misFinder can be freely downloaded from https://github.com/hitbio/misFinder .  [Show abstract] [Hide abstract]
ABSTRACT: Background: With respect to global priority for bioenergy production from plant biomass, understanding the fundamental genetic associations underlying carbohydrate metabolisms is crucial for the development of effective biorefinery process. Compared with gut microbiome of ruminal animals and woodfeed insects, knowledge on carbohydrate metabolisms of engineered biosystems is limited. Results: In this study, comparative metagenomics coupled with metabolic network analysis was carried out to study the interspecies cooperation and competition among carbohydrateactive microbes in typical units of wastewater treatment process including activated sludge and anaerobic digestion. For the first time, sludge metagenomes demonstrated rather diverse pool of carbohydrateactive genes (CAGs) comparable to that of rumen microbiota. Overall, the CAG composition correlated strongly with the microbial phylogenetic structure across sludge types. Genecentric clustering analysis showed the carbohydrate pathways of sludge systems were shaped by different environmental factors, including dissolved oxygen and salinity, and the latter showed more determinative influence of phylogenetic composition. Eventually, the highly clustered cooccurrence network of CAGs and saccharolytic phenotypes, revealed three metabolic modules in which the prevalent populations of Actinomycetales, Clostridiales and Thermotogales, respectively, play significant roles as interaction hubs, while broad negative coexclusion correlations observed between anaerobic and aerobic microbes, probably implicated roles of niche separation by dissolved oxygen in determining the microbial assembly. Conclusions: Sludge microbiomes encoding diverse pool of CAGs was another potential source for effective lignocellulosic biomass breakdown. But unlike gut microbiomes in which Clostridiales, Lactobacillales and Bacteroidales play a vital role, the carbohydrate metabolism of sludge systems is built on the interspecies cooperation and competition among Actinomycetales, Clostridiales and Thermotogales.  [Show abstract] [Hide abstract]
ABSTRACT: The problem of finding kedgeconnected components is a fundamental problem in computer science. Given a graph G = (V, E), the problem is to partition the vertex set V into {V1, V2,…, Vh}, where each Vi is maximized, such that for any two vertices x and y in Vi, there are k edgedisjoint paths connecting them. In this paper, we present an algorithm to solve this problem for all k. The algorithm preprocesses the input graph to construct an Auxiliary Graph to store information concerning edgeconnectivity among every vertex pair in O(Fn) time, where F is the time complexity to find the maximum flow between two vertices in graph G and n = ∣V∣. For any value of k, the kedgeconnected components can then be determined by traversing the auxiliary graph in O(n) time. The input graph can be a directed or undirected, simple graph or multigraph. Previous works on this problem mainly focus on fixed value of k.  [Show abstract] [Hide abstract]
ABSTRACT: Sequence alignment is a fundamental problem in computational biology, which is also important in theoretical computer science. In this paper, we consider the problem of aligning a set of sequences subject to a given constrained sequence. Given two sequences \(A=a_1a_2\ldots a_n\) and \(B=b_1b_2\ldots b_n\) with a given distance function and a constrained sequence \(C=c_1c_2\ldots c_k\), our goal is to find the optimal sequence alignment of A and B w.r.t. the constraint C. We investigate several variants of this problem. If \(C=c^k\), i.e., all characters in C are same, the optimal constrained pairwise sequence alignment can be solved in \(O(\min \{kn^2,(tk)n^2\})\) time, where t is the minimum number of occurrences of character c in A and B. If in the final alignment, the alignment score between any two consecutive constrained characters is upper bounded by some value, which is called GBCPSA, we give a dynamic programming with the time complexity \(O(kn^4/\log n)\). For the constrained centerstar sequence alignment (CCSA), we prove that it is NPhard to achieve the optimal alignment even over the binary alphabet. Furthermore, we show a negative result for CCSA, i.e., there is no polynomialtime algorithm to approximate the CCSA within any constant ratio.  [Show abstract] [Hide abstract]
ABSTRACT: Metatranscriptomic analysis provides information on how a microbial community reacts to environmental changes. Using nextgeneration sequencing (NGS) technology, biologists can study microbe community by sampling short reads from a mixture of mRNAs (metatranscriptomic data). As most microbial genome sequences are unknown, it would seem that de novo assembly of the mRNAs is needed. However, NGS reads are short and mRNAs share many similar regions and differ tremendously in abundance levels, making de novo assembly challenging. The existing assembler, IDBAMT, designed specifically for the assembly of metatranscriptomic data only performs well on highexpressed mRNAs. This paper introduces IDBAMTP, which adopts a novel approach to metatranscriptomic assembly that makes use of the fact that there is a database of millions of known protein sequences associated with mRNAs. How to effectively use the protein information is nontrivial given the size of the database and given that different mRNAs might lead to proteins with similar functions (because different amino acids might have similar characteristics). IDBAMTP employs a similarity measure between mRNAs and protein sequences, dynamic programming techniques and seedandextend heuristics to tackle the problem effectively and efficiently. Experimental results show that IDBAMTP outperforms existing assemblers by reconstructing 14% more mRNAs. Availability: www.cs.hku.hk/~alse/hkubrg/.  [Show abstract] [Hide abstract]
ABSTRACT: Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the stateoftheart approaches base on de Bruijn graph strategy and overlaplayout strategy. However, these approaches which depend on kmers or read overlaps do not fully utilize information of pairedend and singleend reads when resolving branches. Since they treat all singleend reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (PairedEnd Reads Guided Assembler), a novel sequencereadsguided de novo assembly approach, which adopts greedylike prediction strategy for assembling reads to contigs and scaffolds using pairedend reads and different read overlap size ranging from Omax to Omin to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using lookahead approach. Many difficultresolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other stateoftheart assemblers. PERGA can be freely downloaded at https://github.com/hitbio/PERGA.  [Show abstract] [Hide abstract]
ABSTRACT: Sequence assembling is an important step for bioinformatics study. With the help of next generation sequencing (NGS) technology, high throughput DNA fragment (reads) can be randomly sampled from DNA or RNA molecular sequence. However, as the positions of reads being sampled are unknown, assembling process is required for combining overlapped reads to reconstruct the original DNA or RNA sequence. Compared with traditional Sanger sequencing methods, although the throughput of NGS reads increases, the read length is shorter and the error rate is higher. It introduces several problems in assembling. Moreover, pairedend reads instead of singleend reads can be sampled which contain more information. The existing assemblers cannot fully utilize this information and fails to assemble longer contigs. In this article, we will revisit the major problems of assembling NGS reads on genomic, transcriptomic, metagenomic and metatranscriptomic data. We will also describe our IDBA package for solving these problems. IDBA package has adopted several novel ideas in assembling, including using multiple k, local assembling and progressive depth removal. Compared with existence assemblers, IDBA has better performance on many simulated and real sequencing datasets.  [Show abstract] [Hide abstract]
ABSTRACT: In this paper, we study 1space bounded 2dimensional bin packing and square packing. A sequence of rectangular items (square items) arrive one by one, each item must be packed into a square bin of unit size on its arrival without any information about future items. When packing items, 90∘90∘rotation is allowed. 1space bounded means there is only one “active” bin. If the “active” bin cannot accommodate the coming item, it will be closed and a new bin will be opened. The objective is to minimize the total number of bins used for packing all items in the sequence. Our contributions are as follows: For 1space bounded 2dimensional bin packing, we propose an online packing algorithm with a tight competitive ratio of 5.06. A lower bound of 3.17 on the competitive ratio is proven. Moreover, we study 1space bounded square packing, where each item is a square with side length no more than 1. A 4.3competitive algorithm is achieved, and a lower bound of 2.94 on the competitive ratio is given. All these bounds surpass the previously best known results.  [Show abstract] [Hide abstract]
ABSTRACT: We consider the problem of aligning a set of sequences subject to a given constrained sequence, which has applications in computational biology. In this paper we show that sequence alignment for two sequences A and B with a given distance function and a constrained sequence of k identical characters (say character c) can be solved in O( min {kn 2,(t − k)n 2}) time, where n is the length of A and B, and t is the minimum number of occurrences of character c in A and B. We also prove that the problem of constrained centerstar sequence alignment (CCSA) is NPhard even over the binary alphabet. Furthermore, for some distance function, we show that no polynomialtime algorithm can approximate the CCSA within any constant ratio.  [Show abstract] [Hide abstract]
ABSTRACT: Given a seller with $k$ types of items, $m$ of each, a sequence of users $\{u_1, u_2,\ldots \}$ arrive one by one. Each user is singleminded, i.e., each user is interested only in a particular bundle of items. The seller must set the price and assign some amount of bundles to each user upon his/her arrival. Bundles can be sold fractionally. Each $u_i$ has his/her value function $v_i(\cdot )$ such that $v_i(x)$ is the highest unit price $u_i$ is willing to pay for $x$ bundles. The objective is to maximize the revenue of the seller by setting the price and amount of bundles for each user. In this paper, we first show that a lower bound of the competitive ratio for this problem is $\Omega (\log h+\log k)$ , where $h$ is the highest unit price to be paid among all users. We then give a deterministic online algorithm, Pricing, whose competitive ratio is $O(\sqrt{k}\cdot \log h\log k)$ . When $k=1$ the lower and upper bounds asymptotically match the optimal result $O(\log h)$ . 
Article: MetaClusterTA: Taxonomic annotation for metagenomic data based on assemblyassisted binning
[Show abstract] [Hide abstract]
ABSTRACT: Taxonomic annotation of reads is an important problem in metagenomic analysis. Existing annotation tools, which rely on the approach of aligning each read to the taxonomic structure, are unable to annotate many reads efficiently and accurately as reads (~100 bp) are short and most of them come from unknown genomes. Previous work has suggested assembling the reads to make longer contigs before annotation. More reads/contigs can be annotated as a longer contig (in Kbp) can be aligned to a taxon even if it is from an unknown species as long as it contains a conserved region of that taxon. Unfortunately existing metagenomic assembly tools are not mature enough to produce long enough contigs. Binning tries to group reads/contigs of similar species together. Intuitively, reads in the same group (cluster) should be annotated to the same taxon and these reads altogether should cover a significant portion of the genome alleviating the problem of short contigs if the quality of binning is high. However, no existing work has tried to use binning results to help solve the annotation problem. This work explores this direction. In this paper, we describe MetaClusterTA, an assemblyassisted binningbased annotation tool which relies on an innovative idea of annotating binned reads instead of aligning each read or contig to the taxonomic structure separately. We propose the novel concept of the 'virtual contig' (which can be up to 10 Kb in length) to represent a set of reads and then represent each cluster as a set of 'virtual contigs' (which together can be total up to 1 Mb in length) for annotation. MetaClusterTA can outperform widelyused MEGAN4 and can annotate (1) more reads since the virtual contigs are much longer; (2) more accurately since each cluster of long virtual contigs contains global information of the sampled genome which tends to be more accurate than short reads or assembled contigs which contain only local information of the genome; and (3) more efficiently since there are much fewer long virtual contigs to align than short reads. MetaClusterTA outperforms MetaCluster 5.0 as a binning tool since binning itself can be more sensitive and precise given long virtual contigs and the binning results can be improved using the reference taxonomic database. MetaClusterTA can outperform widelyused MEGAN4 and can annotate more reads with higher accuracy and higher efficiency. It also outperforms MetaCluster 5.0 as a binning tool.  [Show abstract] [Hide abstract]
ABSTRACT: In this paper, we study the online tree node assignment problem, which is a generalization of the well studied OVSF code assignment problem. Assigned nodes in a complete binary tree must follow the rule that each leaftoroot path must contain at most one assigned node. At times, it is necessary to swap assigned nodes with unassigned nodes in order to accommodate some new node assignment. The target of this problem is to minimize the number of swaps in satisfying a sequence of node assignments and releases. This problem is fundamental, not only to the OVSF code assignment, but also to other applications, such as buddy memory allocation and hypercube subcube allocation. All the previous solutions to this problem are based on a sorted and compact configuration by assigning the nodes linearly and level by level, ignoring the intrinsic tree property in their assignments. Our contributions are: (1) give the concept of safe assignment, which is proved to be unique for any fixed set of nodeassignment requests; (2) an 8competitive algorithm by holding the safe assignment; and (3) an improved 6competitive variant of this algorithm. Our algorithms are simple and easy to implement and our contributions represent meaningful improvements over recent results. 
Conference Paper: PERGA: A pairedend read guided de novo assembler for extending contigs using SVM approach
[Show abstract] [Hide abstract]
ABSTRACT: Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the stateoftheart approaches base on de Bruijn graph strategy and overlaplayout strategy. However, these approaches which depend on kmers or read overlaps do not fully utilize information of singleend and pairedend reads when resolving branches, e.g. the number and positions of reads supporting each possible extension are not taken into account when resolving branches. We present PERGA (PairedEnd Reads Guided Assembler), a novel sequencereadsguided de novo assembly approach, which adopts greedylike prediction strategy for assembling reads to contigs and scaffolds. Instead of using singleend reads to construct contig, PERGA uses pairedend reads and different read overlap size thresholds ranging from Omax to Omin to resolve the gaps and branches. Moreover, by constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contigs by all feasible extensions and determine the correct extension by using look ahead technology. We evaluated PERGA on both simulated Illumina data sets and real data sets, and it constructed longer and more correct contigs and scaffolds than other stateoftheart assemblers IDBAUD, Velvet, ABySS, SGA and CABOG. Availability: https://github.com/hitbio/PERGA  [Show abstract] [Hide abstract]
ABSTRACT: In this paper, we study 1space bounded multidimensional bin packing and hypercube packing. A sequence of items arrive over time, each item is a ddimensional hyperbox (in bin packing) or hypercube (in hypercube packing), and the length of each side is no more than 1. These items must be packed without overlapping into ddimensional hypercubes with unit length on each side. In ddimensional space, any two dimensions i and j define a space P ij . When an item arrives, we must pack it into an active bin immediately without any knowledge of the future items, and 90∘rotation on any plane P ij is allowed. The objective is to minimize the total number of bins used for packing all these items in the sequence. In the 1space bounded variant, there is only one active bin for packing the current item. If the active bin does not have enough space to pack the item, it must be closed and a new active bin is opened. For ddimensional bin packing, an online algorithm with competitive ratio 4d is given. Moreover, we consider ddimensional hypercube packing, and give a 2d+1competitive algorithm. These two results are the first study on 1space bounded multi dimensional bin packing and hypercube packing.  [Show abstract] [Hide abstract]
ABSTRACT: Abstract Highthroughput nextgeneration sequencing technology provides a great opportunity for analyzing metatranscriptomic data. However, the reads produced by these technologies are short and an assembling step is required to combine the short reads into longer contigs. As there are many repeat patterns in mRNAs from different genomes and the abundance ratio of mRNAs in a sample varies a lot, existing assemblers for genomic data, transcriptomic data, and metagenomic data do not work on metatranscriptomic data and produce chimeric contigs, that is, incorrect contigs formed by merging multiple mRNA sequences. To our best knowledge, there is no assembler designed for metatranscriptomic data. In this article, we introduce an assembler called IDBAMT, which is designed for assembling reads from metatranscriptomic data. IDBAMT produces much fewer chimeric contigs (reduce by 50% or more) when compared with existing assemblers such as Oases, IDBAUD, and Trinity.  [Show abstract] [Hide abstract]
ABSTRACT: RNA sequencing based on nextgeneration sequencing technology is effective for analyzing transcriptomes. Like de novo genome assembly, de novo transcriptome assembly does not rely on any reference genome or additional annotation information, but is more difficult. In particular, isoforms can have very uneven expression levels (e.g. 1:100), which make it very difficult to identify lowexpressed isoforms. One challenge is to remove erroneous vertices/edges with high multiplicity (produced by highexpressed isoforms) in the de Bruijn graph without removing correct ones with notsohigh multiplicity from lowexpressed isoforms. Failing to do so will result in the loss of lowexpressed isoforms or having complicated subgraphs with transcripts of different genes mixed together due to erroneous vertices/edges. Contributions: Unlike existing tools, which remove erroneous vertices/edges with multiplicities lower than a global threshold, we use a probabilistic progressive approach to iteratively remove them with local thresholds. This enables us to decompose the graph into disconnected components, each containing a few genes, if not a single gene, while retaining many correct vertices/edges of lowexpressed isoforms. Combined with existing techniques, IDBATran is able to assemble both highexpressed and lowexpressed transcripts and outperform existing assemblers in terms of sensitivity and specificity for both simulated and real data. http://www.cs.hku.hk/∼alse/idba_tran. chin@cs.hku.hk Supplementary data are available at Bioinformatics online.  [Show abstract] [Hide abstract]
ABSTRACT: Metagenomic binning remains an important topic in metagenomic analysis. Existing unsupervised binning methods for nextgeneration sequencing (NGS) reads do not perform well on (i) samples with lowabundance species or (ii) samples (even with high abundance) when there are many extremely lowabundance species. These two problems are common for real metagenomic datasets. Binning methods that can solve these problems are desirable. We proposed a tworound binning method (MetaCluster 5.0) that aims at identifying both lowabundance and highabundance species in the presence of a large amount of noise due to many extremely lowabundance species. In summary, MetaCluster 5.0 uses a filtering strategy to remove noise from the extremely lowabundance species. It separate reads of highabundance species from those of lowabundance species in two different rounds. To overcome the issue of low coverage for lowabundance species, multiple w values are used to group reads with overlapping wmers, whereas reads from highabundance species are grouped with high confidence based on a large w and then binning expands to lowabundance species using a relaxed (shorter) w. Compared to the recent tools, TOSS and MetaCluster 4.0, MetaCluster 5.0 can find more species (especially those with low abundance of say 6× to 10×) and can achieve better sensitivity and specificity using less memory and running time. http://i.cs.hku.hk/~alse/MetaCluster/ chin@cs.hku.hk.  [Show abstract] [Hide abstract]
ABSTRACT: Given a seller with m items, a sequence of users {u1, u2, …} come one by one, the seller must set the unit price and assign some items to each user on his/her arrival. Items can be sold fractionally. Each ui has his/her value function vi(⋅) such that vi(x) is the highest unit price ui is willing to pay for x items. The objective is to maximize the revenue by setting the price and number of items for each user. In this paper, we have the following contributions: if the highest value h among all vi(x) is known in advance, we first show the lower bound of the competitive ratio is ⌊log h⌋/2, then give an online algorithm with competitive ratio 4⌊log h⌋ + 6; if h is not known in advance, we give an online algorithm with competitive ratio 2⋅hlog1/2 h + 8⋅h3log1/2 h. 
Conference Paper: Phylogenetic Tree Reconstruction with Protein Linkage
[Show abstract] [Hide abstract]
ABSTRACT: When reconstructing a phylogenetic tree, one common representation for a species is a binary string indicating the existence of some selected genes/proteins. Up until now, all existing methods have assumed the existence of these genes/proteins to be independent. However, in most cases, this assumption is not valid. In this paper, we consider the reconstruction problem by taking into account the dependency of proteins, i.e. protein linkage. We assume that the tree structure and leaf sequences are given, so we need only to find an optimal assignment to the ancestral nodes. We prove that the Phylogenetic Tree Reconstruction with Protein Linkage (PTRPL) problem for three different versions of linkage distance is NPcomplete. We provide an efficient dynamic programming algorithm to solve the general problem in O(4m ·n)4 and O(4m ·(m + n)) time (compared to the straightforward O(4m ·m ·n) and O(4m ·m 2 ·n) time algorithm), depending on the versions of linkage distance used, where .. stands for the number of species and .. for the number of proteins, i.e. length of binary string. We also argue, by experiments, that trees with higher accuracy can be constructed by using linkage information than by using only hamming distance to measure the differences between the binary strings, thus validating the significance of linkage information. 
Conference Paper: Online Pricing for Multitype of Items
[Show abstract] [Hide abstract]
ABSTRACT: In this paper, we study the problem of online pricing for bundles of items. Given a seller with k types of items, m of each, a sequence of users {u 1 ,u 2 ,⋯} arrives one by one. Each user is singleminded, i.e., each user is interested only in a particular bundle of items. The seller must set the price and assign some amount of bundles to each user upon his/her arrival. Bundles can be sold fractionally. Each u i has his/her value function v i (·) such that v i (x) is the highest unit price u i is willing to pay for x bundles. The objective is to maximize the revenue of the seller by setting the price and amount of bundles for each user. In this paper, we first show that the lower bound of the competitive ratio for this problem is Ω(logh+logk), where h is the highest unit price to be paid among all users. We then give a deterministic online algorithm, Pricing, whose competitive ratio is O(k·loghlogk). When k=1 the lower and upper bounds asymptotically match the optimal result O(logh).
Publication Stats
3k  Citations  
147.17  Total Impact Points  
Top Journals
Institutions

2015

Hang Seng Management College
Hong Kong, Hong Kong


19702015

The University of Hong Kong
 Department of Computer Science
Hong Kong, Hong Kong


2011

Xiamen University
 Department of Computer Science
Xiamen, Fujian, China


2006

Lands Department of The Government of the Hong Kong Special Administrative Region
Hong Kong, Hong Kong


2005

Japan Advanced Institute of Science and Technology
 School of Information Science
KMQ, Ishikawa, Japan


2004

Memorial University of Newfoundland
Saint John's, Newfoundland and Labrador, Canada


2002

University of Regina
 Department of Computer Sciences
Regina, Saskatchewan, Canada


1995

Peking University
Peping, Beijing, China


19771986

University of Alberta
 Department of Computing Science
Edmonton, Alberta, Canada


1981

University of California, San Diego
San Diego, California, United States
