[Show abstract][Hide abstract]ABSTRACT: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression.
MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes.
We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline.
The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration.
[Show abstract][Hide abstract]ABSTRACT: Threshold graphs are recursive deterministic network models that capture properties of certain social and economic interactions. One drawback of these graph families is that they they have highly constrained generative attachment rules. To mitigate this problem, we introduce a new class of graphs termed Doubly Threshold (DT) graphs which may be succinctly described through vertex weights that govern the existence of edges via two inequalities. One inequality imposes the constraint that the sum of weights of adjacent vertices has to exceed a specified threshold. The second inequality ensures that adjacent vertices have a bounded difference of their weights. We provide a succinct characterization and decomposition of DT graphs and analyze their forbidden induced subgraphs which we compare to those of known social networks. We also present a method for performing vertex weight assignments on DT graphs that satisfy the defining constraints.
[Show abstract][Hide abstract]ABSTRACT: We introduce the notion of weakly mutually uncorrelated (WMU) sequences, motivated by applications in DNA-based storage systems and synchronization protocols. WMU sequences are characterized by the property that no sufficiently long suffix of one sequence is the prefix of the same or another sequence. In addition, WMU sequences used in DNA-based storage systems are required to have balanced compositions of symbols and to be at large mutual Hamming distance from each other. We present a number of constructions for balanced, error-correcting WMU codes using Dyck paths, Knuth's balancing principle, prefix synchronized and cyclic codes.
[Show abstract][Hide abstract]ABSTRACT: We introduce the new problem of code design in the Damerau metric. The Damerau metric is a generalization of the Levenshtein distance which also allows for adjacent transposition edits. We first provide constructions for codes that may correct either a single deletion or a single adjacent transposition and then proceed to extend these results to codes that can simultaneously correct a single deletion and multiple adjacent transpositions.
[Show abstract][Hide abstract]ABSTRACT: Motivated by charge balancing constraints for rank modulation schemes, we introduce the notion of balanced permutations and derive the capacity of balanced permutation codes. We also describe simple interleaving methods for permutation code constructions and show that they approach capacity
[Show abstract][Hide abstract]ABSTRACT: Motivation:
Cancer genomes exhibit a large number of different alterations that affect many genes in a diverse manner. An improved understanding of the generative mechanisms behind the mutation rules and their influence on gene community behavior is of great importance for the study of cancer.
To expand our capability to analyze combinatorial patterns of cancer alterations, we developed a rigorous methodology for cancer mutation pattern discovery based on a new, constrained form of correlation clustering. Our new algorithm, named C(3) (Cancer Correlation Clustering), leverages mutual exclusivity of mutations, patient coverage, and driver network concentration principles. To test C(3), we performed a detailed analysis on TCGA breast cancer and glioblastoma data and showed that our algorithm outperforms the state-of-the-art CoMEt method in terms of discovering mutually exclusive gene modules and identifying biologically relevant driver genes. The proposed agnostic clustering method represents a unique tool for efficient and reliable identification of mutation patterns and driver pathways in large-scale cancer genomics studies, and it may also be used for other clustering problems on biological graphs.
The source code for the C(3) method can be found at https://github.com/jackhou2/C3 CONTACT: firstname.lastname@example.org (J.M.) and email@example.com (O.M.) SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online.
[Show abstract][Hide abstract]ABSTRACT: We propose a new latent Boolean feature model for complex networks that captures different types of node interactions and network communities. The model is based on a new concept in graph theory, termed the co-intersection representation of a graph, which generalizes the notion of an intersection representation. We describe how to use co-intersection representations to deduce node feature sets and their communities, and proceed to derive several general bounds on the minimum number of features used in co-intersection representations. We also discuss graph families for which exact co-intersection characterizations are possible.
[Show abstract][Hide abstract]ABSTRACT: We consider the problem of synchronizing coded data in distributed storage networks undergoing insertion and deletion edits. We present modifications of distributed storage codes that allow updates in the parity-check values to be performed with one round of communication at low bit rates and with small storage overhead. Our main contributions are novel protocols for synchronizing frequently updated and semi-static data based on functional intermediary coding involving permutation and Vandermonde matrices.
Article · Dec 2015 · IEEE/ACM Transactions on Networking
[Show abstract][Hide abstract]ABSTRACT: We provide an overview of current approaches to DNA-based storage system design and accompanying synthesis, sequencing and editing methods. We also introduce and analyze a suite of new constrained coding schemes for both archival and random access DNA storage channels. The mathematical basis of our work is the construction and design of sequences over discrete alphabets that avoid pre-specified address patterns, have balanced base content, and exhibit other relevant substring constraints. These schemes adapt the stored signals to the DNA medium and thereby reduce the inherent error-rate of the system.
[Show abstract][Hide abstract]ABSTRACT: We introduce a new agnostic clustering method: minimax correlation
clustering. Given a graph whose edges are labeled with $+$ or $-$, we wish to
partition the graph into clusters while trying to avoid errors: $+$ edges
between clusters or $-$ edges within clusters. Unlike classical correlation
clustering, which seeks to minimize the total number of errors, minimax
clustering instead seeks to minimize the number of errors at the worst vertex,
that is, at the vertex with the greatest number of incident errors. This
minimax objective function may be seen as a way to enforce individual-level
quality of partition constraints for vertices in a graph. We study this problem
on complete graphs and complete bipartite graphs, proving that the problem is
NP-hard on these graph classes and giving polynomial-time constant-factor
approximation algorithms. The approximation algorithms rely on LP relaxation
and rounding procedures.
[Show abstract][Hide abstract]ABSTRACT: We continue our study of a new family of asymmetric Lee codes that arise in the design and implementation of emerging DNA-based storage systems and systems which use parallel string transmission protocols. The codewords are defined over a quaternary alphabet, although the results carry over to other alphabet sizes, and have symbol distances dictated by their underlying binary representation. Our contributions include deriving new bounds for the size of the largest code in this metric based on Delsarte-like linear programming methods and describing new constructions for non-linear asymmetric Lee codes.
[Show abstract][Hide abstract]ABSTRACT: We consider a new family of codes, termed asymmetric Lee distance codes, that
arise in the design and implementation of DNA-based storage systems and systems
with parallel string transmission protocols. The codewords are defined over a
quaternary alphabet, although the results carry over to other alphabet sizes;
furthermore, symbol confusability is dictated by their underlying binary
representation. Our contributions are two-fold. First, we demonstrate that the
new distance represents a linear combination of the Lee and Hamming distance
and derive upper bounds on the size of the codes under this metric based on
linear programming techniques. Second, we propose a number of code
constructions which imply lower bounds.
[Show abstract][Hide abstract]ABSTRACT: We describe the first DNA-based storage architecture that enables random
access to data blocks and rewriting of information stored at arbitrary
locations within the blocks. The newly developed architecture overcomes
drawbacks of existing read-only methods that require decoding the whole file in
order to read one data fragment. Our system is based on new constrained coding
techniques and accompanying DNA editing methods that ensure data reliability,
specificity and sensitivity of access, and at the same time provide
exceptionally high data storage capacity. As a proof of concept, we encoded
parts of the Wikipedia pages of six universities in the USA, and selected and
edited parts of the text written in DNA corresponding to three of these
schools. The results suggest that DNA is a versatile media suitable for both
ultrahigh density archival and rewritable storage applications.
[Show abstract][Hide abstract]ABSTRACT: We consider the problem of storing and retrieving information from synthetic
DNA media. The mathematical basis of the problem is the construction and design
of sequences that may be discriminated based on their collection of substrings
observed through a noisy channel. This problem of reconstructing sequences from
traces was first investigated in the noiseless setting under the name of
"Markov type" analysis. Here, we explain the connection between the
reconstruction problem and the problem of DNA synthesis and sequencing, and
introduce the notion of a DNA storage channel. We analyze the number of
sequence equivalence classes under the channel mapping and propose new
asymmetric coding techniques to combat the effects of synthesis and sequencing
noise. In our analysis, we make use of restricted de Bruijn graphs and Ehrhart
theory for rational polytopes.
Article · Feb 2015 · IEEE Transactions on Information Theory