[Show abstract][Hide abstract] ABSTRACT: We provide an overview of current approaches to DNA-based storage system design and accompanying synthesis, sequencing and editing methods. We also introduce and analyze a suite of new constrained coding schemes for both archival and random access DNA storage channels. The mathematical basis of our work is the construction and design of sequences over discrete alphabets that avoid pre-specified address patterns, have balanced base content, and exhibit other relevant substring constraints. These schemes adapt the stored signals to the DNA medium and thereby reduce the inherent error-rate of the system.
[Show abstract][Hide abstract] ABSTRACT: We introduce a new agnostic clustering method: minimax correlation
clustering. Given a graph whose edges are labeled with $+$ or $-$, we wish to
partition the graph into clusters while trying to avoid errors: $+$ edges
between clusters or $-$ edges within clusters. Unlike classical correlation
clustering, which seeks to minimize the total number of errors, minimax
clustering instead seeks to minimize the number of errors at the worst vertex,
that is, at the vertex with the greatest number of incident errors. This
minimax objective function may be seen as a way to enforce individual-level
quality of partition constraints for vertices in a graph. We study this problem
on complete graphs and complete bipartite graphs, proving that the problem is
NP-hard on these graph classes and giving polynomial-time constant-factor
approximation algorithms. The approximation algorithms rely on LP relaxation
and rounding procedures.
[Show abstract][Hide abstract] ABSTRACT: We continue our study of a new family of asymmetric Lee codes that arise in the design and implementation of emerging DNA-based storage systems and systems which use parallel string transmission protocols. The codewords are defined over a quaternary alphabet, although the results carry over to other alphabet sizes, and have symbol distances dictated by their underlying binary representation. Our contributions include deriving new bounds for the size of the largest code in this metric based on Delsarte-like linear programming methods and describing new constructions for non-linear asymmetric Lee codes.
[Show abstract][Hide abstract] ABSTRACT: We consider a new family of codes, termed asymmetric Lee distance codes, that
arise in the design and implementation of DNA-based storage systems and systems
with parallel string transmission protocols. The codewords are defined over a
quaternary alphabet, although the results carry over to other alphabet sizes;
furthermore, symbol confusability is dictated by their underlying binary
representation. Our contributions are two-fold. First, we demonstrate that the
new distance represents a linear combination of the Lee and Hamming distance
and derive upper bounds on the size of the codes under this metric based on
linear programming techniques. Second, we propose a number of code
constructions which imply lower bounds.
[Show abstract][Hide abstract] ABSTRACT: We describe the first DNA-based storage architecture that enables random
access to data blocks and rewriting of information stored at arbitrary
locations within the blocks. The newly developed architecture overcomes
drawbacks of existing read-only methods that require decoding the whole file in
order to read one data fragment. Our system is based on new constrained coding
techniques and accompanying DNA editing methods that ensure data reliability,
specificity and sensitivity of access, and at the same time provide
exceptionally high data storage capacity. As a proof of concept, we encoded
parts of the Wikipedia pages of six universities in the USA, and selected and
edited parts of the text written in DNA corresponding to three of these
schools. The results suggest that DNA is a versatile media suitable for both
ultrahigh density archival and rewritable storage applications.
[Show abstract][Hide abstract] ABSTRACT: We consider the problem of storing and retrieving information from synthetic
DNA media. The mathematical basis of the problem is the construction and design
of sequences that may be discriminated based on their collection of substrings
observed through a noisy channel. This problem of reconstructing sequences from
traces was first investigated in the noiseless setting under the name of
"Markov type" analysis. Here, we explain the connection between the
reconstruction problem and the problem of DNA synthesis and sequencing, and
introduce the notion of a DNA storage channel. We analyze the number of
sequence equivalence classes under the channel mapping and propose new
asymmetric coding techniques to combat the effects of synthesis and sequencing
noise. In our analysis, we make use of restricted de Bruijn graphs and Ehrhart
theory for rational polytopes.
[Show abstract][Hide abstract] ABSTRACT: We consider the problem of correlation clustering on graphs with constraints
on both the cluster sizes and the positive and negative weights of edges. Our
contributions are twofold: First, we introduce the problem of correlation
clustering with bounded cluster sizes. Second, we extend the region of weight
values for which the clustering may be performed with constant approximation
guarantees in polynomial time and apply the results to the bounded cluster size
[Show abstract][Hide abstract] ABSTRACT: We consider the problem of assembling a sequence based on a collection of its
substrings observed through a noisy channel. This problem of reconstructing
sequences from traces was first investigated in the noiseless setting under the
name of "Markov type" analysis. Here, we explain the connection between the
problem and the problem of DNA synthesis and sequencing, and introduce the
notion of a DNA storage channel. We analyze the number of sequence equivalence
classes under the channel mapping and propose new asymmetric coding techniques
to combat the effects of synthesis noise. In our analysis, we make use of
Ehrhart theory for rational polytopes.
[Show abstract][Hide abstract] ABSTRACT: We analyze a new group testing system, termed semi-quantitative group
testing, which may be viewed as a concatenation of an adder channel and a
discrete quantizer. Our focus is on non-uniform quantizers with arbitrary
thresholds. For a given choice of parameters for the semi-quantitative group
testing model, we define three new families of sequences capturing the
constraints on the code design imposed by the choice of the thresholds. The
sequences represent extensions and generalizations of Bh and certain types of
super-increasing and lexicographically ordered sequences, and they lead to code
structures amenable for efficient recursive decoding. We describe the decoding
methods and provide an accompanying computational complexity and performance
[Show abstract][Hide abstract] ABSTRACT: We propose a new family of distance measures on rankings, derived through an axiomatic approach, that consider the nonuniform relevance of the top and bottom of ordered lists and similarities between candidates. The proposed distance functions include specialized weighted versions of the Kendall (tau ) distance and the Cayley distance, and are suitable for comparing rankings in a number of applications, including information retrieval and rank aggregation. In addition to proposing the distance measures and providing the theoretical underpinnings for their applications, we also analyze algorithmic and computational aspects of weighted distance-based rank aggregation. We present an aggregation method based on approximating weighted distance measures by a generalized version of Spearman’s footrule distance as well as a Markov chain method inspired by PageRank, where transition probabilities of the Markov chain reflect the chosen weighted distances.
IEEE Transactions on Information Theory 10/2014; 60(10):6417-6439. DOI:10.1109/TIT.2014.2345760 · 2.33 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We consider the problem of synchronizing data in distributed storage networks
under an edit model that includes deletions and insertions. We present two
modifications of MDS, regenerating and locally repairable codes that allow
updates in the parity-check values to be performed with one round of
communication at low bit rates and using small storage overhead. Our main
contributions are novel protocols for synchronizing both hot and semi-static
data and protocols for data deduplication applications, based on intermediary
permutation, Vandermonde and Cauchy matrix coding.
[Show abstract][Hide abstract] ABSTRACT: We present a new family of codes for non-uniformly quantized adder channels. Quantized adder channels are generalizations of group testing models, which were studied under the name of semi-quantitative group testing. We describe nonbinary group testing schemes in which the test matrices are generated by concatenating scaled disjunct codebooks, with the scaling parameters determined through lexicographical ordering constraints. In addition, we propose simple iterative decoding methods for one class of such codes.
[Show abstract][Hide abstract] ABSTRACT: Motivated by the problem of deducing the structure of proteins using mass-spectrometry, we study the reconstruction of a string from the multiset of its substring compositions. We specialize the backtracking algorithm used for the more general turnpike problem for string reconstruction. Employing well known results about transience of random walks in ≥ 3 dimensions, we show that the algorithm reconstructs random strings over alphabet size ≥ 4 with high probability in near-optimal quadratic time.
2014 IEEE International Symposium on Information Theory (ISIT); 06/2014
[Show abstract][Hide abstract] ABSTRACT: We present a multiset rank modulation scheme capable of correcting translocation errors, motivated by the fact that compared to permutation codes, multipermutation codes offer higher rates and longer block lengths. We show that the appropriate distance measure for code construction is the Ulam metric applied to equivalence classes of permutations, where each permutation class corresponds to a multipermutation. The paper includes a study of multipermutation codes in the Hamming metric, also known as constant composition codes, due to their use in constructing multipermutation codes in the Ulam metric. We derive bounds on the size of multipermutation codes in both the Ulam metric and the Hamming metric, compute their capacity, and present constructions for codes in the Ulam metric based on permutation interleaving, semi-Latin squares, and resolvable Steiner systems.
2014 IEEE International Symposium on Information Theory (ISIT); 06/2014
[Show abstract][Hide abstract] ABSTRACT: We address the problem of computing distances between rankings that take into account similarities between elements. The need for evaluating such distances arises in applications such as machine learning, social sciences and data storage. The problem may be summarized as follows: Given two rankings and a positive cost function on transpositions that depends on the similarity of the elements involved, find a smallest cost sequence of transpositions that converts one ranking into another. Our focus is on costs that may be described via special tree structures and on rankings modeled as permutations. The presented results include a quadratic-time algorithm for finding a minimum cost transform for a single cycle; and a linear time, 5/3-approximation algorithm for permutations that contain multiple cycles.
2014 IEEE International Symposium on Information Theory (ISIT); 06/2014
[Show abstract][Hide abstract] ABSTRACT: We propose a novel group testing method, termed semi-quantitative group testing, motivated by a class of problems arising in genome screening experiments. Semi-quantitative group testing (SQGT) is a (possibly) non-binary pooling scheme that may be viewed as a concatenation of an adder channel and an integer-valued quantizer. In its full generality, SQGT may be viewed as a unifying framework for group testing, in the sense that most group testing models are special instances of SQGT. For the new testing scheme, we define the notion of SQ-disjunct and SQ-separable codes, representing generalizations of classical disjunct and separable codes. We describe several combinatorial and probabilistic constructions for such codes. While for most of these constructions we assume that the number of defectives is much smaller than total number of test subjects, we also consider the case in which there is no restriction on the number of defectives and they may be as large as the total number of subjects. For the codes constructed in this paper, we describe a number of efficient decoding algorithms. In addition, we describe a belief propagation decoder for sparse SQGT codes for which no other efficient decoder is currently known.
IEEE Transactions on Information Theory 05/2014; 60(8). DOI:10.1109/TIT.2014.2327630 · 2.33 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We introduce a novel probabilistic group testing framework, termed Poisson group testing, in which the number of defectives follows a right-truncated Poisson distribution. The Poisson model applies to a number of biological testing scenarios, where the subjects are assumed to be ordered based on their arrival times and where the probability of being defective decreases with time. Our main result is an information-theoretic upper bound on the minimum number of tests required to achieve an average probability of detection error asymptotically
converging to zero.
Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP’14), Florence, Italy; 05/2014
[Show abstract][Hide abstract] ABSTRACT: We introduce a novel algorithm for inference of causal gene interactions, termed CaSPIAN (Causal Subspace Pursuit for Inference and Analysis of Networks), which is based on coupling compressive sensing and Granger causality techniques. The core of the approach is to discover sparse linear dependencies between shifted time series of gene expressions using a sequential list-version of the subspace pursuit reconstruction algorithm and to estimate the direction of gene interactions via Granger-type elimination. The method is conceptually simple and computationally efficient, and it allows for dealing with noisy measurements. Its performance as a stand-alone platform without biological side-information was tested on simulated networks, on the synthetic IRMA network in Saccharomyces cerevisiae, and on data pertaining to the human HeLa cell network and the SOS network in E. coli. The results produced by CaSPIAN are compared to the results of several related algorithms, demonstrating significant improvements in inference accuracy of documented interactions. These findings highlight the importance of Granger causality techniques for reducing the number of false-positives, as well as the influence of noise and sampling period on the accuracy of the estimates. In addition, the performance of the method was tested in conjunction with biological side information of the form of sparse "scaffold networks", to which new edges were added using available RNA-seq or microarray data. These biological priors aid in increasing the sensitivity and precision of the algorithm in the small sample regime.
PLoS ONE 03/2014; 9(3):e90781. DOI:10.1371/journal.pone.0090781 · 3.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Motivated by mass-spectrometry protein sequencing, we consider a
simply-stated problem of reconstructing a string from the multiset of its
substring compositions. We show that all strings of length 7, one less than a
prime, or one less than twice a prime, can be reconstructed uniquely up to
reversal. For all other lengths we show that reconstruction is not always
possible and provide sometimes-tight bounds on the largest number of strings
with given substring compositions. The lower bounds are derived by
combinatorial arguments and the upper bounds by algebraic considerations that
precisely characterize the set of strings with the same substring compositions
in terms of the factorization of bivariate polynomials. The problem can be
viewed as a combinatorial simplification of the turnpike problem, and its
solution may shed light on this long-standing problem as well. Using well known
results on transience of multi-dimensional random walks, we also provide a
reconstruction algorithm that reconstructs random strings over alphabets of
size $\ge4$ in optimal near-quadratic time.