Olgica Milenkovic

University of Illinois, Urbana-Champaign, Urbana, Illinois, United States

Are you Olgica Milenkovic?

Claim your profile

Publications (145)115.98 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of synchronizing coded data in distributed storage networks undergoing insertion and deletion edits. We present modifications of distributed storage codes that allow updates in the parity-check values to be performed with one round of communication at low bit rates and with small storage overhead. Our main contributions are novel protocols for synchronizing frequently updated and semi-static data based on functional intermediary coding involving permutation and Vandermonde matrices.
    No preview · Article · Dec 2015 · IEEE/ACM Transactions on Networking
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We provide an overview of current approaches to DNA-based storage system design and accompanying synthesis, sequencing and editing methods. We also introduce and analyze a suite of new constrained coding schemes for both archival and random access DNA storage channels. The mathematical basis of our work is the construction and design of sequences over discrete alphabets that avoid pre-specified address patterns, have balanced base content, and exhibit other relevant substring constraints. These schemes adapt the stored signals to the DNA medium and thereby reduce the inherent error-rate of the system.
    Full-text · Article · Jul 2015
  • Source
    Gregory J. Puleo · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a new agnostic clustering method: minimax correlation clustering. Given a graph whose edges are labeled with $+$ or $-$, we wish to partition the graph into clusters while trying to avoid errors: $+$ edges between clusters or $-$ edges within clusters. Unlike classical correlation clustering, which seeks to minimize the total number of errors, minimax clustering instead seeks to minimize the number of errors at the worst vertex, that is, at the vertex with the greatest number of incident errors. This minimax objective function may be seen as a way to enforce individual-level quality of partition constraints for vertices in a graph. We study this problem on complete graphs and complete bipartite graphs, proving that the problem is NP-hard on these graph classes and giving polynomial-time constant-factor approximation algorithms. The approximation algorithms rely on LP relaxation and rounding procedures.
    Preview · Article · Jun 2015
  • Ryan Gabrys · Han Mao Kiah · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We continue our study of a new family of asymmetric Lee codes that arise in the design and implementation of emerging DNA-based storage systems and systems which use parallel string transmission protocols. The codewords are defined over a quaternary alphabet, although the results carry over to other alphabet sizes, and have symbol distances dictated by their underlying binary representation. Our contributions include deriving new bounds for the size of the largest code in this metric based on Delsarte-like linear programming methods and describing new constructions for non-linear asymmetric Lee codes.
    No preview · Article · Jun 2015
  • Source
    Ryan Gabrys · Han Mao Kiah · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider a new family of codes, termed asymmetric Lee distance codes, that arise in the design and implementation of DNA-based storage systems and systems with parallel string transmission protocols. The codewords are defined over a quaternary alphabet, although the results carry over to other alphabet sizes; furthermore, symbol confusability is dictated by their underlying binary representation. Our contributions are two-fold. First, we demonstrate that the new distance represents a linear combination of the Lee and Hamming distance and derive upper bounds on the size of the codes under this metric based on linear programming techniques. Second, we propose a number of code constructions which imply lower bounds.
    Preview · Article · Jun 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the first DNA-based storage architecture that enables random access to data blocks and rewriting of information stored at arbitrary locations within the blocks. The newly developed architecture overcomes drawbacks of existing read-only methods that require decoding the whole file in order to read one data fragment. Our system is based on new constrained coding techniques and accompanying DNA editing methods that ensure data reliability, specificity and sensitivity of access, and at the same time provide exceptionally high data storage capacity. As a proof of concept, we encoded parts of the Wikipedia pages of six universities in the USA, and selected and edited parts of the text written in DNA corresponding to three of these schools. The results suggest that DNA is a versatile media suitable for both ultrahigh density archival and rewritable storage applications.
    Full-text · Article · May 2015 · Scientific Reports
  • Source
    Han Mao Kiah · Gregory J. Puleo · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of storing and retrieving information from synthetic DNA media. The mathematical basis of the problem is the construction and design of sequences that may be discriminated based on their collection of substrings observed through a noisy channel. This problem of reconstructing sequences from traces was first investigated in the noiseless setting under the name of "Markov type" analysis. Here, we explain the connection between the reconstruction problem and the problem of DNA synthesis and sequencing, and introduce the notion of a DNA storage channel. We analyze the number of sequence equivalence classes under the channel mapping and propose new asymmetric coding techniques to combat the effects of synthesis and sequencing noise. In our analysis, we make use of restricted de Bruijn graphs and Ehrhart theory for rational polytopes.
    Preview · Article · Feb 2015
  • Source
    Minji Kim · Farzad Farnoud · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene prioritization refers to a family of computational techniques for inferring disease genes through a set of training genes and carefully chosen similarity criteria. Test genes are scored based on their average similarity to the training set, and the rankings of genes under various similarity criteria are aggregated via statistical methods. The contributions of our work are threefold: a) first, based on the realization that there is no unique way to define an optimal aggregate for rankings, we investigate the predictive quality of a number of new aggregation methods and known fusion techniques from machine learning and social choice theory. Within this context, we quantify the influence of the number of training genes and similarity criteria on the diagnostic quality of the aggregate and perform in-depth cross-validation studies; b) second, we propose a new approach to genomic data aggregation, termed HyDRA (Hybrid Distance-score Rank Aggregation), which combines the advantages of score-based and combinatorial aggregation techniques. We also propose incorporating a new top-vs-bottom (TvB) weighting feature into the hybrid schemes. The TvB feature ensures that aggregates are more reliable at the top of the list, rather than at the bottom, since only top candidates are tested experimentally; c) third, we propose an iterative procedure for gene discovery that operates via successful augmentation of the set of training genes by genes discovered in previous rounds, checked for consistency. Fundamental results from social choice theory, political and computer sciences, and statistics have shown that there exists no consistent, fair and unique way to aggregate rankings. Instead, one has to decide on an aggregation approach using predefined set of desirable properties for the aggregate. The aggregation methods fall into two categories, score-based and distance-based approaches, each of which has its own drawbacks and advantages. This work is motivated by the observation that merging these two techniques in a computationally efficient manner, and by incorporating additional constraints, one can ensure that the predictive quality of the resulting aggregation algorithm is very high. We tested HyDRA on a number of gene sets, including Autism, Breast cancer, Colorectal cancer, Endometriosis, Ischeemic stroke, Leukemia, Lymphoma, and Osteoarthritis. Furthermore, we performed iterative gene discovery for Glioblastoma, Meningioma and Breast cancer, using a sequentially augmented list of training genes related to the Turcot syndrome, Li-Fraumeni condition and other diseases. The methods outperform state-of-the-art software tools such as ToppGene and Endeavour. Despite this finding, we recommend as best practice to take the union of top-ranked items produced by different methods for the final aggregated list. Availability: The HyDRA software may be downloaded from: https://dl.dropboxusercontent.com/u/40200227/HyDRAsoftware.zip CONTACT: mkim158@illinois.edu. © The Author (2014). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
    Full-text · Article · Nov 2014 · Bioinformatics
  • Source
    Gregory J. Puleo · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of correlation clustering on graphs with constraints on both the cluster sizes and the positive and negative weights of edges. Our contributions are twofold: First, we introduce the problem of correlation clustering with bounded cluster sizes. Second, we extend the region of weight values for which the clustering may be performed with constant approximation guarantees in polynomial time and apply the results to the bounded cluster size problem.
    Preview · Article · Nov 2014 · SIAM Journal on Optimization
  • Source
    Han Mao Kiah · Gregory J. Puleo · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of assembling a sequence based on a collection of its substrings observed through a noisy channel. This problem of reconstructing sequences from traces was first investigated in the noiseless setting under the name of "Markov type" analysis. Here, we explain the connection between the problem and the problem of DNA synthesis and sequencing, and introduce the notion of a DNA storage channel. We analyze the number of sequence equivalence classes under the channel mapping and propose new asymmetric coding techniques to combat the effects of synthesis noise. In our analysis, we make use of Ehrhart theory for rational polytopes.
    Full-text · Article · Oct 2014
  • Source
    Amin Emad · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We analyze a new group testing system, termed semi-quantitative group testing, which may be viewed as a concatenation of an adder channel and a discrete quantizer. Our focus is on non-uniform quantizers with arbitrary thresholds. For a given choice of parameters for the semi-quantitative group testing model, we define three new families of sequences capturing the constraints on the code design imposed by the choice of the thresholds. The sequences represent extensions and generalizations of Bh and certain types of super-increasing and lexicographically ordered sequences, and they lead to code structures amenable for efficient recursive decoding. We describe the decoding methods and provide an accompanying computational complexity and performance analysis.
    Preview · Article · Oct 2014 · IEEE Transactions on Information Theory
  • Source
    Farzad Farnoud (Hassanzadeh · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a new family of distance measures on rankings, derived through an axiomatic approach, that consider the nonuniform relevance of the top and bottom of ordered lists and similarities between candidates. The proposed distance functions include specialized weighted versions of the Kendall (tau ) distance and the Cayley distance, and are suitable for comparing rankings in a number of applications, including information retrieval and rank aggregation. In addition to proposing the distance measures and providing the theoretical underpinnings for their applications, we also analyze algorithmic and computational aspects of weighted distance-based rank aggregation. We present an aggregation method based on approximating weighted distance measures by a generalized version of Spearman’s footrule distance as well as a Markov chain method inspired by PageRank, where transition probabilities of the Markov chain reflect the chosen weighted distances.
    Full-text · Article · Oct 2014 · IEEE Transactions on Information Theory
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of synchronizing data in distributed storage networks under an edit model that includes deletions and insertions. We present two modifications of MDS, regenerating and locally repairable codes that allow updates in the parity-check values to be performed with one round of communication at low bit rates and using small storage overhead. Our main contributions are novel protocols for synchronizing both hot and semi-static data and protocols for data deduplication applications, based on intermediary permutation, Vandermonde and Cauchy matrix coding.
    Full-text · Article · Sep 2014
  • Amin Emad · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a new family of codes for non-uniformly quantized adder channels. Quantized adder channels are generalizations of group testing models, which were studied under the name of semi-quantitative group testing. We describe nonbinary group testing schemes in which the test matrices are generated by concatenating scaled disjunct codebooks, with the scaling parameters determined through lexicographical ordering constraints. In addition, we propose simple iterative decoding methods for one class of such codes.
    No preview · Conference Paper · Jul 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Motivated by the problem of deducing the structure of proteins using mass-spectrometry, we study the reconstruction of a string from the multiset of its substring compositions. We specialize the backtracking algorithm used for the more general turnpike problem for string reconstruction. Employing well known results about transience of random walks in ≥ 3 dimensions, we show that the algorithm reconstructs random strings over alphabet size ≥ 4 with high probability in near-optimal quadratic time.
    No preview · Conference Paper · Jun 2014
  • Source
    Farzad Farnoud · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a multiset rank modulation scheme capable of correcting translocation errors, motivated by the fact that compared to permutation codes, multipermutation codes offer higher rates and longer block lengths. We show that the appropriate distance measure for code construction is the Ulam metric applied to equivalence classes of permutations, where each permutation class corresponds to a multipermutation. The paper includes a study of multipermutation codes in the Hamming metric, also known as constant composition codes, due to their use in constructing multipermutation codes in the Ulam metric. We derive bounds on the size of multipermutation codes in both the Ulam metric and the Hamming metric, compute their capacity, and present constructions for codes in the Ulam metric based on permutation interleaving, semi-Latin squares, and resolvable Steiner systems.
    Full-text · Conference Paper · Jun 2014
  • Source
    Lili Su · Farzad Farnoud · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We address the problem of computing distances between rankings that take into account similarities between elements. The need for evaluating such distances arises in applications such as machine learning, social sciences and data storage. The problem may be summarized as follows: Given two rankings and a positive cost function on transpositions that depends on the similarity of the elements involved, find a smallest cost sequence of transpositions that converts one ranking into another. Our focus is on costs that may be described via special tree structures and on rankings modeled as permutations. The presented results include a quadratic-time algorithm for finding a minimum cost transform for a single cycle; and a linear time, 5/3-approximation algorithm for permutations that contain multiple cycles.
    Full-text · Conference Paper · Jun 2014
  • Amin Emad · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a novel group testing method, termed semi-quantitative group testing, motivated by a class of problems arising in genome screening experiments. Semi-quantitative group testing (SQGT) is a (possibly) non-binary pooling scheme that may be viewed as a concatenation of an adder channel and an integer-valued quantizer. In its full generality, SQGT may be viewed as a unifying framework for group testing, in the sense that most group testing models are special instances of SQGT. For the new testing scheme, we define the notion of SQ-disjunct and SQ-separable codes, representing generalizations of classical disjunct and separable codes. We describe several combinatorial and probabilistic constructions for such codes. While for most of these constructions we assume that the number of defectives is much smaller than total number of test subjects, we also consider the case in which there is no restriction on the number of defectives and they may be as large as the total number of subjects. For the codes constructed in this paper, we describe a number of efficient decoding algorithms. In addition, we describe a belief propagation decoder for sparse SQGT codes for which no other efficient decoder is currently known.
    No preview · Article · May 2014 · IEEE Transactions on Information Theory
  • Source
    Amin Emad · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a novel probabilistic group testing framework, termed Poisson group testing, in which the number of defectives follows a right-truncated Poisson distribution. The Poisson model applies to a number of biological testing scenarios, where the subjects are assumed to be ordered based on their arrival times and where the probability of being defective decreases with time. Our main result is an information-theoretic upper bound on the minimum number of tests required to achieve an average probability of detection error asymptotically converging to zero.
    Preview · Conference Paper · May 2014
  • Source
    Amin Emad · Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a novel algorithm for inference of causal gene interactions, termed CaSPIAN (Causal Subspace Pursuit for Inference and Analysis of Networks), which is based on coupling compressive sensing and Granger causality techniques. The core of the approach is to discover sparse linear dependencies between shifted time series of gene expressions using a sequential list-version of the subspace pursuit reconstruction algorithm and to estimate the direction of gene interactions via Granger-type elimination. The method is conceptually simple and computationally efficient, and it allows for dealing with noisy measurements. Its performance as a stand-alone platform without biological side-information was tested on simulated networks, on the synthetic IRMA network in Saccharomyces cerevisiae, and on data pertaining to the human HeLa cell network and the SOS network in E. coli. The results produced by CaSPIAN are compared to the results of several related algorithms, demonstrating significant improvements in inference accuracy of documented interactions. These findings highlight the importance of Granger causality techniques for reducing the number of false-positives, as well as the influence of noise and sampling period on the accuracy of the estimates. In addition, the performance of the method was tested in conjunction with biological side information of the form of sparse "scaffold networks", to which new edges were added using available RNA-seq or microarray data. These biological priors aid in increasing the sensitivity and precision of the algorithm in the small sample regime.
    Preview · Article · Mar 2014 · PLoS ONE

Publication Stats

2k Citations
115.98 Total Impact Points

Institutions

  • 2008-2015
    • University of Illinois, Urbana-Champaign
      • • Department of Electrical and Computer Engineering
      • • Coordinated Science Laboratory
      Urbana, Illinois, United States
  • 2014
    • Yahoo! Labs
      Sunnyvale, California, United States
  • 2007
    • Texas A&M University
      College Station, Texas, United States
    • University of California, San Diego
      San Diego, California, United States
  • 2003-2007
    • University of Colorado at Boulder
      • Department of Electrical, Computer, and Energy Engineering (ECEE)
      Boulder, Colorado, United States
  • 2006
    • University of Colorado
      Denver, Colorado, United States
  • 2001-2002
    • University of Michigan
      • Department of Electrical Engineering and Computer Science (EECS)
      Ann Arbor, Michigan, United States
  • 2000
    • Concordia University–Ann Arbor
      Ann Arbor, Michigan, United States