Olgica Milenkovic

University of Massachusetts Amherst, Amherst Center, MA, United States

Are you Olgica Milenkovic?

Claim your profile

Publications (135)98.03 Total impact

  • Minji Kim, Farzad Farnoud, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: Gene prioritization refers to a family of computational techniques for inferring disease genes through a set of training genes and carefully chosen similarity criteria. Test genes are scored based on their average similarity to the training set, and the rankings of genes under various similarity criteria are aggregated via statistical methods. The contributions of our work are threefold: a) first, based on the realization that there is no unique way to define an optimal aggregate for rankings, we investigate the predictive quality of a number of new aggregation methods and known fusion techniques from machine learning and social choice theory. Within this context, we quantify the influence of the number of training genes and similarity criteria on the diagnostic quality of the aggregate and perform in-depth cross-validation studies; b) second, we propose a new approach to genomic data aggregation, termed HyDRA (Hybrid Distance-score Rank Aggregation), which combines the advantages of score-based and combinatorial aggregation techniques. We also propose incorporating a new top-vs-bottom (TvB) weighting feature into the hybrid schemes. The TvB feature ensures that aggregates are more reliable at the top of the list, rather than at the bottom, since only top candidates are tested experimentally; c) third, we propose an iterative procedure for gene discovery that operates via successful augmentation of the set of training genes by genes discovered in previous rounds, checked for consistency. Fundamental results from social choice theory, political and computer sciences, and statistics have shown that there exists no consistent, fair and unique way to aggregate rankings. Instead, one has to decide on an aggregation approach using predefined set of desirable properties for the aggregate. The aggregation methods fall into two categories, score-based and distance-based approaches, each of which has its own drawbacks and advantages. This work is motivated by the observation that merging these two techniques in a computationally efficient manner, and by incorporating additional constraints, one can ensure that the predictive quality of the resulting aggregation algorithm is very high. We tested HyDRA on a number of gene sets, including Autism, Breast cancer, Colorectal cancer, Endometriosis, Ischeemic stroke, Leukemia, Lymphoma, and Osteoarthritis. Furthermore, we performed iterative gene discovery for Glioblastoma, Meningioma and Breast cancer, using a sequentially augmented list of training genes related to the Turcot syndrome, Li-Fraumeni condition and other diseases. The methods outperform state-of-the-art software tools such as ToppGene and Endeavour. Despite this finding, we recommend as best practice to take the union of top-ranked items produced by different methods for the final aggregated list. Availability: The HyDRA software may be downloaded from: https://dl.dropboxusercontent.com/u/40200227/HyDRAsoftware.zip CONTACT: mkim158@illinois.edu. © The Author (2014). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
    Bioinformatics (Oxford, England). 11/2014;
  • Gregory J. Puleo, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of correlation clustering on graphs with constraints on both the cluster sizes and the positive and negative weights of edges. Our contributions are twofold: First, we introduce the problem of correlation clustering with bounded cluster sizes. Second, we extend the region of weight values for which the clustering may be performed with constant approximation guarantees in polynomial time and apply the results to the bounded cluster size problem.
    11/2014;
  • Source
    Han Mao Kiah, Gregory J. Puleo, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of assembling a sequence based on a collection of its substrings observed through a noisy channel. This problem of reconstructing sequences from traces was first investigated in the noiseless setting under the name of "Markov type" analysis. Here, we explain the connection between the problem and the problem of DNA synthesis and sequencing, and introduce the notion of a DNA storage channel. We analyze the number of sequence equivalence classes under the channel mapping and propose new asymmetric coding techniques to combat the effects of synthesis noise. In our analysis, we make use of Ehrhart theory for rational polytopes.
    10/2014;
  • Amin Emad, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We analyze a new group testing system, termed semi-quantitative group testing, which may be viewed as a concatenation of an adder channel and a discrete quantizer. Our focus is on non-uniform quantizers with arbitrary thresholds. For a given choice of parameters for the semi-quantitative group testing model, we define three new families of sequences capturing the constraints on the code design imposed by the choice of the thresholds. The sequences represent extensions and generalizations of Bh and certain types of super-increasing and lexicographically ordered sequences, and they lead to code structures amenable for efficient recursive decoding. We describe the decoding methods and provide an accompanying computational complexity and performance analysis.
    10/2014;
  • Source
    F. Farnoud Hassanzadeh, O. Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a new family of distance measures on rankings, derived through an axiomatic approach, that consider the nonuniform relevance of the top and bottom of ordered lists and similarities between candidates. The proposed distance functions include specialized weighted versions of the Kendall (tau ) distance and the Cayley distance, and are suitable for comparing rankings in a number of applications, including information retrieval and rank aggregation. In addition to proposing the distance measures and providing the theoretical underpinnings for their applications, we also analyze algorithmic and computational aspects of weighted distance-based rank aggregation. We present an aggregation method based on approximating weighted distance measures by a generalized version of Spearman’s footrule distance as well as a Markov chain method inspired by PageRank, where transition probabilities of the Markov chain reflect the chosen weighted distances.
    IEEE Transactions on Information Theory 10/2014; 60(10):6417-6439. · 2.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of synchronizing data in distributed storage networks under an edit model that includes deletions and insertions. We present two modifications of MDS, regenerating and locally repairable codes that allow updates in the parity-check values to be performed with one round of communication at low bit rates and using small storage overhead. Our main contributions are novel protocols for synchronizing both hot and semi-static data and protocols for data deduplication applications, based on intermediary permutation, Vandermonde and Cauchy matrix coding.
    09/2014;
  • Amin Emad, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a new family of codes for non-uniformly quantized adder channels. Quantized adder channels are generalizations of group testing models, which were studied under the name of semi-quantitative group testing. We describe nonbinary group testing schemes in which the test matrices are generated by concatenating scaled disjunct codebooks, with the scaling parameters determined through lexicographical ordering constraints. In addition, we propose simple iterative decoding methods for one class of such codes.
    IEEE Int. Symp. Inf. Theory (ISIT’14); 07/2014
  • Farzad Farnoud, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a multiset rank modulation scheme capable of correcting translocation errors, motivated by the fact that compared to permutation codes, multipermutation codes offer higher rates and longer block lengths. We show that the appropriate distance measure for code construction is the Ulam metric applied to equivalence classes of permutations, where each permutation class corresponds to a multipermutation. The paper includes a study of multipermutation codes in the Hamming metric, also known as constant composition codes, due to their use in constructing multipermutation codes in the Ulam metric. We derive bounds on the size of multipermutation codes in both the Ulam metric and the Hamming metric, compute their capacity, and present constructions for codes in the Ulam metric based on permutation interleaving, semi-Latin squares, and resolvable Steiner systems.
    2014 IEEE International Symposium on Information Theory (ISIT); 06/2014
  • Lili Su, Farzad Farnoud, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We address the problem of computing distances between rankings that take into account similarities between elements. The need for evaluating such distances arises in applications such as machine learning, social sciences and data storage. The problem may be summarized as follows: Given two rankings and a positive cost function on transpositions that depends on the similarity of the elements involved, find a smallest cost sequence of transpositions that converts one ranking into another. Our focus is on costs that may be described via special tree structures and on rankings modeled as permutations. The presented results include a quadratic-time algorithm for finding a minimum cost transform for a single cycle; and a linear time, 5/3-approximation algorithm for permutations that contain multiple cycles.
    2014 IEEE International Symposium on Information Theory (ISIT); 06/2014
  • Amin Emad, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a novel group testing method, termed semi-quantitative group testing, motivated by a class of problems arising in genome screening experiments. Semi-quantitative group testing (SQGT) is a (possibly) non-binary pooling scheme that may be viewed as a concatenation of an adder channel and an integer-valued quantizer. In its full generality, SQGT may be viewed as a unifying framework for group testing, in the sense that most group testing models are special instances of SQGT. For the new testing scheme, we define the notion of SQ-disjunct and SQ-separable codes, representing generalizations of classical disjunct and separable codes. We describe several combinatorial and probabilistic constructions for such codes. While for most of these constructions we assume that the number of defectives is much smaller than total number of test subjects, we also consider the case in which there is no restriction on the number of defectives and they may be as large as the total number of subjects. For the codes constructed in this paper, we describe a number of efficient decoding algorithms. In addition, we describe a belief propagation decoder for sparse SQGT codes for which no other efficient decoder is currently known.
    IEEE Transactions on Information Theory 05/2014; · 2.62 Impact Factor
  • Amin Emad, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a novel probabilistic group testing framework, termed Poisson group testing, in which the number of defectives follows a right-truncated Poisson distribution. The Poisson model applies to a number of biological testing scenarios, where the subjects are assumed to be ordered based on their arrival times and where the probability of being defective decreases with time. Our main result is an information-theoretic upper bound on the minimum number of tests required to achieve an average probability of detection error asymptotically converging to zero.
    Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP’14), Florence, Italy; 05/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivated by mass-spectrometry protein sequencing, we consider a simply-stated problem of reconstructing a string from the multiset of its substring compositions. We show that all strings of length 7, one less than a prime, or one less than twice a prime, can be reconstructed uniquely up to reversal. For all other lengths we show that reconstruction is not always possible and provide sometimes-tight bounds on the largest number of strings with given substring compositions. The lower bounds are derived by combinatorial arguments and the upper bounds by algebraic considerations that precisely characterize the set of strings with the same substring compositions in terms of the factorization of bivariate polynomials. The problem can be viewed as a combinatorial simplification of the turnpike problem, and its solution may shed light on this long-standing problem as well. Using well known results on transience of multi-dimensional random walks, we also provide a reconstruction algorithm that reconstructs random strings over alphabets of size $\ge4$ in optimal near-quadratic time.
    03/2014;
  • Source
    Lili Su, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of exact synchronization of two rankings at remote locations connected by a two-way channel. Such synchronization problems arise when items in the data are distinguishable, as is the case for playlists, tasklists, crowdvotes and recommender systems rankings. Our model accounts for different constraints on the communication throughput of the forward and feedback links, resulting in different anchoring, syndrome and checksum computation strategies. Information editing is assumed of the form of deletions, insertions, block deletions/insertions, translocations and transpositions. The protocols developed under the given model are order-optimal with respect to genie aided lower bounds.
    01/2014;
  • Source
    Amin Emad, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a novel algorithm for inference of causal gene interactions, termed CaSPIAN (Causal Subspace Pursuit for Inference and Analysis of Networks), which is based on coupling compressive sensing and Granger causality techniques. The core of the approach is to discover sparse linear dependencies between shifted time series of gene expressions using a sequential list-version of the subspace pursuit reconstruction algorithm and to estimate the direction of gene interactions via Granger-type elimination. The method is conceptually simple and computationally efficient, and it allows for dealing with noisy measurements. Its performance as a stand-alone platform without biological side-information was tested on simulated networks, on the synthetic IRMA network in Saccharomyces cerevisiae, and on data pertaining to the human HeLa cell network and the SOS network in E. coli. The results produced by CaSPIAN are compared to the results of several related algorithms, demonstrating significant improvements in inference accuracy of documented interactions. These findings highlight the importance of Granger causality techniques for reducing the number of false-positives, as well as the influence of noise and sampling period on the accuracy of the estimates. In addition, the performance of the method was tested in conjunction with biological side information of the form of sparse "scaffold networks", to which new edges were added using available RNA-seq or microarray data. These biological priors aid in increasing the sensitivity and precision of the algorithm in the small sample regime.
    PLoS ONE 01/2014; 9(3):e90781. · 3.53 Impact Factor
  • Source
    Farzad Farnoud, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We address the problem of multipermutation code design in the Ulam metric for novel storage applications. Multipermutation codes are suitable for flash memory where cell charges may share the same rank. Changes in the charges of cells manifest themselves as errors whose effects on the retrieved signal may be measured via the Ulam distance. As part of our analysis, we study multipermutation codes in the Hamming metric, known as constant composition codes. We then present bounds on the size of multipermutation codes and their capacity, for both the Ulam and the Hamming metrics. Finally, we present constructions and accompanying decoders for multipermutation codes in the Ulam metric.
    IEEE Journal on Selected Areas in Communications 12/2013; 32(5). · 3.12 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a parallel algorithmic architecture for metagenomic sequence assembly, termed MetaPar, which allows for significant reductions in assembly time and consequently enables the processing of large genomic datasets on computers with low memory usage. The gist of the approach is to iteratively perform read (re)classification based on phylogenetic marker genes and assembler outputs generated from random subsets of metagenomic reads. Once a sufficiently accurate classification within genera is performed, de novo metagenomic assemblers (such as Velvet or IDBA-UD) or reference based assemblers may be used for contig construction. We analyze the performance of MetaPar on synthetic data consisting of 15 randomly chosen species from the NCBI database through the effective gap and effective coverage metrics.
    11/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Metagenomics is an emerging field of molecular biology concerned with analyzing the genomes of environmental samples comprising many different diverse organisms. Given the nature of metagenomic data, one usually has to sequence the genomic material of all organisms in a batch, leading to a mix of reads coming from different DNA sequences. In deep high-throughput sequencing experiments, the volume of the raw reads is extremely high, frequently exceeding 600 Gb. With an ever increasing demand for storing such reads for future studies, the issue of efficient metagenomic compression becomes of paramount importance. We present the first known approach to metagenome read compression, termed MCUIUC (Metagenomic Compression at UIUC). The gist of the proposed algorithm is to perform classification of reads based on unique organism identifiers, followed by reference-based alignment of reads for individually identified organisms, and metagenomic assembly of unclassified reads. Once assembly and classification are completed, lossless reference based compression is performed via positional encoding. We evaluate the performance of the algorithm on moderate sized synthetic metagenomic samples involving 15 randomly selected organisms and describe future directions for improving the proposed compression method.
    11/2013;
  • Lili Su, Farzad Farnoud, Olgica Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We address the problem of computing distances between rankings that take into account similarities between candidates. The need for evaluating such distances is governed by applications as diverse as rank aggregation, bioinformatics, social sciences and data storage. The problem may be summarized as follows. Given two rankings and a positive cost function on transpositions that depends on the similarity of the candidates involved, find a smallest cost sequence of transpositions that converts one ranking into another. Our focus is on costs that may be described via special metric-tree structures and on full rankings modeled as permutations. The presented results include a quadratic-time algorithm for finding a minimum cost transform for simple cycles; and a linear time, 5/3-approximation algorithm for permutations that contain multiple cycles. In addition, for permutations with digraphs represented by non-intersecting cycles embedded in trees, we present a polynomial-time transform algorithm. The proposed methods rely on investigating a newly introduced balancing property of cycles embedded in trees, cycle-merging methods, and shortest-path optimization techniques.
    07/2013;
  • B. Touri, F. Fardnoud, A. Nedic, O. Milenkovic
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a general model for opinion dynamics in a social network together with several possibilities for object selections at times when the agents are communicating. We study the limiting behavior of such a dynamics and show that this dynamics almost surely converges. We consider some special implications of the convergence result for gossip and top-k selective gossip models. In particular, holds in a general setting. Moreover, we propose an extension of the gossip and top-k selective gossip models, and provide some results for their limiting behavior.
    American Control Conference (ACC), 2013; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In networked systems comprised of many agents, it is often required to reach a common operating point of all agents, termed the network consensus. We consider two iterative methods for reaching a ranking (ordering) consensus over a voter network, where the initial preference of every voter is of the form of a full ranking of candidates. The voters are allowed, one at a time and based on some random scheme, to change their votes to bring them “closer” to the opinions of selected subsets of peers. The first consensus method is based on changing votes one adjacent swap at a time; the second method is based on changing votes via averaging with the votes of peers, potentially leading to many adjacent swaps at a given time. For the first model, we characterize convergence points and conditions for convergence. For the second model, we prove convergence to a global ranking and derive the rate of convergence to this consensus.
    Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on; 01/2013

Publication Stats

1k Citations
98.03 Total Impact Points

Institutions

  • 2010
    • University of Massachusetts Amherst
      • School of Computer Science
      Amherst Center, MA, United States
  • 2008–2010
    • University of Illinois, Urbana-Champaign
      • • Department of Electrical and Computer Engineering
      • • Coordinated Science Laboratory
      Urbana, IL, United States
  • 2009
    • University of Hawaiʻi at Mānoa
      • Department of Electrical Engineering
      Honolulu, Hawaii, United States
  • 2008–2009
    • Technische Universität München
      • Lehrstuhl für Nachrichtentechnik LNT
      München, Bavaria, Germany
  • 2007
    • Texas A&M University
      College Station, Texas, United States
    • University of California, San Diego
      San Diego, California, United States
    • Rice University
      • Department of Electrical and Computer Engineering
      Houston, Texas, United States
  • 2003–2007
    • University of Colorado at Boulder
      • Department of Electrical, Computer, and Energy Engineering (ECEE)
      Boulder, CO, United States
  • 2006
    • University of Colorado
      Denver, Colorado, United States
  • 2005
    • University of Bristol
      • Department of Electrical and Electronic Engineering
      Bristol, ENG, United Kingdom
  • 2004
    • The University of Arizona
      • Department of Electrical and Computer Engineering
      Tucson, AZ, United States
  • 2001–2003
    • University of Michigan
      • Department of Electrical Engineering and Computer Science (EECS)
      Ann Arbor, MI, United States
  • 2000–2001
    • Concordia University–Ann Arbor
      Ann Arbor, Michigan, United States
  • 1996–1997
    • Rochester Institute of Technology
      • Department of Electrical and Microelectronic Engineering
      Rochester, NY, United States