Olgica Milenkovic

University of Illinois, Urbana-Champaign, Urbana, Illinois, United States

Are you Olgica Milenkovic?

Claim your profile

Publications (159)

  • Source
    Minji Kim · Xiejia Zhang · Jonathan G. Ligo · [...] · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes. We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline. The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration.
    Full-text Article · Dec 2016 · BMC Bioinformatics
  • Ryan Gabrys · Eitan Yaakobi · Olgica Milenkovic
    Conference Paper · Jul 2016
  • S. M. Hossein · Tabatabaei Yazdi · Han Mao Kiah · Olgica Milenkovic
    Conference Paper · Jul 2016
  • Son Hoang Dau · Olgica Milenkovic
    Conference Paper · Jul 2016
  • Ryan Gabrys · Olgica Milenkovic
    Conference Paper · Jul 2016
  • Vida Ravanmehr · Sadegh Bolouki · Gregory J. Puleo · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: Threshold graphs are recursive deterministic network models that capture properties of certain social and economic interactions. One drawback of these graph families is that they they have highly constrained generative attachment rules. To mitigate this problem, we introduce a new class of graphs termed Doubly Threshold (DT) graphs which may be succinctly described through vertex weights that govern the existence of edges via two inequalities. One inequality imposes the constraint that the sum of weights of adjacent vertices has to exceed a specified threshold. The second inequality ensures that adjacent vertices have a bounded difference of their weights. We provide a succinct characterization and decomposition of DT graphs and analyze their forbidden induced subgraphs which we compare to those of known social networks. We also present a method for performing vertex weight assignments on DT graphs that satisfy the defining constraints.
    Article · Mar 2016
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: We introduce the notion of weakly mutually uncorrelated (WMU) sequences, motivated by applications in DNA-based storage systems and synchronization protocols. WMU sequences are characterized by the property that no sufficiently long suffix of one sequence is the prefix of the same or another sequence. In addition, WMU sequences used in DNA-based storage systems are required to have balanced compositions of symbols and to be at large mutual Hamming distance from each other. We present a number of constructions for balanced, error-correcting WMU codes using Dyck paths, Knuth's balancing principle, prefix synchronized and cyclic codes.
    Full-text Article · Jan 2016
  • Source
    Ryan Gabrys · Eitan Yaakobi · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: We introduce the new problem of code design in the Damerau metric. The Damerau metric is a generalization of the Levenshtein distance which also allows for adjacent transposition edits. We first provide constructions for codes that may correct either a single deletion or a single adjacent transposition and then proceed to extend these results to codes that can simultaneously correct a single deletion and multiple adjacent transpositions.
    Full-text Article · Jan 2016
  • Ryan Gabrys · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: Motivated by charge balancing constraints for rank modulation schemes, we introduce the notion of balanced permutations and derive the capacity of balanced permutation codes. We also describe simple interleaving methods for permutation code constructions and show that they approach capacity
    Article · Jan 2016
  • Jack P. Hou · Amin Emad · Gregory J. Puleo · [...] · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: Motivation: Cancer genomes exhibit a large number of different alterations that affect many genes in a diverse manner. An improved understanding of the generative mechanisms behind the mutation rules and their influence on gene community behavior is of great importance for the study of cancer. Results: To expand our capability to analyze combinatorial patterns of cancer alterations, we developed a rigorous methodology for cancer mutation pattern discovery based on a new, constrained form of correlation clustering. Our new algorithm, named C(3) (Cancer Correlation Clustering), leverages mutual exclusivity of mutations, patient coverage, and driver network concentration principles. To test C(3), we performed a detailed analysis on TCGA breast cancer and glioblastoma data and showed that our algorithm outperforms the state-of-the-art CoMEt method in terms of discovering mutually exclusive gene modules and identifying biologically relevant driver genes. The proposed agnostic clustering method represents a unique tool for efficient and reliable identification of mutation patterns and driver pathways in large-scale cancer genomics studies, and it may also be used for other clustering problems on biological graphs. Availability: The source code for the C(3) method can be found at https://github.com/jackhou2/C3 CONTACT: jianma@cs.cmu.edu (J.M.) and milenkov@illinois.edu (O.M.) SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online.
    Article · Jan 2016 · Bioinformatics
  • Son Hoang Dau · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: We propose a new latent Boolean feature model for complex networks that captures different types of node interactions and network communities. The model is based on a new concept in graph theory, termed the co-intersection representation of a graph, which generalizes the notion of an intersection representation. We describe how to use co-intersection representations to deduce node feature sets and their communities, and proceed to derive several general bounds on the minimum number of features used in co-intersection representations. We also discuss graph families for which exact co-intersection characterizations are possible.
    Article · Jan 2016
  • [Show abstract] [Hide abstract] ABSTRACT: We consider the problem of synchronizing coded data in distributed storage networks undergoing insertion and deletion edits. We present modifications of distributed storage codes that allow updates in the parity-check values to be performed with one round of communication at low bit rates and with small storage overhead. Our main contributions are novel protocols for synchronizing frequently updated and semi-static data based on functional intermediary coding involving permutation and Vandermonde matrices.
    Article · Dec 2015 · IEEE/ACM Transactions on Networking
  • Source
    S. M. Hossein Tabatabaei Yazdi · Han Mao Kiah · Eva Ruiz Garcia · [...] · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: We provide an overview of current approaches to DNA-based storage system design and accompanying synthesis, sequencing and editing methods. We also introduce and analyze a suite of new constrained coding schemes for both archival and random access DNA storage channels. The mathematical basis of our work is the construction and design of sequences over discrete alphabets that avoid pre-specified address patterns, have balanced base content, and exhibit other relevant substring constraints. These schemes adapt the stored signals to the DNA medium and thereby reduce the inherent error-rate of the system.
    Full-text Article · Jul 2015
  • Gregory J. Puleo · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: We introduce a new agnostic clustering method: minimax correlation clustering. Given a graph whose edges are labeled with $+$ or $-$, we wish to partition the graph into clusters while trying to avoid errors: $+$ edges between clusters or $-$ edges within clusters. Unlike classical correlation clustering, which seeks to minimize the total number of errors, minimax clustering instead seeks to minimize the number of errors at the worst vertex, that is, at the vertex with the greatest number of incident errors. This minimax objective function may be seen as a way to enforce individual-level quality of partition constraints for vertices in a graph. We study this problem on complete graphs and complete bipartite graphs, proving that the problem is NP-hard on these graph classes and giving polynomial-time constant-factor approximation algorithms. The approximation algorithms rely on LP relaxation and rounding procedures.
    Article · Jun 2015
  • Ryan Gabrys · Han Mao Kiah · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: We continue our study of a new family of asymmetric Lee codes that arise in the design and implementation of emerging DNA-based storage systems and systems which use parallel string transmission protocols. The codewords are defined over a quaternary alphabet, although the results carry over to other alphabet sizes, and have symbol distances dictated by their underlying binary representation. Our contributions include deriving new bounds for the size of the largest code in this metric based on Delsarte-like linear programming methods and describing new constructions for non-linear asymmetric Lee codes.
    Article · Jun 2015
  • Ryan Gabrys · Han Mao Kiah · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: We consider a new family of codes, termed asymmetric Lee distance codes, that arise in the design and implementation of DNA-based storage systems and systems with parallel string transmission protocols. The codewords are defined over a quaternary alphabet, although the results carry over to other alphabet sizes; furthermore, symbol confusability is dictated by their underlying binary representation. Our contributions are two-fold. First, we demonstrate that the new distance represents a linear combination of the Lee and Hamming distance and derive upper bounds on the size of the codes under this metric based on linear programming techniques. Second, we propose a number of code constructions which imply lower bounds.
    Article · Jun 2015
  • Conference Paper · Jun 2015
  • Source
    S. M. Hossein Tabatabaei Yazdi · Yongbo Yuan · Jian Ma · [...] · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: We describe the first DNA-based storage architecture that enables random access to data blocks and rewriting of information stored at arbitrary locations within the blocks. The newly developed architecture overcomes drawbacks of existing read-only methods that require decoding the whole file in order to read one data fragment. Our system is based on new constrained coding techniques and accompanying DNA editing methods that ensure data reliability, specificity and sensitivity of access, and at the same time provide exceptionally high data storage capacity. As a proof of concept, we encoded parts of the Wikipedia pages of six universities in the USA, and selected and edited parts of the text written in DNA corresponding to three of these schools. The results suggest that DNA is a versatile media suitable for both ultrahigh density archival and rewritable storage applications.
    Full-text Article · May 2015 · Scientific Reports
  • Han Mao Kiah · Gregory J. Puleo · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: We consider the problem of storing and retrieving information from synthetic DNA media. The mathematical basis of the problem is the construction and design of sequences that may be discriminated based on their collection of substrings observed through a noisy channel. This problem of reconstructing sequences from traces was first investigated in the noiseless setting under the name of "Markov type" analysis. Here, we explain the connection between the reconstruction problem and the problem of DNA synthesis and sequencing, and introduce the notion of a DNA storage channel. We analyze the number of sequence equivalence classes under the channel mapping and propose new asymmetric coding techniques to combat the effects of synthesis and sequencing noise. In our analysis, we make use of restricted de Bruijn graphs and Ehrhart theory for rational polytopes.
    Article · Feb 2015 · IEEE Transactions on Information Theory
  • Source
    Minji Kim · Farzad Farnoud · Olgica Milenkovic
    [Show abstract] [Hide abstract] ABSTRACT: Gene prioritization refers to a family of computational techniques for inferring disease genes through a set of training genes and carefully chosen similarity criteria. Test genes are scored based on their average similarity to the training set, and the rankings of genes under various similarity criteria are aggregated via statistical methods. The contributions of our work are threefold: a) first, based on the realization that there is no unique way to define an optimal aggregate for rankings, we investigate the predictive quality of a number of new aggregation methods and known fusion techniques from machine learning and social choice theory. Within this context, we quantify the influence of the number of training genes and similarity criteria on the diagnostic quality of the aggregate and perform in-depth cross-validation studies; b) second, we propose a new approach to genomic data aggregation, termed HyDRA (Hybrid Distance-score Rank Aggregation), which combines the advantages of score-based and combinatorial aggregation techniques. We also propose incorporating a new top-vs-bottom (TvB) weighting feature into the hybrid schemes. The TvB feature ensures that aggregates are more reliable at the top of the list, rather than at the bottom, since only top candidates are tested experimentally; c) third, we propose an iterative procedure for gene discovery that operates via successful augmentation of the set of training genes by genes discovered in previous rounds, checked for consistency. Fundamental results from social choice theory, political and computer sciences, and statistics have shown that there exists no consistent, fair and unique way to aggregate rankings. Instead, one has to decide on an aggregation approach using predefined set of desirable properties for the aggregate. The aggregation methods fall into two categories, score-based and distance-based approaches, each of which has its own drawbacks and advantages. This work is motivated by the observation that merging these two techniques in a computationally efficient manner, and by incorporating additional constraints, one can ensure that the predictive quality of the resulting aggregation algorithm is very high. We tested HyDRA on a number of gene sets, including Autism, Breast cancer, Colorectal cancer, Endometriosis, Ischeemic stroke, Leukemia, Lymphoma, and Osteoarthritis. Furthermore, we performed iterative gene discovery for Glioblastoma, Meningioma and Breast cancer, using a sequentially augmented list of training genes related to the Turcot syndrome, Li-Fraumeni condition and other diseases. The methods outperform state-of-the-art software tools such as ToppGene and Endeavour. Despite this finding, we recommend as best practice to take the union of top-ranked items produced by different methods for the final aggregated list. Availability: The HyDRA software may be downloaded from: https://dl.dropboxusercontent.com/u/40200227/HyDRAsoftware.zip CONTACT: mkim158@illinois.edu. © The Author (2014). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
    Full-text Article · Nov 2014 · Bioinformatics

Publication Stats

3k Citations


  • 2010-2013
    • University of Illinois, Urbana-Champaign
      • Department of Electrical and Computer Engineering
      Urbana, Illinois, United States
  • 2007
    • University of California, San Diego
      San Diego, California, United States
  • 2006
    • University of Colorado
      Denver, Colorado, United States
  • 2003-2005
    • University of Colorado at Boulder
      • Department of Electrical, Computer, and Energy Engineering (ECEE)
      Boulder, Colorado, United States
  • 2001-2002
    • University of Michigan
      • Department of Electrical Engineering and Computer Science (EECS)
      Ann Arbor, Michigan, United States
  • 1997
    • Kodak
      Rochester, New York, United States