Farzad Farnoud

Farzad Farnoud
  • PhD
  • Associate Professor at University of Virginia

About

104
Publications
7,067
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,250
Citations
Current institution
University of Virginia
Current position
  • Associate Professor
Additional affiliations
August 2016 - present
University of Virginia
Position
  • Professor (Assistant)
September 2006 - August 2008
University of Toronto
Position
  • Research Assistant
June 2013 - August 2016
California Institute of Technology
Position
  • PostDoc Position

Publications

Publications (104)
Article
Full-text available
We propose a new family of distance measures on rankings, derived through an axiomatic approach, that consider the nonuniform relevance of the top and bottom of ordered lists and similarities between candidates. The proposed distance functions include specialized weighted versions of the Kendall (tau ) distance and the Cayley distance, and are suit...
Article
Full-text available
It is known that the majority of the human genome consists of repeated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from repeated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a s...
Conference Paper
Full-text available
Broadcast communications is critically important in vehicular networks. Many safety applications need safety warning messages to be broadcast to all vehicles present in an area. Design of a medium access control (MAC) protocol for vehicular networks is an interesting problem because of challenges posed by broadcast traffic, high mobility, high reli...
Article
Full-text available
Gene prioritization refers to a family of computational techniques for inferring disease genes through a set of training genes and carefully chosen similarity criteria. Test genes are scored based on their average similarity to the training set, and the rankings of genes under various similarity criteria are aggregated via statistical methods. The...
Article
Full-text available
We consider rank modulation codes for flash memories that allow for handling arbitrary charge-drop errors. Unlike classical rank modulation codes used for correcting errors that manifest themselves as swaps of two adjacently ranked elements, the proposed \emph{translocation rank codes} account for more general forms of errors that arise in storage...
Preprint
Linear codes correcting one deletions have rate at most $1/2$. In this paper, we construct linear list decodable codes correcting edits with rate approaching $1$ and reasonable list size. Our encoder and decoder run in polynomial time.
Article
The substring edit error replaces a substring u of x with another string v , where the lengths of u and v are bounded by a given constant k . It encompasses localized insertions, deletions, and substitutions within a window. Codes correcting one substring edit have redundancy at least log n + k . In this paper, we construct codes correcting one sub...
Article
Localized errors, which occur in windows with bounded lengths, are common in a range of applications. Such errors can be modeled as k-substring edits , which replace one substring with another string, both with lengths upper bounded by k . This generalizes errors such as localized deletions or burst substitutions studied in the literature. In t...
Article
Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising technology for satisfying future storage needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current p...
Preprint
Dueling bandits are widely used to model preferential feedback that is prevalent in machine learning applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propo...
Article
The problem of correcting deletions has received significant attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a consecutive burst of at most t deletions in non-binary sequences. When the alphabet size q is even, we first propose a non-binary code correcting a burs...
Preprint
The problem of correcting deletions has received significant attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a consecutive burst of at most $t$ deletions in non-binary sequences. We first propose a non-binary code correcting a burst of at most 2 deletions for $q$-ary...
Article
Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more computationally efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis of the performance of dedupli...
Preprint
Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current p...
Article
Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplicati...
Preprint
In heterogeneous rank aggregation problems, users often exhibit various accuracy levels when comparing pairs of items. Thus a uniform querying strategy over users may not be optimal. To address this issue, we propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons from users and imp...
Article
Full-text available
Antimicrobial susceptibility in Pseudomonas aeruginosa is dependent on a complex combination of host and pathogen-specific factors. Through the profiling of 971 clinical P. aeruginosa isolates from 590 patients and collection of paired patient metadata, we show that antimicrobial resistance is associated with not only patient-centric factors (e.g.,...
Conference Paper
The problem of correcting deletions has recently received significantly increased attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a burst of at most two deletions in non-binary sequences. The problem was first studied for binary sequences by Levenshtein, who presente...
Preprint
Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more time efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis on the performance of deduplication algo...
Article
Because of its high data density and longevity, DNA is emerging as a promising candidate for satisfying increasing data storage needs. Compared to conventional storage media, however, data stored in DNA is subject to a wider range of errors resulting from various processes involved in the data storage pipeline. In this article, we consider correcti...
Preprint
Full-text available
Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplicati...
Preprint
Full-text available
Because of its high data density and longevity, DNA is emerging as a promising candidate for satisfying increasing data storage needs. Compared to conventional storage media, however, data stored in DNA is subject to a wider range of errors resulting from various processes involved in the data storage pipeline. In this paper, we consider correcting...
Article
Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitut...
Preprint
Full-text available
A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writin...
Article
A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writin...
Article
We propose the Heterogeneous Thurstone Model (HTM) for aggregating ranked data, which can take the accuracy levels of different users into account. By allowing different noise distributions, the proposed HTM model maintains the generality of Thurstone's original framework, and as such, also extends the Bradley-Terry-Luce (BTL) model for pairwise co...
Preprint
We propose the Heterogeneous Thurstone Model (HTM) for aggregating ranked data, which can take the accuracy levels of different users into account. By allowing different noise distributions, the proposed HTM model maintains the generality of Thurstone's original framework, and as such, also extends the Bradley-Terry-Luce (BTL) model for pairwise co...
Preprint
Full-text available
Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitut...
Article
Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processe...
Article
We study random string-duplication systems, which we call Pólya string models. These are motivated by a class of mutations that are common in most organisms and lead to an abundance of repeated sequences in their genomes. Unlike previous works that study the combinatorial capacity of string-duplication systems, or in a probabilistic setting, variou...
Conference Paper
We study random string-duplication systems, which we call Pólya string models. These are motivated by a class of mutations that are common in most organisms and lead to an abundance of repeated sequences in their genomes. Unlike previous works that study the combinatorial capacity of string-duplication systems, or in a probabilistic setting, variou...
Article
In this paper, we consider the problem of synchronizing two sets of data where the size of the symmetric difference between the sets is small and, in addition, the elements in the symmetric difference are related through the Hamming distance metric. Upper and lower bounds are derived on the minimum amount of information exchange. Furthermore, expli...
Article
Full-text available
Background Tandem repeat sequences are common in the genomes of many organisms and are known to cause important phenomena such as gene silencing and rapid morphological changes. Due to the presence of multiple copies of the same pattern in tandem repeats and their high variability, they contain a wealth of information about the mutations that have...
Preprint
Full-text available
Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processe...
Preprint
In this work, we consider the problem of synchronizing two sets of data where the size of the symmetric difference between the sets is small and, in addition, the elements in the symmetric difference are related through the Hamming distance metric. Upper and lower bounds are derived on the minimum amount of information exchange. Furthermore, explic...
Preprint
We study random string-duplication systems, which we call P\'olya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact cap...
Article
We address the problem of computing distances between permutations that take into account similarities between elements of the ground set dictated by a graph. The problem may be summarized as follows: Given two permutations and a positive cost function on transpositions that depends on the similarity of the elements involved, find a smallest cost s...
Conference Paper
The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence AGTCGC, TGTG is a tandem repeat, that may be generated from AGTCTGC by a tandem duplication of length 2. In this work,...
Conference Paper
Duplication mutations play a critical role in the generation of biological sequences. Simultaneously, they have a deleterious effect on data stored using in-vivo DNA data storage. While duplications have been studied both as a sequence-generation mechanism and in the context of error correction, for simplicity these studies have not taken into acco...
Article
Duplication mutations play a critical role in the generation of biological sequences. Simultaneously, they have a deleterious effect on data stored using in-vivo DNA data storage. While duplications have been studied both as a sequence-generation mechanism and in the context of error correction, for simplicity these studies have not taken into acco...
Article
Full-text available
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their subs...
Preprint
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their subs...
Article
We consider the problem of approximate sorting of a data stream (in one pass) with limited internal storage where the goal is not to rearrange data but to output a permutation that reflects the ordering of the elements of the data stream as closely as possible. Our main objective is to study the relationship between the quality of the sorting and t...
Conference Paper
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form x = abc → y = abbc, where x and y are sequences and a, b, and c are their substrings, needed to generate a binary sequence of length n starting from a square-free sequ...
Conference Paper
We study the tandem duplication distance between binary sequences and their roots. This distance is motivated by genomic tandem duplication mutations and counts the smallest number of tandem duplication events that are required to take one sequence to another. We consider both exact and approximate tandem duplications, the latter leading to a combi...
Conference Paper
We study random string-duplication systems, called Pólya string models, motivated by certain random mutation processes in the genome of living organisms. Unlike previous works that study the combinatorial capacity of string-duplication systems, or peripheral properties such as symbol frequency, this work provides exact capacity or bounds on it, for...
Conference Paper
The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected...
Article
The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected...
Preprint
The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected...
Article
Full-text available
Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. Th...
Article
Full-text available
The majority of the human genome consists of repeated sequences. An important type of repeats common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence AGTCTGTGC,TGTG is a tandem repeat, namely, generated from AGTCTGC by a tandem duplication of length 2. In this work, we investigat...
Conference Paper
Mutation processes such as point mutation, insertion, deletion, and duplication (including tandem and interspersed duplication) have an important role in evolution, as they lead to genomic diversity, and thus to phenotypic variation. In this work, we study the expressive power of interspersed duplication, i.e., its ability to generate diversity, vi...
Conference Paper
In this work, we consider the problem of synchronizing two sets of data where the size of the symmetric difference between the sets is small and, in addition, the elements in the symmetric difference are related. In this introductory work, the elements within the symmetric difference are related through the Hamming distance metric. Upper and lower...
Conference Paper
Full-text available
We consider the problem of approximate sorting of a data stream (in one pass) with limited internal storage where the goal is not to rearrange data but to output a permutation that reflects the ordering of the elements of the data stream as closely as possible. Our main ob-jective is to study the relationship between the quality of the sorting and...
Conference Paper
Full-text available
We study the rate-distortion relationship in the set of permutations endowed with the Kendall t-metric and the Chebyshev metric. Our study is motivated by the application of permutation rate-distortion to the average-case and worst-case distortion analysis of algorithms for ranking with incomplete information and approximate sorting algorithms. For...
Conference Paper
Full-text available
We present a multiset rank modulation scheme capable of correcting translocation errors, motivated by the fact that compared to permutation codes, multipermutation codes offer higher rates and longer block lengths. We show that the appropriate distance measure for code construction is the Ulam metric applied to equivalence classes of permutations,...
Conference Paper
Full-text available
Motivated by the rank modulation scheme for flash memories, we consider an information representation system with relative values (permutations) and study codes for correcting deletions. In contrast to the case of a deletion in a regular (with absolute values) representation system, a deletion in this new paradigm results in a new permutation over...
Conference Paper
Full-text available
Error-correcting codes for permutations have received a considerable attention in the past few years, especially in applications of the rank modulation scheme for flash memories. While several metrics have been studied like the Kendall's τ, Ulam, and Hamming distances, no recent research has been carried for erasures and deletions over permutations...
Article
Full-text available
We study the rate-distortion relationship in the set of permutations endowed with the Kendall Tau metric and the Chebyshev metric. Our study is motivated by the application of permutation rate-distortion to the average-case and worst-case analysis of algorithms for ranking with incomplete information and approximate sorting algorithms. For the Kend...
Article
Full-text available
We address the problem of multipermutation code design in the Ulam metric for novel storage applications. Multipermutation codes are suitable for flash memory where cell charges may share the same rank. Changes in the charges of cells manifest themselves as errors whose effects on the retrieved signal may be measured via the Ulam distance. As part...
Conference Paper
Full-text available
Gene prioritization is a class of methods for discovering genes implicated in the onset and progression of a disease. As candidate genes are ranked based on similarity to known disease genes according to different set of criteria, the overall aggregation of these ranked datasets is a vital step of the prioritization procedure. Aggregation of differ...
Article
Full-text available
We introduce a parallel algorithmic architecture for metagenomic sequence assembly, termed MetaPar, which allows for significant reductions in assembly time and consequently enables the processing of large genomic datasets on computers with low memory usage. The gist of the approach is to iteratively perform read (re)classification based on phyloge...
Conference Paper
Full-text available
We consider the problem of rank aggregation, where the goal is to assemble ordered lists into one consensus order. Our contributions consist of proposing a new family of distance measures that allow for incorporating practical ranking constraints into the aggregation problem formulation; showing how such distance measures arise from a generalizatio...
Article
Full-text available
We address the problem of computing distances between rankings that take into account similarities between candidates. The need for evaluating such distances is governed by applications as diverse as rank aggregation, bioinformatics, social sciences and data storage. The problem may be summarized as follows. Given two rankings and a positive cost f...
Conference Paper
Full-text available
In networked systems comprised of many agents, it is often required to reach a common operating point of all agents, termed the network consensus. We consider two iterative methods for reaching a ranking (ordering) consensus over a voter network, where the initial preference of every voter is of the form of a full ranking of candidates. The voters...
Conference Paper
Full-text available
We propose a new family of algorithms for bounding/approximating the optimal solution of rank aggregation problems based on weighted Kendall distances. The algorithms represent linear programming relaxations of integer programs that involve variables reflecting partial orders of three or more candidates. Our simulation results indicate that the lin...
Article
Full-text available
We present a general model for opinion dynamics in a social network together with several possibilities for object selections at times when the agents are communicating. We study the limiting behavior of such a dynamics and show that this dynamics almost surely converges. We consider some special implications of the convergence result for gossip an...
Article
Full-text available
We consider a classical problem in choice theory -- vote aggregation -- using novel distance measures between permutations that arise in several practical applications. The distance measures are derived through an axiomatic approach, taking into account various issues arising in voting with side constraints. The side constraints of interest include...
Conference Paper
Full-text available
We consider rank modulation codes for flash memories that allow for handling arbitrary charge drop errors. Unlike classical rank modulation codes used for correcting errors that manifest themselves as swaps of two adjacently ranked elements, the proposed translocation codes account for more general forms of errors that arise in storage systems. Tra...
Article
Full-text available
We consider the problem of non-uniform vote aggregation, and in particular, the algorithmic aspects associated with the aggregation process. For a novel class of weighted distance measures on votes, we present two different aggregation methods. The first algorithm is based on approximating the weighted distance measure by Spearman's footrule distan...
Article
Full-text available
We consider the problem of rank aggregation based on new distance measures derived through axiomatic approaches and based on score-based methods. In the first scenario, we derive novel distance measures that allow for discriminating between the ranking process of highest and lowest ranked elements in the list. These distance functions represent wei...
Article
Full-text available
We consider a class of small-sample distribution estimators over noisy channels. Our estimators are designed for repetition channels, and rely on properties of the runs of the observed sequences. These runs are modeled via a special type of Markov chains, termed alternating Markov chains. We show that alternating chains have redundancy that scales...
Article
Full-text available
We consider the problem of finding the minimum cost transposition decomposition of a permutation. In this frame- work, arbitrary non-negative costs are assigned to individual transpositions and the task at hand is to devise polynomial-time, constant-approximation decomposition algorithms. We describe a polynomial-time algorithm based on specialized...
Conference Paper
Full-text available
We address the problem of finding the minimum decomposition of a permutation in terms of transpositions with non-uniform cost. For metric-path costs, we describe exact polynomial-time decomposition algorithms. For extended-metric-path cost functions, we describe polynomial-time constant-approximation decomposition algorithms. Our algorithms rely on...
Article
Full-text available
We address the problem of finding the minimum decomposition of a permutation in terms of transpositions with non-uniform cost. For arbitrary non-negative cost functions, we describe polynomial-time, constant-approximation decomposition algorithms. For metric-path costs, we describe exact polynomial-time decomposition algorithms. Our algorithms repr...
Article
Full-text available
The ¿Japanese¿ theorem is extended to multiple multicast sessions in an arbitrary network to characterize the routing capacity region by the intersection of an infinite collection of halfspaces. An elimination technique is developed to simplify this infinite description into a finite one based upon the shortest routing paths and trees in the netw...

Network

Cited By