
Farzad Farnoud- PhD
- Associate Professor at University of Virginia
Farzad Farnoud
- PhD
- Associate Professor at University of Virginia
About
104
Publications
7,067
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,250
Citations
Introduction
Current institution
Additional affiliations
August 2016 - present
September 2006 - August 2008
June 2013 - August 2016
Publications
Publications (104)
We propose a new family of distance measures on rankings, derived through an axiomatic approach, that consider the nonuniform relevance of the top and bottom of ordered lists and similarities between candidates. The proposed distance functions include specialized weighted versions of the Kendall (tau ) distance and the Cayley distance, and are suit...
It is known that the majority of the human genome consists of repeated
sequences. Furthermore, it is believed that a significant part of the rest of
the genome also originated from repeated sequences and has mutated to its
current form. In this paper, we investigate the possibility of constructing an
exponentially large number of sequences from a s...
Broadcast communications is critically important in vehicular networks. Many safety applications need safety warning messages to be broadcast to all vehicles present in an area. Design of a medium access control (MAC) protocol for vehicular networks is an interesting problem because of challenges posed by broadcast traffic, high mobility, high reli...
Gene prioritization refers to a family of computational techniques for inferring disease genes through a set of training genes and carefully chosen similarity criteria. Test genes are scored based on their average similarity to the training set, and the rankings of genes under various similarity criteria are aggregated via statistical methods. The...
We consider rank modulation codes for flash memories that allow for handling arbitrary charge-drop errors. Unlike classical rank modulation codes used for correcting errors that manifest themselves as swaps of two adjacently ranked elements, the proposed \emph{translocation rank codes} account for more general forms of errors that arise in storage...
Linear codes correcting one deletions have rate at most $1/2$. In this paper, we construct linear list decodable codes correcting edits with rate approaching $1$ and reasonable list size. Our encoder and decoder run in polynomial time.
The substring edit error replaces a substring u of x with another string v , where the lengths of u and v are bounded by a given constant k . It encompasses localized insertions, deletions, and substitutions within a window. Codes correcting one substring edit have redundancy at least log n + k . In this paper, we construct codes correcting one sub...
Localized errors, which occur in windows with bounded lengths, are common in a range of applications. Such errors can be modeled as
k-substring edits
, which replace one substring with another string, both with lengths upper bounded by
k
. This generalizes errors such as localized deletions or burst substitutions studied in the literature. In t...
Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising technology for satisfying future storage needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current p...
Yue Wu Tao Jin Hao Lou- [...]
Quanquan Gu
Dueling bandits are widely used to model preferential feedback that is prevalent in machine learning applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propo...
The problem of correcting deletions has received significant attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a consecutive burst of at most
t
deletions in non-binary sequences. When the alphabet size
q
is even, we first propose a non-binary code correcting a burs...
The problem of correcting deletions has received significant attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a consecutive burst of at most $t$ deletions in non-binary sequences. We first propose a non-binary code correcting a burst of at most 2 deletions for $q$-ary...
Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more computationally efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis of the performance of dedupli...
Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current p...
Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplicati...
Yue Wu Tao Jin Hao Lou- [...]
Quanquan Gu
In heterogeneous rank aggregation problems, users often exhibit various accuracy levels when comparing pairs of items. Thus a uniform querying strategy over users may not be optimal. To address this issue, we propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons from users and imp...
Antimicrobial susceptibility in Pseudomonas aeruginosa is dependent on a complex combination of host and pathogen-specific factors. Through the profiling of 971 clinical P. aeruginosa isolates from 590 patients and collection of paired patient metadata, we show that antimicrobial resistance is associated with not only patient-centric factors (e.g.,...
The problem of correcting deletions has recently received significantly increased attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a burst of at most two deletions in non-binary sequences. The problem was first studied for binary sequences by Levenshtein, who presente...
Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more time efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis on the performance of deduplication algo...
Because of its high data density and longevity, DNA is emerging as a promising candidate for satisfying increasing data storage needs. Compared to conventional storage media, however, data stored in DNA is subject to a wider range of errors resulting from various processes involved in the data storage pipeline. In this article, we consider correcti...
Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplicati...
Because of its high data density and longevity, DNA is emerging as a promising candidate for satisfying increasing data storage needs. Compared to conventional storage media, however, data stored in DNA is subject to a wider range of errors resulting from various processes involved in the data storage pipeline. In this paper, we consider correcting...
Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitut...
A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writin...
A method for encoding information in DNA sequences is described. The method is based on the precision-resolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writin...
We propose the Heterogeneous Thurstone Model (HTM) for aggregating ranked data, which can take the accuracy levels of different users into account. By allowing different noise distributions, the proposed HTM model maintains the generality of Thurstone's original framework, and as such, also extends the Bradley-Terry-Luce (BTL) model for pairwise co...
We propose the Heterogeneous Thurstone Model (HTM) for aggregating ranked data, which can take the accuracy levels of different users into account. By allowing different noise distributions, the proposed HTM model maintains the generality of Thurstone's original framework, and as such, also extends the Bradley-Terry-Luce (BTL) model for pairwise co...
Motivated by mutation processes occurring in in-vivo DNA-storage applications, a channel that mutates stored strings by duplicating substrings as well as substituting symbols is studied. Two models of such a channel are considered: one in which the substitutions occur only within the duplicated substrings, and one in which the location of substitut...
Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processe...
We study random string-duplication systems, which we call Pólya string models. These are motivated by a class of mutations that are common in most organisms and lead to an abundance of repeated sequences in their genomes. Unlike previous works that study the combinatorial capacity of string-duplication systems, or in a probabilistic setting, variou...
We study random string-duplication systems, which we call Pólya string models. These are motivated by a class of mutations that are common in most organisms and lead to an abundance of repeated sequences in their genomes. Unlike previous works that study the combinatorial capacity of string-duplication systems, or in a probabilistic setting, variou...
In this paper, we consider the problem of synchronizing two sets of data where the size of the symmetric difference between the sets is small and, in addition, the elements in the symmetric difference are related through the Hamming distance metric. Upper and lower bounds are derived on the minimum amount of information exchange. Furthermore, expli...
Background
Tandem repeat sequences are common in the genomes of many organisms and are known to cause important phenomena such as gene silencing and rapid morphological changes. Due to the presence of multiple copies of the same pattern in tandem repeats and their high variability, they contain a wealth of information about the mutations that have...
Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processe...
In this work, we consider the problem of synchronizing two sets of data where the size of the symmetric difference between the sets is small and, in addition, the elements in the symmetric difference are related through the Hamming distance metric. Upper and lower bounds are derived on the minimum amount of information exchange. Furthermore, explic...
We study random string-duplication systems, which we call P\'olya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact cap...
We address the problem of computing distances between permutations that take into account similarities between elements of the ground set dictated by a graph. The problem may be summarized as follows: Given two permutations and a positive cost function on transpositions that depends on the similarity of the elements involved, find a smallest cost s...
The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence AGTCGC, TGTG is a tandem repeat, that may be generated from AGTCTGC by a tandem duplication of length 2. In this work,...
Duplication mutations play a critical role in the generation of biological sequences. Simultaneously, they have a deleterious effect on data stored using in-vivo DNA data storage. While duplications have been studied both as a sequence-generation mechanism and in the context of error correction, for simplicity these studies have not taken into acco...
Duplication mutations play a critical role in the generation of biological sequences. Simultaneously, they have a deleterious effect on data stored using in-vivo DNA data storage. While duplications have been studied both as a sequence-generation mechanism and in the context of error correction, for simplicity these studies have not taken into acco...
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their subs...
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their subs...
We consider the problem of approximate sorting of a data stream (in one pass) with limited internal storage where the goal is not to rearrange data but to output a permutation that reflects the ordering of the elements of the data stream as closely as possible. Our main objective is to study the relationship between the quality of the sorting and t...
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form x = abc → y = abbc, where x and y are sequences and a, b, and c are their substrings, needed to generate a binary sequence of length n starting from a square-free sequ...
We study the tandem duplication distance between binary sequences and their roots. This distance is motivated by genomic tandem duplication mutations and counts the smallest number of tandem duplication events that are required to take one sequence to another. We consider both exact and approximate tandem duplications, the latter leading to a combi...
We study random string-duplication systems, called Pólya string models, motivated by certain random mutation processes in the genome of living organisms. Unlike previous works that study the combinatorial capacity of string-duplication systems, or peripheral properties such as symbol frequency, this work provides exact capacity or bounds on it, for...
The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected...
The ability to store data in the DNA of a living
organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected...
The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected...
Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. Th...
The majority of the human genome consists of
repeated sequences. An important type of repeats common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence AGTCTGTGC,TGTG is a tandem repeat, namely, generated from AGTCTGC by a tandem duplication of length 2. In this work, we investigat...
Mutation processes such as point mutation, insertion,
deletion, and duplication (including tandem and interspersed
duplication) have an important role in evolution, as
they lead to genomic diversity, and thus to phenotypic variation. In this work, we study the expressive power of interspersed duplication, i.e., its ability to generate diversity, vi...
In this work, we consider the problem of synchronizing two sets of data where the size of the symmetric difference between the sets is small and, in addition, the elements in the symmetric difference are related. In this introductory work, the elements within the symmetric difference are related through the Hamming distance metric. Upper and lower...
We consider the problem of approximate sorting of a data stream (in one pass) with limited internal storage where the goal is not to rearrange data but to output a permutation that reflects the ordering of the elements of the data stream as closely as possible. Our main ob-jective is to study the relationship between the quality of the sorting and...
We study the rate-distortion relationship in the set of permutations endowed with the Kendall t-metric and the Chebyshev metric. Our study is motivated by the application of permutation rate-distortion to the average-case and worst-case distortion analysis of algorithms for ranking with incomplete information and approximate sorting algorithms. For...
We present a multiset rank modulation scheme capable of correcting translocation errors, motivated by the fact that compared to permutation codes, multipermutation codes offer higher rates and longer block lengths. We show that the appropriate distance measure for code construction is the Ulam metric applied to equivalence classes of permutations,...
Motivated by the rank modulation scheme for flash memories, we consider an information representation system with relative values (permutations) and study codes for correcting deletions. In contrast to the case of a deletion in a regular (with absolute values) representation system, a deletion in this new paradigm results in a new permutation over...
Error-correcting codes for permutations have received a considerable attention in the past few years, especially in applications of the rank modulation scheme for flash memories. While several metrics have been studied like the Kendall's τ, Ulam, and Hamming distances, no recent research has been carried for erasures and deletions over permutations...
We study the rate-distortion relationship in the set of permutations endowed
with the Kendall Tau metric and the Chebyshev metric. Our study is motivated by
the application of permutation rate-distortion to the average-case and
worst-case analysis of algorithms for ranking with incomplete information and
approximate sorting algorithms. For the Kend...
We address the problem of multipermutation code design in the Ulam metric for
novel storage applications. Multipermutation codes are suitable for flash
memory where cell charges may share the same rank. Changes in the charges of
cells manifest themselves as errors whose effects on the retrieved signal may
be measured via the Ulam distance. As part...
Gene prioritization is a class of methods for discovering genes implicated in the onset and progression of a disease. As candidate genes are ranked based on similarity to known disease genes according to different set of criteria, the overall aggregation of these ranked datasets is a vital step of the prioritization procedure. Aggregation of differ...
We introduce a parallel algorithmic architecture for metagenomic sequence
assembly, termed MetaPar, which allows for significant reductions in assembly
time and consequently enables the processing of large genomic datasets on
computers with low memory usage. The gist of the approach is to iteratively
perform read (re)classification based on phyloge...
We consider the problem of rank aggregation, where the goal is to assemble ordered lists into one consensus order. Our contributions consist of proposing a new family of distance measures that allow for incorporating practical ranking constraints into the aggregation problem formulation; showing how such distance measures arise from a generalizatio...
We address the problem of computing distances between rankings that take into
account similarities between candidates. The need for evaluating such distances
is governed by applications as diverse as rank aggregation, bioinformatics,
social sciences and data storage. The problem may be summarized as follows.
Given two rankings and a positive cost f...
In networked systems comprised of many agents, it is often required to reach a common operating point of all agents, termed the network consensus. We consider two iterative methods for reaching a ranking (ordering) consensus over a voter network, where the initial preference of every voter is of the form of a full ranking of candidates. The voters...
We propose a new family of algorithms for bounding/approximating the optimal solution of rank aggregation problems based on weighted Kendall distances. The algorithms represent linear programming relaxations of integer programs that involve variables reflecting partial orders of three or more candidates. Our simulation results indicate that the lin...
We present a general model for opinion dynamics in a social network together
with several possibilities for object selections at times when the agents are
communicating. We study the limiting behavior of such a dynamics and show that
this dynamics almost surely converges. We consider some special implications of
the convergence result for gossip an...
We consider a classical problem in choice theory -- vote aggregation -- using
novel distance measures between permutations that arise in several practical
applications. The distance measures are derived through an axiomatic approach,
taking into account various issues arising in voting with side constraints. The
side constraints of interest include...
We consider rank modulation codes for flash memories that allow for handling arbitrary charge drop errors. Unlike classical rank modulation codes used for correcting errors that manifest themselves as swaps of two adjacently ranked elements, the proposed translocation codes account for more general forms of errors that arise in storage systems. Tra...
We consider the problem of non-uniform vote aggregation, and in particular,
the algorithmic aspects associated with the aggregation process. For a novel
class of weighted distance measures on votes, we present two different
aggregation methods. The first algorithm is based on approximating the weighted
distance measure by Spearman's footrule distan...
We consider the problem of rank aggregation based on new distance measures
derived through axiomatic approaches and based on score-based methods. In the
first scenario, we derive novel distance measures that allow for discriminating
between the ranking process of highest and lowest ranked elements in the list.
These distance functions represent wei...
We consider a class of small-sample distribution estimators over noisy
channels. Our estimators are designed for repetition channels, and rely on
properties of the runs of the observed sequences. These runs are modeled via a
special type of Markov chains, termed alternating Markov chains. We show that
alternating chains have redundancy that scales...
We consider the problem of finding the minimum cost transposition decomposition of a permutation. In this frame- work, arbitrary non-negative costs are assigned to individual transpositions and the task at hand is to devise polynomial-time, constant-approximation decomposition algorithms. We describe a polynomial-time algorithm based on specialized...
We address the problem of finding the minimum decomposition of a permutation in terms of transpositions with non-uniform cost. For metric-path costs, we describe exact polynomial-time decomposition algorithms. For extended-metric-path cost functions, we describe polynomial-time constant-approximation decomposition algorithms. Our algorithms rely on...
We address the problem of finding the minimum decomposition of a permutation in terms of transpositions with non-uniform cost. For arbitrary non-negative cost functions, we describe polynomial-time, constant-approximation decomposition algorithms. For metric-path costs, we describe exact polynomial-time decomposition algorithms. Our algorithms repr...
The ¿Japanese¿ theorem is extended to multiple multicast sessions in an arbitrary network to characterize the routing capacity region by the intersection of an infinite collection of halfspaces. An elimination technique is developed to simplify this infinite description into a finite one based upon the shortest routing paths and trees in the netw...