## About

135

Publications

12,629

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

5,150

Citations

Citations since 2017

Introduction

Additional affiliations

January 2004 - January 2020

September 1995 - September 1999

September 1993 - February 1994

## Publications

Publications (135)

It is natural to generalize the $k$-Server problem by allowing each request to specify not only a point $p$, but also a subset $S$ of servers that may serve it. To attack this generalization, we focus on uniform and star metrics. For uniform metrics, the problem is equivalent to a generalization of Paging in which each request specifies not only a...

Huang and Wong (Acta Inform 21(1):113–123, 1984) proposed a polynomial-time dynamic-programming algorithm for computing optimal generalized binary split trees. We show that their algorithm is incorrect. Thus, it remains open whether such trees can be computed in polynomial time. Spuler (Optimal search trees using two-way key comparisons, PhD thesis...

We present a simple O(n ⁴ ) -time algorithm for computing optimal search trees with two-way comparisons. The only previous solution to this problem, by Anderson et al., has the same running time but is significantly more complicated and is restricted to the variant where only successful queries are allowed. Our algorithm extends directly to solve t...

Modern NoSQL database systems use log-structured merge (LSM) storage architectures to support high write throughput. LSM architectures aggregate writes in a mutable MemTable (stored in memory), which is regularly flushed to disk, creating a new immutable file called an SSTable. Some of the SSTables are chosen to be periodically merged—replaced with...

Search trees are commonly used to implement access operations to a set of stored keys. If this set is static and the probabilities of membership queries are known in advance, then one can precompute an optimal search tree, namely one that minimizes the expected access cost. For a non-key query, a search tree can determine its approximate location b...

We present a simple $O(n^4)$-time algorithm for computing optimal search trees with two-way comparisons. The only previous solution to this problem, by Anderson et al., has the same running time, but is significantly more complicated and is restricted to the variant where only successful queries are allowed. Our algorithm extends directly to solve...

Search trees are commonly used to implement access operations to a set of stored keys. If this set is static and the probabilities of membership queries are known in advance, then one can precompute an optimal search tree, namely one that minimizes the expected access cost. For a non-key query, a search tree can determine its approximate location b...

Data-structure dynamization is a general approach for making static data structures dynamic. It is used extensively in geometric settings and in the guise of so-called merge (or compaction) policies in big-data databases such as Google Bigtable and LevelDB (our focus). Previous theoretical work is based on worst-case analyses for uniform inputs --...

We study the problem of selecting control clones in DNA array hybridization experiments. The problem arises in the OFRG method for analyzing microbial communities. The OFRG method performs classification of rRNA gene clones using binary fingerprints created from a series of hybridization experiments, where each experiment consists of hybridizing a...

This paper gives poly-logarithmic-round, distributed D-approximation algorithms for covering problems with submodular cost and monotone covering constraints (Submodular-cost Covering). The approximation ratio D is the maximum number of variables in any constraint. Special cases include Covering Mixed Integer Linear Programs (CMIP), and Weighted Ver...

In this Web 2.0 era, there is an ever increasing number of customer reviews, which must be summarized to help consumers effortlessly make informed decisions. Previous work on reviews summarization has simplified the problem by assuming that aspects (e.g., “display”) are independent of each other and that the opinion for each aspect in a review is B...

Huang and Wong [5] proposed a polynomial-time dynamic-programming algorithm for computing optimal generalized binary split trees. We show that their algorithm is incorrect. Thus, it remains open whether such trees can be computed in polynomial time. Spuler [11, 12] proposed modifying Huang and Wong's algorithm to obtain an algorithm for a different...

We consider the problem of political redistricting: given the locations of people in a geographical area (e.g. a US state), the goal is to decompose the area into subareas, called districts, so that the populations of the districts are as close as possible and the districts are "compact" and "contiguous," to use the terms referred to in most US sta...

We propose a method for redistricting, decomposing a geographical area into subareas, called districts, so that the populations of the districts are as close as possible and the districts are compact and contiguous. Each district is the intersection of a polygon with the geographical area. The polygons are convex and the average number of sides per...

In 1971, Knuth gave an \(O(n^2)\)-time algorithm for the classic problem of finding an optimal binary search tree. Knuth’s algorithm works only for search trees based on 3-way comparisons, but most modern computers support only 2-way comparisons (\(<\), \(\le \), \(=\), \(\ge \), and \(>\)). Until this paper, the problem of finding an optimal searc...

The Joint Replenishment Problem ($${\hbox {JRP}}$$JRP) is a fundamental optimization problem in supply-chain management, concerned with optimizing the flow of goods from a supplier to retailers. Over time, in response to demands at the retailers, the supplier ships orders, via a warehouse, to the retailers. The objective is to schedule these orders...

Over the past decade, time series clustering has become an increasingly important research topic in data mining community. Most existing methods for time series clustering rely on distances calculated from the entire raw data using the Euclidean distance or Dynamic Time Warping distance as the distance measure. However, the presence of significant...

We study the following problem: given a set of keys and access probabilities,
find a minimum-cost binary search tree that uses only 2-way comparisons ($=, <,
\le$) at each node. We give the first polynomial-time algorithm when both
successful and unsuccessful queries are allowed, settling a long-standing open
question.
Our algorithm relies on a new...

We describe nearly linear-time approximation algorithms for explicitly given
mixed packing/covering and facility-location linear programs. The algorithms
compute $(1+\epsilon)$-approximate solutions in time $O(N \log(N)/\epsilon^2)$,
where $N$ is the number of non-zeros in the constraint matrix. We also describe
parallel variants taking time $O(\te...

We initiate the formal study of the online stack-compaction policies used by
big-data NoSQL databases such as Google Bigtable, Hadoop HBase, and Apache
Cassandra. We propose a deterministic policy, show that it is optimally
competitive, benchmark it against Bigtable's default policy, and suggest five
interesting open problems.

Can one choose a good Huffman code on the fly, without knowing the underlying
distribution? Online Slot Allocation (OSA) models this and similar problems:
There are n slots, each with a known cost. There are n items. Requests for
items are drawn i.i.d. from a fixed but hidden probability distribution p.
After each request, if the item, i, was not p...

We give a short proof that any comparison-based n^(1-epsilon)-approximation
algorithm for the 1-dimensional Traveling Salesman Problem (TSP) requires
Omega(n log n) comparisons.

The Joint Replenishment Problem (JRP) is a fundamental optimization problem in supply-chain management, concerned with optimizing the flow of goods over time from a supplier to retailers. Over time, in response to demands at the retailers, the supplier sends shipments, via a warehouse, to the retailers. The objective is to schedule shipments to min...

The \emph{file caching} problem is defined as follows. Given a cache of size
$k$ (a positive integer), the goal is to minimize the total retrieval cost for
the given sequence of requests to files. A file $f$ has size $size(f)$ (a
positive integer) and retrieval cost $cost(f)$ (a non-negative number) for
bringing the file into the cache. A \emph{mis...

Given a satisfiable 3-SAT formula, how hard is it to find an assignment to
the variables that has Hamming distance at most n/2 to a satisfying assignment?
More generally, consider any polynomial-time verifier for any NP-complete
language. A d(n)-Hamming-approximation algorithm for the verifier is one that,
given any member x of the language, output...

Minimum-weight triangulation (MWT) is NP-hard. It has a polynomial-time
constant-factor approximation algorithm, and a variety of effective polynomial-
time heuristics that, for many instances, can find the exact MWT. Linear
programs (LPs) for MWT are well-studied, but previously no connection was known
between any LP and any approximation algorith...

This paper gives poly-logarithmic-round, distributed δ-approximation algorithms for covering problems with submodular cost and monotone covering constraints (Submodular-cost Covering). The approximation ratio δ is the maximum number of variables in any constraint. Special cases include Covering Mixed Integer Linear Programs (CMIP), and Weighted Ver...

Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant adoption and extension in the community. Despite th...

We consider the problem of choosing Euclidean points to maximize the sum of their weighted pairwise distances, when each point is constrained to a ball centered at the origin. We derive a dual minimization problem and show strong duality holds (i.e., the resulting upper bound is tight) when some locally optimal configuration of points is affinely i...

We present efficient distributed δ-approximation algorithms for fractional packing and maximum weighted b-matching in hypergraphs, where δ is the maximum number of packing constraints in which a variable appears (for maximum weighted b-matching
δ is the maximum edge degree — for graphs δ= 2). (a) For δ= 2 the algorithm runs in O(logm) rounds in exp...

This paper describes a greedy D{\ensuremath{\Delta}}-approximation algorithm for monotone covering, a generalization of many fundamental NP-hard covering problems. The approximation ratio D{\ensuremath{\Delta}} is the maximum number of variables on which any constraint depends. (For example, for vertex cover, D{\ensuremath{\Delta}} is 2.) The algor...

With fully directional communications, nodes must track the positions of their neighbors so that communication with these neighbors is feasible when needed. Tracking process introduces an overhead, which increases with the number of discovered neighbors. The overhead can be reduced if nodes maintain only a subset of their neighbors; however, this m...

The paper presents distributed and parallel -approximation algorithms for covering problems, where is the maximum number of variables on which any constraint depends (for example, = 2 for vertex cover). Specic results include the following. For weighted vertex cover, the rst distributed 2-ap- proximation algorithm taking O(logn) rounds and the rst...

This paper describes a simple greedy D-approximation algorithm for any
covering problem whose objective function is submodular and non-decreasing, and
whose feasible region can be expressed as the intersection of arbitrary (closed
upwards) covering constraints, each of which constrains at most D variables of
the problem. (A simple example is Vertex...

In the k-median problem we are given sets of facilities and customers, and distances between them. For a given set F of facilities, the cost of serving a customer u is the minimum distance between u and a facility in F. The goal is to find a set F of k facilities that minimizes the sum, over all customers, of their service costs.
Following the wor...

We give an approximation algorithm for packing and covering linear programs
(linear programs with non-negative coefficients). Given a constraint matrix
with n non-zeros, r rows, and c columns, the algorithm computes feasible primal
and dual solutions whose costs are within a factor of 1+eps of the optimal cost
in time O((r+c)log(n)/eps^2 + n).

We give an approximation algorithm for packing and covering linear programs (linear programs with non-negative coefficients). Given a constraint matrix with n non-zeros, r rows, and c columns, the algorithm (with high probability) computes feasible primal and dual solutions whose costs are within a factor of I +epsiv of OPT l+ epsiv of OPT (the opt...

We study the problem of selecting control clones in DNA array hybridization experiments. The problem arises in the OFRG method for analyzing microbial communities. The OFRG method performs classification of rRNA gene clones using binary fingerprints created from a series of hybridization experiments, where each experiment consists of hybridizing a...

Dimension attributes in data warehouses are typically hierarchical, and a variety of OLAP applications (such as point-of-sales analysis and decision support) call for summarizing the measure attributes in fact tables along the hierarchies of these attributes. For example, the total sales at different stores can be summarized hierarchically by geogr...

We start with definitions given by Plotkin, Shmoys, and Tardos [16]. Given A∈ℝm×n, b∈ℝm and a polytope P
\(
\subseteq\) ℝn
, the fractional packing problem is to find an x ∈ P such that Ax ≤ b if such an x exists. An ∈-approximate solution to this problem is an x ∈ P such that Ax ≤ (1+∈)b. An ∈-relaxed decision procedure always finds an ∈-approxima...

Dimension attributes in data warehouses are typically hierarchical (e.g., geographic locations in sales data, URLs in Web traffic logs). OLAP tools are used to summarize the measure attributes (e.g., total sales) along a dimension hierarchy, and to characterize changes (e.g., trends and anomalies) in a hierarchical summary over time. When thenumber...

With directional antennas, it is extremely important that a node maintains information with regards to the positions of its neighbors. This would allow the node to "track" the neighbors as they move; otherwise, a node will have to resort to either omnidirectional or circular directional transmissions (or receptions) fairly often. This can be overhe...

We consider the following variant of Huffman coding in which the costs of the letters, rather than the probabilities of the words, are non-uniform: Given an alphabet of unequal-length letters, find a minimum-average-length prefix-free set of n codewords over the alphabet. We show new structural properties of such codes, leading to an O(n log2
r) ti...

Our objective in this paper is to design topology control algorithms such that (i) nodes have low degree and (ii) paths in the network have few hops. Low node degree is desirable in networks equipped with smart antennas and to reduce access contention. Short paths are desirable for minimizing communication delays and for better robustness to channe...

The Reverse Greedy algorithm (RGreedy) for the k-median problem works as follows. It starts by placing facilities on all nodes. At each step, it removes a facility to minimize
the resulting total distance from the customers to the remaining facilities. It stops when k facilities remain. We prove that, if the distance function is metric, then the ap...

The Reverse Greedy algorithm (RGreedy) for the k-median problem works as
follows. It starts by placing facilities on all nodes. At each step, it removes
a facility to minimize the resulting total distance from the customers to the
remaining facilities. It stops when k facilities remain. We prove that, if the
distance function is metric, then the ap...

Following Mettu and Plaxton [22, 21], we study oblivious algorithms for the k-medians problem. Such an algorithm produces an incremental sequence of facility sets. We give improved algorithms, including a (24+ε)-competitive deterministic polynomial algorithm and a 2e ≈ 5.44-competitive randomized non-polynomial algorithm. Our approach is similar to...

The multiway-cut problem is, given a weighted graph and k >= 2 terminal
nodes, to find a minimum-weight set of edges whose removal separates all the
terminals. The problem is NP-hard, and even NP-hard to approximate within
1+delta for some small delta > 0.
Calinescu, Karloff, and Rabani (1998) gave an algorithm with performance
guarantee 3/2-1/k, b...

Large surveys using multiobject spectrographs require automated methods for deciding how to efficiently point observations and how to assign targets to each pointing. The Sloan Digital Sky Survey (SDSS) will observe around 106 spectra from targets distributed over an area of about 10,000 deg2, using a multiobject fiber spectrograph that can simulta...

Consider the following file caching problem: in response to a sequence of requests for files, where each file has a specified size and retrieval cost , maintain a cache of files of total size at most some specified k so as to minimize the total retrieval cost. Specifically, when a requested file is not in the cache, bring it into the cache and pay...

A generalization of the Seidel-Entringer-Arnold method for calculating the alternating permutation numbers (or secant-tangent numbers) leads to a new operation on integer sequences, the Boustrophedon transform.

The problem considered is the following. Given a graph with edge weights
satisfying the triangle inequality, and a degree bound for each vertex, compute
a low-weight spanning tree such that the degree of each vertex is at most its
specified bound. The problem is NP-hard (it generalizes Traveling Salesman
(TSP)). This paper describes a network-flow...

This report presents notes from the first eight lectures of the class Many
Models of Complexity taught by Laszlo Lovasz at Princeton University in the
fall of 1990. The topic is evasiveness of graph properties: given a graph
property, how many edges of the graph an algorithm must check in the worst case
before it knows whether the property holds.

Congestion control in the current Internet is accomplished mainly by TCP/IP. To understand the macroscopic network behavior that results from TCP/IP and similar end-to-end protocols, one main analytic technique is to show that the the protocol maximizes some global objective function of the network traffic. Here we analyze a particular end-to-end,...

The goal of the Sloan Digital Sky Survey is ``to map in detail one-quarter of
the entire sky, determining the positions and absolute brightnesses of more
than 100 million celestial objects''. The survey will be performed by taking
``snapshots'' through a large telescope. Each snapshot can capture up to 600
objects from a small circle of the sky. Th...

Two common objectives for evaluating a schedule are the makespan, or schedule
length, and the average completion time. This short note gives improved bounds
on the existence of schedules that simultaneously optimize both criteria. In
particular, for any rho> 0, there exists a schedule of makespan at most 1+rho
times the minimum, with average comple...

In this paper we introduce the notion of approximate da2a siruclures, in which a small amount of error is tolerated in the output. Approximate data structures trade error of approximation for faster operation, leading to theoretical and practical speedups for a wide variety of algorithms. We give approximate variants of the van Emde Boas data struc...

Von Neumann's Min-Max Theorem guarantees that each player of a zero-sum matrix game has an optimal mixed strategy. This paper gives an elementary proof that each player has a near-optimal mixed strategy that chooses uniformly at random from a multiset of pure strategies of size logarithmic in the number of pure strategies available to the opponent....

The parametric shortest path problem is to find the shortest paths in graph where the edge costs are of the form w_ij+lambda where each w_ij is constant and lambda is a parameter that varies. The problem is to find shortest path trees for every possible value of lambda. The minimum-balance problem is to find a ``weighting'' of the vertices so that...

Pattern-matching-based document-compression systems (e.g. for faxing) rely on
finding a small set of patterns that can be used to represent all of the ink in
the document. Finding an optimal set of patterns is NP-hard; previous
compression schemes have resorted to heuristics. This paper describes an
extension of the cross-entropy approach, used pre...

Given matrices A and B and vectors a, b, c and d, all with non-negative entries, we consider the problem of computing . We give a bicriteria-approximation algorithm that, given ε∈(0,1], finds a solution of cost O(ln(m)/ε2) times optimal, meeting the covering constraints (Ax⩾a) and multiplicity constraints (x⩽d), and satisfying Bx⩽(1+ε)b+β, where β...

We give a polynomial-time approximation scheme for the generalization of
Huffman Coding in which codeword letters have non-uniform costs (as in
Morse code, where the dash is twice as long as the dot). The algorithm
computes a (1+epsilon)-approximate solution in time O(n + f(epsilon)
log^3 n), where n is the input size.

Congestion control in the current Internet is accomplished mainly by TCP/IP. To understand the macroscopic network behavior that results from TCP/IP and similar end-to-end protocols, one main analytic technique is to show that the the protocol maximizes some global objective function of the network traffic. We analyze a particular end-to-end MIMD (...

This paper give a simple linear-time algorithm that, given a weighted
digraph, finds a spanning tree that simultaneously approximates a
shortest-path tree and a minimum spanning tree. The algorithm provides a
continuous trade-off: given the two trees and epsilon > 0, the
algorithm returns a spanning tree in which the distance between any
vertex and...

this paper we give a natural probability distribution of fractional packing instances such that, for an instance chosen at random, with probability 1 o(1) any Dantzig-Wolfe-type -relaxed procedure must make at

(MATH) In the standard Huffman coding problem, one is given a set of words and for each word a positive frequency. The goal is to encode each word w as a codeword c(w) over a given alphabet. The encoding must be prefix free (no codeword is a prefix of any other) and should minimize the weighted average codeword size &Sgr;w freq w, &124;c(w)&124;. T...

We describe sequential and parallel algorithms that approximately solve linear programs with no negative coefficients (aka mixed packing and covering problems). For explicitly given problems, our fastest sequential algorithm returns a solution satisfying all constraints within a 1±ε factor in O(mdlog(m)/ε<sup>2</sup>) time, where m is the...

It is well known that every 2-edge-connected graph can be oriented so that the resulting

We report on implementation and a modest experimental evaluation of a recently introduced

We study the general (non-metric) facility-location and weighted k-medians problems, as well as the fractional facility-location and unweighted k-medians problems. We describe a natural randomized rounding scheme and use it to derive approximation algorithms for all of these problems. For facility location and weighted k-medians, the respective alg...

Randomized rounding is a standard method, based on the probabilistic method,
for designing combinatorial approximation algorithms. In Raghavan's seminal
paper introducing the method (1988), he writes: "The time taken to solve the
linear program relaxations of the integer programs dominates the net running
time theoretically (and, most likely, in pr...

In this problem, the input is a sequence of requests for files, given on-line (one at a time). Each file has a non-negative size and a non-negative retrieval cost. The problem is to decide which files to keep in a fixed-size cache so as to minimize the sum of the retrieval costs for files that are not in the cache when requested. The problem arises...

Weighted caching is a generalization of paging in which the cost to

This paper give a simple linear-time algorithm that, given a weighted
digraph, finds a spanning tree that simultaneously approximates a shortest-path
tree and a minimum spanning tree. The algorithm provides a continuous
trade-off: given the two trees and epsilon > 0, the algorithm returns a
spanning tree in which the distance between any vertex and...

We give an efficient deterministic parallel approximation algorithm for the minimumweight

The Sloan Digital Sky Survey (SDSS) will observe around 10^6 spectra from targets distributed over an area of about 10,000 square degrees, using a multi-object fiber spectrograph which can simultaneously observe 640 objects in a circular field-of-view (referred to as a ``tile'') 1.49 degrees in radius. No two fibers can be placed closer than 55'' d...

The MEG (minimum equivalent graph) problem is, given a directed graph, to
find a small subset of the edges that maintains all reachability relations
between nodes. The problem is NP-hard. This paper gives a proof that, for
graphs where each directed cycle has at most three edges, the MEG problem is
equivalent to maximum bipartite matching, and ther...