Conference PaperPDF Available

Scalable Techniques for Clustering the Web.

Authors:

Abstract

Clustering is one of the most crucial techniques for dealing with the massive amount of information present on the web. Clustering can either be performed once oine, indepen- dent of search queries, or performed online on the results of search queries. Our oine approach aims to ecien t- ly cluster similar pages on the web, using the technique of Locality-Sensitive Hashing (LSH), in which web pages are hashed in such a way that similar pages have a much higher probability of collision than dissimilar pages. Our prelimi- nary experiments on the Stanford WebBase have shown that the hash-based scheme can be scaled to millions of urls.
Scalable Techniques for Clustering the Web
Extended Abstract
Taher H. Haveliwala
Stanford University
taherh@db.stanford.edu
Aristides Gionis
Stanford University
gionis@db.stanford.edu
Piotr Indyk
Stanford University
indyk@db.stanford.edu
ABSTRACT
Clustering is one of the most crucial techniques for dealing
with the massive amount of information present on the web.
Clustering can either be performed once offline, indepen-
dent of search queries, or performed online on the results
of search queries. Our offline approach aims to efficient-
ly cluster similar pages on the web, using the technique of
Locality-Sensitive Hashing (LSH), in which web pages are
hashed in such a way that similar pages have a much higher
probability of collision than dissimilar pages. Our prelimi-
nary experiments on the Stanford WebBase have shown that
the hash-based scheme can be scaled to millions of urls.
1. INTRODUCTION
Clustering, or finding sets of related pages, is currently one
of the crucial web-related information-retrieval problems.
Various forms of clustering are required in a wide range of
applications, including finding mirrored web pages, detect-
ing copyright violations, and reporting search results in a
structured way. With an estimated 1 billion pages current-
ly accessible on the web [19], the design of highly scalable
clustering algorithms is required.
Recently, there has been considerable work on web cluster-
ing. The approaches can be roughly divided into two cate-
gories1
Offline clustering, in which the entire web crawl data
is used to precompute sets of pages that are related
according to some metric. Published work on very-
large-scale offline clustering has dealt mainly with a
metric that provides a syntactic notion of similarity
(initiated by Broder et al. [5], see also [9]), where the
goal is to find pairs or clusters of web pages which are
nearly identical.
Online clustering, in which clustering is done on the
results of search queries, according to topic. Recent
work has included both link-based (initiated by Dean
and Henzinger [8]), and text-based (see Zamir and Et-
zioni [18]) methods.
1As far as the published results are concerned. Several ma-
jor search engines, including AOL, Excite, Google and In-
foseek, offer the “find related web pages” option, but the
details of their algorithms are not publicly available.
Although the syntactic approach for finding duplicates has
been tried offline on a large portion of the web, it cannot be
used when the form of documents is not an issue (e.g, when
two pages, one devoted to “automobiles” and the other fo-
cused on “cars,” are considered similar). The approaches
taken by [5, 9, 17] can not scale to the case where we are
looking for similar, as opposed to almost identical, docu-
ments. Computing the document-document similarity ma-
trix essentially requires processing the self-join of the rela-
tion DOCS(doc, wor d) on the word attribute, and counting
the number of words each pair of documents has in common.
The syntactic clustering algorithms [5, 9], on the other hand,
use shingles or sentences rather than words to reduce the size
of the self-join; this of course only allows for copy detection
type applications. Although clever algorithms exist which
do not require the self-join to be represented explicitly [9],
their running time is still proportional to the size of the self-
join, which could contain as many as 0.4×1013 tuples if we
are looking for similar, rather than identical, documents2.
Assuming the processing power of, say, 0.5×106pairs per
second3, the running time of those algorithms could easily
exceed 90 days.
Online methods based on link structure and text have been
applied successfully to finding pages on related topics. Un-
fortunately, the text-based methods are not in general scal-
able to an offline clustering of the whole web. The link-based
methods, on the other hand, suffer from the usual drawbacks
of collaborative filtering techniques:
At least a few pages pointing to two pages are neces-
sary in order to provide evidence of similarity between
the two. This prevents search engines from finding the
relations early in the page’s life, e.g., when a page is
first crawled. Pages are discovered to be similar only
when a sufficient number of people cocite them. By
that time, the web pages are likely to already have
been included in one of the popular web directories
(e.g., Yahoo! or Open Directory), making the discov-
ery less attractive.
Link-based methods are sensitive to specific choices
made by the authors of web pages; for example, some
2Our estimation using the sketching approach of [5], see Sec-
tion 3 for more details of that technique. Unfortunately, the
relation is too large to compute its exact size.
3Estimation based on the experiment in [17], page 45
people use (and point to) CNN weather information,
while others prefer MSNBC, in which case there might
be no “bridge” between these two pages.
We describe an ongoing project at Stanford whose goal is
to build a scalable, offline clustering tool that overcomes
the limitations of the above approaches, allowing topical
clustering of the entire web. Our approach uses the text
information from the pages themselves and from the pages
pointing to them. In the simplest scenario, we might want
to find web pages that share many similar words.
We use algorithms based on Locality-Sensitive Hashing (LSH),
introduced by Indyk and Motwani [14]4. The basic idea is
to hash the web pages in a way such that the pages which
are similar, according to metrics we will discuss later, have
a much higher probability of collision than pages which are
very different. We show that using LSH allows us to cir-
cumvent the “self-join bottleneck” and make web clustering
possible in a matter of a few days on modest hardware,
rather than months if the aforementioned techniques were
used.
In Section 2 we describe our representation of documents.
In Section 3 we discuss our similarity measure and provide
evidence of its validity. In Section 4 we show how we use
LSH techniques to efficiently find pairs of similar documents,
and in Section 5 we use the set of similar pairs to generate
clusters. We present our initial timing results in Section 6
and finally end with future work in Section 7.
2. BAG GENERATION
We now describe our representation of a given document.
The most common representation used in the IR community
is based on the vector space model, in which each document
is treated as an n-dimensional vector, where dimension i
represents the frequency of termi[16]. Because similarity
metric, described further in Section 3, is based on the in-
tersection and union of multisets, we will use an equivalent
characterization in which each document docuis represented
by a bag Bu={(wu
1, f u
1),... ,(wu
k, f u
k)}where wu
iare the
words present in the bag and fu
iare the corresponding fre-
quencies. We consider two options for choosing which terms
are assigned to a document’s bag. In the first strategy, we
take the obvious approach, and say that Bufor docuis given
by the multiset of words appearing in docu. Our second ap-
proach is to define Buto be the union of all anchor-windows
referring to docu. We will define anchor-windows in detail
in Section 2.2, but briefly, it means that termjappears in
Bufor each occurrence of termjoccurring near a hyperlink
to docu. We chose to drop both low frequency and high fre-
quency words from all bags. For both the content-based and
anchor-based approach, we have the option of applying one
of the commonly used variants of TFIDF scaling; we scale
each word’s frequency according to tfidf u
i=fu
i×log(N
dfi),
where N is the number of documents, dfiis the overall doc-
ument frequency of word i, and fu
iis as before [16]. Finally
we normalize the frequencies within each bag so all frequen-
cies sum to a fixed number (in our implementation 100).
4A similar algorithm has been independently discovered by
Broder [3]
2.1 Content-Based Bags
The generation of content based bags is straightforward.
We scan through the web repository, outputting normalized
word-occurrence frequencies for each document in turn. The
following three heuristics are used to improve the quality of
the word bags generated:
All HTML comments, Javascript code, and non-alphabetic
characters are removed. All HTML tags are removed,
although image ‘alt’ text is preserved.
A custom stopword list containing roughly 750 terms
is used. Roughly 300 terms were taken from a popular
indexing engine. We then inspected the 500 highest
frequency terms from a sample of the repository and
included all words except for names of companies and
other terms we subjectively decided could be meaning-
ful.
The well known Porter’s stemming algorithm is used
to remove word endings [15].
Note that the content-based bags never need to be stored
to disk; rather, the stream of bags is piped directly to the
min-hash generator described in 3.
2.2 Anchor-Based Bags
The use of page content for clustering is problematic for
several reasons. Often times the top level pages in a site
contain mostly navigational links and image maps, and may
not contain content useful for clustering [1]. Different pages
use different styles of writing, leading to the well known
linguistic problems of polysemy and synonymy [16].
One way to alleviate these problems is to define the bag rep-
resenting a document to be the multiset of occurrences of
words near a hyperlink to the page. When pages are linked
to, the anchor text, as well as the text surrounding the link,
henceforth referred to collectively as anchor-windows, are of-
ten succinct descriptions of the page [1, 6]. The detrimental
effects of synonymy in particular is reduced, since the union
of all anchor-windows will likely contain most variations of
words strongly indicative of the target page’s content.
Also note that with the anchor-based approach, more docu-
ments can be clustered: because it currently is not feasible
to crawl the entire web, for any web-crawl repository of size
n, references to more than npages will be made in anchors.
We now discuss the generation of anchor-based bags. We
sequentially process each document in the web repository
using the same heuristics given in Section 2.1, except that
instead of outputting a bag of words for the current doc-
ument, we output bag fragments for each url to which the
document links. Each bag fragment consists of the anchor
text of the url’s link, as well as a window of words imme-
diately preceding and immediately following the link. The
issue of what window size yields the best results is still be-
ing investigated; initial experiments led us to use a window
of size 8 (before and after the anchortext) for the result-
s presented in this paper. In addition to anchor-windows,
we generate an additional bag fragment for docuconsisting
solely of the words in the title of docu.
As they are generated, we write the bag fragments to one
of Mon-disk buckets, based on a hash of the url, where M
is chosen such that a single bucket can fit in main memory.
In our case, M= 256. After all bag fragments are gen-
erated, we sort (in memory) each bucket, collapse the bag
fragments for a given url, apply TFIDF scaling as discussed
in Section 2, and finally normalize the frequencies to sum to
100.
The remainder of our discussion will be limited to the use
of anchor-based bags for representing documents.
3. SIMILARITY MEASURE
The key idea of our approach is to create a small signature
for each url, to ensure that similar urls have similar sig-
natures. Recall that each url docuis represented as a bag
Bu={(wu
1, f u
1), . . . , (wu
k, f u
k)}. For each pair of urls uand
v, we define their similarity as sim(u, v) = |BuBv|
|BuBv|. The ex-
tension of the operation of intersection (union, resp.) from
sets to bags is defined by taking as the resulting frequency of
a word wthe minimum (maximum, resp.) of the frequencies
of win the two bags to be intersected (merged, resp.).
Before discussing how we can efficiently find similar docu-
ments, we provide evidence suggesting that the above simi-
larity metric applied to anchor-based bags as defined in Sec-
tion 2.2 provides intuitive and useful results.
For all of our experiments, we used the first 12 million pages
of the Stanford WebBase repository, on a crawl performed
in January 1999 [11]. The 12 million pages led to the gen-
eration of anchor-based bags for 35 million urls.
We tested our approach to defining document-document sim-
ilarity as follows. We gathered all urls contained at the sec-
ond level of the Yahoo! hierarchy. We randomly chose 20
of the Yahoo! urls, and found the 10 nearest-neighbors for
each among our collection of 35 million urls, using the simi-
larity measure defined above. To find the neighbors for each
of the 20 urls, we simply scan through our bags and keep
track of the 10 nearest-neighbors for each. Of course this
brute force method will not work when we wish to discover
pairwise similarities among all 35 million urls in our collec-
tion; we will discuss in detail in Section 4 how to use LSH
to do this efficiently.
Note that we never utilize Yahoo’s classifications; we simply
use Yahoo! as a source of query urls. By inspecting the sets
of neighbors for each of the Yahoo! urls, we can qualitatively
judge how well our measure of document-document similar-
ity is performing. Substantial work remains in both measur-
ing and improving the quality of our similarity measure; a
quantitative comparison of how quality is affected based on
our parameters (i.e., adjusting anchor-window sizes, using
other IDF variants, using page content, etc...) is beyond the
scope of our current presentation, but is an important part
of our ongoing work. Our initial results suggest that using
anchor-windows is a valid technique for judging the simi-
larity of documents. We list seven of the nearest-neighbor
sets below. The basic topics of the sets are, respectively:
(1) English language studies (2) Dow Jones Index (3) roller-
coasters (4) food (5) French national institutes (6) headline
news, and (7) pets. In each set, the query url from Yahoo!
is first, followed by its 10 nearest neighbors.
----------------------
1:
eserver.org
www.links2go.com/go/humanitas.ucsb.edu
eng.hss.cmu.edu
www.rci.rutgers.edu/~wcd/engweb1.htm
www.mala.bc.ca/~mcneil/template.htx
www.links2go.com/more/humanitas.ucsb.edu
www.teleport.com/~mgroves
www.ualberta.ca/~englishd/litlinks.htm
www.links2go.com/add/humanitas.ucsb.edu
english-www.hss.cmu.edu/cultronix
sunsite.unc.edu/ibic/guide.html
----------------------
2:
www.dowjones.com
bis.dowjones.com
bd.dowjones.com
businessdirectory.dowjones.com
www.djinteractive.com/cgi-bin/NewsRetrieval
www.dow.com
www.motherjones.com
www.yahoo.com/Business
rave.ohiolink.edu/databases/login/abig
www.bankhere.com/personal/service/cssurvey/1,1695,,00.html
www.gamelan.com/workbench/y2k/y2k_052998.html
----------------------
3:
www.casinopier-waterworks.com
www.cite-espace.com
www.rollercoaster.com/census/blands_park
world.std.com/~fun/clp.html
www2.storylandnh.com/storyland
www.storylandnh.com
www.rollercoaster.com/census/funtown_pier.html
www.wwtravelsource.com/newhampshire.htm
www.rollercoaster.com/census/casino_pier.html
www.dinosaurbeach.com
www.usatoday.com/life/travel/leisure/1998/t1228tw.htm
----------------------
4:
www.foodchannel.com
www.epicurious.com/a_home/a00_home/home.html
www.gourmetworld.com
www.foodwine.com
www.cuisinenet.com
www.kitchenlink.com
www.yumyum.com
www.menusonline.com
www.snap.com/directory/category/0,16,-324,00.html
www.ichef.com
www.home-canning.com
----------------------
5:
www.insee.fr
www.ined.fr
www.statistik-bund.de/e_home.htm
www.ineris.fr
cri.ensmp.fr/dp
www.ping.at/patent/index.htm
www.inist.fr
www.inrp.fr
www.industrie.gouv.fr
www.inpi.fr
www.adit.fr
----------------------
6:
www.nando.net/nt/world
www.cnn.com/WORLD/index.html
www.oneworld.org/news/index.html
www.iht.com
www2.nando.net/nt/world
www.rferl.org
www.politicsnow.com
www.cfn.cs.dal.ca/Media/TodaysNews/TodaysNews.html
www.csmonitor.com
www.herald.com
www.pathfinder.com/time/daily
----------------------
7:
www.petnewsinc.com
www.petchannel.com/petindustry/print/vpn/main.htm
www.pettribune.com
www.nwf.org/rrick
www.petchannel.com/reptiles
www.petsandvets.com.au
www.moorshead.com/pets
www.ecola.com/news/magazine/animals
www.thevivarium.com
www.petlifeweb.com
www.menagerie.on.ca
4. LOCALITY SENSITIVE HASHING
To describe our algorithms, let us assume for a moment that
Bu, as defined in Section 2, is a set instead of a bag. For this
case, it is known that there exists a family Hof hash func-
tions (see [5]) such that for each pair of pages u,vwe have
P r[mh(u) = mh(v)] = sim(u, v), where the hash function
mh is chosen at random from the family H. The family His
defined by imposing a random order on the set of all words
and then representing each url uby the smallest (according
to that random order) element from Bu. In practice, it is
quite inefficient to generate fully random permutation of all
words. Therefore, Broder et al [5] use a family of random
linear functions of the form h(x) = ax +bmod P; we use
the same approach (see Broder et al [4] and Indyk [13] for
theoretical background of this technique).
A simple observation is that the notion of a min-wise in-
dependent family of hash functions can be extended nat-
urally from sets to bags. This is done by replacing each
bag B={(w1, f1),... ,(wk, fk)}by the set S={w11, . . . ,
w1f1,... ,wk1,...wkfk}, where by wijwe denote the con-
catenation of the word wiwith the number j. It is easy to
see that for any two bags Buand Bvwe have |BuBv|=
|SuSv|and |BuBv|=|SuSv|.
After flattening each bag Buto the set Su, a Min Hash sig-
nature (MH-signature) can be computed as minw{h(w)|w
Su}, where h(·) is a random linear function as described
above. Such an MH-signature has the desired property that
the same value indicates similar urls. However, the method
is probabilistic and therefore both false positives and false
negatives are likely to occur. In order to reduce these in-
accuracies, we apply the Locality Sensitive Hashing (LSH)
technique introduced by Indyk and Motwani [14]. Accord-
ing to the LSH scheme, we generate mMH-signatures for
each url, and compute an LSH-signature by concatenating
kof these MH-signatures. Since unrelated pages are unlike-
ly to agree on all kMH-signatures, using an LSH-signature
decreases the number of false positives, but as a side effect,
increases the number of false negatives. In order to reduce
the latter effect, ldifferent LSH-signatures are extracted for
each url. In that way, it is likely that two related urls agree
on at least one of their LSH-signatures5.
The above discussion motivates our algorithm:
In the first step, url bags are scanned and mMH-
signatures are extracted from each url. This is very
easy to implement with one pass over the url bags.
This is the only information about urls used by the
rest of the algorithm.
In the second step, the algorithm generates LSH-signatures
and outputs similar pairs of urls according these LSH-
signatures.
This second step is done as follows:
Algorithm: ExtractSimilarPairs
Do ltimes
Generate kdistinct random indices,
each from the interval {1...m}
For each url u
Create an LSH-signature for u, by concatenating
the MH-signatures pointed by the kindices
Sort all urls by their LSH-signatures
For each run of urls with matching LSH-signatures
Output all pairs
The output pairs are written to the disk.
To enhance the quality of our results and reduce false pos-
itives, we perform a post-filtering stage on the pairs pro-
duced by the ExtractSimilarPairs algorithm. During
this stage, each pair (u, v) is validated by checking whether
the urls uand vagree on a fraction of their MH-signatures
which is at least as large as the desired similarity level (say
20%). If the condition does not hold, the pair is discarded.
5For a more formal analysis of the LSH technique see [14,
10, 7].
The implementation of the filtering stage requires a linear
scan over the pairs, assuming that all mMH-signatures for
all urls fit in the main memory. If this is not the case, more
passes over the pair file might be needed. Notice that this
step is the most main memory intensive part of our algo-
rithm. In our actual implementation we used two additional
techniques to reduce the memory requirements. The first is
to keep in memory only one byte from each MH-signature.
The second is to validate the pairs on less than mMH-
signatures. Both techniques introduce statistical error.
Implementation choices: We chose to represent each
MH-signature with w= 3 bytes. For each url we extract
m= 80 MH-signatures, which leads to a space requirement
of 240 bytes per url. By picking k= 3 the probability that
two unrelated urls end up having the same LSH-signature
is low; e.g., the probability that two urls with disjoint bags
collide is at most 1/28wk= 1/248, which guarantees a very
small number of false positives (for about 20 million urls).
On the other hand, since we look for pairs with similari-
ty at least 20%, a fixed pair of urls with similarity 20% gets
the same MH-signature with probability 2/10, and the same
LSH-signature with probability (2/10)k= 1/125. In order
to ensure that the pair would finally be discovered, that is,
to ensure a small probability of false negatives, we have to
take about 125 different LSH-signatures (l= 125).
5. CLUSTERING
The set of similar document pairs S, generated by the al-
gorithm discussed above, must then be sorted. Note that
each pair appears twice, both as (u, v) and (v, u). Sorting
the pairs data, which is close to 300 GB for 20 million urls,
is the most expensive step of our procedure. After the sort
step, we can efficiently build an index over the pairs so that
we can respond to “What’s Related” type queries: given a
query for document uwe can return the set {v|(u, v)S}.
We can proceed further by using the set of similar pairs,
which represents the document-document similarity matrix,
to group pages into flat clusters. The clustering step allows
a more compact final representation than document pairs,
and would be a necessary for creating hierarchies6.
To form flat clusters, we use a variant of the C-LINK al-
gorithm due to Hochbaum and Shmoys [12], which we call
CENTER. The idea of the algorithm is as follows. We can
think of the similar pairs generated earlier as edges in a
graph (the nodes correspond to urls). Our algorithm parti-
tions the graph in such a way that in each cluster there is
acenter node and all other nodes in the cluster are “close
enough” to the center. For our purposes, “close enough”
means that there is an edge in the graph; that it, there is a
pair found in the previous phase that contains the node and
its center.
CENTER can be implemented very efficiently. The algo-
rithm performs a sequential scan over the sorted pairs. The
first time that node uappears in the scan, it is marked as a
cluster center. All subsequent nodes vthat appear in pairs
of the form (u, v) are marked as belonging to the cluster of
uand are not considered again.
6We have not yet explored generating hierarchies
6. RESULTS
We discuss only the anchor-window approach here, although
the content based approach requires a similar running time.
As discussed in Section 3, our dataset consists of 35 million
urls whose anchor-bags were generated from 12 million web
pages. For the timing results presented here, we applied our
LSH-based clustering technique to a subset of 20 million
urls.
The timing results of the various stages are given in Table 1.
We ran the first four steps of the experiment on dual Pen-
tium II 300 MHz, with 512 MB of memory. The last two
steps were performed on dual Pentium II 450 MHz, with
1 GB of memory. The timings for the last two steps are
estimated from the time needed to generate the first 10,000
clusters, which is as many as we can inspect manually.
Algorithm step No. CPUs Time
bag generation 2 23 hours
bag sorting 2 4.7 hours
MH-signature generation 1 26 hours
pair generation 1 16 hours
filtering 1 83 hours
sorting 1 107 hours
CENTER 1 18 hours
Table 1: Timing results
Developing an effective way to measure cluster quality when
the dataset consists of tens of millions of urls is an extremely
challenging problem. As discussed more formally in [14, 10,
7], the LSH technique has probabilistic guarantees as to how
well nearest-neighbors are approximated. Thus the initial
results for the quality of exact nearest-neighbors that we
described in Section 3 are indicative of our clustering quality.
We are currently investigating techniques to analyze more
thoroughly the overall clustering quality given the scale of
our input.
7. FUTURE WORK
We are actively developing the techniques we have intro-
duced in this paper. We plan to integrate our clustering
mechanism with the Stanford WebBase to facilitate user
feedback on our cluster quality, allowing us to both mea-
sure quality and make further enhancements. We also plan
to experiment with a hybrid approach to clustering, by first
using standard supervised classification algorithms to pre-
classify our set of urls into several hundred classes, and then
applying LSH-based clustering to the urls of each of the re-
sulting classes. This would help desensitize our algorithm
from word ambiguity, while still allowing for the generation
of fine-grained clusters in a scalable fashion.
8. REFERENCES
[1] E. Amitay, “Using common hypertext links to
identify the best phrasal description of target web
documents ”, SIGIR’98 (Workshop on Hypertext
Information Retrieval for the Web ).
[2] A. Broder, “On the resemblance and containment of
documents”, SEQUENCES’98, p. 21-29.
[3] A. Broder, “Filtering near-duplicate documents”,
FUN’98.
[4] A. Broder, M. Charikar, A. Frieze, M.
Mitzenmacher, “Min-wise independent
permutations”, STOC’98.
[5] A. Broder, S. Glassman, M. Manasse, G. Zweig,
“Syntactic clustering of the Web”, WWW6, p.
391-404, 1997.
[6] M. Craven, D. DiPasquo, D. Freitag, A. McCallum,
T. Mitchell, K. Nigam and S. Slattery, “Learning to
Extract Symbolic Knowledge from the World Wide
Web”, AAAI’98
[7] E. Cohen, M. Datar, S. Fujiware, A. Gionis, P.
Indyk, R. Motwani, J. Ullman, and C. Yang,
“Finding interesting associations without support
pruning”, ICDE’2000.
[8] J. Dean, M. Henzinger, “Finding related web pages
in the world wide web”, WWW8, p., 389-401, 1999.
[9] M. Fang, H. Garcia-Molina, R. Motwani, N.
Shivakumar and J. Ullman , “Computing iceberg
queries efficiently”, VLDB’98.
[10] A. Gionis, P. Indyk, R. Motwani, “Similarity search
in high dimensions via hashing”, VLDB’99.
[11] J. Hirai, S. Raghavan, H. Garcia-Molina, A.
Paepcke, “WebBase: A repository of web pages”,
WWW9
[12] D. Hochbaum, D. Shmoys, “A best possible
heuristic for the k-center problem”, Mathematics of
Operations Research, 10(2):180-184, 1985.
[13] P. Indyk, “A small minwise independent family of
hash functions”, SODA’99.
[14] P. Indyk, R. Motwani, “Approximate nearest
neighbor: Towards removing the curse of
dimensionality”, STOC’98.
[15] M. Porter “An algorithm for suffix stripping”,
Program 14(3):130-137, 1980.
[16] G. Salton, M. J. McGill “Introduction to Modern
Information Retrieval”, McGraw-Hill Publishing
Company, New York, NY, 1983.
[17] N. Shivakumar, “Detecting Digital Copyright
Violations on the Internet” Ph.D. thesis, Stanford
University, 1999.
[18] O. Zamir, O. Etzioni “Web document clustering: A
feasibility demonstration”, SIGIR’98.
[19] “Inktomi WebMap”,
http://www.inktomi.com/webmap/
... Min-Hash cannot handle such weights properly. To address this limitation, weighted Min-Hash algorithms have been explored to approximate the generalized Jaccard similarity [16], which is used to measure the similarity of weighted sets. Existing works of weighted Min-Hash can be roughly classified into quantization-based and sampling-based approaches. ...
... For example, a document is commonly represented as a tf-idf set. In order to reasonably compute the similarity of two weighted sets, the generalized Jaccard similarity was introduced in [16]. Considering two weighted sets, S and T , the generalized Jaccard similarity is defined as ...
Preprint
Min-Hash is a popular technique for efficiently estimating the Jaccard similarity of binary sets. Consistent Weighted Sampling (CWS) generalizes the Min-Hash scheme to sketch weighted sets and has drawn increasing interest from the community. Due to its constant-time complexity independent of the values of the weights, Improved CWS (ICWS) is considered as the state-of-the-art CWS algorithm. In this paper, we revisit ICWS and analyze its underlying mechanism to show that there actually exists dependence between the two components of the hash-code produced by ICWS, which violates the condition of independence. To remedy the problem, we propose an Improved ICWS (I2^2CWS) algorithm which not only shares the same theoretical computational complexity as ICWS but also abides by the required conditions of the CWS scheme. The experimental results on a number of synthetic data sets and real-world text data sets demonstrate that our I2^2CWS algorithm can estimate the Jaccard similarity more accurately, and also compete with or outperform the compared methods, including ICWS, in classification and top-K retrieval, after relieving the underlying dependence.
... The problem is related to the similarity join operator [4,10], which has been studied extensively in the database and data-mining communities. The similarity self-join is an essential component in several applications, including plagiarism detection [11,16], query refinement [21], document clustering [8,9,15], data cleaning [10], community mining [22], near-duplicate record detection [28], and collaborative filtering [12]. ...
... The candidate pairs may contain false positives but no false negatives. 15 report ApplyDecay(P, λ) candidate verification (CV): computes true similarities between candidate pairs, and reports true similar pairs, while dismissing false positives. More concretely, for an indexing scheme IDX we assume that the following three primitives are available, which correspond to the three phases outlined above: (I, P ) ← IndConstr-IDX(D, θ): Given a dataset D consisting of n vectors, and a similarity threshold θ, the function IndConstr-IDX returns in P = {(x, y)} all similar pairs (x, y), with x, y ∈ D. Additionally, IndConstr-IDX builds an index I, which can be used to find similar pairs between the vectors in D and another query vector z. ...
Preprint
We introduce and study the problem of computing the similarity self-join in a streaming context (SSSJ), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose similarity is greater than a given threshold. The simplest formulation of the problem requires unbounded memory, and thus, it is intractable. To make the problem feasible, we introduce the notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time. By leveraging the properties of this time-dependent similarity function, we design two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch (MB), uses existing index-based filtering techniques for the static version of the problem, and combines them in a pipeline. The second framework, Streaming (STR), adds time filtering to the existing indexes, and integrates new time-based bounds deeply in the working of the algorithms. We also introduce a new indexing technique (L2), which is based on an existing state-of-the-art indexing technique (L2AP), but is optimized for the streaming case. Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters.
... Besides binary vectors, a variety of methods have also been developed to estimate generalized Jaccard similarity on weighted vectors. For vectors consisting of only nonnegative integer weights, Haveliwala et al. [35] proposed to add a corresponding number of replications of each element in order to apply the conventional MinHash. To handle more general real weights, Haeupler et al. [36] proposed to generate another additional replication with a probability that equals the floating part of an element's weight. ...
... / 12 j ← RandInt(z i , k); / * Swap(π i,z i , π i,j ) exchanges the values of two variables π i,z i and π i,j . * / 13 Swap(π i,zi , π i,j ); 14 c ← π i,zi ; 15 if y c < 0 then 16 (y c , s c ) ← (b i , i); k * ← k * − 1; 17 else if b i < y c then 18 (y c , s c ) ← (b i , i);/ * The following part is FastPrune * / 19 j * ← argmax j=1,...,k y j ; N ← N + v ; 20 while N is not empty do ← c , s c ) ← (b i , i);35 ...
Preprint
Full-text available
The well-known Gumbel-Max Trick for sampling elements from a categorical distribution (or more generally a non-negative vector) and its variants have been widely used in areas such as machine learning and information retrieval. To sample a random element i in proportion to its positive weight viv_i, the Gumbel-Max Trick first computes a Gumbel random variable gig_i for each positive weight element i, and then samples the element i with the largest value of gi+lnvig_i+\ln v_i. Recently, applications including similarity estimation and weighted cardinality estimation require to generate k independent Gumbel-Max variables from high dimensional vectors. However, it is computationally expensive for a large k (e.g., hundreds or even thousands) when using the traditional Gumbel-Max Trick. To solve this problem, we propose a novel algorithm, FastGM, which reduces the time complexity from O(kn+)O(kn^+) to O(klnk+n+)O(k \ln k + n^+), where n+n^+ is the number of positive elements in the vector of interest. FastGM stops the procedure of Gumbel random variables computing for many elements, especially for those with small weights. We perform experiments on a variety of real-world datasets and the experimental results demonstrate that FastGM is orders of magnitude faster than state-of-the-art methods without sacrificing accuracy or incurring additional expenses.
... Clustering algorithms find applications in various fields such as economics, marketing, electronic design, space research, etc. For example, clustering has been used to group related documents for web browsing (Broder, Glassman, Manasse, & Zweig, 1997;Haveliwala, Gionis, & Indyk, 2000), by banks to cluster the previous transactions of clients to identify suspicious (possibly fraudulent) behaviour (Sabau, 2012), for formulating effective marketing strategies by clustering customers with similar behaviour (Chaturvedi, Carroll, Green, & Rotondo, 1997), in earthquake studies for identifying dangerous zones based on previous epicentre locations (Weatherill & Burton, 2009;Shelly, Ellsworth, Ryberg, Haberland, Fuis, Murphy, Nadeau, & Bürgmann, 2009;Lei, 2010), and so on. However, when we analyze such real-world data, we may encounter incomplete data where some features of some of the data instances are missing. ...
Preprint
Many real-world clustering problems are plagued by incomplete data characterized by missing or absent features for some or all of the data instances. Traditional clustering methods cannot be directly applied to such data without preprocessing by imputation or marginalization techniques. In this article, we overcome this drawback by utilizing a penalized dissimilarity measure which we refer to as the Feature Weighted Penalty based Dissimilarity (FWPD). Using the FWPD measure, we modify the traditional k-means clustering algorithm and the standard hierarchical agglomerative clustering algorithms so as to make them directly applicable to datasets with missing features. We present time complexity analyses for these new techniques and also undertake a detailed theoretical analysis showing that the new FWPD based k-means algorithm converges to a local optimum within a finite number of iterations. We also present a detailed method for simulating random as well as feature dependent missingness. We report extensive experiments on various benchmark datasets for different types of missingness showing that the proposed clustering techniques have generally better results compared to some of the most well-known imputation methods which are commonly used to handle such incomplete data. We append a possible extension of the proposed dissimilarity measure to the case of absent features (where the unobserved features are known to be undefined).
... As a result, across a variety of domains, hashing approaches have been widely utilized in applications requiring fast (approximate) nearest neighbor retrieval. Examples include: image annotation [58], visual tracking [31], 3D reconstruction [7], video segmentation [40], object detection [11], audio search [54], multimedia retrieval [16], [46], [47], and large-scale clustering [20], [21], [22]. Our goal is to learn hashing functions that can result in optimal nearest neighbor retrieval performance. ...
Preprint
Binary vector embeddings enable fast nearest neighbor retrieval in large databases of high-dimensional objects, and play an important role in many practical applications, such as image and video retrieval. We study the problem of learning binary vector embeddings under a supervised setting, also known as hashing. We propose a novel supervised hashing method based on optimizing an information-theoretic quantity: mutual information. We show that optimizing mutual information can reduce ambiguity in the induced neighborhood structure in the learned Hamming space, which is essential in obtaining high retrieval performance. To this end, we optimize mutual information in deep neural networks with minibatch stochastic gradient descent, with a formulation that maximally and efficiently utilizes available supervision. Experiments on four image retrieval benchmarks, including ImageNet, confirm the effectiveness of our method in learning high-quality binary embeddings for nearest neighbor retrieval.
... We next describe an efficient method for clustering the noisy reads. The clustering method is based on locality sensitive hashing (LSH), and is inspired by an algorithm proposed for clustering web documents [HGI00]. LSH relies on a cheaper-to-compute proxy for the edit distance, computed using the so-called Min-Hashing (MH) method. ...
Preprint
Due to its longevity and enormous information density, DNA is an attractive medium for archival data storage. Thanks to rapid technological advances, DNA storage is becoming practically feasible, as demonstrated by a number of experimental storage systems, making it a promising solution for our society's increasing need of data storage. While in living things, DNA molecules can consist of millions of nucleotides, due to technological constraints, in practice, data is stored on many short DNA molecules, which are preserved in a DNA pool and cannot be spatially ordered. Moreover, imperfections in sequencing, synthesis, and handling, as well as DNA decay during storage, introduce random noise into the system, making the task of reliably storing and retrieving information in DNA challenging. This unique setup raises a natural information-theoretic question: how much information can be reliably stored on and reconstructed from millions of short noisy sequences? The goal of this monograph is to address this question by discussing the fundamental limits of storing information on DNA. Motivated by current technological constraints on DNA synthesis and sequencing, we propose a probabilistic channel model that captures three key distinctive aspects of the DNA storage systems: (1) the data is written onto many short DNA molecules that are stored in an unordered fashion; (2) the molecules are corrupted by noise and (3) the data is read by randomly sampling from the DNA pool. Our goal is to investigate the impact of each of these key aspects on the capacity of the DNA storage system. Rather than focusing on coding-theoretic considerations and computationally efficient encoding and decoding, we aim to build an information-theoretic foundation for the analysis of these channels, developing tools for achievability and converse arguments.
Article
A fundamental problem in many scenarios is to match entities across two data sources. It is frequently presumed in prior work that entities to be matched are of comparable granularity. In this work, we address one-to-many or poly-matching in the scenario where entities have varying granularity. A distinctive feature of our problem is its bidirectional nature, where the ‘one’ or the ‘many’ could come from either source arbitrarily. Moreover, to deal with diverse entity representations that give rise to noisy similarity values, we incorporate novel notions of receptivity and reclusivity into a robust matching objective. As the optimal solution to the resulting formulation is proven computationally intractable, we propose more scalable yet still performant heuristics. Experiments on multiple real-life datasets showcase the effectiveness and outperformance of our proposed algorithms over baselines.
Article
The well-known Gumbel-Max Trick for sampling elements from a categorical distribution (or more generally a non-negative vector) and its variants have been widely used in areas such as machine learning and information retrieval. To sample a random element i in proportion to its positive weight viv_{i} , the Gumbel-Max Trick first computes a Gumbel random variable gig_{i} for each positive weight element i , and then samples the element i with the largest value of gi+lnvig_{i}+\ln v_{i} . Recently, applications including similarity estimation and weighted cardinality estimation require to generate k independent Gumbel-Max variables from high dimensional vectors. However, it is computationally expensive for a large k (e.g., hundreds or even thousands) when using the traditional Gumbel-Max Trick. To solve this problem, we propose a novel algorithm, FastGM , which reduces the time complexity from O(kn+)O(kn^+) to O(klnk+n+)O(k \ln k + n^+) , where n+n^+ is the number of positive elements in the vector of interest. FastGM stops the procedure of Gumbel random variables computing for many elements, especially for those with small weights. We perform experiments on a variety of real-world datasets and the experimental results demonstrate that FastGM is orders of magnitude faster than state-of-the-art methods without sacrificing accuracy or incurring additional expenses.
Article
Full-text available
The nearest neighbor problem is the following: Given a set of n points P in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to the query point q in X. We focus on the particularly interesting case of the d-dimensional Euclidean space where X = R-d under some l-p norm.
Conference Paper
Full-text available
Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of “roughly the same” and “roughly contained.” The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints
Article
Full-text available
In this paper we present a 2-approximation algorithm for the k-center problem with triangle inequality. This result is “best possible” since for any δ < 2 the existence of δ-approximation algorithm would imply that P = NP. It should be noted that no δ-approximation algorithm, for any constant δ, has been reported to date. Linear programming duality theory provides interesting insight to the problem and enables us to derive, in O|E| log |E| time, a solution with value no more than twice the k-center optimal value. A by-product of the analysis is an O|E| algorithm that identifies a dominating set in G2, the square of a graph G, the size of which is no larger than the size of the minimum dominating set in the graph G. The key combinatorial object used is called a strong stable set, and we prove the NP-completeness of the corresponding decision problem.
Conference Paper
Full-text available
The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch” for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document. However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffices to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a “sample” of less than 50 bytes per document. The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest. The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
In this paper, we study the problem of constructing and maintaining a large shared repository of Web pages. We discuss the unique characteristics of such a repository, propose an architecture, and identify its functional modules. We focus on the storage manager module, and illustrate how traditional techniques for storage and indexing can be tailored to meet the requirements of a Web repository. To evaluate design alternatives, we also present experimental results from a prototype repository called WebBase, that is currently being developed at Stanford University.
Article
When using traditional search engines, users have to formulate queries to describe their information need. This paper discusses a different approach to Web searching where the input to the search process is not a set of query terms, but instead is the URL of a page, and the output is a set of related Web pages. A related Web page is one that addresses the same topic as the original page. For example, www.washingtonpost.com is a page related to www.nytimes.com, since both are online newspapers.We describe two algorithms to identify related Web pages. These algorithms use only the connectivity information in the Web (i.e., the links between pages) and not the content of pages or usage information. We have implemented both algorithms and measured their runtime performance. To evaluate the effectiveness of our algorithms, we performed a user study comparing our algorithms with Netscape's `What's Related' service (http://home.netscape.com/escapes/related/). Our study showed that the precision at 10 for our two algorithms are 73% better and 51% better than that of Netscape, despite the fact that Netscape uses both content and usage pattern information in addition to connectivity information.
Article
We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a “Lost and Found” service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.