ArticlePDF Available

Use of Locality Sensitive Hashing (LSH) in NLP

Authors:

Abstract and Figures

Locality Sensitive Hashing (LSH) is a powerful technique utilized in Natural Language Processing (NLP) for grouping similar words based on their word embeddings. This paper provides an overview of LSH and its application in NLP tasks, focusing on the organization of similar words in translation tasks. Traditional hashing techniques often fail to guarantee that similar words will be assigned to the same bucket, making LSH a valuable solution. By employing LSH, similar words can be effectively mapped to the same or nearby buckets, enabling the creation of embedding matrices that facilitate language translation. Understanding the principles of LSH in NLP is essential for efficient language mapping and improving translation accuracy.
Content may be subject to copyright.
Hashing is a technique used to assign values to a key, employing various hashing algorithms.
One might wonder about the purpose of hashing in NLP. Well, when you're attempting to
translate one language to another, let's say Nepali to English, and you're using word embeddings
for both languages, there is a need for organizing similar words before training a model to
identify them. This involves placing similar words from both languages into the same list,
enabling the creation of a matrix that maps embeddings from one language to their
corresponding words in the other language.
To achieve this, hashing is utilized to group similar words from both languages into the same
bucket based on their word embeddings. However, using traditional hashing techniques does not
guarantee that similar words will be assigned to the same bucket. To address this challenge,
Locality Sensitive Hashing (LSH) can be employed. LSH facilitates the determination of
similarity between data points and ensures that they are mapped to the same or nearby buckets,
thereby accomplishing the grouping of similar words effectively.
Figure 1: Traditional hashing technique
Here's a step-by-step explanation of the method:
1. Plane Definition: The first step is to define a plane in the space. A plane can be
represented by a normal vector and a point on the plane, or by an equation that
describes the plane. The normal vector indicates the direction perpendicular to
the plane.
2. Position Calculation: For each data point, the position with respect to the plane
is determined. This can be done by calculating the signed distance between the
data point and the plane. The sign of the distance indicates whether the point is
on one side or the other side of the plane.
Figure 2: Calculation of hash value with respect to three planes
3. Hashing: Based on the calculated position, the data points are then hashed into
buckets. Typically, a hash function is used to map the position to a hash code or bucket.
Data points with similar positions or on the same side of the plane are more likely to be
mapped to the same or nearby buckets.
We can use multiple planes to get a single hash value. Let's take a look at the following
example:
Figure 3: Calculation of final hash value
4. Let's consider an example to illustrate how this hashing function works. Suppose
we have 3 planes, and a data point has the following positions relative to the
planes: h0 = 1 (on the same side of the first plane), h1 = 1 (on the other side of
the second plane), h2 = 0 (on the same side of the third plane). Using the
hashing function, the hash code for this data point would be:
2^0 *1 + 2^1 *1 + 2^2 * 0 = 2 + 1 + 0 = 3
So, the data point would be hashed to bucket number 3.
The advantage of this hashing function is that it generates distinct hash codes for
different combinations of positions, allowing similar data points with the same or
nearby positions to be mapped to the same or nearby buckets. By adjusting the
weights (powers of 2), you can control the importance of each position value in
the final hash code.
5. Indexing and Retrieval: The hashed data points are stored in a data structure,
such as a hash table or hash map, where each bucket contains the data points
that hashed to the same code. During retrieval, given a query data point, its
position with respect to the planes is calculated, and the search is performed
within the bucket(s) corresponding to the hashed position.
This method leverages the concept of separating data points based on their position
relative to the plane. Points that lie on the same side of the plane are considered more
similar to each other, and therefore, they are likely to be hashed into the same buckets.
It's important to note that the specific implementation and choice of hash functions may
vary depending on the requirements and characteristics of the data. The design of the
hash functions aims to maximize the probability of mapping similar data points to the
same buckets while maintaining an acceptable level of collision (different data points
being mapped to the same bucket).
This approach can be particularly useful for certain types of data and similarity search
tasks where separating data points based on their position to a plane is meaningful for
capturing similarity.
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.