Conference PaperPDF Available

BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs

Authors:

Abstract and Figures

Word2Vec is a popular algorithm used for generating dense vector representations of words in large corpora using unsupervised learning. The resulting vectors have been shown to capture semantic relationships between the corresponding words and are used extensively for many downstream natural language processing (NLP) tasks like sentiment analysis, named entity recognition and machine translation. Most open-source implementations of the algorithm have been parallelized for multi-core CPU architectures including the original C implementation by Mikolov et al. [1] and FastText [2] by Facebook. A few other implementations have attempted to leverage GPU parallelization but at the cost of accuracy and scalability. In this work, we present BlazingText, a highly optimized implementation of word2vec in CUDA, that can leverage multiple GPUs for training. BlazingText can achieve a training speed of up to 43M words/sec on 8 GPUs, which is a 9x speedup over 8-threaded CPU implementations, with minimal effect on the quality of the embeddings.
Content may be subject to copyright.
BlazingText: Scaling and Accelerating Word2Vec using Multiple
GPUs
Saurabh Gupta
Amazon Web Services
gsaur@amazon.com
Vineet Khare
Amazon Web Services
vkhare@amazon.com
ABSTRACT
Word2Vec is a popular algorithm used for generating dense vector
representations of words in large corpora using unsupervised learn-
ing. The resulting vectors have been shown to capture semantic
relationships between the corresponding words and are used ex-
tensively for many downstream natural language processing (NLP)
tasks like sentiment analysis, named entity recognition and machine
translation. Most open-source implementations of the algorithm
have been parallelized for multi-core CPU architectures including
the original C implementation by Mikolov et al. [
1
] and FastText
[
2
] by Facebook. A few other implementations have attempted to
leverage GPU parallelization but at the cost of accuracy and scal-
ability. In this work, we present BlazingText, a highly optimized
implementation of word2vec in CUDA, that can leverage multiple
GPUs for training. BlazingText can achieve a training speed of up to
43M words/sec on 8 GPUs, which is a 9x speedup over 8-threaded
CPU implementations, with minimal eect on the quality of the
embeddings.
CCS CONCEPTS
Computing methodologies Neural networks
;
Natural
language processing;
KEYWORDS
Word embeddings, Word2Vec, Natural Language Processing, Ma-
chine Learning, CUDA, GPU
ACM Reference format:
Saurabh Gupta and Vineet Khare. 2017. BlazingText: Scaling and Accelerat-
ing Word2Vec using Multiple GPUs. In Proceedings of MLHPC’17: Machine
Learning in HPC Environments, Denver, CO, USA, November 12–17, 2017,
5 pages.
https://doi.org/10.1145/3146347 .3146354
1 INTRODUCTION
Word2Vec aims to represent each word as a vector in a low-dimensional
embedding space such that the geometry of resulting vectors cap-
tures word semantic similarity through the cosine similarity of cor-
responding vectors as well as more complex relationships through
vector subtractions, such as vec(“King”) - vec(“Queen”) + vec(“Woman”)
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
MLHPC’17: Machine Learning in HPC Environments, November 12–17, 2017, Denver, CO,
USA
©2017 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-5137-9/17/11.
https://doi.org/10.1145/3146347 .3146354
vec(“Man”). This idea has enabled many Natural Language Pro-
cessing (NLP) algorithms to achieve better performance [3, 4].
The optimization in word2vec is done using Stochastic Gradient
Descent (SGD), which solves the problem iteratively; at each step,
it picks a pair of words: an input word and a target word either
from its window or a random negative sample. It then computes the
gradients of the objective function with respect to the two chosen
words, and updates the word representations of the two words
based on the gradient values. The algorithm then proceeds to the
next iteration with a dierent word pair being chosen.
One of the main issues with SGD is that it is inherently sequential;
since there is a dependency between the update from one iteration
and the computation in the next iteration (they may happen to touch
the same word representations), each iteration must potentially wait
for the update from the previous iteration to complete. This does
not allow us to use the parallel resources of the hardware.
However, to solve the above issue, word2vec uses Hogwild [
5
],
a scheme where dierent threads process dierent word pairs in
parallel and ignore any conicts that may arise in the model up-
date phases. In theory, this can reduce the rate of convergence of
algorithm as compared to a sequential run. However, the Hogwild
approach has been shown to work well in the case updates across
threads are unlikely to be to the same word; and indeed for large
vocabulary sizes, conicts are relatively rare and convergence is
not typically aected.
The success of Hogwild approach for Word2Vec in case of multi-
core architectures makes this algorithm a good candidate for ex-
ploiting GPU, which provides orders of magnitude more parallelism
than a CPU. In this paper, we propose an ecient parallelization
technique for accelerating word2vec using GPUs.
GPU acceleration using deep learning frameworks is not a good
choice for accelerating word2vec [
6
]. These frameworks are often
suitable for “deep networks” where the computation is dominated
by heavy operations like convolutions and large matrix multiplica-
tions. On the other hand, word2vec is a relatively shallow network,
as each training step consists of an embedding lookup, gradient
computation and nally weight updates for the word pair under
consideration. The gradient computation and updates involve small
dot products and thus don’t benet from the use of cuDNN [
7
] or
cuBLAS [8] libraries.
The limitations of deep learning frameworks led us to explore
the CUDA C++ API. We design the training algorithm from scratch,
to utilize CUDA multi-threading capabilities optimally, without
hurting the output accuracy by over-exploiting GPU parallelism.
Finally, to scale out BlazingText to process text corpus at several
million words/sec, we demonstrate the possibility of using multiple
GPUs to perform data parallelism based training, which is one of the
main contributions of our work. We benchmark BlazingText against
MLHPC’17: Machine Learning in HPC Environments, November 12–17, 2017, Denver, CO, USA S. Gupta et al.
commonly available open-source tools to illustrate its superior
performance.
The rest of the paper is organized as follows. In Sec. 2, we describe
the original word2vec model. In Sec. 3, we review the existing
approaches to accelerate word2vec using GPUs or multi-node CPUs.
We describe our eorts using CUDA C++ API in Sec. 4. Experimental
results on Text8 and One Billion Words Benchmark dataset are
presented in Sec. 5, followed by conclusion and future work in Sec.
6.
2 WORD2VEC MODEL
Word2vec represents each word
w
in a vocabulary
V
of size
T
as
a low-dimensional dense vector
vw
in an embedding space
RD
. It
attempts to learn the continuous word vectors
vw
,
wV
, from a
training corpus such that the spatial distance between words then
describes the similarity between words, e.g., the closer two words
are in the embedding space, the more similar they are semantically
and syntactically. Inspired by the distributional hypothesis (Harris,
1954), these representations are trained to predict words appearing
in the context of a given word. Under this hypothesis, two distinct
model architectures: Contextual Bag-Of-Words (CBOW) and Skip-
Gram with Negative Sampling (SGNS) are proposed in word2vec
[
1
]. The objective of CBOW is to predict a word given its context,
whereas Skipgram tries to predict the context given a word. In
practice, Skipgram gives better performance and is described below.
More formally, given a large training corpus represented as a
sequence of words
w1,w2...wT
, the objective of the skipgram model
is to maximize the log-likelihood
T
Õ
t=1
Õ
cCt
logp(wc|wt)
where
T
is the vocabulary size and the context
Ct
is the set of in-
dices of words surrounding word
wt
. The probability of observing
a context word
wc
given
wt
will be parameterized using the afore-
mentioned word vectors. For now, let us consider that we are given
a scoring function
s
which maps pairs of (word, context) to scores
in
R
. One possible choice to dene the probability of a context word
is the softmax:
p(wc|wt)=exp(s(wt,wc))
ÍW
j=1exp(s(wt,j))
However, such a model is not adapted to our case as it implies
that, given a word wt, we only predict one context word wc.
The problem of predicting context words can instead be framed
as a set of independent binary classication tasks. Then the goal is
to independently predict the presence (or absence) of context words.
For the word at position
t
we consider all context words as positive
examples and sample negatives at random from the dictionary. For
a chosen context position
c
, using the binary logistic loss, we obtain
the following negative log-likelihood:
log(1+es(wt,wc))+Õ
n∈Nt,c
log(1+es(wt,n))
where
Nt,c
is a set of negative examples sampled from the vo-
cabulary. By denoting the logistic loss function
l
:
x7→ log(
1
+ex)
,
we can re-write the objective as:
T
Õ
t=1
Õ
cCt
l(s(wt,wc)) +Õ
n∈Nt,c
l(−s(wt,n))
A natural parameterization for the scoring function
s
between
a word
wt
and a context word
wc
is to use word vectors. Let us
dene for each word
w
in the vocabulary two vectors
uw
and
vw
in
RD
. These two vectors are sometimes referred to as
input
and
output
vectors in the literature. In particular, we have vectors
uwt
and
vwc
, corresponding to words
wt
and
wc
respectively. Then
the score can be computed as the dot product between word and
context vectors as s(wt,wc)=uT
wtvwc.
3 RELATED WORK
While some existing word2vec systems are limited to running on a
single machine CPU, there have been some eorts in accelerating
word2vec using multi-core CPU nodes for distributed data parallel
training. These include the training systems available in Apache
Spark MLLib [
9
] and Deeplearning4j [
10
]. These systems rely on the
reduce operation to synchronize the model between all executors
after every iteration. Broadcasting all the word vectors across nodes
limits the scalability as typical network bandwidths are an order
of magnitude lower than CPU memory bandwidths. Moreover, the
model accuracy plummets as more nodes are employed to increase
the throughput.
The work done by Ji et al. [
11
] demonstrates strong scalabil-
ity on CPU nodes by using a scheme based on minibatching and
shared negative samples. Their approach converts level-1 BLAS
operations into level-3 BLAS matrix multiply operations, hence
eciently leveraging the vectorized multiply-add instructions of
modern architectures. However, they still don’t leverage GPUs and
their implementation scales well on Intel BDW and Intel KNL pro-
cessors, which can be much more expensive than GPUs and are
not yet provided by the major cloud services platforms. Using their
idea, we shared the negative samples across a minibatch and used
the highly optimized cuBLAS level-3 matrix multiplication kernels,
but due to the small size of matrices being multiplied, the overhead
of CUDA kernel-launches drastically reduced the performance and
scalability.
Several other works [
12
,
13
] have tried to utilize deep learning
libraries like TensorFlow, Keras and Theano, but show even slower
performance compared to the FastText CPU implementation. Since
an iteration of word2vec is not very compute intensive, a large batch
size is needed to fully utilize the GPU. However, mini-batching of
training data reduces the rate of convergence dramatically and with
a batch size of one, training becomes extremely slow.
4 GPU PARALLELIZATION USING CUDA
Since the GPU architecture provides orders of magnitude more
parallelism than CPU cores, word2vec seems to be a good t for ac-
celeration using GPU as the algorithm itself exhibits good amount
of parallelism as exploited by asynchronous SGD or Hogwild. How-
ever, as more parallelism is used, dierent threads might conict
with each other at the time of reading and updating word vectors,
resulting in a huge accuracy drop. Thus, a careful consideration
Blazing Text MLHPC’17: Machine Learning in HPC Environments, November 12–17, 2017, Denver, CO, USA
should be given to manage the trade-o between level of parallelism
and synchronization.
Deep learning frameworks don’t provide a ne-grained control
over the scalability and parallelism of the GPU, as provided by
the CUDA C++ API. This instills the need of re-factoring the algo-
rithm design signicantly, enabling maximum parallel throughput
while forcing synchronization to prevent the slump in accuracy.
We explored the following two approaches for implementing the
algorithm using CUDA.
4.1 One Thread Block per word
The original word2vec implementation processes a sentence se-
quentially i.e. for each center word
wt
, it considers all words in the
window of size
ws
of that center word as target words, which means
that all the words vectors in
[wtws ,wt+ws ]
will get updated in
one step. Similarly in the next step, for the center word
wt+
1,
all the vectors in the range
[wtws+1,wt+w s +1]
will be updated.
This implies that when processing a sentence, a word vector can
be modied upto 2
ws +
1times and ideally, each consecutive step
should use the updated vectors that were modied by the previous
step.
Designing a CUDA program from scratch requires us to make
decisions about the structure of grid and thread blocks. Threads
within the same thread block can synchronize with each other, but
threads belonging to dierent thread blocks can not communicate
with each other, making the thread blocks independent of each other.
In this approach, we choose to allocate a thread block per word,
with the number of threads within a block to be a multiple of 32,
for warp related eciency. As the number of threads within a block
is close to the vector dimension (usually 100), each thread maps to
a vector dimension and does element-wise multiplication. These
individual products are then eciently summed using a reduce
kernel to compute the dot-product between any 2 given vectors.
Several parallel reduction techniques were explored and nally a
highly optimized completely unrolled reduce kernel was employed.
This algorithm design exploits the parallelism to its peak, as each
word is processed independently by a thread block, and the threads
within a thread block synchronize only when executing the reduce
operation. However, the approach suers from a major drawback
that signicantly undermines the quality of embeddings. Dierent
thread blocks can modify the word vectors independently with no
synchronization which can be very detrimental to the accuracy,
since each vector can be updated upto 2
w+
1times as the window
slides over the sentence. This results in a large number of thread
overwrites and stale reads.
4.2 One Thread Block per sentence
As discussed in the previous section, updating word vectors with-
out any synchronization makes full use of CUDA parallelism but
degrades the accuracy of embeddings due to race conditions. As the
window slides over the sentence, updated previous word vectors
should be used when processing the following word. Thus, to ad-
dress the sequential dependency, this approach maps each sentence
to a CUDA thread block and like the previous approach, it maps
each dimension of the vector to a thread. So, dierent thread blocks
process the sentences in parallel to each other, while within a sen-
tence, the thread block loops over the words to update the vectors
serially. This approach may still lead to some race conditions, but
since dierent sentences do not have a lot of words in common, it
does not cause much convergence problems in practice.
Since the text corpus can be too large to reside in GPU memory,
data is streamed from disk to the GPU and several sentences are
batched to amortize the cost of data transfer between CPU and
GPU. To decide the trade-o between accuracy and throughput,
the optimum number of thread blocks or sentences that should be
processed in parallel are chosen empirically. Having more sentences
to be processed concurrently increases the throughput but results in
accuracy drop as the probability of dierent thread blocks updating
the same word vector increases.
Several other optimizations are used to make this approach more
ecient. If the kernel execution time is less than the data trans-
fer time, then the GPU will sit idle waiting for the next batch of
sentences. To avoid this, we try to overlap data transfer and ker-
nel execution on GPU by using multiple CPU threads and CUDA
streams, which allow data transfer to the GPU while it is busy
running kernels, thus leaving no scope for idle GPU time. We use
multiple CPU threads to read data from disk and prepare the next
batch of sentences, which is concurrently transferred to the GPU
using multiple CUDA streams.
We use this approach for all our experiments described in Section
5.
4.3 Distributed training on Multiple GPUs
Scaling BlazingText on a multi-GPU distributed system is critical
because it can still take a couple of days on a single GPU to train
on some of the biggest datasets in the industry, which can be of the
order of several tera-bytes [
14
]. To scale out BlazingText, we explore
dierent paradigms of distributed training - model parallelism and
data parallelism. Since the vector dimensions are usually small, the
scale of dot products is not that big and thus distributing dierent
dimensions of the vectors will not provide too much performance
gain. Hence, we use data parallelism, where we divide the dataset
equally into
N
shards when using
N
GPUs. The model parameters
- Input and Output vectors for all the words in vocabulary, are
replicated on each GPU; each device then independently processes
the data partition it owns and updates its local model, periodically
synchronizing the local model with all other N1GPUs.
The main issue to be addressed in data parallelism is ecient
model synchronization between the GPUs. This is eciently han-
dled by using NVIDIA’s NCCL library [
15
], which provides an
AllReduce method that handles the peer-to-peer data transfer be-
tween the GPUs in an optimized way based on the topology of
GPU network. If GPUs are not connected via the same PCIe switch
or same IOH chip, then data is transferred through the host CPU.
As the model parameters’ size can be of the order of hundreds of
MBs, synchronization can be costly and thus it very important to
determine the right synchronization interval. Frequent synchroniza-
tions will result in better convergence but will slow down training
and vice-versa. For simplicity, we choose to synchronize after ev-
ery epoch and leave it for future work to explore more ecient
synchronization schemes.
MLHPC’17: Machine Learning in HPC Environments, November 12–17, 2017, Denver, CO, USA S. Gupta et al.
5 EXPERIMENTS
We optimize BlazingText with the techniques described above for
single GPU and multiple-GPU systems. In all our experiments, we
map each sentence to a thread block. In this section, we report
the throughput(in million words/sec) and accuracy of learned em-
beddings on standard word similarity and word analogy test sets.
We benchmark BlazingText against FastText CPU implementation
(without subword embeddings).
Hardware:
All our experiments are performed on AWS p2.8xlarge
GPU instance which has 8 NVIDIA K80 GP Us and Intel Xeon CPU
E5-2686 v4 @ 2.30GHz with 16 cores (32 threads).
Software:
BlazingText has been written in C++ using CUDA
compiler - NVCC v8.0
Training corpora:
We train our models on two dierent cor-
pora: (1) Text8 dataset [
16
] of 17 million words from Wikipedia
that is widely used for word embedding demos, (2) The One Billion
Words benchmark dataset [17].
Test sets:
The learned embeddings are evaluated on word simi-
larity and word analogy tasks. For word similarity, we use WS-353
[
18
] which is one of the most popular test datasets used for this
purpose. It contains word pairs together with human-assigned
similarity judgments. The word representations are evaluated by
ranking the pairs according to their cosine similarities, and mea-
suring the Spearman’s rank correlation coecient with the human
judgments. For word analogy, we use the Google analogy dataset
[
1
] which contains word analogy questions. A question is correctly
answered only if the algorithm selects the word that is exactly the
same as the correct word in the question.
Hyperparameters:
For all our experiments, we report the re-
sults using both CBOW and Skipgram algorithms (with negative
sampling) and
FastText’s default parameter settings
(dim = 100,
window size = 5, sampling threshold = 1e-4, initial learning rate =
0.05). We use 20 epochs for Text8 dataset and 10 epochs for One
Billion Words benchmark.
5.1 Throughput
Figures 1 and 2 show the throughput measured as million words/sec
of BlazingText on GPU and FastText on CPU, scaling across multiple
GPUs and CPU cores, for Skipgram and CBOW respectively. When
scaling to multiple GPUs, our implementation achieves near linear
speedup. Using 8 GPUs, Skipgram achieves 13.2 million words/sec
while CBOW delivers about 42.5 million words/sec, which is more
than 3x speedup over 32 threaded FastText. As evident from Table 1,
scaling across multiple GPUs has minimal eect on accuracy, thus
highlighting the eectiveness of our implementation and ecient
use of multi-GPU architecture. As there is a trade-o between
throughput and accuracy, we can further increase the throughput
by lowering synchronization frequency, as long as the drop in
accuracy is acceptable.
5.2 Accuracy
We evaluate the models trained from FastText CPU implementation
and our implementation (with varying number of GPUs), and report
their predictive performances on the word similarity and word
analogy tasks in Table 1. To make sure that our implementation
Figure 1: Skipgram throughput
Figure 2: CBOW throughput
generalizes well, we used 2 dierent corpora for training the models
- Text8 and 1B words benchmark.
To nullify the eects of randomness due to GPU parallelism, we
run each experiment multiple times (n=10) and report the mean
results. As can be seen from Table 1, BlazingText shows similar
performance on mulitple GPUs as FastText CPU. When using upto 4
GPUs, the performance is almost identical, in some cases even better
than the CPU version. As expected, the predictive performance
Blazing Text MLHPC’17: Machine Learning in HPC Environments, November 12–17, 2017, Denver, CO, USA
Table 1: Spearman’s rank correlation coecient between model scores and human judgement on WS-353 dataset for word
similarity. For word analogy task, we report the accuracies on Google analogy dataset.
Training Corpus # GPUs (BlazingText) # CPUs (FastText)
1 2 3 4 5 6 7 8 8 16 32
Text8
Similarity Skipgram .716 .713 .710 .707 .703 .699 .695 .694 .707 .706 .7
CBOW .711 .708 .705 .689 .683 .681 .679 .675 .694 .689 .69
Analogy Skipgram .327 .324 .321 .311 .299 .297 .296 .295 .329 .330 .326
CBOW .321 .329 .32 .299 .295 .289 .285 .281 .323 .326 .325
1 Billion word benchmark
Similarity Skipgram .659 .658 .656 .653 .655 .659 .651 .650 .660 .659 .656
CBOW .609 .607 .599 .598 .601 .604 .598 .597 .610 .607 .608
Analogy Skipgram .301 .305 .299 .299 .298 .297 .295 .289 .300 .302 .301
CBOW .299 .296 .295 .29 .289 .289 .287 .288 .311 .314 .312
decreases as more GPUs are used, but within acceptable limits. For
similarity based evaluation, using 8 GPUs give upto just 2% worse
accuracy, but more than 3x speedup over 32 CPU threads.
For analogy based evaluation, the dierences between multiple
GPUs and CPUs are more conspicuous. In our experiments, our
primary focus was on scalability and thus we did not increase
the synchronization frequency when scaling across more GPUs.
Increment in synchronization frequency can maintain a comparable
accuracy, but will take a toll on scalability, leading to a sub-linear
scaling. However, depending on the end application, one can decide
the trade-o and tune the frequency accordingly.
6 CONCLUSION
In this work, we present BlazingText: a high performance dis-
tributed word2vec implementation that leverages massive paral-
lelism provided by modern GPUs. We carefully exploit GPU paral-
lelism to strike the right balance between throughput and accuracy.
The proposed implementation achieves near linear scalability across
multiple GPUs and can process tens of millions of words per sec-
ond while maintaining a comparable accuracy with open source
CPU based implementations like FastText. Our experiments on
dierent datasets demonstrate good predictive performance and
generalizations of our techniques.
As for future work, we plan to improve our model synchroniza-
tion strategy and learning rate scheduling. As the vectors associ-
ated with more frequent words are updated more often, we want to
match model synchronization frequency to word frequency, thus
syncing the vectors of common words more frequently by using
a sub-model synchronization scheme. For learning rate schedul-
ing, we want to explore techniques like Adam [
19
], to improve the
convergence rate when scaling across multiple GPUs.
REFERENCES
[1]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jerey Dean. Dis-
tributed representations of words and phrases and their compositionality. In
Neural and Information Processing System (NIPS), 2013.
[2]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching
word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
[3]
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami,
and Chris Dyer. Neural architectures for named entity recognition. In HLT-
NAACL, 2016.
[4]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of
tricks for ecient text classication. arXiv preprint arXiv:1607.01759, 2016.
[5]
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A
lock-free approach to parallelizing stochastic gradient descent. In J. Shawe-Taylor,
R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in
Neural Information Processing Systems 24, pages 693–701. Curran Associates, Inc.,
2011.
[6]
Simon Pavlik. Gensim word2vec on cpu faster than word2vec keras on
gpu. https://rare-technologies
.
com/gensim-word2vec- on-cpu- faster-than-
word2veckeras-on- gpu-incubator- student-blog/, 2016. Accessed: 2017-08-01.
[7]
Sharan Chetlur, Cli Woolley, Philippe Vandermersch, Jonathan Cohen, John
Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Ecient primitives for deep
learning. CoRR, abs/1410.0759, 2014.
[8]
cublas: Dense linear algebra on gpus. https://developer
.
nvidia
.
com/cublas, 2015.
Accessed: 2017-08-01.
[9]
Spark mllib word2vec. https://spark
.
apache
.
org/docs/latest/mllib-feature-
extraction.html, 2016. Accessed: 2017-08-01.
[10]
Deeplearning4j: Introduction to word2vec. https://deeplearning4j
.
org/word2vec,
2015. Accessed: 2017-08-01.
[11]
Shihao Ji, Nadathur Satish, Sheng Li, and Pradeep Dubey. Parallelizing word2vec
in multi-core and many-core architectures. Nov 2016.
[12]
Word2vec using tensorow. https://github
.
com/tensorow/models/tree/master/
tutorials/embedding, 2015. Accessed: 2017-08-01.
[13]
Word2vec-keras-in-gensim. https://github
.
com/niitsuma/word2vec-keras- in-
gensim, 2016. Accessed: 2017-08-01.
[14]
Icwsm 2011 spinn3r dataset. http://www
.
icwsm
.
org/2011/data
.
php. Accessed:
2017-08-01.
[15]
Fast multi-gpu collectives with nccl. https://devblogs
.
nvidia
.
com/parallelforall/
fast-multi- gpu-collectives-nccl/, 2017. Accessed: 2017-08-01.
[16] Text8 dataset. http://mattmahoney .net/dc/text8.zip. Accessed: 2017-08-01.
[17]
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, and
Phillipp Koehn. One billion word benchmark for measuring progress in statistical
language modeling. CoRR, abs/1312.3005, 2013.
[18]
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi
Wolfman, and Eytan Ruppin. Placing search in context: The concept revisited. In
Proceedings of the 10th International Conference on World Wide Web, WWW ’01,
pages 406–414, New York, NY, USA, 2001. ACM.
[19]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
CoRR, abs/1412.6980, 2014.
... For example, pWord2Vec [19] parallelized Word2Vec for architectures with many cores and fast RAM access between cores. BlazingText [20] adapted similar techniques to GPU and distributed execution. ...
... A diachronic corpus is a corpus that contains documents from several moments. In this research, the embeddings were built using BlazingText [20] with documents collected in week 25 of 2021, after the COVID-19 outbreak, and week 51 of 2018 (before any known mention of COVID-19). The data was collected from the project CommonCrawl (https://commoncrawl.org/), that periodically crawls documents from the open internet and makes them publicly available. ...
... BlazingText [20] was used for training the word embeddings from the collected corpus. It was executed in 8 instances ml.c4.8xlarge in AWS SageMaker Studio. ...
Article
Full-text available
Words can shift their meaning across time. This study shows the results obtained by the exploratory analysis of the semantic shifting on Spanish vocabulary using Diachronic Word Embeddings. Diachronic data consists of a 2018 Spanish corpus, before the COVID-19 outbreak, and a second corpus with documents from 2021. This paper addresses the construction of the diachronic Spanish word embeddings model, as well as the results obtained by the analysis using a non-supervised distance vector technique. The results allowed us to identify topics with the most semantic shift between those periods.
... Popular techniques were invented to tackle machine learning problems very fast while maintaining a respectable prediction accuracy and few shortcomings. Popular models such as FastText [Boj+17] and BlazingText [GK17] are specifically designed for speed and enable training a model on very large datasets. ...
... Since the pre-training step is very time consuming Finally, in our participation at OffensEval 2020 [HMG20] we have aimed to tackle the problem of detecting offensive language while considering both accuracy and model performance which is a major factor to deploying the model in a production environment. Throughout this work we have utilized an efficient implementation of FastText called BlazingText [GK17] and trained a model with a large dataset of more than 9 million tweets. The results obtained from blazing text were very competent to the much more advanced and large BERT based models while achieving a superior speed in training and making inferences. ...
... BlazingText was introduced by researchers at Amazon [GK17]. They propose an efficient implementation of Word2Vec [mikolov2013distributed] and FastText classifier [joulin2016bag] that can take advantage of multiple CPUs and GPUs in training the model and real-time inference. ...
Preprint
Full-text available
Offensive language detection is an ever-growing natural language processing (NLP) application. This growth is mainly because of the widespread usage of social networks, which becomes a mainstream channel for people to communicate, work, and enjoy entertainment content. Many incidents of sharing aggressive and offensive content negatively impacted society to a great extend. We believe contributing to improving and comparing different machine learning models to fight such harmful contents is an important and challenging goal for this thesis. We targeted the problem of offensive language detection for building efficient automated models for offensive language detection. With the recent advancements of NLP models, specifically, the Transformer model, which tackled many shortcomings of the standard seq-to-seq techniques. The BERT model has shown state-of-the-art results on many NLP tasks. Although the literature still exploring the reasons for the BERT achievements in the NLP field. Other efficient variants have been developed to improve upon the standard BERT, such as RoBERTa and ALBERT. Moreover, due to the multilingual nature of text on social media that could affect the model decision on a given tween, it is becoming essential to examine multilingual models such as XLM-RoBERTa trained on 100 languages and how did it compare to unilingual models. The RoBERTa based model proved to be the most capable model and achieved the highest F1 score for the tasks. Another critical aspect of a well-rounded offensive language detection system is the speed at which a model can be trained and make inferences. In that respect, we have considered the model run-time and fine-tuned the very efficient implementation of FastText called BlazingText that achieved good results, which is much faster than BERT-based models.
... Over time, the demands for high training throughputs continue to increase. While multiple techniques have been proposed to explore parallelism on GPUs [1,7,15,21] and to reduce memory accesses by improving data reuse [9,15,19], these works have failed to adequately utilize currentgeneration GPU hardware and are unlikely to scale to future hardware architectures due to the aforementioned issues. : Roofline benchmarks for state-of-the-art kernels on a V100 GPU. ...
... The state-of-the-art [1,7,15,21] GPU implementations of Word2Vec -Wombat and accSGNS -struggle to effectively utilize the architecture as shown in Figure 1. While it is well known that data-intensive workloads struggle to achieve high arithmetic throughput, GPU implementations of Word2Vec have thus far not approached the peak of its potential performance on this architecture. ...
... There have been many implementations of Word2Vec since the seminal works, including implementations for the Tensorflow [22] and Gensim [24] machine learning frameworks. The algorithm has been ported to many architectures, including the cloud-based BlazingText [7], cluster implementation BIDMach [2], and FPGA architectures [17]. We focus the rest of our discussion on published Word2Vec implementations that push the boundaries of the algorithm's throughput on single-node CPU and GPU architectures. ...
Conference Paper
Full-text available
Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations , resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outper-forms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. Our source code is available at https://github.com/tlranda/FULL-W2V/
... The processing time of Word2vec can be very large due to the huge amount of data involved. To reduce this processing time, many accelerators have been proposed using high-end CPUs [7][8][9][10] and GPUs [11][12][13]. However, these high-end processors often consume a large amount of power, which is a critical concern for applications used in everyday life. ...
Article
Full-text available
Word embedding is a technique for representing words as vectors in a way that captures their semantic and syntactic relationships. The processing time of one of the most popular word embedding technique Word2vec is very large due to the huge data size. We evaluate the performance of a power-efficient FPGA-based accelerator designed using OpenCL. We achieved up to 18.7 times speed-up compared to single-core CPU implementation with the same accuracy. The proposed accelerator consumes less than 83 W of power and it is the most power-efficient one compared to many top-end CPU and GPU-based accelerators.
... BlazingText Algorithm For sub-tasks A and B, we have used BlazingText for classification. Blaz-ingText was introduced by researchers at Amazon (Gupta and Khare, 2017). They propose an efficient implementation of Word2Vec (Mikolov et al., 2013) and FastText classifier (Joulin et al., 2016) that can take advantage of multiple CPUs and GPUs in training the model and real-time inference. ...
... The authors report a throughput of approximately 180k words/s on a single GPU without a loss in the quality of the produced embeddings. This strategy is also pursued by Gupta and Khare [65]. The latter report a throughput of around 13 million words/s when parallelizing over 8 GPUs and only a small drop in embedding quality when parallelizing over multiple GPUs. ...
Thesis
A plethora of resources made available via retrieval systems in digital libraries remains untapped in the so called long tail of the Web. These long-tail websites get considerably less visits than major Web hubs. Zero-effort queries ease the discovery of long-tail resources by proactively retrieving and presenting information based on a user’s context. However, zero-effort queries over existing digital library structures are challenging, since the underlying retrieval system is only accessible via an API. The information need must be expressed by a query, instead of optimizing the ranking between context and resources in the retrieval system directly. We address three research questions that arise from replacing the user information seeking process by zero-effort queries. Our first question addresses the transformation of a user query to an automatic query, derived from the context. We present means to 1) identify the relevant context on different levels of granularity, 2) derive an information need from the context via keyword extraction and personalization and 3) express this information need in a query scheme that avoids over- or under-specified queries. We address the cold start problem with an approach to bootstrap user profiles from social media, even for passive users. With the second question, we address the presentation of resources in zero-effort query scenarios, presenting guidelines for presentation interfaces in the browser and a visualization of the triadic relationship between context, query and results. QueryCrumbs, a compact query history visualization supports recalling information found in the past and exploratory search by visualizing qualitative and quantitative query similarity. Our last question addresses the gap between (simple) keyword queries and the representation of resources by rich and complex meta-data. We investigate and extend feature representation learning techniques centered around the skip-gram model with negative sampling. Finally, we present an approach to learn representations from network and text jointly that can cope with the partial absence of one modality. Experimental results show close to human performance of our zero-effort query and user profile generation approach and visualizations to be helpful in terms of transparency, efficiency and support for exploratory search. These results indicate that the proposed zero-effort query approach indeed eases the discovery of long-tail resources and the accompanying visualizations further facilitate this process. The joint representation model provides a first step to bridge the gap between query and resource representation and we plan to follow and investigate this route further in the future.
... It also achieves good strong-scalability on CPUs. BlazingText [20] is a multi-GPU word embedding system. It adopts a similar idea of sharing negative samples inside each batch. ...
Preprint
Full-text available
Scaling node embedding systems to efficiently process networks in real-world applications that often contain hundreds of billions of edges with high-dimension node features remains a challenging problem. In this paper we present a high-performance multi-GPU node embedding system that uses hybrid model data parallel training. We propose a hierarchical data partitioning strategy and an embedding training pipeline to optimize both communication and memory usage on a GPU cluster. With the decoupled design of our random walk engine and embedding training engine, we can run both random walk and embedding training with high flexibility to fully utilize all computing resources on a GPU cluster. We evaluate the system on real-world and synthesized networks with various node embedding tasks. Using 40 NVIDIA V100 GPUs on a network with over two hundred billion edges and one billion nodes, our implementation requires only 200 seconds to finish one training epoch. We also achieve 5.9x-14.4x speedup on average over the current state-of-the-art multi-GPU single-node embedding system with competitive or better accuracy on open datasets.
... The matrix W e can be trained in conjunction with the neural language model, however it is sometimes advantageous to use pre-trained weights. We will consider three cases: no embeddings, GloVe embeddings (Makarenkov, Shapira, and Rokach 2016), and our own embeddings trained with AWS's BlazingText algorithm (Gupta and Khare 2017). ...
Preprint
Data is being produced in larger quantities than ever before in human history. It's only natural to expect a rise in demand for technology that aids humans in sifting through and analyzing this inexhaustible supply of information. This need exists in the market research industry, where large amounts of consumer research data is collected through video recordings. At present, the standard method for analyzing video data is human labor. Market researchers manually review the vast majority of consumer research video in order to identify relevant portions - highlights. The industry state of the art turnaround ratio is 2.2 - for every hour of video content 2.2 hours of manpower are required. In this study we present a novel approach for NLP-based highlight identification and extraction based on a supervised learning model that aides market researchers in sifting through their data. Our approach hinges on a manually curated user-generated highlight clips constructed from long and short-form video data. The problem is best suited for an NLP approach due to the availability of video transcription. We evaluate multiple classes of models, from gradient boosting to recurrent neural networks, comparing their performance in extraction and identification of highlights. The best performing models are then evaluated using four sampling methods designed to analyze documents much larger than the maximum input length of the classifiers. We report very high performances for the standalone classifiers, ROC AUC scores in the range 0.93-0.94, but observe a significant drop in effectiveness when evaluated on large documents. Based on our results we suggest combinations of models/sampling algorithms for various use cases.
Chapter
Word2vec is a word embedding method that converts words into vectors in such a way that the semantically and syntactically relevant words are close to each other in the vector space. Acceleration is required to reduce the processing time of Word2vec. We propose a power-efficient FPGA accelerator exploiting temporal and spatial parallelism. The proposed FPGA accelerator has the highest power-efficiency compared to existing top-end GPU accelerators. It is more power efficient and nearly two times faster compared to a previously proposed highly power-efficient FPGA accelerator.KeywordsWord embeddingFPGAmachine learningnatural language processing
Article
Full-text available
Word2vec is a widely used algorithm for extracting low-dimensional vector representations of words. State-of-the-art algorithms including those by Mikolov et al. have been parallelized for multi-core CPU architectures, but are based on vector-vector operations with "Hogwild" updates that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we propose "HogBatch" by improving reuse of various data structures in the algorithm through the use of minibatching and negative sample sharing, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a compute cluster, and demonstrate good strong scalability up to 32 nodes. The new algorithm is particularly suitable for modern multi-core/many-core architectures, especially Intel's latest Knights Landing processors, and allows us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge.
Conference Paper
Full-text available
State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures---one based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words: character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora. Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers.
Article
Full-text available
We present a library that provides optimized implementations for deep learning primitives. Deep learning workloads are computationally intensive, and optimizing the kernels of deep learning workloads is difficult and time-consuming. As parallel architectures evolve, kernels must be reoptimized for new processors, which makes maintaining codebases difficult over time. Similar issues have long been addressed in the HPC community by libraries such as the Basic Linear Algebra Subroutines (BLAS). However, there is no analogous library for deep learning. Without such a library, researchers implementing deep learning workloads on parallel processors must create and optimize their own implementations of the main computational kernels, and this work must be repeated as new parallel processors emerge. To address this problem, we have created a library similar in intent to BLAS, with optimized routines for deep learning workloads. Our implementation contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms. The library is easy to integrate into existing frameworks, and provides optimized performance and memory usage. For example, integrating cuDNN into Caffe, a popular framework for convolutional networks, improves performance by 36% on a standard model while also reducing memory consumption.
Article
Full-text available
We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 74.4. A combination of techniques leads to 37% reduction in perplexity, or 11% reduction in cross-entropy (bits), over that baseline. The benchmark is available as a code.google.com project; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.