Conference PaperPDF Available

Software Framework for Topic Modelling with Large Corpora

Authors:

Abstract

Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.
Software Framework for Topic Modelling with Large Corpora
Radim
ˇ
Reh
˚
u
ˇ
rek and Petr Sojka
Natural Language Processing Laboratory
Masaryk University, Faculty of Informatics
Botanick
´
a 68a, Brno, Czech Republic
{xrehurek,sojka}@fi.muni.cz
Abstract
Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector
Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their
scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document
streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement
several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes
them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design,
so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the
usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.
1. Introduction
“Controlling complexity is the essence of computer programming.”
Brian Kernighan (Kernighan and Plauger, 1976)
The Vector Space Model (VSM) is a proven and powerful
paradigm in NLP, in which documents are represented as
vectors in a high-dimensional space. The idea of represent-
ing text documents as vectors dates back to early 1970’s
to the SMART system (Salton et al., 1975). The original
concept has since then been criticised, revised and improved
on by a multitude of authors (Wong and Raghavan, 1984;
Deerwester et al., 1990; Papadimitriou et al., 2000) and
became a research field of its own. These efforts seek to ex-
ploit both explicit and implicit document structure to answer
queries about document similarity and textual relatedness.
Connected to this goal is the field of topical modelling (see
e.g. (Steyvers and Griffiths, 2007) for a recent review of
this field). The idea behind topical modelling is that texts
in natural languages can be expressed in terms of a limited
number of underlying concepts (or topics), a process which
both improves efficiency (new representation takes up less
space) and eliminates noise (transformation into topics can
be viewed as noise reduction). A topical search for related
documents is orthogonal to the more well-known “fulltext”
search, which would match particular words, possibly com-
bined through boolean operators.
Research on topical models has recently picked up pace,
especially in the field of generative topic models such as La-
tent Dirichlet Allocation (Blei et al., 2003), their hierarchical
extensions (Teh et al., 2006), topic quality assessment and
visualisation (Chang et al., 2009; Blei and Lafferty, 2009).
In fact, it is our observation that the research has rather got-
ten ahead of applications—the interested public is only just
catching up with Latent Semantic Analysis, a method which
is now more than 20 years old (Deerwester et al., 1990). We
attribute reasons for this gap between research and practice
partly to inherent mathematical complexity of the inference
algorithms, partly to high computational demands of most
methods and partly to the lack of a “sandbox” environment,
which would enable practitioners to apply the methods to
their particular problem on real data, in an easy and hassle-
free manner. The research community has recognised these
challenges and a lot of work has been done in the area of
accessible NLP toolkits in the past couple of years; our con-
tribution here is one such step in the direction of closing the
gap between academia and ready-to-use software packages
1
.
Existing Systems
The goal of this paper is somewhat orthogonal to much of
the previous work in this area. As an example of another
possible direction of applied research, we cite (Elsayed et
al., 2008). While their work focuses on how to compute
pair-wise document similarities from individual document
representations in a scalable way, using Apache Hadoop
and clusters of computers, our work here is concerned with
how to scalably compute these document representations
in the first place. Although both steps are necessary for a
complete document similarity pipeline, the scope of this
paper is limited to constructing topical representations, not
answering similarity queries.
There exist several mature toolkits which deal with Vec-
tor Space Modelling. These include NLTK (Bird and
Loper, 2004), Apache’s UIMA and ClearTK (Ogren et al.,
2008), Weka (Frank et al., 2005), OpenNLP (Baldridge
et al., 2002), Mallet (McCallum, 2002), MDP (Zito et al.,
2008), Nieme (Maes, 2009), Gate (Cunningham, 2002), Or-
ange (Dem
ˇ
sar et al., 2004) and many others.
These packages generally do a very good job at their in-
tended purpose; however, from our point of view, they also
suffer from one or more of the following shortcomings:
1
Interest in the field of document similarity can also be seen
from the significant number of requests for a VSM software pack-
age which periodically crop up in various NLP mailing lists. An-
other indicator of interest are tutorials aimed at business appli-
cations; see web search results for “SEO myths and LSI” for an
interesting treatment on Latent Semantic Indexing marketing.
46
No topical modelling.
Packages commonly offer super-
vised learning functionality (i.e. classification); topic
inference is an unsupervised task.
Models do not scale.
Package requires that the whole cor-
pus be present in memory before the inference of top-
ics takes place, usually in the form of a sparse term-
document matrix.
Target domain not NLP/IR.
The package was created
with physics, neuroscience, image processing, etc. in
mind. This is reflected in the choice of terminology as
well as emphasis on different parts of the processing
pipeline.
The Grand Unified Framework.
The package covers a
broad range of algorithms, approaches and use case
scenarios, resulting in complex interfaces and depen-
dencies. From the user’s perspective, this is very desir-
able and convenient. From the developer’s perspective,
this is often a nightmare—tracking code logic requires
major effort and interface modifications quickly cas-
cade into a large set of changes.
In fact, we suspect that the last point is also the reason why
there are so many packages in the first place. For a developer
(as opposed to a user), the entry level learning curve is so
steep that it is often simpler to “roll your own” package
rather than delve into intricacies of an existing, proven one.
2. System Design
“Write programs that do one thing and do it well. Write programs
to work together. Write programs to handle text streams, because
that is a universal interface.”
Doug McIlroy (McIlroy et al., 1978)
Our choices in designing the proposed framework are a
reflection of these perceived shortcomings. They can be
explicitly summarised into:
Corpus size independence.
We want the package to be
able to detect topics based on corpora which are larger
than the available RAM, in accordance with the current
trends in NLP (see e.g. (Kilgarriff and Grefenstette,
2003)).
Intuitive API.
We wish to minimise the number of method
names and interfaces that need to be memorised in
order to use the package. The terminology is NLP-
centric.
Easy deployment.
The package should work out-of-the-
box on all major platforms, even without root privileges
and without any system-wide installations.
Cover popular algorithms.
We seek to provide novel,
scalable implementations of algorithms such as TF-
-IDF, Latent Semantic Analysis, Random Projections
or Latent Dirichlet Allocation.
We chose Python as the programming language, mainly be-
cause of its straightforward, compact syntax, multiplatform
nature and ease of deployment. Python is also suitable for
handling strings and boasts a fast, high quality library for
numerical computing, numpy, which we use extensively.
Core interfaces
As mentioned earlier, the core concept of our framework is
document streaming. A corpus is represented as a sequence
of documents and at no point is there a need for the whole
corpus to be stored in memory. This feature is not an after-
thought on lazy evaluation, but rather a core requirement
for our application and as such reflected in the package
philosophy. To ensure transparent ease of use, we define
corpus to be any iterable returning documents:
>>> for document in corpus:
>>> pass
In turn, a document is a sparse vector representation of its
constituent fields (such as terms or topics), again realised as
a simple iterable:
2
>>> for fieldId, fieldValue in document:
>>> pass
This is a deceptively simple interface; while a corpus is
allowed to be something as simple as
>>> corpus = [[(1, 0.8), (8, 0.6)]]
this streaming interface also subsumes loading/storing matri-
ces from/to disk (e.g. in the Matrix Market (Boisvert et al.,
1996) or SVMlight (Joachims, 1999) format), and allows for
constructing more complex real-world IR scenarios, as we
will show later. Note the lack of package-specific keywords,
required method names, base class inheritance etc. This is
in accordance with our main selling points: ease of use and
data scalability.
Needless to say, both corpora and documents are not re-
stricted to these interfaces; in addition to supporting itera-
tion, they may (and usually do) contain additional methods
and attributes, such as internal document ids, means of visu-
alisation, document class tags and whatever else is needed
for a particular application.
The second core interface are transformations. Where a
corpus represents data, transformation represents the pro-
cess of translating documents from one vector space into
another (such as from a TF-IDF space into an LSA space).
Realization in Python is through the dictionary
[ ]
mapping
notation and is again quite intuitive:
>>> from gensim.models import LsiModel
>>> lsi = LsiModel(corpus, numTopics = 2)
>>> lsi[new_document]
[(0, 0.197), (1, -0.056)]
>>> from gensim.models import LdaModel
>>> lda = LdaModel(corpus, numTopics = 2)
>>> lda[new_document]
[(0, 1.0)]
2
In terms of the underlying VSM, which is essentially a sparse
field-document matrix, this interface effectively abstracts away
from both the number of documents and the number of fields.
We note, however, that the abstraction focus is on the number
of documents, not fields. The number of terms and/or topics is
usually carefully chosen, with unwanted token types removed via
document frequency thresholds and stoplists. The hypothetical
use case of introducing new fields in a streaming fashion does not
come up as often in NLP.
47
2.1. Novel Implementations
While an intuitive interface is important for software adop-
tion, it is of course rather trivial and useless in itself. We
have therefore implemented some of the popular VSM meth-
ods, two of which we will describe here in greater detail.
Latent Semantic Analysis, LSA.
Developed in late 80’s
in Bell Laboratories (Deerwester et al., 1990), this method
gained popularity due to its solid theoretical background
and efficient inference of topics. The method exploits co-
occurrence between terms to project documents into a low-
dimensional space. Inference is done using linear algebra
routines for truncated Singular Value Decomposition (SVD)
on the sparse term-document matrix, which is usually first
weighted by some TF-IDF scheme. Once the SVD has been
completed, it can be used to project new documents into the
latent space, in a process called folding-in.
Since linear algebra routines have always been the front
runner of numerical computing (see e.g. (Press et al., 1992)),
some highly optimised packages for sparse SVD exist. For
example, PROPACK and SVDPACK are both based on the
Lanczos algorithm with smart reorthogonalizations, and
both are written in FORTRAN (the latter also has a C-
language port called SVDLIBC). Lightning fast as they are,
adapting the FORTRAN code is rather tricky once we hit the
memory limit for representing sparse matrices directly in
memory. For this and other reasons, research has gradually
turned to incremental algorithms for computing SVD, in
which the matrix is presented sequentially—an approach
equivalent to our document streaming. This problem refor-
mulation is not trivial and only recently have there appeared
practical algorithms for incremental SVD.
Within our framework, we have implemented Gorrell’s
Generalised Hebbian Algorithm (Gorrell, 2006), a stochas-
tic method for incremental SVD. However, this algorithm
proved much too slow in practice and we also found its inter-
nal parameters hard to tune, resulting in convergence issues.
We have therefore also implemented Brand’s algorithm for
fast incremental SVD updates (Brand, 2006). This algorithm
is much faster and contains no internal parameters to tune
3
.
To the best of our knowledge, our pure Python (numpy) im-
plementation is the only publicly available implementation
of LSA that does not require the term-document matrix to
be stored in memory and is therefore independent of the
corpus size
4
. Together with our straightforward document
streaming interface, this in itself is a powerful addition to
the set of publicly available NLP tools.
Latent Dirichlet Allocation, LDA.
LDA is another topic
modelling technique based on the bag-of-words paradigm
and word-document counts (Blei et al., 2003). Unlike La-
tent Semantic Analysis, LDA is a fully generative model,
3
This algorithm actually comes from the field of image process-
ing rather than NLP. Singular Value Decomposition, which is at
the heart of LSA, is a universal data compression/noise reduction
technique and has been successfully applied to many application
domains.
4
This includes completely ignoring the right singular vectors
during SVD computations, as the left vectors together with singular
values are enough to determine the latent space projection for new
documents.
where documents are assumed to have been generated ac-
cording to a per-document topic distribution (with a Dirich-
let prior) and per-topic word distribution. In practice, the
goal is of course not generating random documents through
these distributions, but rather inferring the distributions from
observed documents. This can be accomplished by varia-
tional Bayes approximations (Blei et al., 2003) or by Gibbs
sampling (Griffiths and Steyvers, 2004). Both of these ap-
proaches are incremental in their spirit, so that our imple-
mentation (again, in pure Python with numpy, and again
the only of its kind that we know of) “only” had to abstract
away from the original notations and implicit corpus-size
allocations to be made truly memory independent. Once the
distributions have been obtained, it is possible to assign top-
ics to new, unseen documents, through our transformation
interface.
2.2. Deployment
The framework is heavily documented and is avail-
able from
http://nlp.fi.muni.cz/projekty/
gensim/
. This website contains sections which describe
the framework and provide usage tutorials, as well as instal-
lation instructions.
The framework is open sourced and distributed under an
OSI-approved LGPL license.
3. Application of the Framework
“An idea that is developed and put into action is more important
than an idea that exists only as an idea.”
Hindu Prince Gautama Siddharta, the founder of Buddhism,
563–483 B.C.
3.1. Motivation
Many digital libraries today start to offer browsing features
based on pairwise document content similarity. For collec-
tions having hundreds of thousands documents, computation
of similarity scores is a challenge (Elsayed et al., 2008). We
have faced this task during the project of The Digital Mathe-
matics Library DML-CZ (Sojka, 2009). The emphasis was
not on developing new IR methods for this task, although
some modifications were obviously necessary—such as an-
swering the question of what constitutes a “token”, which
differs between mathematics and the more common English
ASCII texts.
With the collection’s growth and a steady feed of new papers,
lack of scalability appeared to be the main issue. This drove
us to develop our new document similarity framework.
3.2. Data
As of today, the corpus contains over 61,293 fulltext docu-
ments for a total of about 270 million tokens. There are
mathematical papers from the Czech Digital Mathemat-
ics Library DML-CZ
http://dml.cz
(22,991 papers),
from the NUMDAM repository
http://numdam.org
(17,636 papers) and from the math part of arXiv
http:
//arxiv.org/archive/math
(20,666 papers). After
filtering out word types that either appear less than five times
in the corpus (mostly OCR errors) or in more than one half
of the documents (stop words), we are left with 315,167
48
distinct word types. Although this is by no means an excep-
tionally big corpus, it already prohibits storing the sparse
term-document matrices in main memory, ruling out most
available VSM software systems.
3.3. Results
We have tried several VSM approaches to representing doc-
uments as vectors: term weighting by TF-IDF, Latent Se-
mantic Analysis, Random Projections and Latent Dirichlet
Allocation. In all cases, we used the cosine measure to
assess document similarity.
When evaluating data scalability, one of our two main design
goals (together with ease of use), we note memory usage is
now dominated by the transformation models themselves.
These in turn depend on the vocabulary size and the number
of topics (but not on the training corpus size). With 315,167
word types and 200 latent topics, both LSA and LDA models
take up about 480 MB of RAM.
Although evaluation of the quality of the obtained similari-
ties is not the subject of this paper, it is of course of utmost
practical importance. Here we note that it is notoriously
hard to evaluate the quality, as even the preferences of differ-
ent types of similarity are subjective (match of main topic,
or subdomain, or specific wording/plagiarism) and depends
on the motivation of the reader. For this reason, we have
decided to present all the computed similarities to our li-
brary users at once, see e.g.
http://dml.cz/handle/
10338.dmlcz/100785/SimilarArticles
. At the
present time, we are gathering feedback from mathemati-
cians on these results and it is worth noting that the frame-
work proposed in this paper makes such side-by-side com-
parison of methods straightforward and feasible.
4. Conclusion
We believe that our framework makes an important step in
the direction of current trends in Natural Language Process-
ing and fills a practical gap in existing software systems. We
have argued that the common practice, where each novel
topical algorithm gets implemented from scratch (often in-
venting, unfortunately, yet another I/O format for its data in
the process) is undesirable. We have analysed the reasons
for this practice and hypothesised that this partly due to the
steep API learning curve of existing IR frameworks.
Our framework makes a conscious effort to make parsing,
processing and transforming corpora into vector spaces as
intuitive as possible. It is platform independent and requires
no compilation or installations past Python+numpy. As an
added bonus, the package provides ready implementations of
some of the popular IR algorithms, such as Latent Semantic
Analysis and Latent Dirichlet Allocation. These are novel,
pure-Python implementations that make use of modern state-
of-the-art iterative algorithms. This enables them to work
over practically unlimited corpora, which no longer need to
fit in RAM.
We believe this package is useful to topic modelling experts
in implementing new algorithms as well as to the general
NLP community, who is eager to try out these algorithms
but who often finds the task of translating the original im-
plementations (not to say the original articles!) to its needs
quite daunting.
Future work will include comparison of the usefulness of
different topical models to the users of our Digital Math-
ematical Library, as well as further improving the range,
efficiency and scalability of popular topic modelling meth-
ods.
Acknowledgments
We acknowledge the support of grant MUNI/E/0084/2009 of
the Rector of Masaryk University program for PhD students’
research. Partial support of grants by EU #250503 CIP-ICT-
PSP EuDML and by the Ministry of Education of CR within
the Centre of basic research LC536 is acknowledged, too.
We would also like to thank the anonymous reviewer for pro-
viding us with additional pointers and valuable comments.
5. References
J. Baldridge, T. Morton, and G. Bierner. 2002. The
OpenNLP maximum entropy package. Technical report.
http://maxent.sourceforge.net/.
Steven Bird and Edward Loper. 2004. NLTK: The Natural
Language Toolkit. Proceedings of the ACL demonstration
session, pages 214–217.
David M. Blei and John D. Lafferty. 2009. Visualizing
Topics with Multi-Word Expressions. Arxiv preprint
http://arxiv.org/abs/0907.1013.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003.
Latent Dirichlet Allocation. The Journal of Machine
Learning Research, 3:993–1022.
R. F. Boisvert, R. Pozo, and K.A. Remington. 1996. The
matrix market formats: Initial design. Technical report,
Applied and Computational Mathematics Division, NIST.
Matthew Brand. 2006. Fast low-rank modifications of the
thin singular value decomposition. Linear Algebra and its
Applications, 415(1):20–30, May.
http://dx.doi.
org/10.1016/j.laa.2005.07.021.
Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean
Gerrish, and David M. Blei. 2009. Reading Tea Leaves:
How Humans Interpret Topic Models. volume 31, Van-
couver, British Columbia, CA.
Hamish Cunningham. 2002. GATE, a General Architecture
for Text Engineering. Computers and the Humanities,
36(2):223–254. http://gate.ac.uk/.
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer,
and R. Harshman. 1990. Indexing by Latent Semantic
Analysis. Journal of the American society for Information
science, 41(6):391–407.
J. Dem
ˇ
sar, B. Zupan, G. Leban, and T. Curk. 2004. Orange:
From experimental machine learning to interactive data
mining. White Paper, Faculty of Computer and Informa-
tion Science, University of Ljubljana.
Tamer Elsayed, Jimmy Lin, and Douglas W. Oard. 2008.
Pairwise Document Similarity in Large Collections with
MapReduce. In HLT ’08: Proceedings of the 46th Annual
Meeting of the Association for Computational Linguis-
tics on Human Language Technologies, pages 265–268,
Morristown, NJ, USA. Association for Computational
Linguistics.
49
E. Frank, M. A. Hall, G. Holmes, R. Kirkby, B. Pfahringer,
and I. H. Witten. 2005. Weka: A machine learning work-
bench for data mining. Data Mining and Knowledge Dis-
covery Handbook: A Complete Guide for Practitioners
and Researchers, pages 1305–1314.
G. Gorrell. 2006. Generalized Hebbian algorithm for in-
cremental Singular Value Decomposition in Natural Lan-
guage Processing. In Proceedings of 11th Conference of
the European Chapter of the Association for Computa-
tional Linguistics (EACL), Trento, Italy, pages 97–104.
T. L. Griffiths and M. Steyvers. 2004. Finding scientific
topics. Proceedings of the National Academy of Sciences,
101(Suppl 1):5228.
Thorsten Joachims. 1999. SVMLight: Support Vector
Machine. SVM-Light Support Vector Machine
http:
//svmlight.joachims.org/
, University of Dort-
mund.
Brian W. Kernighan and P. J. Plauger. 1976. Software Tools.
Addison-Wesley Professional.
Adam Kilgarriff and Gregory Grefenstette. 2003. Introduc-
tion to the Special Issue on the Web as Corpus. Computa-
tional Linguistics, 29(3):333–347.
Francis Maes. 2009. Nieme: Large-Scale Energy-Based
Models. The Journal of Machine Learning Research,
10:743–746.
http://jmlr.csail.mit.edu/
papers/volume10/maes09a/maes09a.pdf.
A. K. McCallum. 2002. MALLET: A Machine Learning
for Language Toolkit. http://mallet.cs.umass.
edu.
M. D. McIlroy, E. N. Pinson, and B. A. Tague. 1978. UNIX
Time-Sharing System: Forward. The Bell System Techni-
cal Journal, 57(6 (part 2)), July/August.
P. V. Ogren, P. G. Wetzler, and S. J. Bethard. 2008. ClearTK:
A UIMA toolkit for statistical natural language process-
ing. Towards Enhanced Interoperability for Large HLT
Systems: UIMA for NLP, page 32.
C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vem-
pala. 2000. Latent semantic indexing: A probabilistic
analysis. Journal of Computer and System Sciences,
61(2):217–235.
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.
Flannery. 1992. Numerical recipes in C. Cambridge
Univ. Press, Cambridge MA, USA.
Gerard Salton, A. Wong, and C. S. Yang. 1975. A vector
space model for automatic indexing. Communications of
the ACM, 18(11):620.
Petr Sojka. 2009. An Experience with Building Digital
Open Access Repository DML-CZ. In Proceedings of
CASLIN 2009, Institutional Online Repositories and Open
Access, 16th International Seminar, pages
74–78
, Tepl
´
a
Monastery, Czech Republic. University of West Bohemia,
Pilsen, CZ.
Mark Steyvers and Tom Griffiths, 2007. Probabilistic Topic
Models, pages 427–446. Psychology Press, February.
Yee Whye Teh, Michael I. Jordan, Matthew J. Beal,
and David M. Blei. 2006. Hierarchical Dirichlet Pro-
cesses. Journal of the American Statistical Association,
101(476):1566–1581.
S. K. M. Wong and V. V. Raghavan. 1984. Vector space
model of information retrieval: a reevaluation. In Pro-
ceedings of the 7th annual international ACM SIGIR
conference on Research and development in information
retrieval, pages 167–185. British Computer Society, Swin-
ton, UK.
T. Zito, N. Wilbert, L. Wiskott, and P. Berkes. 2008. Mod-
ular toolkit for Data Processing (MDP): a Python data
processing framework. Frontiers in Neuroinformatics, 2.
http://mdp-toolkit.sourceforge.net/.
50

Supplementary resource (1)

... Then, we removed stopwords using the list provided at [62]. This Russian stopword list contains 153 stopwords and is less comprehensive than its English counterparts [63][64][65][66]. The word cloud of the resulting texts and its translation appears in Figure 2. Further discussion and a list of stopwords are provided in Appendix A.2. ...
... Empirically, we set the number K of topics to 100. To perform LDA topic modeling, we used the Gensim Python library [66]. We created a dictionary and a document-term matrix using tokenized and lemmatized texts after stopword removal. ...
... LDA topic modeling was performed with gensim library [66], and sentiment analysis was performed with RuBERT-based sentiment model [74]. The gensim library was used for LDA topic modeling due to its proven efficiency in handling large text corpora and its ease of implementation. ...
Article
Full-text available
Dietary choices, especially vegetarianism, have attracted much attention lately due to their potential effects on the environment, human health, and morality. Despite this, public discourse on vegetarianism in Russian-language contexts remains underexplored. This paper introduces VegRuCorpus, a novel, manually annotated dataset of Russian-language social media texts expressing opinions on vegetarianism. Through extensive experimentation, we demonstrate that contrastive learning significantly outperforms traditional machine learning and fine-tuned transformer models, achieving the best classification performance for distinguishing pro- and anti-vegetarian opinions. While traditional models perform competitively using syntactic and semantic representations and fine-tuned transformers show promise, our findings highlight the need for task-specific data to unlock their full potential. By providing a new dataset and insights into model performance, this work advances opinion mining and contributes to understanding nutritional health discourse in Russia.
... The Gensim library was used for data processing. Gensim is a Python library designed for topic modelling, document indexing and similarity retrieval with large corpora, primarily servicing the natural language processing (NLP) and information retrieval (IR) users (Řehůřek & Sojka, 2010). The study utilised three text mining methods: The study began with semantic network analysis, using the Word2vec model to generate a semantic network graph in Gephi. ...
Article
Full-text available
This study investigates the understanding of deepfake, a highly realistic AI mimicry technique that is rapidly evolving to produce increasingly realistic videos and explores the construction of a deepfake framework through the lens of audience communication using framing theory. It identifies three key findings. First, the public discourse on deepfakes forms a concept hierarchy emphasising technology and its entertainment applications, with core concepts including AI, voice, actor and job, while peripheral concerns such as consent and company receive less focus. Second, employing the BERTopic algorithm, latent themes in public discussions were categorised into two dimensions: social dynamics and cultural phenomena. Third, sentiment analysis reveals predominantly neutral or negative attitudes, indicating concerns over the risks and societal impacts of deepfake technology. The deepfakes framework developed here provides a structured approach to understanding these impacts, highlighting the need for ethical considerations in technological development, regulatory measures and public education.
... To measure the text similarity among restoration standards, the unstructured documents of standards were transformed into a structured numerical vector using classified LSI models 52 . The LSI is a mathematical optimization for representing word meanings in an orthogonal vector space created through an unsupervised learning method applied to a large text corpus, which has been suggested to successfully model human linguistic and cognitive knowledge through extensive psychological experiments 53 . ...
Article
Full-text available
To improve the success of expanding ecosystem restoration efforts, technical guiding-standards are being developed in many nations. Whether these protocols have been well adopted to guide restoration practices remains unknown, especially in developing countries where policies evolve rapidly to balance socioeconomic development with ecosystem restoration. By conducting text semantic mining analyses, we reveal widespread discrepancies between China’s coastal restoration practices and protocols over the past four decades. Over 60% of executed restoration projects had no detailed technical standards to guide implementation, especially for severely degraded ecosystems. Development of these standards lagged significantly behind project implementation, was poorly enforced, and focused more on monitoring than guiding good restoration designs and adaptive management, likely undermining restoration performance. Nevertheless, current policies toward prioritizing ecosystem restoration offer opportunities to remedy this issue. Enforcing policies to ensure that practices are guided by protocols is necessary to promote coastal restoration success in China and globally as nations strive to achieve ambitious restoration targets.
... This paper used LDA to perform TM in the Python environment. The LDA algorithm was deployed using the Gensim library (Řehůřek and Sojka 2010) in Python jupyter notebook. LDA proposed by Blei et al. (2003), is a generative probabilistic model for topic extraction. ...
Article
Full-text available
Topic modelling (TM) is a significant natural language processing (NLP) task and is becoming more popular, especially, in the context of literature synthesis and analysis. Despite the growing volume of studies on the use of and versatility of TM, the knowledge of TM development, especially from the perspective of bibliometrics analysis is limited. To this end, this study evaluated TM research using two techniques namely, bibliometrics analysis and TM itself to provide the current status and the pathway for future studies in the TM field. For this purpose, this study used 16,941 documents collected from Scopus database from 2004 to 2023. Results indicate that the publications on TM have increased over the years, however, the citation impact has declined. Furthermore, the scientific production on TM is concentrated in two countries namely, China and the USA. Our findings showed there are several applications of TM that are understudied, for example, TM for image segmentation and classification. This paper highlighted the future research directions, most importantly, calls for increased multidisciplinary research approaches to fully deploy TM algorithms optimally and thus, increase usage in non-computer science subject areas.
... An ANN model was built using Python 3.9 for this purpose. The model training performed with Word2Vec class from the Gensim module (Rehurek & Sojka, 2010). Number of vector size was set as 200, learning rate as 0.03, number of epochs as 150, minimum count as 10, and window as 5. ...
Article
Full-text available
Keywords Cappadocia restaurants Sentiment analysis Rule-based sentiment analysis Machine learning Abstract It is crucial to understand customers' sentiments and opinions about restaurants for both restaurant owners and academics in order to increase customer satisfaction and profitability. The aim of this study was to investigate customer sentiments of restaurants in Cappadocia, Türkiye. 38380 customer reviews of 386 restaurants in Cappadocia were obtained from Tripadvisor. Two different methods were used: rule-based sentiment analysis (RBSA) and machine learning (ML). The topics extracted from the reviews by RBSA were food, place, service, price, view, and staff and the percentages of these topics in the reviews were 41.45%, 23.94%, 11.36%, 9.23%, 8.18%, and 5.84%, respectively. For each topic, sentiment analysis was performed with ML to determine the proportion of positive, negative, and neutral sentiments. The highest positive sentiment content was found in food (40.15%), followed by staff (35.07%) and view (33.78%). Price (4.11%) and service (3.90%) were found to have the highest negative sentiment rates. The percentage of positive sentiment in reviews in Western languages was usually higher than in Far Eastern languages. Combining RBSA and ML techniques can enable both grammatical rules and artificial intelligence techniques while producing appropriate results. By understanding these sentiment patterns, restaurant owners can identify areas for improvement, while researchers can gain valuable insights into consumer behavior and sentiment analysis techniques.
... Creating word embeddings using Word2Vec and GloVe models through the Gensim Python library [28] is a trivial process. These models generate static embeddings, meaning that each word has a fixed and unique embedding representation. ...
Article
Full-text available
Network Intrusion Detection Systems (NIDS) are critical in ensuring the security of connected computer systems by actively detecting and preventing unauthorized activities and malicious attacks. Machine learning based NIDS models leverage algorithms that learn from historical network traffic data to identify patterns and anomalies to capture complex relationships. The primary objective of this research is to generate tags and descriptions for the packets that are difficult to classify by the NIDS. Most NIDS datasets that are publicly available have focused on flow data, offering aggregated information about network connections, and have played a crucial role in enabling researchers and network security professionals to design and develop flow-based NIDS solutions. While flow records provide valuable information for detecting network-level anomalies and attacks, they do not consider packet-level information and payload content. In this research, we propose a packet-level approach for NIDS that leverages the flow information with the packet header fields and payload. To facilitate this research, we have curated a comprehensive Packet-level dataset constructed by extracting the network packet captures from two widely used flow-level datasets, namely CIC-IDS2017 and UNSW-NB15. Recent advancements in natural language processing have demonstrated the effectiveness of Transformer-based models in handling sequence data with tasks such as token classification and text generation. We have adapted this technology to NIDS to extract key features and characteristics of the header and payload in the context of various attacks. Unlike traditional classification methods that assign predefined labels to network packets, this method focuses on generating tags based on the packet signature that explains the packet content and potential risks through high variability and explainability based on cosine semantic similarity scores of embeddings. The tags and descriptions offer network security professionals a tool to comprehend suspicious packets with an unfamiliar or potentially malicious signature, assess their nature, and help make informed decisions promptly. For the purposes of open science, the all code related to this paper is shared publicly.
Preprint
Full-text available
The growing demand for halal cosmetic products has exposed significant challenges, especially in Muslim-majority countries. Recently, various machine learning-based strategies, e.g., image-based methods, have shown remarkable success in predicting the halal status of cosmetics. However, these methods mainly focus on analyzing the discrete and specific ingredients within separate cosmetics, which ignore the high-order and complex relations between cosmetics and ingredients. To address this problem, we propose a halal cosmetic recommendation framework, namely HaCKG, that leverages a knowledge graph of cosmetics and their ingredients to explicitly model and capture the relationships between cosmetics and their components. By representing cosmetics and ingredients as entities within the knowledge graph, HaCKG effectively learns the high-order and complex relations between entities, offering a robust method for predicting halal status. Specifically, we first construct a cosmetic knowledge graph representing the relations between various cosmetics, ingredients, and their properties. We then propose a pre-trained relational graph attention network model with residual connections to learn the structural relation between entities in the knowledge graph. The pre-trained model is then fine-tuned on downstream cosmetic data to predict halal status. Extensive experiments on the cosmetic dataset over halal prediction tasks demonstrate the superiority of our model over state-of-the-art baselines.
Chapter
Topic modeling using Latent Dirichlet Allocation (LDA) is a type of text mining approach. Text mining encompasses a range of techniques and processes for extracting information and knowledge from large collections of textual data. LDA, as a specific method within text mining, helps achieve this goal by uncovering underlying topics or themes present in a corpus of text documents. There is emerging interest in the use of approaches such as topic modeling in the field of educational research for such tasks which may require analyzing vast amounts of unstructured textual data, including open-ended responses to surveys or assessments, student essays, or other unstructured forms of text data. Topic modeling provides a means to identify themes (i.e., topics) within such data, helping researchers identify underlying structures and patterns that may be used to complement or serve the function of more time-consuming conventional methods of qualitative analysis. Particularly when studying novel phenomena at scale, topic modeling could be an especially useful method. At the same time, there are limitations to using topic modeling in educational research. In the present chapter, we introduce topic modeling by describing past educational research that has used this method, provide general steps for conducting topic modeling, and discuss the limitations of using topic modeling in educational research.
Article
Full-text available
The intersection of artificial intelligence (AI) and industrial ecology (IE) is gaining significant attention due to AI's potential to enhance the sustainability of production and consumption systems. Understanding the current state of research in this field can highlight covered topics, identify trends, and reveal understudied topics warranting future research. However, few studies have systematically reviewed this intersection. In this study, we analyze 1068 publications within the IE–AI domain using trend factor analysis, word2vec modeling, and top2vec modeling. These methods uncover patterns of topic interconnections and evolutionary trends. Our results identify 71 trending terms within the selected publications, 69 of which, such as “deep learning,” have emerged in the past 8 years. The word2vec analysis shows that the application of various AI techniques is increasingly integrated into life cycle assessment and the circular economy. The top2vec analysis suggests that employing AI to predict and optimize indicators related to products, waste, processes, and their environmental impacts is an emerging trend. Lastly, we propose that fine‐tuning large language models to better understand and process data specific to IE, along with deploying real‐time data collection technologies such as sensors, computer vision, and robotics, could effectively address the challenges of data‐driven decision‐making in this domain.
Article
Full-text available
The Web, teeming as it is with language data, of all manner of varieties and languages, in vast quantity and freely available, is a fabulous linguists' playground. This special issue of Computational Linguistics explores ways in which this dream is being explored.
Conference Paper
Full-text available
Orange (www.ailab.si/orange) is a suite for machine learning and data mining. For researchers in machine learning, Orange offers scripting to easily prototype new algorithms and experimental procedures. For explorative data analysis, it provides a visual programming framework with emphasis on interactions and creative combinations of visual components.
Article
Full-text available
We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stick-breaking process, and a generalization of the Chinese restaurant process that we refer to as the "Chinese restaurant franchise." We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures and describe applications to problems in information retrieval and text modeling.
Article
In this article, we review probabilistic topic models: graphical models that can be used to summarize a large collection of documents with a smaller number of distributions over words. Those distributions are called "topics" because, when fit to data, they capture the salient themes that run through the collection. We describe both finite-dimensional parametric topic models and their Bayesian nonparametric counterparts, which are based on the hierarchical Dirichlet process (HDP). We discuss two extensions of topic models to time-series data-one that lets the topics slowly change over time and one that lets the assumed prevalence of the topics change. Finally, we illustrate the application of topic models to nontext data, summarizing some recent research results in image analysis.
Article
A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.
Article
This paper presents the design, implementation and evaluation of GATE, a General Architecture for Text Engineering.GATE lies at the intersection of human language computation and software engineering, and constitutes aninfrastructural system supporting research and development of languageprocessing software.
Article
This paper develops an identity for additive modifications of a singular value decomposition (SVD) to reflect updates, downdates, shifts, and edits of the data matrix. This sets the stage for fast and memory-efficient sequential algorithms for tracking singular values and subspaces. In conjunction with a fast solution for the pseudo-inverse of a submatrix of an orthogonal matrix, we develop a scheme for computing a thin SVD of streaming data in a single pass with linear time complexity: A rank-r thin SVD of a p × q matrix can be computed in O(pqr) time for .
Article
Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the term-document matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering.