Article

Advances in Pre-Training Distributed Word Representations

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Many Natural Language Processing applications nowadays rely on pre-trained word representations estimated from large text corpora such as news collections, Wikipedia and Web Crawl. In this paper, we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main result of our work is the new set of publicly available pre-trained models that outperform the current state of the art by a large margin on a number of tasks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Di bidang pemrosesan bahasa alami (NLP), penggunaan Pre-Trained Embedding telah menjadi strategi yang diadopsi secara luas untuk meningkatkan kinerja berbagai tugas [27]. Word embedding adalah representasi vektor padat dari kata-kata, menangkap hubungan semantik dan sintaksis, dan biasanya dilatih pada korpus data teks yang besar menggunakan algoritme pembelajaran tanpa pengawasan, seperti GloVe, FastText, dan Word2Vec [27]. ...
... Di bidang pemrosesan bahasa alami (NLP), penggunaan Pre-Trained Embedding telah menjadi strategi yang diadopsi secara luas untuk meningkatkan kinerja berbagai tugas [27]. Word embedding adalah representasi vektor padat dari kata-kata, menangkap hubungan semantik dan sintaksis, dan biasanya dilatih pada korpus data teks yang besar menggunakan algoritme pembelajaran tanpa pengawasan, seperti GloVe, FastText, dan Word2Vec [27]. Model-model yang telah dilatih sebelumnya ini mampu menangkap informasi kontekstual dan menghasilkan embedding yang lebih ekspresif dan kuat daripada word embedding yang statis. ...
... Banyak penelitian yang telah dilakukan dengan menggunakan LSTM, yang menunjukkan bahwa metode ini memiliki kinerja yang lebih baik daripada metode konvensional. LSTM dapat bekerja lebih baik daripada RNN pada deret data yang panjang, karena pada sel LSTM terdapat node yang memiliki selfrecurrent [29] [27].Terdapat tiga komponen penting dalam tahapan proses LSTM, yaitu forget gate, input gate, dan output gate. Gambar 2 adalah gambar arsitektur LSTM. ...
Article
Full-text available
The product review dataset is a rapidly growing and interesting source of data to explore. The increase in the number of internet users and customer shopping habits through online stores has a significant impact on the growth of product review data, especially for online stores in Indonesia, such as Tokopedia. The sample data used amounted to 1079. This research aims to evaluate the performance of three types of pre-trained word embeddings, namely FastText, GloVe, and Word2Vec, in the Long Short-Term Memory (LSTM) model for sentiment classification of product reviews on Tokopedia. An automated sentiment classification system is needed to process many product reviews, making it easier for sellers to know what consumers think of their products. This research contributes by evaluating the impact of various pre-trained word embeddings on the performance of LSTM models in sentiment classification tasks. In addition, this research also aims to measure the effectiveness of LSTM models combined with multiple pre-trained word embeddings. By implementing a deep learning architecture, computers can learn and recognize contextual data stored in review sentences. The research was conducted in three stages: model selection, layer setup, and hyperparameter optimization, to feature in-depth testing of the deep learning architecture used and the appropriate combination of layers and parameters to obtain high sentiment classification performance. The experimental results show that FastText with LSTM provides the best performance with 85.08% accuracy, Word2Vec with 84.62% accuracy, and GloVe with 83.04% accuracy. The main contribution of this research is to present an in-depth test of the product review dataset and provide a deep learning architecture along with a combination of layers and parameters that has the best performance in recognizing sentiment on the product review dataset. This architecture achieves higher performance than the BERT method with CNN and BiLSTM layers.
... The model was fine-tuned using 200,000 pairs of personality items from several open datasets and is publicly available on Hugging Face (dwulff/mpnet-personality). We provide a detailed description and systematic comparisons with other embeddings in the Supplementary Information 15,20,[22][23][24] . Second, we obtained scale embeddings by averaging the embeddings for the items corresponding to each of the 459 scales in IPIP (Fig. 2b). ...
... Most concern the constructs 'organization' (35,10.3%), 'humour/playfulness' (24,7.1%), 'toughness' (23, 6.8%), 'industriousness/perseverance/ persistence' (22, 6.5%), 'friendliness' (20,5.9%) ...
... Specifically, we fine-tuned MPNet 20 , a lightweight sentence transformer model based on the BERT architecture. We then compared this to other transformer models, including mixedbread (mxbai-embed-large-v1) 38 , OpenAI's latest model (text-embedding-3-large) 39 and models relying on other architectures 15,24 . The results for these models and associated comparisons are reported in the Supplementary Information. ...
Article
Full-text available
Taxonomic incommensurability denotes the difficulty in comparing scientific theories due to different uses of concepts and operationalizations. To tackle this problem in psychology, here we use language models to obtain semantic embeddings representing psychometric items, scales and construct labels in a vector space. This approach allows us to analyse different datasets (for example, the International Personality Item Pool) spanning thousands of items and hundreds of scales and constructs and show that embeddings can be used to predict empirical relations between measures, automatically detect taxonomic fallacies and suggest more parsimonious taxonomies. These findings suggest that semantic embeddings constitute a powerful tool for tackling taxonomic incommensurability in the psychological sciences.
... In addition to answering the research questions, we compare BERTweet.BR model's predictive performance to a fastText-based classifier [31] as a baseline result. Experiments showed that our model consistently outperforms mBERT, BERTimbau, XLM-R, and XLM-T in most cases and the static word embeddings from fastText [31] in all the experiments. ...
... In addition to answering the research questions, we compare BERTweet.BR model's predictive performance to a fastText-based classifier [31] as a baseline result. Experiments showed that our model consistently outperforms mBERT, BERTimbau, XLM-R, and XLM-T in most cases and the static word embeddings from fastText [31] in all the experiments. In brief, we highlight the main contributions of this work: ...
... We used the LR implementation from scikitlearn [38]. 7 Static embeddings from fastText [31] are obtained for whole sentences after normalization. Given a tweet, we get a single vector representation with get_sentence_vector method on top of the Portuguese pre-trained word vectors 8 shared by the fastText team with the dimension of 300. ...
Article
Full-text available
Recent advancements in neural language models have been primarily centered around English, with limited focus on the more than seven thousand other languages. This includes Portuguese, which, despite being the sixth most spoken language globally, has markedly fewer neural-based linguistic resources available when compared to English. Notably, Portuguese speakers compose one of the most active groups of Twitter users; however, no pre-trained language model in Portuguese tweets is extensively studied in the literature. Besides the language, tweets-based pre-trained models must account for the cultural code, informal linguistic style, code-switching, and the limited number of characters. This manuscript tackles this gap by introducing BERTweet.BR, the first public large-scale pre-trained model specific to the Brazilian Portuguese tweets domain. BERTweet.BR has the same architecture of BERTweetbase, a BERT-based English-tweets model, and was trained from scratch following the RoBERTa pre-training procedure on a 100-M Portuguese tweets corpus. On the sentiment analysis task, experiments show that BERTweet.BR outperforms three multilingual Transformers and BERTimbau, a monolingual general-domain Brazilian Portuguese language model. We release our model in the transformers library aiming at promoting future research in analytical tasks for Portuguese tweets. The BERTweet.BR code, experimental results, and related documentation are publicly available on Github.
... We did this separately for complete scene descriptions as well as for the similarity of used nouns, verbs or adjectives. To derive a quantitative estimate of semantic similarity, we determined the cosine similarity of description embeddings using SentenceBERT 28 for complete descriptions and the genSim library 29 and pretrained fastText model 30,31 for pairs of words (see methods for details). We then averaged the resulting difference values across scenes for each pair of observers to compute Description Dissimilarity Matrices (DDMs). ...
... The assignment was subsequently checked manually, and isolated assignment errors were corrected. Moreover, we estimated the semantic similarity of all possible word pairs of the respective scene descriptions by using the genSim library 29 and the pretrained fastText model 30 , which offers also German language applications 31 and calculates the similarity values using cosine similarity (ranging from -1 to 1; the higher the value, the higher is the contextual similarity of the words). Finally, we identified the individual words that a subject used to describe a scene and compared the semantic similarity between the words used by this subject and their respective comparison partner. ...
Article
Full-text available
Do different people looking at the same scene perceive individual versions of what’s in front of them? If perception is individual, which mechanisms mediate our particular view of the world? Recent findings have shown systematic observer differences in gaze, but it is unclear whether individual fixation biases translate to divergent impressions of the same scene. Here, we find systematic differences in the scene descriptions individual observers provide for identical complex scenes. Crucially, observer differences in fixation patterns predicted pairwise differences in scene descriptions, particularly the use of nouns, even for out-of-sample images. Part of this could be explained by the individual tendency to fixate text and people predicting corresponding description references. Our results strongly suggest that subjective scene perception is shaped by individual gaze.
... In the first step, we generated a common ground-truth network based on a pre-trained fastText word embedding model [27]. The fastText embedding has been trained on 600 billion tokens of the Common Crawl 1 and provides, for 2 million words, word embedding vectors that [26]). (2) Simulating behavioral data, varying the study design parameters cue set type, cue set size, number of responses, and response type. ...
... have been shown to accurately predict human behavior [26]. From the fastText embedding, we extracted vectors of 13,486 words that are cues or frequent responses (n ≥ 30) in the English Small World of Words (SWOW) free association study data [17]. ...
Preprint
Full-text available
Accurately capturing individual differences in semantic networks is fundamental to advancing our mechanistic understanding of semantic memory. Past empirical attempts to construct individual-level semantic networks from behavioral paradigms may be limited by data constraints. To assess these limitations and propose improved designs for the measurement of individual semantic networks, we conducted a recovery simulation investigating the psychometric properties underlying estimates of individual semantic networks obtained from two different behavioral paradigms: free associations and relatedness judgment tasks. Our results show that successful inference of semantic networks is achievable, but they also highlight critical challenges. Estimates of absolute network characteristics are severely biased, such that comparisons between behavioral paradigms and different design configurations are often not meaningful. However, comparisons within a given paradigm and design configuration can be accurate and generalizable when based on designs with moderate numbers of cues, moderate numbers of responses, and cue sets including diverse words. Ultimately, our results provide insights that help evaluate past findings on the structure of semantic networks and design new studies capable of more reliably revealing individual differences in semantic networks.
... To overcome the above limitations, in this paper, we propose DistilLog, a simple yet effective log-based anomaly detection method towards resource-constrained scenarios. DistilLog utilizes a pre-trained word2vec model [22] to represent log event templates as semantic vectors, incorporated with the PCA dimensionality reduction algorithm to minimize computational and storage requirements. Then, Knowledge Distillation is applied to optimize the size and running time of a Gated Recurrent Unit (GRU) model while maintaining high detection accuracies for the development in resouce-contrained scenarios. ...
... 1) Word Vectorization: After preprocessing, DistilLog transforms each log template into vector features. To this end, we adopt a word2vec [22] model, namely FastText [40]. Specifically, we first remove all variable marks in a log template (i.e., the "< * >" token). ...
... While these techniques can improve model robustness, especially for simpler classification tasks, their random nature may distort semantic nuance [42]. More sophisticated augmentation involves embeddings like Word2Vec, GloVe, or transformer-based vectors, enabling semantically aware replacements that preserve context and meaning [43][44][45]. ...
Article
Full-text available
Emotion classification in natural language processing (NLP) has recently witnessed significant advancements. However, class imbalance in emotion datasets remains a critical challenge, as dominant emotion categories tend to overshadow less frequent ones, leading to biased model predictions. Traditional techniques, such as undersampling and oversampling, offer partial solutions. More recently, synthetic data generation using large language models (LLMs) has emerged as a promising strategy for augmenting minority classes and improving model robustness. In this study, we investigate the impact of synthetic data augmentation on German-language emotion classification. Using an imbalanced dataset, we systematically evaluate multiple balancing strategies, including undersampling overrepresented classes and generating synthetic data for underrepresented emotions using a GPT-4–based model in a few-shot prompting setting. Beyond enhancing model performance, we conduct a detailed linguistic analysis of the synthetic samples, examining their lexical diversity, syntactic structures, and semantic coherence to determine their contribution to overall model generalization. Our results demonstrate that integrating synthetic data significantly improves classification performance, particularly for minority emotion categories, while maintaining overall model stability. However, our linguistic evaluation reveals that synthetic examples exhibit reduced lexical diversity and simplified syntactic structures, which may introduce limitations in certain real-world applications. These findings highlight both the potential and the challenges of synthetic data augmentation in emotion classification. By providing a comprehensive evaluation of balancing techniques and the linguistic properties of generated text, this study contributes to the ongoing discourse on improving NLP models for underrepresented linguistic phenomena.
... Figure 22 shows the performance of encoding already tokenized abstracts from DBLP [103], capped at a maximum of 1,000 tokens. All plots show the total execution time of 10 repetitions of performing word embedding with word2vec [70] embeddings trained on Wikipedia. The first row represents word embedding only, while the second row adds a fully-connected neural network layer with ReLu activation on the embedded outputs. ...
Preprint
Full-text available
Data-centric ML pipelines extend traditional machine learning (ML) pipelines -- of feature transformations and ML model training -- by outer loops for data cleaning, augmentation, and feature engineering to create high-quality input data. Existing lossless matrix compression applies lightweight compression schemes to numeric matrices and performs linear algebra operations such as matrix-vector multiplications directly on the compressed representation but struggles to efficiently rediscover structural data redundancy. Compressed operations are effective at fitting data in available memory, reducing I/O across the storage-memory-cache hierarchy, and improving instruction parallelism. The applied data cleaning, augmentation, and feature transformations provide a rich source of information about data characteristics such as distinct items, column sparsity, and column correlations. In this paper, we introduce BWARE -- an extension of AWARE for workload-aware lossless matrix compression -- that pushes compression through feature transformations and engineering to leverage information about structural transformations. Besides compressed feature transformations, we introduce a novel technique for lightweight morphing of a compressed representation into workload-optimized compressed representations without decompression. BWARE shows substantial end-to-end runtime improvements, reducing the execution time for training data-centric ML pipelines from days to hours.
... The feature maps were then fed into an SVM classifier to produce the final prediction. The proposed model utilized the FastText [43] word embeddings for word representation. ...
Article
Full-text available
This work introduces a novel deep learning method for Arabic sentiment analysis, arguing that reading the entire input sequence is not always necessary. Many texts can be accurately classified without processing all input tokens. The method employs a reinforcement learning agent that selects relevant tokens using a selection policy network. Instead of predicting sentiment polarity from the entire input, the model focuses only on tokens chosen by the policy network. To empirically evaluate the proposed method, experiments were carried out on three Arabic sentiment analysis datasets: Large Arabic Book Reviews (LABR), Hotels Arabic Reviews Data (HARD), and Arabic Sentiment Tweets Dataset (ASTD). The results demonstrate a significant improvement in Arabic sentiment classification with the selective reading method, achieving state-of-the-art accuracy while using only a fraction of the tokens. However, the approach introduces additional computational cost due to the reinforcement learning component, and its scalability to larger datasets might require further optimization.
... FastText is another NLP method developed by Facebook AI Research that efficiently learns word and sentence representations. Unlike Word2Vec, FastText considers the character -grams, making it especially effective for morphologically rich languages and for handling out-ofvocabulary words [30][31][32]. Similar to Word2Vec, FastText can use either the SG or CBOW architecture but differs in its use of sub-word embeddings. ...
Article
Full-text available
Satellite reliability is critical to ensuring uninterrupted operations in aerospace systems, where anomalies can lead to mission failures and significant economic losses. Existing anomaly classification methods often lack scalability, interpretability, and adaptability to diverse datasets. This study introduces the Trade-Space Exploration Machine Learning (TSE-ML) framework, a comprehensive pipeline for satellite anomaly classification that optimizes preprocessing, transformation, normalization, and machine learning stages. Leveraging a Seradata dataset spanning 66 years and 4,455 satellite records, the framework systematically evaluates four data cleaning methods, four data transformation techniques, five normalization strategies, and seven machine learning algorithms across 480 configurations. The optimal configuration, comprising Iterative Imputation, FastText, Robust Scaling, and Decision Tree, achieved the highest testing accuracy of 95.74% with competitive computational efficiency. The Decision Tree model delivered superior accuracy and provided interpretability, revealing critical factors influencing satellite anomalies, such as Age Since Launch, Design Life, and Orbit Category. Stratified 5-fold cross-validation ensured robustness and generalizability of the results. The TSE-ML framework’s transparency and high performance enable actionable insights for improving satellite design, operational planning, and anomaly mitigation. Future research will focus on real-time anomaly detection, integrating satellite telemetry data, and extending the framework to other space applications. This study establishes a robust, interpretable foundation for advancing anomaly classification in aerospace engineering, addressing the dual challenges of reliability and operational efficiency.
... In addition, originality -measured by the use of specific and unusual vocabulary -and redundancy in discourse -measured by the repetition of common words -were assessed using specific dictionaries. Part of Speech (POS) tagging determined the occurrence of adjectives and verb tense, while semantic analysis benefited from word representation techniques such as Word2Vec [46] and Global Vectors for Word Representation (GloVe) [47]. Finally, sentiment analysis techniques measured the polarity of speech by classifying it as positive, negative, or neutral. ...
Preprint
Full-text available
In the context of the rapid dissemination of multimedia content, identifying disinformation on social media platforms such as TikTok represents a significant challenge. This study introduces a hybrid framework that combines the computational power of deep learning with the interpretability of fuzzy logic to detect suspected disinformation in TikTok videos. The methodology is comprised of two core components: a multimodal feature analyser that extracts and evaluates data from text, audio, and video; and a multimodal disinformation detector based on fuzzy logic. These systems operate in conjunction to evaluate the suspicion of spreading disinformation, drawing on human behavioural cues such as body language, speech patterns, and text coherence. Two experiments were conducted: one focusing on context-specific disinformation and the other on the scalability of the model across broader topics. For each video evaluated, high-quality, comprehensive, well-structured reports are generated, providing a detailed view of the disinformation behaviours.
... The maximum number of words in each row of the dataset is 19, and the dimension length used is 300. This study uses pre-trained word embedding FastText [25]. FastText is commonly used to tackle sentence classification and word representation tasks in a more efficient and faster manner than Word2vec and GloVe. ...
Article
Data labeling is a critical aspect of sentiment analysis that requires assigning labels to text data to reflect the sentiment expressed. Traditional methods of data labeling involve manual annotation by human annotators, which can be both time-consuming and costly when handling large volumes of text data. Automation of the data labeling process can be achieved through the utilization of lexicon resources, which consist of pre-labeled dictionaries or databases of words and phrases in sentiment information. The contribution of this study is an evaluation of the performance of lexicon resources in document labeling. The evaluation aims to provide insight into the accuracy of using lexicon resources and inform future research. In this study, a publicly available dataset was utilized and labeled as negative, neutral, and positive. To generate new labels, a lexicon resource such as VADER, AFINN, SentiWordNet, and Liu & Hu was employed. An LSTM model was then trained using the newly generated labels. The performance of the trained model was evaluated by testing it on data that had been manually labeled. The study found manual labeling led to highest accuracy of 0.79, 0.80, and 0.80 for training, validation, and testing respectively. This is likely due to manual creation of test data labels, enabling the model to learn and capture balanced patterns. Models using lexicon resources (VADER and AFINN) had lower accuracy of 0.54 and 0.56. SentiWordNet had lowest accuracy among all models with 0.49, and Liu&Hu model had the lowest testing score of 0.26. Our research indicates that lexicon resources alone are not sufficient for sentiment data labeling as they are dependent on pre-defined dictionaries and may not fully capture the context of words within a sentence, thus, manual labeling is necessary to complement lexicon-based methods to achieve better result.
... These large language models (LLMs) have found increasing applications in educational and psychometric research, especially for analyzing psychological scales and identifying redundancies (e.g., Bezirhan and von Davier 2023;Ma et al. 2024;Urban et al. 2024). At the core of these approaches is the use of text embeddings, which map words, phrases, or sentences into high-dimensional vectors to capture semantic relationships (Bojanowski et al. 2017;Mikolov et al. 2018;Pennington et al. 2014;Petukhova et al. 2024). This has enabled researchers to assess conceptual overlaps by comparing the semantic structures of scale items, revealing redundancies that align with theoretical models of non-overlapping concepts (Hernandez and Nie 2022;Rosenbusch et al. 2020;Arnulf et al. 2024;Wulff and Mata 2024, preprint). ...
Article
Full-text available
As psychological research progresses, the issue of concept overlap becomes increasing evident, adding to participant burden and complicating data interpretation. This study introduces an Embedding-based Semantic Analysis Approach (ESAA) for detecting redundancy in psychological concepts, which are operationalized through their respective scales, using natural language processing techniques. The ESAA utilizes OpenAI’s text-embedding-3-large model to generate high-dimensional semantic vectors (i.e., embeddings) of scale items and applies hierarchical clustering to group semantically similar items, revealing potential redundancy. Three preliminary experiments evaluated the ESAA’s ability to (1) identify semantically similar items, (2) differentiate semantically distinct items, and (3) uncover overlap between scales of concepts known for redundancy issues. Additionally, comparative analyses assessed the ESAA’s robustness and incremental validity against the advanced chatbots based on GPT-4. The results demonstrated that the ESAA consistently produced stable outcomes and outperformed all evaluated chatbots. As an objective approach for analyzing relationships between concepts operationalized as scales, the ESAA holds promise for advancing research on theory refinement and scale optimization.
... The log HAL frequency for prime words was 8.70 (SD = 1.94), and for the target words was 10.50 (SD = 1.28). To assess the semantic relatedness between prime and target words, we used the Gensim python library [25] to access the pre-trained word embedding model "fasttextwiki-news-subwords-300" [26] and compute cosine similarity. This model represents words as dense vectors in a 300-dimensional space, capturing semantic relationships based on their co-occurrence patterns in a large text dataset. ...
Article
Full-text available
Consumer-grade EEG devices, such as the InteraXon Muse 2 headband, present a promising opportunity to enhance the accessibility and inclusivity of neuroscience research. However, their effectiveness in capturing language-related ERP components, such as the N400, remains underexplored. This study thus aimed to investigate the feasibility of using the Muse 2 to measure the N400 effect in a semantic relatedness judgment task. Thirty-seven participants evaluated the semantic relatedness of word pairs while their EEG was recorded using the Muse 2. Single-trial ERPs were analyzed using robust Yuen t-tests and hierarchical linear modeling (HLM) to assess the N400 difference between semantically related and unrelated target words. ERP analyses indicated a significantly larger N400 effect in response to unrelated word pairs over the right frontal electrode. Additionally, dependability estimates suggested acceptable internal consistency for the N400 data. Overall, these findings illustrate the capability of the Muse 2 to reliably measure the N400 effect, reinforcing its potential as a valuable tool for language research. This study highlights the potential of affordable, wearable EEG technology to expand access to brain research by offering an affordable and portable way to study language and cognition in diverse populations and settings.
... Static embeddings (Mikolov et al. 2013;Pennington et al. 2014;Mikolov et al. 2018) are obtained based on the co-occurrence of adjacent words and have a fixed representation for the words, not taking into account their context. ...
Article
Full-text available
Emotion Recognition in Conversations (ERC) is a key step towards successful human–machine interaction. While the field has seen tremendous advancement in the last few years, new applications and implementation scenarios present novel challenges and opportunities. These range from leveraging the conversational context, speaker, and emotion dynamics modelling, to interpreting common sense expressions, informal language, and sarcasm, addressing challenges of real-time ERC, recognizing emotion causes, different taxonomies across datasets, multilingual ERC, and interpretability. This survey starts by introducing ERC, elaborating on the challenges and opportunities of this task. It proceeds with a description of the emotion taxonomies and a variety of ERC benchmark datasets employing such taxonomies. This is followed by descriptions comparing the most prominent works in ERC with explanations of the neural architectures employed. Then, it provides advisable ERC practices towards better frameworks, elaborating on methods to deal with subjectivity in annotations and modelling and methods to deal with the typically unbalanced ERC datasets. Finally, it presents systematic review tables comparing several works regarding the methods used and their performance. Benchmarking these works highlights resorting to pre-trained Transformer Language Models to extract utterance representations, using Gated and Graph Neural Networks to model the interactions between these utterances, and leveraging Generative Large Language Models to tackle ERC within a generative framework. This survey emphasizes the advantage of leveraging techniques to address unbalanced data, the exploration of mixed emotions, and the benefits of incorporating annotation subjectivity in the learning phase.
Conference Paper
In the realm of recommendation systems, achieving real-time performance in embedding similarity tasks is often hindered by the limitations of traditional Top-K sparse matrix-vector multiplication (SpMV) methods, which suffer from high latency due to inefficient memory access patterns. This paper identifies these critical gaps and introduces AccelES, a novel approach that significantly enhances the efficiency of Top-K SpMV. Our method employs a two-stage calculation scheme: the first stage utilizes a compact, low-bit dataset to quickly identify the most relevant entries, while the second stage performs full-precision calculations solely on this pruned subset, thereby minimizing computational overhead. Furthermore, AccelES incorporates innovative matrix representations, Ultra-CSR and Random-CSR, which optimize memory bandwidth utilization. Experimental results demonstrate that AccelES accelerates performance, surpassing state-of-the-art FPGA, GPU, and CPU solutions by factors of 3.4×, 2.5×, and 153.3×, respectively, under controlled conditions. These advancements not only enhance processing speed but also significantly improve real-time performance in recommendation systems, establishing AccelES as a pivotal contribution to the field of Top-K sparse matrix-vector multiplication.
Article
Full-text available
Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline and somatic mutations. Germline mutations underlie hereditary conditions, while somatic mutations can be induced by various factors including environmental influences, chemicals, lifestyle choices, and errors in DNA replication and repair mechanisms which can lead to cancer. DNA sequence analysis plays a pivotal role in uncovering the intricate information embedded within an organism's genetic blueprint and understanding the factors that can modify it. This analysis helps in early detection of genetic diseases and the design of targeted therapies. Traditional wet-lab experimental DNA sequence analysis through traditional wet-lab experimental methods is costly, time-consuming, and prone to errors. To accelerate large-scale DNA sequence analysis, researchers are developing AI applications that complement wet-lab experimental methods. These AI approaches can help generate hypotheses, prioritize experiments, and interpret results by identifying patterns in large genomic datasets. Effective integration of AI methods with experimental validation requires scientists to understand both fields. Considering the need of a comprehensive literature that bridges the gap between both fields, contributions of this paper are manifold: It presents diverse range of DNA sequence analysis tasks and AI methodologies. It equips AI researchers with essential biological knowledge of 44 distinct DNA sequence analysis tasks and aligns these tasks with 3 distinct AI-paradigms, namely, classification, regression, and clustering. It streamlines the integration of AI into DNA sequence analysis tasks by consolidating information of 36 diverse biological databases that can be used to develop benchmark datasets for 44 different DNA sequence analysis tasks. To ensure performance comparisons between new and existing AI predictors, it provides insights into 140 benchmark datasets related to 44 distinct DNA sequence analysis tasks. It presents word embeddings and language models applications across 44 distinct DNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 39 word embeddings and 67 language models based predictive pipeline performance values as well as top performing traditional sequence encoding-based predictors and their performances across 44 DNA sequence analysis tasks.
Conference Paper
Full-text available
This work presents a proposal for adapting Large Language Models (LLMs) to the unsupervised task of Topic Modeling (TM). Our proposal consists of three stages: document summarization, characterization of topics, and definition of topics. We instantiated our proposal with two LLMs, one open-source (Llama3) and the other proprietary (GPT 3.5), comparing them with four state-of-the-art (SOTA) strategies in TM. Our results demonstrated that the approach is very promising, having been able to define topics as coherent as SOTA strategies but still with room for improvement in terms of organizational structure.
Article
The internet has been inundated with an ocean of information, and hence, information retrieval systems are failing to provide optimal results to the user. In order to meet the challenge, query expansion techniques have emerged as a game-changer and are improving the results of information retrieval significantly. Of late, semantic query expansion techniques have attracted increased interest among researchers since these techniques offer more pertinent and practical results to the users. These allow the user to retrieve more meaningful and useful information from the web. Currently, few research works provide a comprehensive review on semantic query expansion; usually, they cannot provide a full view on recent advances, diversified data application, and practical challenges. Therefore, it is imperative to go deep in review in order to explain these advances and assist researchers with concrete insights for future development. This article represents the comprehensive review of the query expansion methods, with a particular emphasis on semantic approaches. It overviews the recent frameworks that have been developed within a period of 2015–2024 and reviews the limitations of each approach. Further, it discusses challenges that are inherent in the semantic query expansion field and identifies some future research directions. This article emphasizes that the linguistic approach is the most effective and flexible direction for researchers to follow, while the ontology approach better suits domain-specific search applications. This, in turn, means that development of the ontology field may further open new perspectives for semantic query expansion. Moreover, by employing artificial intelligence (AI) and making most of the query context without relying on user intervention, improvements toward the optimal expanded query can be achieved.
Chapter
The notion of ‘construction’ is the cornerstone upon which linguistic theories examine the intricate system of form-meaning pairings that underpin our everyday communicative endeavors.
Article
Full-text available
The exponential growth of textual data has heightened the importance of efficient text classification, a fundamental natural language processing task that assigns predefined categories to documents. This task can follow flat classification, where categories are equally structured, or hierarchical classification, which organizes categories in multi-level structures and presents additional complexities. While extensive research has advanced text classification for English, studies on Arabic text classification remain limited, particularly in hierarchical contexts. The unique features of Arabic, such as its rich morphology, diverse dialects, and syntactic complexity, pose significant challenges. This survey provides a comprehensive review of Arabic text classification by examining data sources, preprocessing steps, and feature extraction techniques, ranging from traditional methods like Bag of Words and TF-IDF to modern approaches such as neural embeddings (e.g., Word2Vec) and transformer-based models like BERT. Additionally, it explores classification techniques, from machine learning algorithms (e.g., SVM, Random Forest) to deep learning models (e.g., CNN, RNN, LSTM, GPT), and evaluates performance through metrics such as precision, recall, and F1-score. This survey aims to guide future research and innovation in Arabic text classification by addressing current advancements and challenges.
Article
Recently, many researches have created adversarial samples to enrich the diversity of training data for improving the text classification performance via reducing the loss incurred in the neural network training. However, existing studies have focused solely on adding perturbations to the input, such as text sentences and embedded representations, resulting in adversarial samples that are very similar to the original ones. Such adversarial samples can not significantly improve the diversity of training data, which restricts the potential for improved classification performance. To alleviate the problem, in this paper, we extend the diversity of generated adversarial samples based on the fact that adding different disturbances between different layers of neural network has different effects. We propose a novel neural network with perturbation strategy (PTNet), which generates adversarial samples by adding perturbation to the intrinsic representation of each hidden layer of the neural network. Specifically, we design two different perturbation ways to perturb each hidden layer: 1) directly adding a certain threshold perturbation; 2) adding the perturbation in the way of adversarial training. Through above settings, we can get more perturbed intrinsic representations of hidden layers and use them as new adversarial samples, thus improving the diversity of the augmented training data. We validate the effectiveness of our approach on six text classification datasets and demonstrate that it improves the classification ability of the model. In particular, the classification accuracy on the sentiment analysis task improved by an average of 1.79% and on question classification task improved by 3.2% compared to the BERT baseline, respectively.
Article
Full-text available
Both -ity and -ness are frequent and productive suffixes in English that fulfill the same core function: turning adjectives into nouns that denote the state or quality of whatever the adjective denotes. This well-known affix rivalry raises two core questions: 1. What determines the choice between -ity and -ness for a given base word? 2. Are the two affixes synonymous? For the first question, previous work has focused on morphological and phonological properties of the bases, but not their semantics. For question 2, the literature fails to give a convincing answer, with some studies, faced with doublets like ethnicity/ethnicness, arguing for a semantic difference, but most assuming synonymy. Using pretrained distributional vectors, I show empirically first that the semantics of the bases plays a major role in affix selection and second that the two affixes induce similar meaning shifts.
Article
Microblogging platforms have been increasingly used by the public in crisis situations, enabling more participatory crisis communication between the official response channels and the affected community. However, the sheer volume of crisis-related messages on social media can make it challenging for officials to find pertinent information and understand the public’s perception of evolving risks. To address this issue, crisis informatics researchers have proposed a variety of technological solutions, but there has been limited examination of the cognitive and perceptual processes and subsequent responses of the affected population. Yet, this information is critical for the crisis response officials to gauge public’s understanding of the event, their perception of event-related risk, and perception of incident response and recovery efforts, in turn enabling the officials to craft crisis communication messaging more effectively. Taking cues from the Protective Action Decision Model, we conceptualize a metric —resonance+ — that prioritizes the cognitive and perceptual processes of the affected population, quantifying shifts in collective attention and information exposure for each tweet. Based on resonance+, we develop a principled, scalable pipeline that recommends content relating to people’s cognitive and perceptual processes. Our results suggest that resonance+ is generalizable across different types of natural hazards. We have also demonstrated its applicability for near-real time scenarios. According to the feedback from the target users, the local public information officers (PIOs) in emergency management, the messages recommended by our pipeline are useful in their tasks of understanding public perception and finding hopeful narratives, potentially leading to more effective crisis communications.
Chapter
As data sets grow in size, determining the content of different types of shared text continues to be a difficult task. The task is even more difficult when the documents are short and noisy. Examples include social media posts, product reviews, clinician notes, and open-ended survey responses. This chapter discusses the emergence of a new class of topic models, topic-noise models. Topic-noise models are generative and treat topics and noise as separate distributions over words. In other words, the model assumes that noise exists and it must be modeled as well. The chapter begins by defining and presenting different topic-noise algorithms, both unsupervised and semi-supervised. We then present the similarities and differences between topic-noise models and well-known topic models like LDA, highlighting when each model performs best. Finally, we consider how these models can be advanced by employing large language models and other auxiliary data sources. In an era when noise can no longer be ignored, topic-noise models offer an important alternative to traditional topic models. They may eventually be the next step in the evolution of topic models.
Article
Customer journey mapping (CJM) is a product and service design method that is widely used by design researchers and practitioners. It tracks the customer's or user's interactions with products and services during experiences and maps out significant changes in those experiences. While CJM has been praised for its ability to understand customer experiences from their perspectives (i.e., empathy building), it suffers from limitations such as limited sample sizes and cognitive biases, mostly due to its reliance on traditional ethnographic methods such as interviews, observations, and surveys. To address these issues, this paper presents an approach to performing CJM with mobile applications (apps) and quantitatively analyzing the gathered data. By dividing CJM into data collection and analytics stages, challenges in each stage are tackled separately. In the data collection stage, a custom-built mobile app is designed to gather information on customer experiences. Then, a two-step data analysis approach is developed to gain insights into those experiences. To demonstrate the approach's effectiveness in tracking customer experiences, it was applied to the errand experiences of students during the pandemic, and the results were compared with those from a parallel study using traditional CJM approaches. The findings demonstrated the feasibility of the proposed approach in performing CJM and showed that additional insights can be obtained through the proposed approach. This work provides a more effective and objective CJM approach for designers to understand their customers, thereby contributing toward the broader effort to develop reliable empathy-building design methods.
Article
Full-text available
Self-admitted technical debts (SATDs) refer to a solution in software development that selects suboptimal solutions to meet the current requirements and are intentionally introduced and documented by developers. SATDs in issue-tracking systems are a complement to those within source code comments. The effective identification of SATDs is crucial for software quality assurance and maintenance. Current studies focus on whether issue sections contain debt, but overlook specific SATD types. Meanwhile, they lack solutions for the challenge that SATD features are hard to learn due to the scarcity of instances containing SATDs. To address these problems, we propose a novel method, which is a weighted prompt tuning to identify SATDs, called WPTD. Specifically, WPTD employs a weighted prompt tuning to adapt the model with few-shot samples for insufficient training data. Moreover, to improve the performance of the model, WPTD constructs an SATD verbalizer by extracting keywords through mutual information and refining it with prior contextual information. Furthermore, it also improves SATD representation by extracting weights using the chi-square method and integrating them into the text. Finally, to reduce bias, WPTD computes the average score of results as final predicted distributions. We conduct comprehensive experiments on seven projects and the results show that our method significantly outperforms baseline approaches. In addition, we summarize the project-specific keywords, which can help developers better understand SATDs.
Article
Full-text available
The expanding Arabic user base presents a unique opportunity for researchers to tap into vast online Arabic resources. However, the lack of reliable Arabic word embedding models and the limited availability of Arabic corpora poses significant challenges. This paper addresses these gaps by developing and evaluating Arabic word embedding models trained on diverse Arabic corpora, investigating how varying hyperparameter values impact model performance across different NLP tasks. To train our models, we collected data from three distinct sources: Wikipedia, newspapers, and 32 Arabic books, each selected to capture specific linguistic and contextual features of Arabic. By using advanced techniques such as Word2Vec and FastText, we experimented with different hyperparameter configurations, such as vector size, window size, and training algorithms (CBOW and skip-gram), to analyze their impact on model quality. Our models were evaluated using a range of NLP tasks, including sentiment analysis, similarity tests, and an adapted analogy test designed specifically for Arabic. The findings revealed that both the corpus size and hyperparameter settings had notable effects on performance. For instance, in the analogy test, a larger vocabulary size significantly improved outcomes, with the FastText skip-gram models excelling in accurately solving analogy questions. For sentiment analysis, vocabulary size was critical, while in similarity scoring, the FastText models achieved the highest scores, particularly with smaller window and vector sizes. Overall, our models demonstrated strong performance, achieving 99% and 90% accuracies in sentiment analysis and the analogy test, respectively, along with a similarity score of 8 out of 10. These results underscore the value of our models as a robust tool for Arabic NLP research, addressing a pressing need for high-quality Arabic word embeddings.
Chapter
Full-text available
In recent years, the problem of misinformation on the web has become widespread across languages, countries and various social media platforms. One problem central to stopping the spread of misinformation is identifying claims and prioritising them for fact-checking. Although there has been much work on automated claim detection from text recently, the role of images and their variety still need to be explored. As posts and content shared on social media are often multimodal, it has become crucial to view the problem of misinformation and fake news from a multimodal perspective. In this chapter, first, we present an overview of existing claim detection methods and their limitations; second, we present a unimodal approach to identify check-worthy claims; third, and lastly, we introduce a dataset that takes both the image and text into account for detecting claims and benchmark recent multimodal models on the task.
Article
Full-text available
Gendered disinformation undermines women's rights, democratic principles, and national security by worsening societal divisions through authoritarian regimes' intentional weaponization of social media. Online misogyny represents a harmful societal issue, threatening to transform digital platforms into environments that are hostile and inhospitable to women. Despite the severity of this issue, efforts to persuade digital platforms to strengthen their protections against gendered disinformation are frequently ignored, highlighting the difficult task of countering online misogyny in the face of commercial interests. This growing concern underscores the need for effective measures to create safer online spaces, where respect and equality prevail, ensuring that women can participate fully and freely without the fear of harassment or discrimination. This study addresses the challenge of detecting misogynous content in bilingual (English and Italian) online communications. Utilizing FastText word embeddings and explainable artificial intelligence techniques, we introduce a model that enhances both the interpretability and accuracy in detecting misogynistic language. To conduct an in-depth analysis, we implemented a range of experiments encompassing classic machine learning methodologies and conventional deep learning approaches to the recent transformer-based models incorporating both language-specific and multilingual capabilities. This paper enhances the methodologies for detecting misogyny by incorporating incremental learning for cutting-edge datasets containing tweets and posts from different sources like Facebook, Twitter, and Reddit, with our proposed approach outperforming these datasets in metrics such as accuracy, F1-score, precision, and recall. This process involved refining hyperparameters, employing optimization techniques, and utilizing generative configurations. By implementing Local Interpretable Model-agnostic Explanations (LIME), we further elucidate the rationale behind the model's predictions, enhancing understanding of its decision-making process.
Article
Full-text available
Early detection of accidents and rescue are of paramount importance in the reduction of fatalities. Social media data, which has evolved to become an important source of sharing information, plays a great role in building machine learning-based models for classifying posts related to accidents. Since the context of the word “accident” is difficult to determine in a posting, various works in literature have developed better classifiers for predicting whether the posting is actually related to an accident. However, an ensemble of classifiers are known to provide better performance than the basic models. Therefore, in this direction, we present a novel weighted majority voting-based ensemble approach for context classification of tweets (WM-ECCT) to detect whether the tweets are related or unrelated to road accidents. For the proposed ensemble model, the weighting scheme is based on the principle of false prediction to true prediction ratio. Also, the proposed model uses the multi-inducer technique and bootstrap sampling to reduce misclassification rates. Moreover, we propose a context-aware labeling approach for the annotation of tweets into related and unrelated categories. Experiments conducted reveal that the proposed ensemble model outperforms the different standalone machine learning and ensemble models on various performance measures.
Article
The rapid growth of biomedical publications has presented significant challenges in the field of information retrieval. Most existing work focuses on document retrieval given explicit queries. However, in real applications such as curated biomedical database maintenance, explicit queries are missing. In this paper, we propose a two-step model for biomedical information retrieval in the case that only a small set of example documents is available without explicit queries. Initially, we extract keywords from the observed documents using large pre-trained language models and biomedical knowledge graphs. These keywords are then enriched with domain-specific entities. Information retrieval techniques can subsequently use the collected entities to rank the documents. Following this, we introduce an iterative Positive-Unlabeled learning method to classify all unlabeled documents. Experiments conducted on the PubMed dataset demonstrate that the proposed technique outperforms the state-of-the-art positive-unlabeled learning methods. The results underscore the effectiveness of integrating large language models and biomedical knowledge graphs in improving zero-shot information retrieval performance in the biomedical domain.
Article
Full-text available
Social media has emerged as a dominant platform where individuals freely share opinions and communicate globally. Its role in disseminating news worldwide is significant due to its easy accessibility. However, the increase in the use of these platforms presents severe risks for potentially misleading people. Our research aims to investigate different techniques within machine learning, deep learning, and ensemble learning frameworks in Arabic fake news detection. We integrated FastText word embeddings with various machine learning and deep learning methods. We then leveraged advanced transformer-based models, including BERT, XLNet, and RoBERTa, optimizing their performance through careful hyperparameter tuning. The research methodology involves utilizing two Arabic news article datasets, AFND and ARABICFAKETWEETS datasets, categorized into fake and real subsets and applying comprehensive preprocessing techniques to the text data. Four hybrid deep learning models are presented: CNN-LSTM, RNN-CNN, RNN-LSTM, and Bi-GRU-Bi-LSTM. The Bi-GRU-Bi-LSTM model demonstrated superior performance regarding the F1 score, accuracy, and loss metrics. The precision, recall, F1 score, and accuracy of the hybrid Bi-GRU-Bi-LSTM model on the AFND Dataset are 0.97, 0.97, 0.98, and 0.98, and on the ARABICFAKETWEETS dataset are 0.98, 0.98, 0.99, and 0.99 respectively. The study’s primary conclusion is that when spotting fake news in Arabic, the Bi-GRU-Bi-LSTM model outperforms other models by a significant margin. It significantly aids the global fight against false information by setting the stage for future research to expand fake news detection to multiple languages.
Article
Non-Terrestrial Networks (NTNs) enabled Internet of Things (IoT) extends connectivity to remote and underserved areas, enhances network reliability and coverage, and supports diverse IoT applications in challenging environments such as rural, maritime, and disaster-stricken regions. As an emerging and fast-evolving IoT scheme, NTN-enabled IoT requires extensive evaluation to ensure effective deployment in real-world scenarios, such as connectivity, performance, and security evaluation. Since conducting testing in remote and diverse environments is logistically challenging and costly, we propose a Generative Artificial Intelligence (GAI)-based synthetic traffic generation framework that facilitates comprehensive traffic analysis and performance evaluation. The proposed framework employs a GAI model to learn the traffic pattern and generate synthetic traffic from historical data. Our approach includes an embedding-based model for representing network flow attributes and a Conditional Generative Adversarial Network (CGAN) for generating traffic flows. Considering both source-destination information and statistical features achieves more comprehensive characterization of traffic flows. Finally, the simulation results demonstrate that the proposed approach can generate high quality traffic that conforms to real data distribution and shows obvious difference between multiple applications.
Article
Full-text available
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Full-text available
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Full-text available
Research into corpus-based semantics has focused on the development of ad hoc models that treat single tasks, or sets of closely related tasks, as unrelated challenges to be tackled by extracting different kinds of distributional information from the corpus. As an alternative to this “one task, one model” approach, the Distributional Memory framework extracts distributional information once and for all from the corpus, in the form of a set of weighted word-link-word tuples arranged into a third-order tensor. Different matrices are then generated from the tensor, and their rows and columns constitute natural spaces to deal with different semantic problems. In this way, the same distributional information can be shared across tasks such as modeling word similarity judgments, discovering synonyms, concept categorization, predicting selectional preferences of verbs, solving analogy problems, classifying relations between word pairs, harvesting qualia structures with patterns or example pairs, predicting the typical properties of concepts, and classifying verbs into alternation classes. Extensive empirical testing in all these domains shows that a Distributional Memory implementation performs competitively against task-specific algorithms recently reported in the literature for the same tasks, and against our implementations of several state-of-the-art methods. The Distributional Memory approach is thus shown to be tenable despite the constraints imposed by its multi-purpose nature.
Article
Full-text available
Empirical evidence suggests that hashing is an effective strategy for dimensionality reduction and practical nonparametric estimation. In this paper we provide exponential tail bounds for feature hashing and show that the interaction between random subspaces is negligible with high probability. We demonstrate the feasibility of this approach with experimental results for a new use case -- multitask learning with hundreds of thousands of tasks. Comment: Fixed broken theorem
Article
Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference dataset can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks.
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.
Article
Recent trends suggest that neural-network-inspired word embedding models outperform traditional count-based distributional models on word similarity and analogy detection tasks. We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.
Article
Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones.
Article
It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf's law observed in natural languages such as the English. The facts that the frequency of occurrence of a word is almost an inverse power law function of its rank and the exponent of this inverse power law is very close to 1 are largely due to the transformation from the word's length to its rank, which stretches an exponential function to a power law function.
UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems
  • L Han
  • A L Kashyap
  • T Finin
  • J Mayfield
  • J Weese
Han, L., Kashyap, A. L., Finin, T., Mayfield, J., and Weese, J. (2013). UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, June.