Conference Paper

Bag of Tricks for Efficient Text Classification

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Once the word-embeddings were extracted, the next stage consisted of sentence classification. For this, we explored four methods: Deep Convolutional Neural Networks [13] with or without pre-trained word-embeddings at the input layer, FastText [23], Support Vector Machines (SVM), and k-Nearest Neighbors (kNN). ...
... FastText [20,23] for supervised learning is a computationally efficient method that starts with an embedding layer which maps the vocabulary indexes into d dimensions or, alternatively, it can use pre-trained word vectors. It then adds a global average pooling layer, which averages the embeddings of all the words in the sentence. ...
Preprint
Interventional cancer clinical trials are generally too restrictive, and some patients are often excluded on the basis of comorbidity, past or concomitant treatments, or the fact that they are over a certain age. The efficacy and safety of new treatments for patients with these characteristics are, therefore, not defined. In this work, we built a model to automatically predict whether short clinical statements were considered inclusion or exclusion criteria. We used protocols from cancer clinical trials that were available in public registries from the last 18 years to train word-embeddings, and we constructed a~dataset of 6M short free-texts labeled as eligible or not eligible. A text classifier was trained using deep neural networks, with pre-trained word-embeddings as inputs, to predict whether or not short free-text statements describing clinical information were considered eligible. We additionally analyzed the semantic reasoning of the word-embedding representations obtained and were able to identify equivalent treatments for a type of tumor analogous with the drugs used to treat other tumors. We show that representation learning using {deep} neural networks can be successfully leveraged to extract the medical knowledge from clinical trial protocols for potentially assisting practitioners when prescribing treatments.
... Recent work in natural language processing has made great strides by representing word relationships in a corpus not as networks or topical clusters but as vectors in a dense, continuous, high-dimensional space (Mikolov, Yih, and Zweig 2013;Pennington, Socher, and Manning 2014;Joulin et al. 2016) . These vector space models, known collectively as word embeddings, have attracted widespread interest among computer scientists and computational linguists due to their ability to capture complex semantic relations between words. ...
... Furthermore, when an adequate amount of text is available, contemporary word embedding approaches are able to distill much more semantic information and more complex semantic relations than previous methods of computational text analysis. It is also important to note that word embedding models' algorithms have received intensive attention in computer science and natural language processing communities, and are being continuously improved, with more recent algorithms demonstrating advances in their ability to successfully model semantic relations with smaller bodies of text by leveraging subword information Joulin et al. 2016) . As word embedding models become more widely used in the social scientific community, we expect them to enable the application of new data to classic sociological questions, and the productive identification and discovery of novel social and cultural problems. ...
Preprint
We demonstrate the utility of a new methodological tool, neural-network word embedding models, for large-scale text analysis, revealing how these models produce richer insights into cultural associations and categories than possible with prior methods. Word embeddings represent semantic relations between words as geometric relationships between vectors in a high-dimensional space, operationalizing a relational model of meaning consistent with contemporary theories of identity and culture. We show that dimensions induced by word differences (e.g. man - woman, rich - poor, black - white, liberal - conservative) in these vector spaces closely correspond to dimensions of cultural meaning, and the projection of words onto these dimensions reflects widely shared cultural connotations when compared to surveyed responses and labeled historical data. We pilot a method for testing the stability of these associations, then demonstrate applications of word embeddings for macro-cultural investigation with a longitudinal analysis of the coevolution of gender and class associations in the United States over the 20th century and a comparative analysis of historic distinctions between markers of gender and class in the U.S. and Britain. We argue that the success of these high-dimensional models motivates a move towards "high-dimensional theorizing" of meanings, identities and cultural processes.
... moreover, recently, a model for detecting misogyny in tweets has been described (frenda et al., 2019), as well as a neural architecture based on Bert, which carried out the first work on multi-labeling classification for sexism detection (Parikh, 2019). furthermore, there are in-depth analysis of sexist social media postings on social networking site twitter/X (called 'tweets') that classified them as 'hostile' , 'benevolent' or 'others' (Jha & mamidi, 2017), using support Vector machines (sVm), sequence-to-sequence models and FastText classifiers (Joulin et al., 2016). sharifirad and matwin (2019) also used different types of word embeddings mixed with lstm and naive Bayes models to detect harassment of different classes. ...
Article
Full-text available
Sexism against women remains an entrenched problem, manifested in contemporary cultural production worldwide. Since cultural production can be understood as both a mirror for and a reflection of the society where it is inserted, the persistence of sexism in music might rather represent how sexist our society is. the present work aims to analyze the evolution of sexism towards women among the most listened to music lyrics during the past six decades in Spain. To perform a large-scale analysis, automatic text classification based on manually labeled training data is used to categorize music lyrics as sexist or non-sexist. The findings show that sexism has always been present in song lyrics in Spain, and the presence of it has increased considerably in the music made available through streaming platforms over the last decade. This research has the potential to help detect, monitor, and mitigate sexist biases, while also advancing the automation of some aspects of content analysis within the realm of cultural studies.
... This methodology figured out the similarities among the documents. The authors in [41] described the importance of linear models in word representation learning. Refs. ...
Article
Full-text available
The impact of artificial intelligence (AI) on English language learning has become the center of attention in the past few decades. This study, with its potential to transform English language instruction and offer various instructional approaches, provides valuable insights and knowledge. To fully grasp the potential advantages of AI, more research is needed to improve, validate, and test AI algorithms and architectures. Grammatical notations provide a word’s information to the readers. If a word’s images are properly extracted and categorized using a CNN, it can help non-native English speakers improve their learning habits. The classification of parts of speech into different grammatical notations is the major problem that non-native English learners face. This situation stresses the need to develop a computer-based system using a machine learning algorithm to classify words into proper grammatical notations. A convolutional neural network (CNN) was applied to classify English words into nine classes: noun, pronoun, adjective, determiner, verb, adverb, preposition, conjunction, and interjection. A simulation of the selected model was performed in MATLAB. The model achieved an overall accuracy of 97.22%. The CNN showed 100% accuracy for pronouns, determiners, verbs, adverbs, and prepositions; 95% for nouns, adjectives, and conjunctions; and 90% for interjections. The significant results (p < 0.0001) of the chi-square test supported the use of the CNN by non-native English learners. The proposed approach is an important source of word classification for non-native English learners by putting the word image into the model. This not only helps beginners in English learning but also helps in setting standards for evaluating documents.
... The model's performance was evaluated on TweetEval [24], a unified benchmark for seven Twitter classification tasks (Emoji, Emotion, Hate, Irony, Offensive, Sentiment, and Stance). TimeLM-19 and TimeLM-21 [25] were compared with other models such as RoBERTa-Base [10], BERTweet [26] (pretrained on 900 million tweets), and older models such as FastText [27], SVM [28], and BLSTM [29]. Although BERTweet achieved the best average results across the seven tasks, TimeLM-21 performed better in 6 out of 7 tasks. ...
Article
Full-text available
In the digital era, social media platforms have seen a substantial increase in the volume of online comments. While these platforms provide users with a space to express their opinions, they also serve as fertile ground for the proliferation of hate speech. Hate comments can be categorized into various types, including discrimination, violence, racism, and sexism, all of which can negatively impact mental health. Among these, sexism poses a significant challenge due to its various forms and the difficulty in defining it, making detection complex. Nevertheless, detecting and preventing sexism on social networks remains a critical issue. Recent studies have leveraged language models such as transformers, known for their ability to capture the semantic nuances of textual data. In this study, we explore different transformer models, including multiple versions of RoBERTa (A Robustly Optimized BERT Pretraining Approach), to detect sexism. We hypothesize that combining a sentiment-focused language model with models specialized in sexism detection can improve overall performance. To test this hypothesis, we developed two approaches. The first involved using classical transformers trained on our dataset, while the second combined embeddings generated by transformers with a Long Short-Term Memory (LSTM) model for classification. The probabilistic outputs of each approach were aggregated through various voting strategies to enhance detection accuracy. The LSTM with embeddings approach improved the F1-score by 0.2% compared to the classical transformer approach. Furthermore, the combination of both approaches confirms our hypothesis, achieving a 1.6% improvement in the F1-score in each case. We determined that an F1 score of over 0.84 effectively measures sexism. Additionally, we constructed our own dataset to train and evaluate the models.
... FastText is another method developed by Joulin et al. [20] and further improved [21] by introducing an extension of the continuous skipgram model. In this approach, the concept of word embedding differs from Word2Vec or GloVe, where words are represented by vectors. ...
Article
Full-text available
This study introduces a method for the improvement of word vectors, addressing the limitations of traditional approaches like Word2Vec or GloVe through introducing into embeddings richer semantic properties. Our approach leverages supervised learning methods, with shifts in vectors in the representation space enhancing the quality of word embeddings. This ensures better alignment with semantic reference resources, such as WordNet. The effectiveness of the method has been demonstrated through the application of modified embeddings to text classification and clustering. We also show how our method influences document class distributions, visualized through PCA projections. By comparing our results with state-of-the-art approaches and achieving better accuracy, we confirm the effectiveness of the proposed method. The results underscore the potential of adaptive embeddings to improve both the accuracy and efficiency of semantic analysis across a range of NLP.
... As lyrics are part of the input, this is a reasonable assumption. For instance, there is 100% accuracy on JamendoLyrics Multi-Lang using[28].2 If unannotated symbols are present in the audio and mistakenly used as negatives, our method still works as the majority of examples are correct. ...
Preprint
Lyrics alignment gained considerable attention in recent years. State-of-the-art systems either re-use established speech recognition toolkits, or design end-to-end solutions involving a Connectionist Temporal Classification (CTC) loss. However, both approaches suffer from specific weaknesses: toolkits are known for their complexity, and CTC systems use a loss designed for transcription which can limit alignment accuracy. In this paper, we use instead a contrastive learning procedure that derives cross-modal embeddings linking the audio and text domains. This way, we obtain a novel system that is simple to train end-to-end, can make use of weakly annotated training data, jointly learns a powerful text model, and is tailored to alignment. The system is not only the first to yield an average absolute error below 0.2 seconds on the standard Jamendo dataset but it is also robust to other languages, even when trained on English data only. Finally, we release word-level alignments for the JamendoLyrics Multi-Lang dataset.
... Then, we generate vector representations of texts using Word2Vec. We selected this model because it has been the seed for all word embedding models, and it is the most widely used model, despite the existence of newer and very successful word embedding models such as fastText [37,38,39], BERT [40], Swivel [41] and ELMo [42]. Afterwards, clustering analysis is performed in the embedded space to find latent topics. ...
Preprint
With the increasing use of the Internet and mobile devices, social networks are becoming the most used media to communicate citizens' ideas and thoughts. This information is very useful to identify communities with common ideas based on what they publish in the network. This paper presents a method to automatically detect city communities based on machine learning techniques applied to a set of tweets from Bogot\'a's citizens. An analysis was performed in a collection of 2,634,176 tweets gathered from Twitter in a period of six months. Results show that the proposed method is an interesting tool to characterize a city population based on a machine learning methods and text analytics.
... iii) fastText: A simple yet efficient baseline for text classification based on a linear model with a rank constraint and a fast loss approximation. Experiments show that fastText typically produces results on par with sophisticated deep learning classifiers (Grave et al. 2017). iv) Convolutional Neural Network (CNN): We use a single-layer CNN model trained on top of word vectors as proposed by Kim (2014). ...
Preprint
We propose a simple, yet effective, approach towards inducing multilingual taxonomies from Wikipedia. Given an English taxonomy, our approach leverages the interlanguage links of Wikipedia followed by character-level classifiers to induce high-precision, high-coverage taxonomies in other languages. Through experiments, we demonstrate that our approach significantly outperforms the state-of-the-art, heuristics-heavy approaches for six languages. As a consequence of our work, we release presumably the largest and the most accurate multilingual taxonomic resource spanning over 280 languages.
... URL filtering: We excluded any documents from our dataset whose URLs' domain ended with ".pt", which refer to Portuguese from Portugal, as the language style in those documents might differ significantly from that used in Brazilian Portuguese web pages. Additionally, we used FastText [Joulin et al. 2016b, Joulin et al. 2016a] as an additional language verification method to ensure that only Portuguese documents were included in our corpus. Document segmentation into passages: Following language verification, we segmented the documents into approximately 1,000-character segments and assessed the percentage of line breaks (\n) occurrences within each segment, removing those with more than 20%. ...
Conference Paper
We present Quati, a dataset specifically designed for evaluating Information Retrieval (IR) systems for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of frequently accessed Brazilian Portuguese websites, which ensures a representative and relevant corpus. To label the query–document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. Our annotation methodology is described, enabling the cost-effective creation of similar datasets for other languages, with an arbitrary number of labeled documents per query. As a baseline, we evaluate a diverse range of open-source and commercial retrievers. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati, and all scripts at https://github.com/unicamp-dl/quati.
... FastText classifier: FastText [Jou+16] is an efficient text classification library designed to provide fast and scalable text classification tasks, particularly suitable for classification of largescale datasets. FastText relies on a simple shallow neural network architecture that enables rapid training and inference. ...
Preprint
Full-text available
We investigate biases in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar filtering and deduplication steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that popular pretraining datasets have their own unique biases or fingerprints. Those biases remain even when the text is rewritten with LLMs. Moreover, these biases propagate through training: Random sequences generated by models trained on those datasets can be classified well by a classifier trained on the original datasets.
... Therefore, we leveraged: (a) a pre-trained model for language identification is used (lid.176.ftz [15,16]), and (b) a pre-trained model for sentiment analysis to infer the reviews' sentiment (Twitter RoBERTa Base Sentiment 7 model [2]). Thus, the English reviews were filtered, and their sentiments were inferred. ...
Chapter
The proliferation of textual data on the Internet presents a unique opportunity for institutions and companies to monitor public opinion about their services and products. Given the rapid generation of such data, the text stream mining setting, which handles sequentially arriving, potentially infinite text streams, is often more suitable than traditional batch learning. While pre-trained language models are commonly employed for their high-quality text vectorization capabilities in streaming contexts, they face challenges adapting to concept drift—the phenomenon where the data distribution changes over time, adversely affecting model performance. Addressing the issue of concept drift, this study explores the efficacy of seven text sampling methods designed to fine-tune language models, thereby mitigating performance degradation selectively. We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions. Our evaluation, focused on Macro F1-score and elapsed time, employs two text stream datasets and an incremental SVM classifier to benchmark performance. Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification, demonstrating that larger sample sizes correlate with improved macro F1 scores. Notably, our proposed WordPieceToken ratio sampling method significantly enhances performance with the identified loss functions, surpassing baseline results.
... The local learning rate η is set to 0.005 for the 4-layer CNN and 0.01 for ResNet-18. For text classification tasks, the AG News [25] dataset with fastText [26] is used, and the local learning rate for fastText is set to η = 0.01, with other settings the same as image classification tasks. ...
Preprint
Full-text available
Recently, Federated Learning (FL) has gained popularity for its privacy-preserving and collaborative learning capabilities. Personalized Federated Learning (PFL), building upon FL, aims to address the issue of statistical heterogeneity and achieve personalization. Personalized-head-based PFL is a common and effective PFL method that splits the model into a feature extractor and a head, where the feature extractor is collaboratively trained and shared, while the head is locally trained and not shared. However, retaining the head locally, although achieving personalization, prevents the model from learning global knowledge in the head, thus affecting the performance of the personalized model. To solve this problem, we propose a novel PFL method called Federated Learning with Aggregated Head (FedAH), which initializes the head with an Aggregated Head at each iteration. The key feature of FedAH is to perform element-level aggregation between the local model head and the global model head to introduce global information from the global model head. To evaluate the effectiveness of FedAH, we conduct extensive experiments on five benchmark datasets in the fields of computer vision and natural language processing. FedAH outperforms ten state-of-the-art FL methods in terms of test accuracy by 2.87%. Additionally, FedAH maintains its advantage even in scenarios where some clients drop out unexpectedly. Our code is open-accessed at https://github.com/heyuepeng/FedAH.
... Unlike rule-based methods, which primarily filter out explicit noise from raw content, quality evaluation approaches offer greater robustness and flexibility. These approaches leverage models such as logistic regression [19], BERT [20], FastText [21], and others to calculate probability scores for each text. Based on these scores, texts are classified as positive or negative according to a predefined threshold. ...
Preprint
During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. First, we employ manually crafted rules to discard explicit noisy texts from raw contents. Second, the quality evaluation model, domain classifier, and toxicity evaluation model are well-designed to assess the remaining cleaned data respectively. Finally, we integrate these three types of fine-grained information for each text. With this approach, we release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score, facilitating the LLM researchers to select data based on various types of fine-grained information. The data, codes and the tool-chain are available on this website https://github.com/CASIA-LM/ChineseWebText-2.0
... Unlike rule-based methods, which primarily filter out explicit noise from raw content, quality evaluation approaches offer greater robustness and flexibility. These approaches leverage models such as logistic regression [19], BERT [20], FastText [21], and others to calculate probability scores for each text. Based on these scores, texts are classified as positive or negative according to a predefined threshold. ...
... Figure 2 shows the comparison on R8 dataset. [21] 0.8587 0.7829 0.6109 PV-DM [21] 0.5207 0.4492 0.5947 fastText [22] 0.9613 0.9281 0.7514 SWEM [23] 0.9532 0.9294 0.7665 LEAM [24] 0.9331 0.9184 0.7695 Text-MGNN ES [25] 0.9739 0.942 0.7746 TF-IDF + LR [26] 0.9374 0.8695 0.7459 CNN [27] 0 Table 3 displays the results for accuracy on AGNews dataset in comparison with various models in a classification task, BERT+LIC+CL is the second most accurate model with an accuracy of 91.15, which is notably lower than PS. BERT+LIC+HNM+CL follows closely behind, with an accuracy of 91.08. ...
Article
Full-text available
span lang="EN-US">Text classification is a pivotal task within natural language processing (NLP), aimed at assigning semantic labels to text sequences. Traditional methods of text representation often fall short in capturing intricacies in contextual information, relying heavily on manual feature extraction. To overcome these limitations, this research work presents the sequential attention fusion architecture (SAFA) to enhance the features extraction. SAFA combines deep long sort-term memory (LSTM) and multi-head attention mechanism (MHAM). This model efficiently preserves data, even for longer phrases, while enhancing local attribute understanding. Additionally, we introduce a unique attention mechanism that optimizes data preservation, a crucial element in text classification. The paper also outlines a comprehensive framework, incorporating convolutional layers and pooling techniques, designed to improve feature representation and enhance classification accuracy. The model's effectiveness is demonstrated through 2-dimensional convolution processes and advanced pooling, significantly improving prediction accuracy. This research not only contributes to the development of more accurate text classification models but also underscores the growing importance of NLP techniques.</span
... We used the open-source library fasttext (Joulin et al., 2016), which covers 176 languages in the ISO code format, to detect the language of each top-level and embedded Tweet with text content. Code-switching and Tweets with unidentifiable languages (e.g., very short ones, emojis or links only, etc.) were ignored. ...
Article
Open Source Intelligence (OSINT) refers to intelligence efforts based on freely available data. It has become a frequent topic of conversation on social media, where private users or networks can share their findings. Such data is highly valuable in conflicts, both for gaining a new understanding of the situation as well as for tracking the spread of misinformation. In this paper, we present a method for collecting such data as well as a novel OSINT dataset for the Russo-Ukrainian war drawn from Twitter between January 2022 and July 2023. It is based on an initial search of users posting OSINT and a subsequent snowballing approach to detect more. The final dataset contains almost 2 million Tweets posted by 1040 users. We also provide some first analyses and experiments on the data, and make suggestions for its future usage.
... • FastText (Joulin et al., 2017): It is a text classification library, which can be used to intuitively demonstrate whether there are distributional differences between member data and non-member data. We train FastText with the member data and the non-member data of M, and use it to directly assess whether a code snippet is member or non-member data. ...
... We use fastText 12 text classification (Joulin et al., 2016) for all experiments. FastText is a CPU-based library for efficient learning of word representations and sentence classification. ...
... 1. FastText (Joulin et al., 2016): FastText is a library released by Facebook which uses bag of words and bag of n-grams as features for text classification. It relies on capturing partial information about the local word order efficiently. ...
... • To select from a large corpus the most similar segments from a large corpus: MTUOC-corpuscombination. 12 • To preprocess the parallel corpora to train the systems: For some cleaning operations the language of the segments should be detected. As Asturian, Aranese and Aragonses are underrepresented in available language detection models, we decided to develop our own language detection model using fasttext 14 (Joulin et al., 2016). We trained a model able to detect the following languages: Aragonese, Aranese, Asturian, Catalan, English, French, Galician, Occitan, Portuguese and Spanish. ...
... Primitive approaches deal with the ERD task as simple solely-sentence emotion recognition task with no consideration of the historical information (Joulin et al., 2016;Chen et al., 2016;Yang et al., 2016;Chatterjee et al., 2019). ...
... Baselines We compare our models to multiple existing state-of-the-art text classification methods including TF-IDF+LR, fastText (Joulin et al., 2016), CNN (Le and Mikolov, 2014), LSTM (Liu et al., 2016), PTE (Tang et al., 2015), BERT (Devlin et al., 2018), TextGCN (Yao et al., 2019) and TextGAT. ...
... Other text representations could have been adopted as input to IS methods, such as static embeddings (e.g., FastText [32]) or contextualized embeddings built by transformer architectures (whether by forwarding documents through fine-tuned model or a zero-shot approach). However, as previously demonstrated in [3,11,14]: (i) static embeddings can slow down classification methods significantly; and (ii) using contextualized embeddings directly as IS input is inefficient and ineffective, probably because it sacrifices sparsity. ...
Article
Fine-tuning transformer-based deep learning models is currently at the forefront of natural language processing (NLP) and information retrieval (IR) tasks. However, fine-tuning these transformers for specific tasks, especially when dealing with ever-expanding volumes of data, constant retraining requirements, and budget constraints, can be computationally and financially costly, requiring substantial energy consumption and contributing to carbon dioxide emissions. This article focuses on advancing the state-of-the-art (SOTA) on instance selection (IS) – a range of document filtering techniques designed to select the most representative documents for the sake of training. The objective is to either maintain or enhance classification effectiveness while reducing the overall training (fine-tuning) total processing time. In our prior research, we introduced the E2SC framework, a redundancy-oriented IS method focused on transformers and large datasets – currently the state-of-the-art in IS. Nonetheless, important research questions remained unanswered in our previous work, mostly due to E2SC's sole emphasis on redundancy. In this article, we take our research a step further by proposing biO-IS – an extended bi - ob jective i nstance s election solution, a novel IS framework aimed at simultaneously removing redundant and noisy instances from the training. biO-IS estimates redundancy based on scalable, fast, and calibrated weak classifiers and captures noise with the support of a new entropy-based step. We also propose a novel iterative process to estimate near-optimum reduction rates for both steps. Our extended solution is able to reduce the training sets by 41% on average (up to 60%) while maintaining the effectiveness in all tested datasets, with speedup gains of 1.67 on average (up to 2.46x). No other baseline, not even our previous SOTA solution, was capable of achieving results with this level of quality, considering the tradeoff among training reduction, effectiveness, and speedup. To ensure reproducibility, our documentation, code, and datasets can be accessed on GitHub – https://github.com/waashk/bio-is .
Article
Social media platforms have become central arenas for public discourse, enabling the exchange of ideas and information among diverse user groups. However, the rise of echo chambers, where individuals reinforce their existing beliefs through repeated interactions with like-minded users, poses significant challenges to the democratic exchange of ideas and the potential for polarization and information disorder. This paper presents a comparative analysis of the main metrics that have been proposed in the literature for echo chamber detection, with a focus on their application in a cross-platform scenario constituted by the two major social media platforms, i.e., Twitter (now renamed X\mathbb {X} ) and Reddit. The echo chamber detection metrics considered encompass network analysis, content analysis, and hybrid solutions. The findings of this work shed light on the unique dynamics of echo chambers present on the two social media platforms, while also highlighting the strengths and limitations of various metrics employed to identify them, and their transversality to the different social graph modeling and domains considered.
Preprint
Recurrent neural networks have shown remarkable success in modeling sequences. However low resource situations still adversely affect the generalizability of these models. We introduce a new family of models, called Lattice Recurrent Units (LRU), to address the challenge of learning deep multi-layer recurrent models with limited resources. LRU models achieve this goal by creating distinct (but coupled) flow of information inside the units: a first flow along time dimension and a second flow along depth dimension. It also offers a symmetry in how information can flow horizontally and vertically. We analyze the effects of decoupling three different components of our LRU model: Reset Gate, Update Gate and Projected State. We evaluate this family on new LRU models on computational convergence rates and statistical efficiency. Our experiments are performed on four publicly-available datasets, comparing with Grid-LSTM and Recurrent Highway networks. Our results show that LRU has better empirical computational convergence rates and statistical efficiency values, along with learning more accurate language models.
Article
In the digital security environment, the obfuscation and encryption of malicious scripts are primary attack methods used to evade detection. These scripts—easily spread through websites, emails, and file downloads—can be automatically executed on users' systems, posing serious security threats. To overcome the limitations of signature‐based detection methods, this study proposed a methodology for real‐time detection of obfuscated and encrypted malicious scripts using ML/DL models with feature optimization techniques. The obfuscated script datasets were analyzed to identify the unique characteristics, classified into 16 feature sets, to evaluate the optimal features for the best detection accuracy. Although the detection accuracy of these datasets was < 20%, when tested with commercial antivirus services, the experimental results using ML and DL models demonstrated that the proposed light gradient boosting model (LGBM) could achieve the best detection accuracy and processing speed. The LGBM outperformed other artificial intelligence models by achieving 97% accuracy and the minimum processing time in the decoded, obfuscated, and encrypted dataset cases.
Article
Full-text available
Large language models (LLMs) have transformed Natural Language Processing (NLP) by enabling robust text generation and understanding. However, their deployment in sensitive domains like healthcare, finance, and legal services raises critical concerns about privacy and data security. This paper proposes a comprehensive framework for embedding trust mechanisms into LLMs to dynamically control the disclosure of sensitive information. The framework integrates three core components: User Trust Profiling, Information Sensitivity Detection, and Adaptive Output Control. By leveraging techniques such as Role-Based Access Control (RBAC), Attribute-Based Access Control (ABAC), Named Entity Recognition (NER), contextual analysis, and privacy-preserving methods like differential privacy, the system ensures that sensitive information is disclosed appropriately based on the user's trust level. By focusing on balancing data utility and privacy, the proposed solution offers a novel approach to securely deploying LLMs in high-risk environments. Future work will focus on testing this framework across various domains to evaluate its effectiveness in managing sensitive data while maintaining system efficiency.
Article
Semantic textual analysis is a natural language processing task that has enjoyed several research contributions towards solving diverse real-life problems. Vector comparison is a core subtask in semantic textual similarity analysis. A plethora of solutions including recent state-of-the-art transformer-based pre-trained language models for transfer learning have focused on using only cosine similarity for embedding evaluation in downstream tasks and ignored other vector comparison methods. To investigate the relative performance of some such ignored measures, this work proposes novel adaptations for soft cosine and extended cosine vector measures. We investigate their performance against the conventional cosine measure, distance-weighted cosine, vector similarity measure, negative Manhattan, and Euclidean distances on downstream semantic textual similarity tasks, under same conditions, for the first time in literature. Adopting transformer-based Universal sentence encoder, SBERT, SRoBERTa, SimCSE, and ST5 for text encoding; the performances of the adapted measures are evaluated on diverse real world datasets using Pearson, Spearman, accuracy and F1 evaluation metrics. Results obtained show that the adapted measures significantly surpass previously reported state-of-the-art cosine similarity-based correlations in several test cases considered.
Article
Full-text available
An increasing number of people are suffering from depression due to rising chronic stress levels. With the advent of Web 2.0, individuals are more inclined to express their emotions on social media, offering new opportunities for depression prediction. Researchers have developed various single-modal methods for early-stage depression prediction. Recently, multimodal social media data has been utilized to enhance the accuracy of depression detection methods. These methods primarily extract multidimensional information such as text, language, and images from social media users, integrating these diverse modes to assess the risk or severity of depression. This approach significantly improves the precision of depression prediction. However, the research is still in its early stages, with challenges such as limited datasets and many areas requiring further improvement. To aid researchers in better understanding and refining multimodal approaches, we conducted a review that summarizes emerging research directions in using multimodal techniques for depression prediction on social media. Additionally, this review compares different depression detection methods, datasets, and the various modalities used in multimodal approaches, analyzing their strengths and limitations. Finally, it offers suggestions for future research.
Conference Paper
Full-text available
This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.
Article
Full-text available
This article demontrates that we can apply deep learning to text understanding from character-level inputs all the way up to abstract text concepts, using temporal convolutional networks (ConvNets). We apply ConvNets to various large-scale datasets, including ontology classification, sentiment analysis, and text categorization. We show that temporal ConvNets can achieve astonishing performance without the knowledge of words, phrases, sentences and any other syntactic or semantic structures with regards to a human language. Evidence shows that our models can work for both English and Chinese.
Article
Full-text available
LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regres- sion and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Recent trends suggest that neural-network-inspired word embedding models outperform traditional count-based distributional models on word similarity and analogy detection tasks. We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.
Article
Document classification tasks were primarily tackled at word level. Recent research that works with character-level inputs shows several benefits over word-level approaches such as natural incorporation of morphemes and better handling of rare words. We propose a neural network architecture that utilizes both convolution and recurrent layers to efficiently encode character inputs. We validate the proposed model on eight large scale document classification tasks and compare with character-level convolution-only models. It achieves comparable performances with much less parameters.
Article
We created the Yahoo Flickr Creative Commons 100 Million Dataseta (YFCC100M) in 2014 as part of the Yahoo Webscope program, which is a reference library of interesting and scientifically useful datasets. The YFCC100M is the largest public multimedia collection ever released, with a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all uploaded to Flickr between 2004 and 2014 and published under a CC commercial or noncommercial license. The dataset is distributed through Amazon Web Services as a 12.5GB compressed archive containing only metadata. However, as with many datasets, the YFCC100M is constantly evolving; over time, we have released and will continue to release various expansion packs containing data not yet in the collection; for instance, the actual photos and videos, as well as several visual and aural features extracted from the data, have already been uploaded to the cloud, ensuring the dataset remains accessible and intact for years to come. The YFCC100M dataset overcomes many of the issues affecting existing multimedia datasets in terms of modalities, metadata, licensing, and, principally, volume.
Conference Paper
Variants of Naive Bayes (NB) and Support Vector Machines (SVM) are often used as baseline methods for text classification, but their performance varies greatly depending on the model variant, features used and task/dataset. We show that: (i) the inclusion of word bigram features gives consistent gains on sentiment analysis tasks; (ii) for short snippet sentiment tasks, NB actually does better than SVMs (while for longer documents the opposite result holds); (iii) a simple but novel SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets. Based on these observations, we identify simple NB and SVM variants which outperform most published results on sentiment analysis datasets, sometimes providing a new state-of-the-art performance level.
Article
A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.
Conference Paper
We describe a single convolutional neural net- work architecture that, given a sentence, out- puts a host of language processing predic- tions: part-of-speech tags, chunks, named en- tity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense (grammatically and semanti- cally) using a language model. The entire network is trained jointly on all these tasks using weight-sharing, an instance of multitask learning. All the tasks use labeled data ex- cept the language model which is learnt from unlabeled text and represents a novel form of semi-supervised learning for the shared tasks. We show how both multitask learning and semi-supervised learning improve the general- ization of the shared tasks, resulting in state- of-the-art performance.
Conference Paper
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at the top of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method, called WSABIE, both outperforms several baseline methods and is faster and consumes less memory.
Article
An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area, of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.
Article
We present a system and a set of techniques for learning linear predictors with convex losses on terascale datasets, with trillions of features, {The number of features here refers to the number of non-zero entries in the data matrix.} billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are new, but the careful synthesis required to obtain an efficient implementation is. The result is, up to our knowledge, the most scalable and efficient linear learning system reported in the literature (as of 2011 when our experiments were conducted). We describe and thoroughly evaluate the components of the system, showing the importance of the various design choices.
Conference Paper
Maximum entropy models are considered by many to be one of the most promising avenues of language modeling research. Unfortunately, long training times make maximum entropy research difficult. We present a speedup technique: we change the form of the model to use classes. Our speedup works by creating two maximum entropy models, the first of which predicts the class of each word, and the second of which predicts the word itself. This factoring of the model leads to fewer nonzero indicator functions, and faster normalization, achieving speedups of up to a factor of 35 over one of the best previous techniques. It also results in typically slightly lower perplexities. The same trick can be used to speed training of other machine learning techniques, e.g. neural networks, applied to any problem with a large number of outputs, such as language modeling
Conference Paper
The representation of documents and queries as vectors in a high-dimensional space is well-established in information retrieval. The author proposes that the semantics of words and contexts in a text be represented as vectors. The dimensions of the space are words and the initial vectors are determined by the words occurring close to the entity to be represented, which implies that the space has several thousand dimensions (words). This makes the vector representations (which are dense) too cumbersome to use directly. Therefore, dimensionality reduction by means of a singular value decomposition is employed. The author analyzes the structure of the vector representations and applies them to word sense disambiguation and thesaurus induction
Text categorization with support vector machines: Learning with many relevant features
  • Thorsten Joachims
Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In Claire Nédellec and Céline Rouveirol, editors, 10th European Conference on Machine Learning, pages 137-142, Chemnitz, Germany. Springer Berlin Heidelberg.
Convolutional neural networks for sentence classification
  • Yoon Kim
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746-1751, Doha, Qatar, October. Association for Computational Linguistics.
A comparison of event models for naive bayes text classification
  • Andrew Mccallum
  • Kamal Nigam
Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for naive bayes text classification. In AAAI workshop on learning for text categorization, pages 41-48, Madison, USA.
Cernock`y. 2011. Strategies for training large scale neural network language models
  • Tomáš Mikolov
  • Anoop Deoras
  • Daniel Povey
  • Lukáš Burget
  • Jaň Cernock
Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš Burget, and Jaň Cernock`Cernock`y. 2011. Strategies for training large scale neural network language models. In Workshop on Automatic Speech Recognition Understanding, pages 196–201, Waikoloa, USA. IEEE.
  • Alexis Conneau
  • Holger Schwenk
Alexis Conneau, Holger Schwenk, Loïc Barrault, and Yann Lecun. 2016. Very deep convolutional networks for natural language processing. arXiv preprint arXiv:1606.01781.
Strategies for training large scale neural network language models
  • Tomáš Mikolov
  • Anoop Deoras
  • Daniel Povey
  • Lukáš Burget
Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš Burget, and JanČernockỳ. 2011. Strategies for training large scale neural network language models. In Workshop on Automatic Speech Recognition Understanding, pages 196-201, Waikoloa, USA. IEEE.