Article

Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... To classify textual data, there are automated ways in literature such as the Naive Bayes (Raschka 2014), Support Vector Machines (Joachims 2005), and Deep Learning (Minaee et al. 2021). Since we lack a training set, non-generalizable learning approaches from the literature (Raschka 2014;Joachims 2005;Minaee et al. 2021) are not practical in our case. ...
... To classify textual data, there are automated ways in literature such as the Naive Bayes (Raschka 2014), Support Vector Machines (Joachims 2005), and Deep Learning (Minaee et al. 2021). Since we lack a training set, non-generalizable learning approaches from the literature (Raschka 2014;Joachims 2005;Minaee et al. 2021) are not practical in our case. However, Large Language Models (LLMs) have shown strong generalization to diverse downstream tasks (OpenAI 2023; Zheng et al. 2023) and manual classification of individual models (25,866 models) is not feasible in our context. ...
Article
Full-text available
Business process management entails a multi-billion-dollar industry that is founded on modeling business processes to analyze, understand, improve, and automate them. Business processes consist of a set of interconnected activities that an organization follows to achieve its goals and objectives. While the existence of business process models in open source has been reported in the literature, there is little work in characterizing their landscape. This paper presents the first characterization of business process models in open source, particularly on GitHub. The landscape is formed by 25,866 business process models across 4,954 repositories, with 16% of the repositories belonging to organizations. We discover that models belong to at least 16 domains including traditional software, machine learning, sales, business services, and financial services. These models are created using at least 28 different tools. Our exploration into cloning among the models shows that about 90% of all models are clones of each other. Application domains such as machine learning, traditional software, and business services demonstrate a higher occurrence of clones while in another dimension, clones are found across more repositories owned by industry as compared to those owned by academia. Also, contrary to code clones, we find that the majority of process model cloning occurs across multiple repositories. While our study acts as a precursor for future efforts to develop effective modeling practices in the field of business processes, it also emphasizes the need to address cloning and its implications in the context of reuse, maintenance, and modeling approaches.
... SVM sering dianggap sebagai pengklasifikasi yang mempunyai akurasi terbesar dalam masalah klasifikasi teks, termasuk analisis sentimen (Basari et al., 2013). Dalam tulisannya, Joachims (2005) memberikan bukti secara teoritis dan secara eksperimen bahwa SVM cocok untuk data teks. Salah satunya adalah pada pengolahan data teks didapati jumlah fitur yang sangat besar (lebih dari 10000) dan SVM cenderung tidak tergantung pada besarnya dimensi data. ...
Article
AbstrakTimnas sepak bola Indonesia sering gagal bersaing di berbagai turnamen besar internasional. Sentimen masyarakat terhadap prestasi Timnas yang diekspresikan melalui twitter dapat digunakan sebagai salah satu cara untuk menilai perkembangan sepak bola di Indonesia. Penelitian ini bertujuan untuk mengetahui performa metode Support Vektor Machine (SVM) dengan ekstrasi fitur word embeddings Word2vec dan FastText dalam analisis sentimen terkait Timnas sepak bola Indonesia. Data dalam penelitian ini menggunakan data teks berupa tweet terkait keikutsertaan Timnas di ajang piala AFF tahun 2018, 2020, dan 2022 yang dikumpulkan dengan metode crawling. Metode SVM diawali dengan tahap preprocessing dan ekstraksi fitur. Metode ekstraksi fitur dalam penelitian ini menggunakan word embeddings Word2Vec dan FastText. Hasil penelitian, metode SVM menghasilkan akurasi terbaik hingga mencapai 84%, presisi 82%, recall 81%, dan F1 score sebesar 81%. FastText memiliki peforma yang sedikit lebih baik daripada Word2Vec untuk fitur ekstraksi pada analisis sentimen menggunakan SVM, perbedaannya adalah FastText dapat mengenali kata-kata yang tidak ada dalam korpus sedangkan Word2Vec tidak. Model terbaik dihasilkan dengan menggunakan word embeddings FastText dengan model Skip-gram.Kata kunci: analisis sentimen, Support Vector Machine (SVM), Twitter, FastText, Word2Vec.
... Phrases are classified as named entities if the two feature vectors are similar. • Machine Learning Based: The benefit of SVMs is that they can deal with high dimensional feature spaces, and dense feature vectors, making them ideally suited to text classification tasks [32]. Roberts et al. [33] have proposed a model to use support vector machines with manual features to extract anatomical location data related to the radiology reports. ...
Preprint
Biomedical Named Entity Recognition (NER) is a fundamental task of Biomedical Natural Language Processing for extracting relevant information from biomedical texts, such as clinical records, scientific publications, and electronic health records. The conventional approaches for biomedical NER mainly use traditional machine learning techniques, such as Conditional Random Fields and Support Vector Machines or deep learning-based models like Recurrent Neural Networks and Convolutional Neural Networks. Recently, Transformer-based models, including BERT, have been used in the domain of biomedical NER and have demonstrated remarkable results. However, these models are often based on word-level embeddings, limiting their ability to capture character-level information, which is effective in biomedical NER due to the high variability and complexity of biomedical texts. To address these limitations, this paper proposes a hybrid approach that integrates the strengths of multiple models. In this paper, we proposed an approach that leverages fine-tuned BERT to provide contextualized word embeddings, a pre-trained multi-channel CNN for character-level information capture, and following by a BiLSTM + CRF for sequence labelling and modelling dependencies between the words in the text. In addition, also we propose an enhanced labelling method as part of pre-processing to enhance the identification of the entity's beginning word and thus improve the identification of multi-word entities, a common challenge in biomedical NER. By integrating these models and the pre-processing method, our proposed model effectively captures both contextual information and detailed character-level information. We evaluated our model on the benchmark i2b2/2010 dataset, achieving an F1-score of 90.11. These results illustrate the proficiency of our proposed model in performing biomedical Named Entity Recognition.
... This technique is widely used in NLP tasks because it emphasizes the most informative tokens while minimizing common but meaningless ones. Several studies, including [27], have demonstrated TF-IDF's effectiveness in text classification tasks. ...
Article
Full-text available
Software development is significantly impeded by flaky tests, which intermittently pass or fail without requiring code modifications, resulting in a decline in confidence in automated testing frameworks. Code smells (i.e., test case or production code) are the primary cause of test flakiness. In order to ascertain the prevalence of test smells, researchers and practitioners have examined numerous programming languages. However, one isolated experiment was conducted, which focused solely on one programming language. Across a variety of programming languages, such as Java, Python, C++, Go, and JavaScript, this study examines the predictive accuracy of a variety of machine learning classifiers in identifying flaky tests. We compare the performance of classifiers such as Random Forest, Decision Tree, Naive Bayes, Support Vector Machine, and Logistic Regression in both single-language and cross-language settings. In order to ascertain the impact of linguistic diversity on the flakiness of test cases, models were trained on a single language and subsequently tested on a variety of languages. The following key findings indicate that Random Forest and Logistic Regression consistently outperform other classifiers in terms of accuracy, adaptability, and generalizability, particularly in cross-language environments. Additionally, the investigation contrasts our findings with those of previous research, exhibiting enhanced precision and accuracy in the identification of flaky tests as a result of meticulous classifier selection. We conducted a thorough statistical analysis, which included t-tests, to assess the importance of classifier performance differences in terms of accuracy and F1-score across a variety of programming languages. This analysis emphasizes the substantial discrepancies between classifiers and their effectiveness in detecting flaky tests. The datasets and experiment code utilized in this study are accessible through an open source GitHub repository a to facilitate reproducibility. Our results emphasize the effectiveness of probabilistic and ensemble classifiers in improving the reliability of automated testing, despite certain constraints, including the potential biases introduced by language-specific structures and dataset variability. This research provides developers and researchers with practical insights that can be applied to the mitigation of flaky tests in a variety of software environments..
... Among these opportunities are the possibility of using classifiers other than SVMs. SVMs were chosen due to their ability to perform well with high-dimensional feature spaces and are particularly suited for handling imbalanced datasets, as they maximize the margin between classes rather than optimizing for overall accuracy [20,21]. This characteristic is crucial for our application, where capillary stalling events are significantly outnumbered by normal blood flow events. ...
Article
Full-text available
Capillary stalling has emerged as an important mechanistic and potential therapeutic target in mouse models of several neurological disorders. Time-series optical coherence tomography angiography (OCTA) has been used to rapidly detect capillary stalling over hundreds of capillaries in 3D and can be used to study this phenomenon in a research setting. However, existing methods for quantifying capillary stalls are labor-intensive, prone to errors, and may be limited by their reliance on 2D representations of inherently 3D data. To address these limitations, we developed a computational approach based on a support vector machine (SVM) trained on engineered features pertaining to OCTA time-series data. When evaluated with 4-fold cross-validation, the final classifier achieved a receiver operating characteristic (ROC) area under the curve (AUC) of .978 (baseline: 0.5) and a precision-recall (PR) AUC of .700 (baseline: 0.013). It also reduced the amount of time required to annotate from 1 hour to 22 minutes per dataset and detected an average of 8.1 stalling segments in each dataset that were missed by expert annotations, which amounted to 26% of all stalling segments. To demonstrate the utility of our tool, we measured the morphological properties of capillaries and found that stalling segments are significantly smaller in diameter, more tortuous, and longer than non-stalling segments. These findings highlight the algorithm's potential to uncover morphological patterns associated with stalling and facilitate comparative studies across experimental conditions. To support further research, the tool is freely available as open-source software for use by the scientific community.
... Random forests, which belong to the ensemble learning techniques, are quite useful in supervised learning due to their high accuracy and efficiency in handling large datasets with many features 10 . Support vector machines (SVMs) are highly efficient for classification and regression, particularly in handling numerous features and complex nonlinear relationships between features and targets 11 . Further mathematical development describes SVMs and, finally, the ways in which, guided by inside support vectors, they maximize the entire width margin that separates the classes 12 . ...
Article
Full-text available
OBJECTIVE: Obesity is a global health problem. The aim is to analyze the effectiveness of machine learning models in predicting obesity classes and to determine which model performs best in obesity classification. METHODS: We used a dataset with 2,111 individuals categorized into seven groups based on their body mass index, ranging from average weight to class III obesity. Our classification models were trained and tested using demographic information like age, gender, and eating habits without including height and weight variables. RESULTS: The study demonstrated that when trained on demographic information, machine learning can classify body mass index. The random forest model provided the highest performance scores among all the classification models tested in this research. CONCLUSION: Machine learning methods have the potential to be used more extensively in the classification of obesity and in more effective efforts to combat obesity.
... Polynomial functions, RBF functions, hyperbolic tangents, and other suitable functions can be chosen as the kernel function. [13] Following Table1 shows literature review on classification techniques used for different languages Table1.literature review on Techniques used for different languages. ...
Article
In today's modern environment, when data are easily accessible, plagiarism is the most pervasive problem. Hence, a system for identifying and controlling it is crucial. In a variety of languages, there are numerous approaches that may be used for the purpose, but they are insufficient for literature that is based in the Marathi language. Plagiarism detection is a critical aspect of maintaining academic integrity and ensuring the originality of content in various languages. The detection of plagiarism in languages with relatively less computational research, such as Marathi, presents unique challenges due to its complex linguistic structure, syntax, and morphology. This paper explores a machine learning-based approach for efficient plagiarism detection specifically tailored for the Marathi language. We introduced a machine learning-based plagiarism detection method in this research study. We utilised the learning techniques of naive bayes, svm and artificial neural networks. SVM research have shown an average accuracy of 90%, while Naive Bayes studies have shown an average accuracy of 71%. Studies employing a Neural Network for Marathi Language Plagiarism Detection reported an average accuracy of 95%. The results demonstrate that the proposed approach can effectively detect plagiarism in Marathi texts, offering a promising tool for researchers, educators, and content creators to uphold content authenticity and originality.
... One of the earliest and most influential contribution to this field was presented by Joachims [32], who introduced a model-building approach using Support Vector Machines (SVMs) combined with high-dimensional sparse text representations. This pioneering work laid the foundation for subsequent research in hierarchical text classification, emphasizing the importance of efficient feature representation in handling large and complex datasets. ...
Article
Full-text available
Hierarchical classification, which organizes items into structured categories and subcategories, has emerged as a powerful solution for handling large and complex datasets. However, traditional flat classification approaches often overlook the hierarchical dependencies between classes, leading to suboptimal predictions and limited interpretability. This paper addresses these challenges by proposing a novel integration of tree-based models with hierarchical-aware split criteria through adjusted entropy calculations. The proposed method calculates entropy at multiple hierarchical levels, ensuring that the model respects the taxonomic structure during training. This approach aligns statistical optimization with class semantic relationships, enabling more accurate and coherent predictions. Experiments conducted on real-world datasets structured according to the GS1 Global Product Classification (GPC) system demonstrate the effectiveness of our method. The proposed model was applied using tree-based ensemble methods combined with the newly developed hierarchy-aware metric Penalized Information Gain (PIG). PIG was implemented with level-wise entropy adjustments, assigning greater weight to higher hierarchical levels to maintain the taxonomic structure. The model was trained and evaluated on two real-world datasets based on the GS1 Global Product Classification (GPC) system. The final dataset included approximately 30,000 product descriptions spanning four hierarchical levels. An 80-20 train–test split was used, with model hyperparameters optimized through 5-fold cross-validation and Bayesian search. The experimental results showed a 12.7% improvement in classification accuracy at the lowest hierarchy level compared to traditional flat classification methods, with significant gains in datasets featuring highly imbalanced class distributions and deep hierarchies. The proposed approach also increased the F1 score by 12.6%. Despite these promising results, challenges remain in scaling the model for very large datasets and handling classes with limited training samples. Future research will focus on integrating neural networks with hierarchy-aware metrics, enhancing data augmentation to address class imbalance, and developing real-time classification systems for practical use in industries such as retail, logistics, and healthcare.
... ML algorithms, including logistic regression (LR), naïve Bayes (NB), stochastic gradient descent (SGD), and convolutional neural networks (CNN), are used to assign sentiment to the reviews [20,21]. LR, a binary classification method, performs multi-class classification, while NB, based on Bayes' theorem, classifies data into Gaussian, Bernoulli, or multinomial categories depending on the nature of the database [22][23][24][25]. Elmurngi et al., 2018 [26], in their paper, after following the removal of stop words, performed attribute selection or feature selection to identify a subset of relevant features for model construction [27,28]. ...
Article
Full-text available
In the era of digital commerce, understanding consumer opinions has become crucial for businesses aiming to tailor their products and services effectively. This study investigates acoustic quality diagnostics of the latest generation of AirPods. From this perspective, the work examines consumer sentiment using text mining and sentiment analysis techniques applied to product reviews, focusing on Amazon’s AirPods reviews. Using the naïve Bayes classifier, a probabilistic machine learning approach grounded in Bayes’ theorem, this research analyzes textual data to classify consumer reviews as positive or negative. Data were collected via web scraping, following ethical guidelines, and preprocessed to ensure quality and relevance. Textual features were transformed using term frequency-inverse document frequency (TF-IDF) to create input vectors for the classifier. The results reveal that naïve Bayes provides satisfactory performance in categorizing sentiment, with metrics such as accuracy, sensitivity, specificity, and F1-score offering insight into the model’s effectiveness. Key findings highlight the divergence in consumer perception across ratings, identifying sentiment drivers such as noise cancellation quality and product integration. These insights underline the potential of sentiment analysis in enabling companies to address consumer concerns, improve offerings, and optimize business strategies. The study concludes that such methodologies are indispensable for leveraging consumer feedback in the rapidly evolving digital marketplace.
... NN, a computing system modeled after the human brain's mesh-like network of interconnected neurons, is robust for classification and applies to a variety of domains, including webpage filtering (Lee et al., 2002;Lippman,1987). SVM, an algorithm for ML that attempts to minimize structured risk in classification, has been successfully used for text categorization (Joachims, 1998) and webpage classification (Glover et al., 2002). Each stakeholder page is represented as a feature vector with 987 structural content features (binary variables indicating whether lexicon terms appear on the page title, extended anchor text, and full text) and 1297 textual content features (the frequency of occurrence of the selected features). ...
Article
Full-text available
Cryptocurrency is the most innovative financial and technological breakthrough of this generation. Investment in cryptocurrency grew from USD 11.18 billion in December 2016 to USD 2.147 trillion in April 2024; however, the rationality of investor exuberance is uncertain. This paper explores stakeholders' perceptions of cryptocurrency using a machine-learning approach based on artificial intelligence (AI). In particular, we employ a lexicon-based emotion-detection sentiment analysis to investigate stakeholder perceptions, using 2.3 million open-source data points. We divide the findings into positive, neutral, and negative stakeholder perception pillars based on factors such as trustworthiness in cryptocurrency, motives, cryptocurrency awareness and knowledge, ownership, socioeconomic characteristics of users, and usage. Our analysis reveals that 51 percent of the stakeholders have a positive perception of cryptocurrency, whereas 40 percent have a neutral perception and 9 percent a negative perception. After identifying the perceptions, we investigate the relationship between cryptocurrency prices and stakeholder perceptions using the autoregressive distributed lag (ARDL) framework with time-series data from August 2017 to July 2023. The long- and short-term results confirm that positive and negative perceptions have statistically significant effects on cryptocurrency prices. Individual investors comprise the largest share of those with a positive perception, as 54 percent have a positive view of cryptocurrency. Institutional investors, however, have the largest share of those with a neutral perception because of the lack of a well-established regulatory framework for cryptocurrency. However, 39 percent of institutional investors hold a positive perception is growing, a sign of a growing trend, as they are among the major investor groups with an interest in investing in crypto. Other stakeholders, such as the government, academia, and other miscellaneous groups, have a negative perception. Our results demonstrate that cryptocurrency has affected social change, social inclusion, and sustainability. Moreover, our findings offer social insights about crypto stakeholders’ perceptions about the design of strategies to promote cryptocurrency and the establishment of a sustainable crypto ecosystem.
... Data-driven models, like machine learning, are becoming more popular in various fields, from language models (Brown et al., 2020), to facial detection and writing recognition (Cortes & Vapnik, 1995;Joachims, 1998). In multiphase flow applications, machine learning methods are being used for flow pattern detection (Al-Naser et al., 2016;Wang & Zhang, 2009), and for predicting pressure gradient, flow pattern, and holdup (Ghasemi & Rasheed, 2023;Kanin et al., 2019;Quintino et al., 2019). ...
... Multinomial is the most common for text categorisation problems, so this one will also be of high interest [14]. Support Vector Machines (SVM) have also been used in the past for text classification purposes with a positive results, so these will be incorporated as well [15]. Previous work tends to avoid using computationally expensive approaches such as K-Nearest Neighbours and Random Forests due to the size of the datasets being used, but since the Emobank dataset is not that large, it is worth investigating those models as well. ...
Preprint
Full-text available
Sentiment analysis of text plays a crucial role in various fields, particularly in marketing and customer service industries, where understanding subjective information from text data is essential. While existing sentiment analysis tools often focus on binary classifications of positive or negative sentiment, this study delves into the possibility of representing emotions using multiple dimensions. By exploring Ekman's six basic emotions and the Valence-Arousal-Dominance (VAD) structure, this research aims to investigate whether using more than one dimension to classify emotions is useful. Two datasets, Bag-of-Words and EmoBank, are analyzed, with EmoBank providing VAD values for 10,000 English sentences. Research questions focus on optimizing textual sentiment prediction and evaluating the utility of multi-dimensional emotion classification. Experimental investigations involve data pre-processing, model selection, and sampling tests to address dataset limitations and dependencies between variables. Findings suggest the potential for building more nuanced sentiment prediction models, with implications for improving sentiment analysis accuracy and understanding human emotions in text data.
... We performed the question category identification using Sequential Minimal Optimization (SMO) (Platt, 1998) method of Support Vector Machines (SVM) (Hearst, Dumais, Osman, Platt & Scholkopf, 1998). The kernel used for SVM was Polynomial with order 3. SVMs have been shown to outperform other existing methods (naïve Bayes, k-NN, and decision trees) in text categorization (Joachims, 1998). Their advantages are robustness and elimination of the need for feature selection and parameter tuning. ...
Article
With the proliferation of social media into our daily lives, online communities have become an important platform for collaborative learning and education. To connect users with varying knowledge levels and increase the net learning throughput, these communities often follow a question-answer based approach. Understanding what drives attention to help-seeking questions can reduce the amount of questions that go unnoticed or remain unanswered by the community. In this paper we discuss an important feature that affects the activity of the community, namely the community norms. We present a machine learning based trigger-driven feedback model that functions by (i) differentiating between help-seeking questions and follow-up posts – i.e. posts that are part of an ongoing discussion, and (ii) a dynamic intervention scheme to help improve question formulation. Our findings show that adhering to the community norms significantly increases the chance of eliciting a response.
... • R8 and R52 [37] are two well-established Reuters datasets widely used for news classification. R8 consists of 8 classes, including categories such as crude, grain, and trade, while R52 expands this to 52 classes, covering a broader range of topics. ...
Article
Full-text available
With the development of the information age, the emergence of massive amounts of text data has made effective text classification a critical challenge. Traditional classification methods often underperform or incur high computational costs when dealing with heterogeneous data, limited labeled data, or domain-specific data. To address these challenges, this paper proposes a novel text classification model, GZclassifier, designed to improve both accuracy and efficiency. GZclassifier employs two distinct compressors to handle information data and calculates distances in parallel to facilitate classification. This dual-compressor approach enhances the model’s ability to manage diverse and sparse data effectively. We conducted extensive experimental evaluations on a range of public datasets, including those with few-shot learning scenarios, to assess the proposed method’s performance. The results demonstrate that our model significantly outperforms traditional methods in terms of classification accuracy, robustness, and computational efficiency. The GZclassifier’s ability to handle limited labeled data and domain-specific contexts highlights its potential as an efficient solution for real-world text classification tasks. This study not only advances the field of text classification but also showcases the model’s practical applicability and benefits in various text processing scenarios.
... Multinomial is the most common for text categorisation problems, so this one will also be of high interest [14]. Support Vector Machines (SVM) have also been used in the past for text classification purposes with a positive results, so these will be incorporated as well [15]. Previous work tends to avoid using computationally expensive approaches such as K-Nearest Neighbours and Random Forests due to the size of the datasets being used, but since the Emobank dataset is not that large, it is worth investigating those models as well. ...
Conference Paper
Full-text available
Sentiment analysis of text plays a crucial role in various fields, particularly in marketing and customer service industries, where understanding subjective information from text data is essential. While existing sentiment analysis tools often focus on binary classifications of positive or negative sentiment, this study delves into the possibility of representing emotions using multiple dimensions. By exploring Ekman's six basic emotions and the Valence-Arousal-Dominance (VAD) structure, this research aims to investigate whether using more than one dimension to classify emotions is useful. Two datasets, Bag-of-Words and EmoBank, are analyzed, with EmoBank providing VAD values for 10,000 English sentences. Research questions focus on optimizing textual sentiment prediction and evaluating the utility of multi-dimensional emotion classification. Experimental investigations involve data pre-processing, model selection, and sampling tests to address dataset limitations and dependencies between variables. Findings suggest the potential for building more nuanced sentiment prediction models, with implications for improving sentiment analysis accuracy and understanding human emotions in text data.
... The multi-label classification problems have gained much importance due to its rapidly increasing application areas. The application areas of multi-label classification include but are not limited to text categorization (Gonçalves and Quaresma 2003;Joachims 1998;Luo and Zincir-Heywood 2005;Tikk and Biró 2003;Yu et al. 2005), bioinformatics (Elisseeff and Weston 2001;Min-Ling and Zhi-Hua 2005), medical diagnosis (Karali and Pirnat 1991), image/scene and video categorization Shen et al. 2003), genomics, map labeling (Zhu and Poon 1999), marketing, multimedia, emotion, music categorization, etc. In recent years, the multi-label classification has drawn increased research attention due to the realization of the omnipresence of multi-label prediction tasks in several areas (Tsoumakas et al. 2010). ...
Preprint
In this paper, a high-speed online neural network classifier based on extreme learning machines for multi-label classification is proposed. In multi-label classification, each of the input data sample belongs to one or more than one of the target labels. The traditional binary and multi-class classification where each sample belongs to only one target class forms the subset of multi-label classification. Multi-label classification problems are far more complex than binary and multi-class classification problems, as both the number of target labels and each of the target labels corresponding to each of the input samples are to be identified. The proposed work exploits the high-speed nature of the extreme learning machines to achieve real-time multi-label classification of streaming data. A new threshold-based online sequential learning algorithm is proposed for high speed and streaming data classification of multi-label problems. The proposed method is experimented with six different datasets from different application domains such as multimedia, text, and biology. The hamming loss, accuracy, training time and testing time of the proposed technique is compared with nine different state-of-the-art methods. Experimental studies shows that the proposed technique outperforms the existing multi-label classifiers in terms of performance and speed.
... In recent years, the problem of multi-label classification is gaining much importance motivated by increasing application areas such as text categorization [1][2][3][4][5], marketing, music categorization, emotion, genomics, medical diagnosis [6], image and video categorization, etc. Recent realization of the omnipresence of multi-label prediction tasks in real world problems has drawn increased research attention [7]. ...
Preprint
In this paper a high speed neural network classifier based on extreme learning machines for multi-label classification problem is proposed and dis-cussed. Multi-label classification is a superset of traditional binary and multi-class classification problems. The proposed work extends the extreme learning machine technique to adapt to the multi-label problems. As opposed to the single-label problem, both the number of labels the sample belongs to, and each of those target labels are to be identified for multi-label classification resulting in in-creased complexity. The proposed high speed multi-label classifier is applied to six benchmark datasets comprising of different application areas such as multi-media, text and biology. The training time and testing time of the classifier are compared with those of the state-of-the-arts methods. Experimental studies show that for all the six datasets, our proposed technique have faster execution speed and better performance, thereby outperforming all the existing multi-label clas-sification methods.
... W ITH the increasing availability of text documents in electronic form, it is of great importance to label the contents with a predefined set of thematic categories in an automatic way, what is also known as automated Text Categorization. In last decades, a growing number of advanced machine learning algorithms have been developed to address this challenging task by formulating it as a classification problem [1] [2] [3] [4] [5]. Commonly, an automatic text classifier is built with a learning process from a set of prelabeled documents. ...
Preprint
Automated feature selection is important for text categorization to reduce the feature size and to speed up the learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination (MD) and MDχ2MD-\chi^2 methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches.
... Although these methods are effective in representing local discriminative features, they lack directional information. Even though bag-of-words (BoW) [9] and bag-of-features [10] are effective for resolving this issue; the amount of structure information still falls short. ...
Preprint
Biologically inspired model (BIM) for image recognition is a robust computational architecture, which has attracted widespread attention. BIM can be described as a four-layer structure based on the mechanisms of the visual cortex. Although the performance of BIM for image recognition is robust, it takes the randomly selected ways for the patch selection, which is sightless, and results in heavy computing burden. To address this issue, we propose a novel patch selection method with oriented Gaussian-Hermite moment (PSGHM), and we enhanced the BIM based on the proposed PSGHM, named as PBIM. In contrast to the conventional BIM which adopts the random method to select patches within the feature representation layers processed by multi-scale Gabor filter banks, the proposed PBIM takes the PSGHM way to extract a small number of representation features while offering promising distinctiveness. To show the effectiveness of the proposed PBIM, experimental studies on object categorization are conducted on the CalTech05, TU Darmstadt (TUD), and GRAZ01 databases. Experimental results demonstrate that the performance of PBIM is a significant improvement on that of the conventional BIM.
... This condition of classification, where the input data correspond to a set of class labels instead of one, is called Multi-label classification. Initially, the application of multilabel classification is primarily focused on text-categorization [1][2][3][4][5] and medical diagnosis [6]. But recent realization of the omnipresence of multi-label prediction tasks in real world problems drawn more and more research attention to this domain [7]. ...
Preprint
In this paper, an Extreme Learning Machine (ELM) based technique for Multi-label classification problems is proposed and discussed. In multi-label classification, each of the input data samples belongs to one or more than one class labels. The traditional binary and multi-class classification problems are the subset of the multi-label problem with the number of labels corresponding to each sample limited to one. The proposed ELM based multi-label classification technique is evaluated with six different benchmark multi-label datasets from different domains such as multimedia, text and biology. A detailed comparison of the results is made by comparing the proposed method with the results from nine state of the arts techniques for five different evaluation metrics. The nine methods are chosen from different categories of multi-label methods. The comparative results shows that the proposed Extreme Learning Machine based multi-label classification technique is a better alternative than the existing state of the art methods for multi-label problems.
... The SVM classifiers by the SVM algorithm have been applied for credit risk analysis [3], medical diagnostics [4], handwritten character recognition [5], text categorization [6], information extraction [7], pedestrian detection [8], face detection [9], etc. ...
Preprint
Full-text available
The problem of development of the SVM classifier based on the modified particle swarm optimization has been considered. This algorithm carries out the simultaneous search of the kernel function type, values of the kernel function parameters and value of the regularization parameter for the SVM classifier. Such SVM classifier provides the high quality of data classification. The idea of particles' {\guillemotleft}regeneration{\guillemotright} is put on the basis of the modified particle swarm optimization algorithm. At the realization of this idea, some particles change their kernel function type to the one which corresponds to the particle with the best value of the classification accuracy. The offered particle swarm optimization algorithm allows reducing the time expenditures for development of the SVM classifier. The results of experimental studies confirm the efficiency of this algorithm.
... Some of those methods can be also used for text categorization which can be considered as a multi-class classification problem. Relevance of features is a major concern for designing feature selection methods [7] [8]. For example, several well-recognized feature selection methods have been developed considering the entropic relevance, such as document frequency, information gain [9], mutual information [10], χ 2 statistic, ect. ...
Preprint
In this paper, we present a new wrapper feature selection approach based on Jensen-Shannon (JS) divergence, termed feature selection with maximum JS-divergence (FSMJ), for text categorization. Unlike most existing feature selection approaches, the proposed FSMJ approach is based on real-valued features which provide more information for discrimination than binary-valued features used in conventional approaches. We show that the FSMJ is a greedy approach and the JS-divergence monotonically increases when more features are selected. We conduct several experiments on real-life data sets, compared with the state-of-the-art feature selection approaches for text categorization. The superior performance of the proposed FSMJ approach demonstrates its effectiveness and further indicates its wide potential applications on data mining.
... The first dataset 3 (called Ohsumed) contains medical abstracts from MEDLINE database. Following [16], we consider the 13, 929 unique abstracts in the first 20, 000 abstracts. The task is to classify the documents into 23 cardiovascular diseases categories. ...
Preprint
Manually labeling documents is tedious and expensive, but it is essential for training a traditional text classifier. In recent years, a few dataless text classification techniques have been proposed to address this problem. However, existing works mainly center on single-label classification problems, that is, each document is restricted to belonging to a single category. In this paper, we propose a novel Seed-guided Multi-label Topic Model, named SMTM. With a few seed words relevant to each category, SMTM conducts multi-label classification for a collection of documents without any labeled document. In SMTM, each category is associated with a single category-topic which covers the meaning of the category. To accommodate with multi-labeled documents, we explicitly model the category sparsity in SMTM by using spike and slab prior and weak smoothing prior. That is, without using any threshold tuning, SMTM automatically selects the relevant categories for each document. To incorporate the supervision of the seed words, we propose a seed-guided biased GPU (i.e., generalized Polya urn) sampling procedure to guide the topic inference of SMTM. Experiments on two public datasets show that SMTM achieves better classification accuracy than state-of-the-art alternatives and even outperforms supervised solutions in some scenarios.
... Text categorization is one of the central problems in text mining and information retrieval, where it is the task of classifying documents by the words of which the documents include. Several machine learning algorithms have been developed for text classification, e.g.: decision tree (J-48) [1], k-nearest neighbor (KNN) [2], support vector machine(SVM) [3] and random forests (RF) [4]. Thus, these text classifiers give acceptable accuracy with high dimensional data such as text. ...
Preprint
Text categorization (TC) is the task of automatically organizing a set of documents into a set of pre-defined categories. Over the last few years, increased attention has been paid to the use of documents in digital form and this makes text categorization becomes a challenging issue. The most significant problem of text categorization is its huge number of features. Most of these features are redundant, noisy and irrelevant that cause over fitting with most of the classifiers. Hence, feature extraction is an important step to improve the overall accuracy and the performance of the text classifiers. In this paper, we will provide an overview of using principle component analysis (PCA) as a feature extraction with various classifiers. It was observed that the performance rate of the classifiers after using PCA to reduce the dimension of data improved. Experiments are conducted on three UCI data sets, Classic03, CNAE-9 and DBWorld e-mails. We compare the classification performance results of using PCA with popular and well-known text classifiers. Results show that using PCA encouragingly enhances classification performance on most of the classifiers.
Article
Full-text available
INTRODUCTION The current standard electronic (e‐)phenotype for identifying patients with Alzheimer's disease and related dementias (ADRD) from medical claims data yields suboptimal diagnostic accuracy. This study leveraged artificial intelligence (AI)–based text‐classification methods to improve the identification of patients with dementia due to ADRD using clinical notes from electronic health records (EHRs). METHODS EHR data for patients aged ≥ 64 (N = 4000) from an academic medical center were used. The cohort included 1000 patients with ADRD per the Chronic Conditions Warehouse (CCW) algorithm for ADRD (i.e., at least one ADRD International Classification of Diseases, Tenth Revision codes [ICD‐10 code]) and 3000 matched controls without ADRD (i.e., no CCW codes). We trained several AI‐based text‐classification models, including bag‐of‐words models, deep learning, and large language models (LLMs), to make ADRD determinations from clinical notes. The performance of each model was evaluated against “gold standard” manual chart review. RESULTS A foundational LLM derived from Llama 2 demonstrated superior performance in identifying patients with ADRD (area under the curve [AUC] = 0.9534, F1 score 0.8571) compared to both the current standard CCW algorithm (AUC = 0.8482, F1 score 0.8323, although only the AUC was statistically significantly different) and other AI‐based models. Several of the AI‐based models, including convolutional neural networks, also outperformed the CCW algorithm. DISCUSSION These findings highlight the potential of AI‐based text‐classification methods to optimize the automated identification of patients with ADRD using rich EHR data. However, the success of this approach depends on the quality of clinical notes, and more work is needed to refine and validate these methods across more diverse data sets. Highlights The current e‐phenotype for patients with Alzheimer's disease and related dementias (ADRD) in electronic health records has suboptimal diagnostic accuracy. The study used artificial intelligence (AI)–based text classification methods to improve the detection of patients with ADRD. AI‐based models, including convolutional neural networks, outperformed the Chronic Conditions Warehouse algorithm.
Article
In an effort to mitigate occupational hazards and promote proactive safety measures in industries, this study explores the application of ensemble learning and natural language processing (NLP) techniques to analyze the potential accident severity of hazards in a workplace. Even though the use of machine learning models based on reactive data is well-established in the domain of safety, the development of models using proactive data combining text reports and categorical features for predicting potential accident severity is comparatively new. Based on the road safety data collected through a Fatality Risk Control Programme (FRCP) initiative in an integrated steel plant in India, this study focuses on classifying accidents into different classes of severity. Dealing with unstructured texts and class-imbalanced data poses a significant challenge. In order to address the imbalance of classes of the target variable in the dataset, the Synthetic Minority Over-sampling Technique (SMOTE) was applied. Insights from text data were extracted through NLP techniques, which were then used to develop a dataset with diverse features by incorporating categorical features. An ensemble model is developed by employing six prediction algorithms: Decision Tree, Random Forest, Naive Bayes, Support Vector Machine, Extreme Gradient Boosting or XGBoost, and Adaptive Boosting or AdaBoost. A soft voting ensemble was developed utilizing bagging learning and probabilistic aggregation approaches to yield an improved robust classification. Finally, the comparative importance of features is assessed through the Leave-One-Covariate-Out (LOCO) methodology. By integrating these techniques, the study presents a novel approach to anticipate accident severity beforehand, allowing authorities to take proactive interventions for improved workplace safety.
Article
Full-text available
This paper presents a comprehensive review of loss functions and performance metrics in deep learning, highlighting key developments and practical insights across diverse application areas. We begin by outlining fundamental considerations in classic tasks such as regression and classification, then extend our analysis to specialized domains like computer vision and natural language processing including retrieval-augmented generation. In each setting, we systematically examine how different loss functions and evaluation metrics can be paired to address task-specific challenges such as class imbalance, outliers, and sequence-level optimization. Key contributions of this work include: (1) a unified framework for understanding how losses and metrics align with different learning objectives, (2) an in-depth discussion of multi-loss setups that balance competing goals, and (3) new insights into specialized metrics used to evaluate modern applications like retrieval-augmented generation, where faithfulness and context relevance are pivotal. Along the way, we highlight best practices for selecting or combining losses and metrics based on empirical behaviors and domain constraints. Finally, we identify open problems and promising directions, including the automation of loss-function search and the development of robust, interpretable evaluation measures for increasingly complex deep learning tasks. Our review aims to equip researchers and practitioners with clearer guidance in designing effective training pipelines and reliable model assessments for a wide spectrum of real-world applications.
Article
Full-text available
This article explores the classification of the Sino-Tibetan Horpa dialects, an under-researched linguistic cluster located within the Tibetan Autonomous Region and Sichuan Province in China. The study highlights the critical need to document and preserve these dialects, some of which are endangered. The research addresses the absence of comprehensive information in Russian linguistic sources and aims to systematize the available data on Horpa dialects, analyze their geographical distribution, and determine their linguistic status relative to other Chinese dialects. The research objectives include sorting and presenting information on Horpa dialects using visual aids and contributing to further studies on Sino-Tibetan languages and dialects from synchronic and diachronic perspectives. Additionally, strategies for preserving endangered languages are proposed. The Horpa dialect cluster represents a unique linguistic and cultural phenomenon deserving of further investigation. By systematically classifying the dialects and presenting new findings in Russian, this research enriches the field of theoretical and historical linguistics and provides a foundation for future studies and preservation efforts. Данная статья посвящена исследованию сино-тибетских диалектов Хорпа — недостаточно изученного лингвистического кластера, расположенного в пределах Тибетского автономного региона и провинции Сычуань в Китае. В представленном исследовании подчеркивается необходимость описать и сохранить диалекты Хорпа, которые находятся под угрозой исчезновения. Работа направлена на восполнение отсутствия комплексной информации в русскоязычных лингвистических источниках, систематизацию имеющихся данных о диалектах Хорпа, анализ их географического распределения и определение их лингвистического статуса относительно других китайских диалектов. Целями исследования являются упорядочивание и наглядное представление информации о диалектах Хорпа в таблицах и рисунках, а также содействие дальнейшим исследованиям сино-тибетских языков и диалектов в синхронном и диахронном аспектах. Также, предлагаются стратегии сохранения исчезающих языков. Поскольку кластер диалектов Хорпа представляет собой уникальный лингвистический и культурный феномен, заслуживающий дальнейшего изучения, предложенная в данной статье классификация диалектов и их описание на русском языке обогащает область теоретической и исторической лингвистики, создавая основу для будущих исследований и усилий по их сохранению.
Article
Topic models have been successfully applied to information classification and retrieval. The difficulty in successfully applying these technologies is to select the appropriate number of topics for a given corpus . Selecting too few topics can result in information loss and topic omission, known as underfitting. Conversely, an excess of topics can introduce noise and complexity, resulting in overfitting. Therefore, this article considers the inter-class distance and proposes a new method to determine the number of topics based on clustering results, named average inter-class distance change rate (AICDR). AICDR employs the Ward’s method to calculate inter-class distances, then calculates the average inter-class distance for different numbers of topics, and determines the optimal number of topics based on the average distance change rate. Experiments show that the number of topics determined by AICDR is more in line with the true classification of datasets, with high inter-class distance and low inter-class similarity, avoiding the phenomenon of topic overlap. AICDR is a technique predicated on clustering results to select the optimal number of topics and has strong adaptability to various topic models.
Article
This research examined the ability of a novel mobile application designed to provide proactive mental health support by analyzing the user’s conversations and recommends interventions accordingly. Employing sentiment analysis of the user's recorded discussions with designated social contacts (parents, siblings, partner), the application identifies indicators of potential issues in mental health. A personalized chatbot then interacts with the user, offering feedback based on the sentiment analysis and engages in positive conversation to uplift user’s mood. Additionally, the system monitors the user's application activities and chatbot interaction patterns, detecting atypical behaviors for further feedback or prompting emergency alerts to pre-defined contacts. The research employed a two-phased approach: an initial pilot study with simulated data to refine the sentiment analysis and chatbot algorithms, followed by a validation study with a limited user group, utilizing actual conversation recordings. Analysis of the pilot data showed promising accuracy in identifying negative sentiments, while the validation study demonstrated a significant improvement in positive engagement and self- reported well-being among participants. Overall, the findings suggest that this multi-faceted approach using sentiment analysis and conversational AI holds potential for early detection and proactive intervention in mental health issues, justifying further investigation and refinement for broader implementation.
Conference Paper
Often, a study or research process requires the analysis of large volumes of information in the form of unstructured text. This task consumes a large amount of time and resources of the human experts in charge of it. For this reason, there is great interest in developing automatic systems to support these activities by applying Natural Language Processing and Machine Learning techniques. The work presented here is part of the CIDMEFEO project, developed in collaboration with the Instituto Nacional de Estadística (INE). Our work focuses on the development of a text classification prototype for the identification and labeling of the different economical activities performed by Spanish companies.
Article
Inferring contextual information such as demographics from historical transactions is valuable to public agencies and businesses. Existing methods are data-hungry and do not work well when the available records of transactions are sparse. We consider here specifically inference of demographic information using limited historical grocery transactions from a few random trips that a typical business or public service organization may see. We propose a novel method called DemoMotif to build a network model from heterogeneous data and identify subgraph patterns (i.e., motifs) that enable us to infer demographic attributes. We then design a novel motif context selection algorithm to find specific node combinations significant to certain demographic groups. Finally, we learn representations of households using these selected motif instances as context, and employ a standard classifier (e.g., SVM) for inference. For evaluation purposes, we use three real-world consumer datasets, spanning different regions and time periods in the U.S. We evaluate the framework for predicting three attributes: ethnicity, seniority of household heads, and presence of children. Extensive experiments and case studies demonstrate that DemoMotif is capable of inferring household demographics using only a small number (e.g., fewer than 10) of random grocery trips, significantly outperforming the state-of-the-art.
Article
Full-text available
Aim Effective management strategies for conserving biodiversity and mitigating the impacts of global change rely on access to comprehensive and up-to-date biodiversity data. However, manual search, retrieval, evaluation, and integration of this information into databases present a significant challenge to keeping pace with the rapid influx of large amounts of data, hindering its utility in contemporary decision-making processes. Automating these tasks through advanced algorithms holds immense potential to revolutionize biodiversity monitoring. Innovation In this study, we investigate the potential for automating the retrieval and evaluation of biodiversity data from Dryad and Zenodo repositories. We have designed an evaluation system based on various criteria, including the type of data provided and its spatio-temporal range, and applied it to manually assess the relevance for biodiversity monitoring of datasets retrieved through an application programming interface (API). We evaluated a supervised classification to identify potentially relevant datasets and investigate the feasibility of automatically ranking the relevance. Additionally, we applied the same appraoch on a scientific literature source, using data from Semantic Scholar for reference. Our evaluation centers on the database utilized by a national biodiversity monitoring system in Quebec, Canada. Main conclusions We retrieved 89 (55%) relevant datasets for our database, showing the value of automated dataset search in repositories. Additionally, we find that scientific publication sources offer broader temporal coverage and can serve as conduits guiding researchers toward other valuable data sources. Our automated classification system showed moderate performance in detecting relevant datasets (with an F-score up to 0.68) and signs of overfitting, emphasizing the need for further refinement. A key challenge identified in our manual evaluation is the scarcity and uneven distribution of metadata in the texts, especially pertaining to spatial and temporal extents. Our evaluative framework, based on predefined criteria, can be adopted by automated algorithms for streamlined prioritization, and we make our manually evaluated data publicly available, serving as a benchmark for improving classification techniques.
Chapter
This chapter is concerned with the schemes of evaluating text categorization systems. We adopt the two measures, recall and precision, which are used for evaluating information retrieval systems, and they are integrated into F1 measure. A text categorization task is decomposed into binary classifications, and the F1 measure is applied to each binary classification. There are two schemes of averaging F1 measures which correspond to binary classifications: micro-averaging and macro-averaging. In this chapter, we describe the text collection for evaluating text categorization systems, evaluation measures, and the schemes of comparing two approaches.
Chapter
This chapter is concerned with some machine learning algorithms which are used as the typical approaches to text categorization. Machine learning is a computation paradigm where a model is automatically defined by sample examples called training examples. Machine learning is divided into supervised learning and unsupervised learning, and supervised learning algorithms are used as approaches to text categorization. We mention KNN (K-nearest neighbor) and SVM (support vector machine) as typical approaches to the task. In this chapter, we describe the machine learning algorithms which are the approaches to text categorization with respect to their learning and classification process.
Article
Full-text available
Graph neural networks (GNNs) have emerged as a powerful tool for effectively mining and learning from graph-structured data, with applications spanning numerous domains. However, most research focuses on static graphs, neglecting the dynamic nature of real-world networks where topologies and attributes evolve over time. By integrating sequence modeling modules into traditional GNN architectures, dynamic GNNs aim to bridge this gap, capturing the inherent temporal dependencies of dynamic graphs for a more authentic depiction of complex networks. This paper provides a comprehensive review of the fundamental concepts, key techniques, and state-of-the-art dynamic GNN models. We present the mainstream dynamic GNN models in detail and categorize models based on how temporal information is incorporated. We also discuss large-scale dynamic GNNs and pre-training techniques. Although dynamic GNNs have shown superior performance, challenges remain in scalability, handling heterogeneous information, and lack of diverse graph datasets. The paper also discusses possible future directions, such as adaptive and memory-enhanced models, inductive learning, and theoretical analysis.
ResearchGate has not been able to resolve any references for this publication.