Article

BoosTexter: A Boosting-based System for Text Categorization

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This work focuses on algorithms which learn from examples to perform multiclass text and speech categorization tasks. Our approach is based on a new and improved family of boosting algorithms. We describe in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks. We present results comparing the performance of BoosTexter and a number of other text-categorization algorithms on a variety of tasks. We conclude by describing the application of our system to automatic call-type identification from unconstrained spoken customer responses.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Generally, this strategy provides the most efficient solution for multi-label problems. Boosting-based methods such as Adaboost [20] and an ensemble of classifier chains [21] [3] are examples of this strategy. ...
... Graph Laplacian L is obtained as L = D − W , where D and W are two (L + U ) × (L + U ) matrices. W is the symmetry matrix which can be constructed according to (20). D is an orthogonal matrix whose elements are defined ...
... It is imperative that the weights assigned to the edges of the graph be smooth without any abrupt changes, since the weights reflect the similarity between instances. Equation (20) implies that if two instances are connected through a high weight edge, these two instances share the same labels. ...
Article
Full-text available
In the machine learning jargon, multi-label classification refers to a task where multiple mutually non-exclusive class labels are assigned to a single instance. Generally, the lack of sufficient labeled training data demanded by a classification task is met by an approach known as semi-supervised learning. This type of learning extracts the decision rules of classification by utilizing both labeled and unlabeled data. Regarding multi-label data, however, current semi-supervised learning methods are unable to classify them accurately. Therefore, with the goal of generalizing the state-of-the-art semi-supervised approaches to multi-label data, this paper proposes a novel two-stage method for multi-label semi-supervised classification. The first stage determines the label(s) of the unlabeled training data by means of a smooth graph constructed using the manifold regularization. In the second stage, thanks to the capability of the twin support vector machine to relax the requirement that hyperplanes should be parallel in classical SVM, we employ it to establish a multi-label classifier called LP-MLTSVM. In the experiments, this classifier is applied on benchmark datasets. The simulation results substantiate that compared to the existing multi-label classification algorithms, LP-MLTSVM shows superior performance in terms of the Hamming loss, average precision, coverage, ranking loss, and one-error metrics.
... We adopted six evaluation metrics (Hamming Loss, Ranking Loss, One Error, Coverage, Average Precision, and Micro F1) [47] to evaluate the classification performance. Except for the last two, which achieve better performance when the values are larger, the remaining metrics obtain better performances when values are smaller. ...
... We considered five evaluations [47] including Hamming Loss, One Error, Coverage, Ranking Loss, and Average Precision. Except for Average Precision, which obtains better performance if the metric becomes larger, the others achieve better performances if the metrics become smaller. ...
Article
Full-text available
Multi-label classification deals with the determination of instance-label associations for unseen instances. Although many margin-based approaches are delicately developed, the uncertainty classifications for those with smaller separation margins remain unsolved. The intuitionistic fuzzy set is an effective tool to characterize the concept of uncertainty, yet it has not been examined for multi-label cases. This paper proposed a novel model called intuitionistic fuzzy three-way label enhancement (IFTWLE) for multi-label classification. The IFTWLE combines label enhancement with an intuitionistic fuzzy set under the framework of three-way decisions. For unseen instances, we generated the pseudo-label for label uncertainty evaluation from a logical label-based model. An intuitionistic fuzzy set-based instance selection principle seamlessly bridges logical label learning and numerical label learning. The principle is hierarchically developed. At the label level, membership and non-membership functions are pair-wisely defined to measure the local uncertainty and generate candidate uncertain instances. After upgrading to the instance level, we select instances from the candidates for label enhancement, whereas they remained unchanged for the remaining. To the best of our knowledge, this is the first attempt to combine logical label learning with numerical label learning into a unified framework for minimizing classification uncertainty. Extensive experiments demonstrate that, with the selectively reconstructed label importance, IFTWLE achieves statistically superior over the state-of-the-art multi-label classification algorithms in terms of classification accuracy. The computational complexity of this algorithm is On2mk, where n, m, and k denote the unseen instances count, label count, and average label-specific feature size, respectively.
... Adaboost.MH et Adaboost.MR (Schapire et Singer, 2000 [21]) sont deux extensions d'AdaBoost (Freund et Schapire, 1997 [9]) pour la classification multi-étiquette. Ils appliquent tous les deux AdaBoost sur des classificateurs faibles de la forme H : X * L → R. Dans Ada-Boost.MH, si le signe de la sortie des classificateurs faibles est positif pour un nouvel exemple x et une étiquette l alors nous considérons que cet exemple peut être étiqueté avec l, tandis que s'il est négatif alors cet exemple n'est pas étiqueté avec l. ...
... Cette section présente les différentes métriques qui ont été proposées dans la littérature. Soit Schapire et Singer (2000, [21]) considèrent la "Perte de Hamming", définie comme suit : ...
Article
Full-text available
This is the french translation of "Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3), 1-13. 10.4018/jdwm.2007070101"
... Variations of AdaBoost were developed for multi-class problems (Freund & Schapire, 1997), multi-label problems (Schapire & Singer, 2000), regression problems (Drucker,255 1997; Solomatine & Shrestha, 2004), learning to rank problems (Cohen et al., 1999;Xu & Li, 2007;Wu et al., 2010) and finally, to label ranking (Dery & Shmueli, 2020;Plaia et al., 2021). ...
Article
Label Ranking (LR) is an emerging non-standard supervised classification problem with practical applications in different research fields. The Label Ranking task aims at building preference models that learn to order a finite set of labels based on a set of predictor features. One of the most successful approaches to tackling the LR problem consists of using decision tree ensemble models, such as bagging, random forest, and boosting. However, these approaches, coming from the classical unweighted rank correlation measures, are not sensitive to label importance. Nevertheless, in many settings, failing to predict the ranking position of a highly relevant label should be considered more serious than failing to predict a negligible one. Moreover, an efficient classifier should be able to take into account the similarity between the elements to be ranked. The main contribution of this paper is to formulate, for the first time, a more flexible label ranking ensemble model which encodes the similarity structure and a measure of the individual label importance. Precisely, the proposed method consists of three item-weighted versions of the AdaBoost boosting algorithm for label ranking. The predictive performance of our proposal is investigated both through simulations and applications to three real datasets.
... As for task formulations, most existing methods model ID into a text classification problem and SF into a sequence tagging problem. Early works separately handled ID and SF (Schapire and Singer, 2000;Haffner et al., 2003;Raymond and Riccardi, 2007;Tur et al., 2011;Ravuri and Stolcke, 2015;Kim et al., 2017;Wang et al., 2022b). However, current approaches prefer to modeling them together, with the consideration of the high correlation between them, and thus lead to substantial improvement. ...
Preprint
Multi-intent Spoken Language Understanding has great potential for widespread implementation. Jointly modeling Intent Detection and Slot Filling in it provides a channel to exploit the correlation between intents and slots. However, current approaches are apt to formulate these two sub-tasks differently, which leads to two issues: 1) It hinders models from effective extraction of shared features. 2) Pretty complicated structures are involved to enhance expression ability while causing damage to the interpretability of frameworks. In this work, we describe a Prompt-based Spoken Language Understanding (PromptSLU) framework, to intuitively unify two sub-tasks into the same form by offering a common pre-trained Seq2Seq model. In detail, ID and SF are completed by concisely filling the utterance into task-specific prompt templates as input, and sharing output formats of key-value pairs sequence. Furthermore, variable intents are predicted first, then naturally embedded into prompts to guide slot-value pairs inference from a semantic perspective. Finally, we are inspired by prevalent multi-task learning to introduce an auxiliary sub-task, which helps to learn relationships among provided labels. Experiment results show that our framework outperforms several state-of-the-art baselines on two public datasets.
... Algorithm adaptation approaches extend and customize single-label machine learning algorithms in order to handle multi-label data directly. Several adaptations of traditional learning algorithms have been proposed in the literature, such as boosting (AdaBoost.MH) [4], support vector machines (RankSVM) [5], multi-label k-nearest neighbors (ML-kNN) [6] and neural networks [7]. On the other hand, problem transformation methods transform a multi-label learning problem into a series of single-label problems which already have well-established resolution methods. ...
Article
Full-text available
In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations.
... For each dataset, we introduce the example count (#n), the feature dimensionality (#f), the label count (#q), the average number of associated labels per instance (#c) and the domain. We adopt six evaluation metrics (Hamming Loss, Ranking Loss, One Error, Coverage, Average Precision, and Micro F1) [53] to evaluate the classification performance. The last two achieve better performance if the values are large, and the remaining metrics obtain better performance if the values are small. ...
Article
Full-text available
Data representation is of significant importance in minimizing multi-label ambiguity. While most researchers intensively investigate label correlation, the research on enhancing model robustness is preliminary. Low-quality data is one of the main reasons that model robustness degrades. Aiming at the cases with noisy features and missing labels, we develop a novel method called robust global and local label correlation (RGLC). In this model, subspace learning reconstructs intrinsic latent features immune from feature noise. The manifold learning ensures that outputs obtained by matrix factorization are similar in the low-rank latent label if the latent features are similar. We examine the co-occurrence of global and local label correlation with the constructed latent features and the latent labels. Extensive experiments demonstrate that the classification performance with integrated information is statistically superior over a collection of state-of-the-art approaches across numerous domains. Additionally, the proposed model shows promising performance on multi-label when noisy features and missing labels occur, demonstrating the robustness of multi-label classification.
... Some research studies have focused on improving or explaining the robustness of the quality assurance system, such as iterative input [49] to improve the quality of the text summarization system, and the combination of decision trees and enhancement techniques [50] to improve the classification accuracy. Qin et al. [51] proposed a Machine Natural Language Parser (MParser) to address the semantic interoperability problem between users and computers. ...
Article
Full-text available
A natural language processing system can realize effective communication between human and computer with natural language. Because its evaluation method relies on a large amount of labeled data and human judgment, the question of how to systematically evaluate its quality is still a challenging task. In this article, we use metamorphic testing technology to evaluate natural language processing systems from the user’s perspective to help users better understand the functionalities of these systems and then select the appropriate natural language processing system according to their specific needs. We have defined three metamorphic relation patterns. These metamorphic relation patterns respectively focus on some characteristics of different aspects of natural language processing. Moreover, on this basis, we defined seven metamorphic relations and chose three tasks (text similarity, text summarization, and text classification) to evaluate the quality of the system. Chinese is used as target language. We extended the defined abstract metamorphic relations to these tasks, and seven specific metamorphic relations were generated for each task. Then, we judged whether the metamorphic relations were satisfied for each task, and used them to evaluate the quality and robustness of the natural language processing system without reference output. We further applied the metamorphic test to three mainstream natural language processing systems (including BaiduCloud API, AliCloud API, and TencentCloud API), and on the PWAS-X datasets, LCSTS datasets, and THUCNews datasets. Experiments were carried out, revealing the advantages and disadvantages of each system. These results further show that the metamorphic test can effectively test the natural language processing system without annotated data.
... One of the simplest classification algorithms is logistic regression (LR) which has been addressed in most data mining domains ( [23,14,10]). Other techniques are the ensemble-based learning techniques such as boosting and bagging, which have been used mainly for query learning strategies and text analysis ( [30,52]). ...
Article
Full-text available
The advent of Industry 4.0 provides new opportunities to improve the maintenance of production equipment from both the technical and managerial perspective. In this paper, we propose a contribution in the direction of predictive maintenance of machine tools based on the integration of a text mining algorithm with the cyber-physical system of a manufacturing industry. The system performs its analysis starting from data stored in log files maintained by a machine tool returning an alert about a future potential machine failure. Log files, produced by part programs running on the machine control system, record the status of execution parameters taken by key sensors or derived by the control system during the part program execution. Historical data are collected by means of Digital Twin technologies and then analyzed using computational linguistic techniques so that we can predict a machine failure in the imminent future starting from data collected in the past. The paper first describes a new scheme for the classification of maintenance approaches. Then, starting from the proposed cyber-physical system model, an algorithm for predictive maintenance based on text mining technology is integrated in it. The implemented tool supports the maintenance manager in making the most appropriate decisions about the scheduling of maintenance activities when there is an alert about a possible machine failure.
... Algorithm Adaptation This approach adapts traditional classification algorithms to work directly on multi-label data without considering transformation method. One such algorithm is AdaBoost.MH [6] which is a boosting algorithm for multi-label problems aimed at minimizing the hamming loss. In ML-kNN(Multi-label KNN [11], statistical information such as prior probabilities and posterior probabilities are obtained from the training data. ...
Article
Full-text available
Nowadays, multi-label classification can be considered as one of the important challenges for classification problem. In this case instances are assigned more than one class label. Ensemble learning is a process of supervised learning where several classifiers are trained to get a better solution for a given problem. Feature reduction can be used to improve the classification accuracy by considering the class label information with principal Component Analysis (PCA). In this paper, stacked ensemble learning method with augmented class information PCA (CA PCA) is proposed for classification of multi-label data (SEMML). In the initial step, the dimensionality reduction step is applied, then the number of classifiers have to be chosen to apply on the original training dataset, then the stacking method is applied to it. By observing the results of experiments conducted are showing our proposed method is working better as compared to the existing methods.
... Hamming loss [30] computes the fraction of incorrectly predicted labels by the classifier h h h. It is a loss metric, i.e., a classifier that only produces correct outputs would achieve a value of zero for this metric. ...
Article
Full-text available
Multi-label classification is the task of inferring the set of unseen instances using the knowledge obtained through the analysis of a set of training examples with known label sets. In this paper, a multi-label classifier fusion ensemble approach named decision templates for ensemble of classifier chains is presented, which is derived from the decision templates method. The proposed method estimates two decision templates per class, one representing the presence of the class and the other representing its absence, based on the same examples used for training the set of classifiers. For each unseen instance, a new decision profile is created and the similarity between the decision templates and the decision profile determines the resulting label set. The method is incorporated into a traditional multi-label classifier algorithm: the ensemble of classifier chains. Empirical evidence indicates that the use of the proposed decision templates adaptation can improve the performance over the traditionally used combining schemes, especially for domains with a large number of instances available, improving the performance of an already high-performing multi-label learning method.
... It is more challenging than multi-class classification with a single label. Multi-label learning has attracted attentions from machine learning and related communities, and has been widely applied to diverse problems such as multi-label text categorization [29,35,42], image annotation [1,8], web mining [22], tag recommendation [21,37] etc. In addition, many recent multi-label classification algorithms also have been applied in the multi-view research domain [18,23,49,53]. ...
Article
Full-text available
Multi-label classification is very common in practical applications. Compared with multi-class classification, multi-label classification has larger label space and thus the annotations of multi-label instances are typically more time-consuming. It is significant to develop active learning methods for multi-label classification problems. In addition, multi-view learning is more and more popular, which treats data from different views discriminatively and integrates information from all the views effectively. Introducing multi-view methods into active learning can further enhance its performance when processing multi-view data. In this paper, we propose multi-view active learning methods for multi-label classifications. The proposed methods are developed based on the conditional Bernoulli mixture model which is an effective model for multi-label classification. For making active selection criteria, we consider selecting informative and representative instances. From the informative perspective, least confidence and entropy of the predicting results are employed. From the representative perspective, clustering results on the unlabeled data are exploited. Particularly for multi-view active learning, novel multi-view prediction methods are designed to make final prediction and view consistency is additionally considered to make selection criteria. Finally, we demonstrate the effectiveness of the proposed methods through experiments on real-world datasets.
Article
Multi-label learning aims to solve classification problems where instances are associated with a set of labels. In reality, it is generally easy to acquire unlabeled data but expensive or time-consuming to label them, and this situation becomes more serious in multi-label learning as an instance needs to be annotated with several labels. Hence, semi-supervised multi-label learning approaches emerge as they are able to exploit unlabeled data to help train predictive models. This work proposes a novel approach called Self-paced Multi-label Co-Training (SMCT). It leverages the well-known co-training paradigm to iteratively train two classifiers on two views of a dataset and communicate one classifier’s predictions on unlabeled data to augment the other’s training set. As pseudo labels may be false in iterative training, self-paced learning is integrated into SMCT to rectify false pseudo labels and avoid error accumulation. Concretely, the multi-label co-training model in SMCT is formulated as an optimization problem by introducing latent weight variables of unlabeled instances. It is then solved via an alternative convex optimization algorithm. Experimental evaluations are carried out based on six benchmark multi-label datasets and three metrics. The results demonstrate that SMCT is very competitive in each setting when compared with five state-of-the-art methods.
Article
Full-text available
The limited spatial resolution of the hyperspectral (Hx) images corrupts the spectral information of pure materials and their distribution in an image. The accuracy of characterising or classifying the soil using Hx or Mx images decreases when surfaces are covered by vegetation. In the presence of vegetation, a single pixel can be labelled as either vegetation or a specific soil type. In this context, we have studied the usefulness of the multi-label classification (MLC) approach to classify the soil colour in the presence of vegetation cover. We have evaluated its performance on airborne Hx (Airborne Visible InfraRed Imaging Spectrometer - Next Generation, AVIRIS-NG) images acquired over Berambadi catchment, Karnataka, India. The potential of MLC to classify soil types using simulated Sentinel-2 images (Sen-S) was also explored in this study. The surface soil colour in the Berambadi catchment was classified into two soil types (“black” and “red” soils). The proposed MLC approach consists of (1) simulating the mixed spectra of vegetation, red soil, black soil and non-photosynthesis-vegetation (NPV) using linear-mixture-model (LMM) and bi-linear-mixture-model (BLM) to generate a well-balanced calibration data set, and (2) labelling of each pixel into multiple classes using MLC approaches. Performances of classical and deep-neural-network (DNN) based MLC models were compared to identify the best performing model. Our results showed significant performance for the cost-sensitive-multi-label-embedding (CLEMS) model when applied to both AVIRIS-NG (OA=97%) and Sen-S (OA=93%) images. The proposed method requires a limited number of ground-truth samples, and it is operationally practical for large Hx and Mx images.
Chapter
State-of-the art Spoken Language Understanding models of Spoken Dialog Systems achieve remarkable results on benchmark corpora thanks to the winning combination of pretraining on large collection of out-of-domain data with contextual Transformer representations and fine-tuning on in-domain data. On average, performances are almost perfect on benchmark datasets such as ATIS. However some phenomena can affect greatly these performances, like unseen events or ambiguities. They are the major sources of errors in real-life deployed systems although they are not necessarily equally represented in benchmark corpora. This paper aims to predict and characterize error-prone utterances and to explain what makes a given corpus more or less challenging. After training such a predictor on benchmark corpora from various languages and domains, we confront it to a new corpus collected from a French deployed vocal assistant with different distributional properties. We show that the predictor can highlight challenging utterances and explain the main complexity factors even though this corpus was collected in a completely different setting.
Article
Non-coding RNAs (ncRNAs) play an important role in revealing the mechanism of human disease for anti-tumor and anti-virus substances. Detecting subcellular locations of ncRNAs is a necessary way to study ncRNA. Traditional biochemical methods are time-consuming and labor-intensive, and computational-based methods can help detect the location of ncRNAs on a large scale. However, many models did not consider the correlation information among multiple subcellular localizations of ncRNAs. This study proposes a radial basis function neural network based on shared subspace learning (RBFNN-SSL), which extract shared structures in multi-labels. To evaluate performance, our classifier is tested on three ncRNA datasets. Our model achieves better performance in experimental results.
Article
Multi-label learning is a growing field in machine learning research. Many applications address instances that simultaneously belong to many categories, which cannot be disregarded if optimal results are desired. Among the many algorithms developed for multi-label learning, the multi-label k-nearest neighbor method is among the most successful. However, in a difficult classification task, such as multi-label learning, a challenge that arises in the k-nearest neighbor approach is the assignment of the appropriate value of k. Although a suitable value might be obtained using cross-validation, it is unlikely that the same value will be optimal for the whole space spanned by the training set. It is evident that different regions of the feature space would have different distributions of instances and labels that would require different values of k. The very complex boundaries among the many present labels make the necessity of local k values even more important than in the case with a single-label k-nearest neighbor. We present a simple yet powerful approach for setting a local value of k. We associate a potentially different k with every prototype and obtain the best value of that k by optimizing the criterion consisting of the local effect of the different k values in the neighborhood of the prototype. The proposed method has a fast training stage, as it only uses the neighborhood of each training instance to set the local k value. The complexity of the proposed method in terms of the testing time is similar to that of the standard multi-label k-nearest neighbor approach. Experiments performed on a set of 20 problems show that not only does our proposed method significantly outperform the standard multi-label k-nearest neighbor rule but also the locally adaptive multi-label k-nearest neighbor method can benefit from a local k value.
Article
Full-text available
Boosting is an ensemble learning method that combines a set of weak learners into a strong learner to minimize training errors. AdaBoost algorithm, as a typical boosting algorithm, transforms weak learners or predictors to strong predictors in order to solve problems of classification. With remarkable usability and effectiveness, AdaBoost algorithm has been widely used in many fields, such as face recognition, speech enhancement, natural language processing, and network intrusion detection. In the large-scale enterprise network environment, more and more companies have begun to build trustworthy networks to effectively defend against hacker attacks. However, since trustworthy networks use trusted flags to verify the legitimacy of network requests, it cannot effectively identify abnormal behaviors in network data packets. This paper applies Adaboost algorithm in trustworthy network for anomaly intrusion detection to improve the defense capability against network attacks. This method uses a simple decision tree as the base weak learner, and uses AdaBoost algorithm to combine multiple weak learners into a strong learner by re-weighting the samples. This paper uses the real data of trustworthy network for experimental verification. The experimental results show that the average precision of network anomaly detection method based on AdaBoost algorithm is more than 0.999, indicating that it has a significant detection effect on abnormal network attacks and normal network access. Therefore, the proposed method can effectively improve the security of trustworthy networks.
Article
Full-text available
The resistance variant faults (RVFs) observed in the mine ventilation system can utterly restrict mine safety production. Herein, a machine learning model, which is based on multi-label k-nearest neighbor (ML-KNN), is proposed to solve the problem of the rapid and accurate diagnosis of the RVFs that occur at multiple locations within the mine ventilation system. The air volume that passes through all the branches of the ventilation network, including the residual branches, was used as the diagnostic model input after the occurrence of multiple faults, whereas the label vector of the fault locations was used as the model’s output. In total, seven evaluation indicators and 1800 groups of randomly simulated faults at the typical locations in a production mine with 153 nodes and 223 branches were considered to evaluate the feasibility of the proposed model to solve for multiple fault locations diagnostic and verify the model’s generalization ability. After ten-fold cross-validation of the training sets containing 1600 groups of fault instances, the diagnostic accuracy of the model tested with the air volume of all 223 branches and the 71 residual branches’ air volume as input was 73.6% and 72.3%, respectively. On the other hand, To further evaluate the diagnostic performance of the model, 200 groups of the multiple fault instances that were not included in the training were tested. The accuracy of the fault location diagnosis was 76.5% and 73.5%, and the diagnostic time was 9.9s and 12.16s for the multiple faults instances with all 223 branches’ air volume and the 71 residual branches’ air volume as observation characteristics, respectively. The data show that the machine learning model based on ML-KNN shows good performance in the problem of resistance variant multiple fault locations diagnoses of the mine ventilation system, the multiple fault locations diagnoses can be carried out with all the branches’ air volume or the residual branches’ air volume as the input of the model, the diagnostic average accuracy is higher than 70%, and the average diagnosis time is less than one minute. Hence, the proposed model’s diagnostic accuracy and speed can meet the engineering requirements for the diagnosis of multiple fault locations for a real ventilation system in the field, and this model can effectively replace personnel to discover ventilation system failures, and also lays a good foundation for the construction of intelligent ventilation systems.
Article
Feature selection has been recognized for long as an important preprocessing technique to reduce dimensionality and improve the performance of regression and classification tasks. The class of sequential forward feature selection methods based on Mutual Information (MI) is widely used in practice, mainly due to its computational efficiency and independence from the specific classifier. A recent work introduced a theoretical framework for this class of methods which explains the existing proposals as approximations to an optimal target objective function. Such framework made clear the advantages and drawbacks of each proposal. Methods accounting for the redundancy of candidate features using a maximization function and considering the so-called complementary effect are among the best ones. However, they still penalize the complementarity, which is an important drawback. This paper proposes the Decomposed Mutual Information Maximization (DMIM) method, which keeps the good theoretical properties of the best methods proposed so far but overcomes the complementarity penalization by applying the maximization separately to the inter-feature and class-relevant redundancies. DMIM was extensively evaluated and compared with other methods, both theoretically and using two synthetic scenarios and 20 publicly available real datasets applied to specific classifiers. Our results show that DMIM achieves a better classification performance than the remaining forward feature selection methods based on MI.
Article
In many real-world application domains, e.g., text categorization and image annotation, objects naturally belong to more than one class label, giving rise to the multi-label learning problem. The performance of multi-label learning greatly relies on the quality of available features, whereas the data generally involve a lot of irrelevant, redundant, even noisy features. This fact has led to that a surge of research on feature selection methods that select significant features for multi-label learning. Nevertheless, most of the previous approaches suffer from the deficiency that label-specific features are not taken into account, and they are also inefficient in exploiting labeling information such as local label correlations. Moreover, these methods lack interpretability, which can only find a feature subset for all labels, however, cannot show how features are related to different labels. Based on this, we present a new group-preserving label-specific feature selection (GLFS) framework for multi-label learning, which simultaneously considers the features special to the labels in the same group and specific features owned by each label to execute feature selection. In addition, we further consider to learn label-group and instance-group correlations for the exploitation of labeling information, and make a collaborative use of them to improve the model generalization. Extensive experiments validate the advantages of the proposed GLFS method.
Article
Full-text available
In this paper, a novel distance-based multilabel classification algorithm is proposed. The proposed algorithm combines k-nearest neighbors (kNN) with neighborhood classifier (NC) to impose double constraints on the quantity and distance of the neighbors. In short, the radius constraint is introduced in the kNN model to improve the classification accuracy, and the quantity constraint k is added in the NC model to speed up computing. From the neighbors with the double constraints, the probabilities for each label are estimated by the Bayesian rule, and the classification judgment is made according to the probabilities. Experimental results show that the proposed algorithm has slight advantages over similar algorithms in calculation speed and classification accuracy.
Article
Multi-label classification is a challenging issue in the data science community due to the ambiguity of label semantics. Existing studies mainly focus on improving label association with logical labels, but the performance suffers from the threshold setting. Although label distribution learning gains superior discrimination, the expenditure of collecting large-scale fine-grained numerical labels is intolerable. To address the uncertainty of logical label semantics, we propose a novel model called three-way decisions with label enhancement (3WDLE). For unseen instances, we implement a trisecting-acting-outcome framework. In the trisecting stage, an uncertainty measure called global uncertain-prone degree partitions these instances into uncertain and certain regions, where the trisecting procedure is completed from label level to instance level by leveraging the distributions of pseudo-label information. In the acting stage, instances recognized as certain regions directly take the results generated by label-specific learning, whereas the remaining are reclassified by conducting selective label enhancement. The enriched knowledge generated by the label enhancement module is learnt on trustworthy instances only. In the outcome stage, we adopt five evaluation metrics to evaluate the classification performance from the perspectives of both labels and instances. In this way, three-way decisions provide a systematic methodology to deal with uncertainty in multi-label classification, which combines logical label learning with numerical label learning into a unified framework to optimize the performance of the multi-label classification model. Extensive experiments demonstrate the superiority of 3WDLE over state-of-the-art multi-label classifications with logical labels only.
Article
Full-text available
Multi-label feature selection has gained significant attention over the past decades. However, most existing algorithms are lack of interpretability and uncover the causal mechanisms. As we know, Markov blanket (MB) is a key concept in Bayesian network, which can be used to represent the local causal structure of a variable and the selected optimal features for multi-label feature selection. To select casual features for multi-label learning, in this paper, Parents and Children (PC) of each label are discovered via the Hiton method. Then, we distinguish P & C and search Spouses (SP) of each label based on neighborhood conditional mutual information. Moreover, the equivalent information phenomenon brought by multi-label datasets will cause some features to be ignored. A metric of conditional independence test is designed, which can be used to retrieve ignored features. In addition, we search common features between relevant labels and label-specific features for a single label. Finally, we propose a Multi-label Causal Feature Selection with Neighbourhood Mutual Information algorithm, called MCFS-NMI. To verify the performance of MCFS-NMI, we compare it with five well-established multi-label feature selection algorithms on six datasets. Experiment results show that the proposed algorithm achieves highly competitive performance against all comparing algorithms.
Article
Learning paths are curated sequences of resources organized in a way that a learner has all the prerequisite knowledge needed to achieve their learning goals. In this article, we systematically map the techniques and algorithms that are needed to create such learning paths automatically. We focus on open educational resources (OER), though a similar approach can be used with other types of learning objects. Our method of mapping goes through three passes of selected literature. First, we selected all articles mentioning OER and machine learning from IEEE, SCOPUS, and ACM. This resulted in a set of 347 papers after removing duplicates. Of these, 13 were selected as relating to learning paths and their references and citations were identified and organized into eight categories identified in this article (metadata, linked data, recommendation systems, concept maps, knowledge graphs, classification, and learning paths). After identifying these topics, a manual review was conducted resulting in the final set of 112 papers. This article combines the found categories into three steps for learning path creation, which are then discussed in detail. These steps are as follows: 1) concept extraction; 2) relationship mapping; and 3) path creation. Current research relates primarily to enhancing concept extraction and relationship mapping. We identify directions for potential future research that focus on automatically augmenting previously created learning paths in accordance with the changing needs of learners.
Chapter
When a new request comes to the existing software, determining whether there will be reuse and determining where the new requests will be mapped in the existing design are important problems. Since this process is done manually by developers in the context we work, it depends on experience and domain knowledge, besides it is an error-prone and time-consuming process due to the human factor. The main purpose of this study is to correctly predict which new requests in the System Design Document (SDD) match which feature set in the existing software’s Software Requirement Specification (SRS) document. We consider the feature mapping problem between SDD items and SRS requirements as a multi-label multi-class classification problem. Zemberek, a Turkish natural language processing library, is used for preprocessing and feature extraction of the SRS document of the existing software and three SDD documents of different systems to which this software will be delivered. The features extracted from the SRS document are categorized under a certain number of feature topics using the LDA algorithm. The FastText algorithm and AdaBoost-based classifier ICSIBoost are used to decide which of the topics from the SRS document represents a feature in the SDD document, and the predictions are compared with manually determined topics by experts. ICSIBoost achieves quite 67% to 90% precision in topic predictions, whereas the FastText algorithm does not meet our expectations for small and imbalanced data.KeywordsSoftware product lineTopic modelingFeature mappingTurkish NLPMulti-label multi-class classification
Article
Full-text available
Context: There are many datasets for training and evaluating models to detect web attacks, labeling each request as normal or attack. Web attack protection tools must provide additional information on the type of attack detected, in a clear and simple way. Objectives: This paper presents a new multi-label dataset for classifying web attacks based on CAPEC classification, a new way of features extraction based on ASCII values, and the evaluation of several combinations of models and algorithms. Methods: Using a new way to extract features by computing the average of the sum of the ASCII values of each of the characters in each field that compose a web request, several combinations of algorithms (LightGBM and CatBoost) and multi-label classification models are evaluated, to provide a complete CAPEC classification of the web attacks that a system is suffering. The training and test data used for training and evaluating the models come from the new SR-BH 2020 multi-label dataset. Results: Calculating the average of the sum of the ASCII values of the different characters that make up a web request shows its usefulness for numeric encoding and feature extraction. The new SR-BH 2020 multi-label dataset allows the training and evaluation of multi-label classification models, also allowing the CAPEC classification of the various attacks that a web system is undergoing. The combination of the two-phase model with the MultiOutputClassifier module of the scikit-learn library, together with the CatBoost algorithm shows its superiority in classifying attacks in the different criticality scenarios. Conclusion: Experimental results indicate that the combination of machine learning algorithms and multi-phase models leads to improved prediction of web attacks. Also, the use of a multi-label dataset is suitable for training learning models that provide information about the type of attack.
Article
Full-text available
Machine learning is a field composed of various pillars. Traditionally, supervised learning (SL), unsupervised learning (UL), and reinforcement learning (RL) are the dominating learning paradigms that inspired the field since the 1950s. Based on these, thousands of different methods have been developed during the last seven decades used in nearly all application domains. However, recently, other learning paradigms are gaining momentum which complement and extend the above learning paradigms significantly. These are multi‐label learning (MLL), semi‐supervised learning (SSL), one‐class classification (OCC), positive‐unlabeled learning (PUL), transfer learning (TL), multi‐task learning (MTL), and one‐shot learning (OSL). The purpose of this article is a systematic discussion of these modern learning paradigms and their connection to the traditional ones. We discuss each of the learning paradigms formally by defining key constituents and paying particular attention to the data requirements for allowing an easy connection to applications. That means, we assume a data‐driven perspective. This perspective will also allow a systematic identification of relations between the individual learning paradigms in the form of a learning‐paradigm graph (LP‐graph). Overall, the LP‐graph establishes a taxonomy among 10 different learning paradigms. This article is categorized under: Technologies > Machine Learning Application Areas > Science and Technology Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining
Article
Multi-view multi-label learning tasks often appear in various critical data classification scenarios. Each training sample has multiple heterogeneous data views associated with multiple labels in this learning framework simultaneously. Nevertheless, most existing methods do not consider that a single view cannot fully predict all unknown labels caused by non-aligned views, which leads to insufficient consideration of the relationship between the features and labels of each view, and the learning effect is not ideal. In this paper, we develop a novel method that uses view-specific labels and label-feature dependence maximization. Concretely, we first assume that each view and its corresponding label space have a smooth local structure. In this way, the view-specific label learning model is constructed, enhancing the consistency and complementarity of label space information. Then, multiple multi-label classifiers are constructed by maximizing label-feature dependence. Finally, the linear classification model is extended to the nonlinear, and the prediction stage is combined with the contribution weight of each view. The results of several benchmark datasets show that our proposed method is significantly more effective than the state-of-the-art methods.
Article
Multi-Label Classification (MLC) assumes that each instance belongs to a set of labels, unlike traditional classification, where each instance corresponds to a unique value of a class variable. Calibrated Label Ranking (CLR) is an MLC algorithm that determines a ranking of labels for a given instance by considering a binary classifier for each pair of labels. In this way, it exploits pairwise label correlations. Furthermore, CLR alleviates the class imbalance problem that usually arises in MLC because, in this domain, very few instances often belong to a label. In order to build the binary classifiers in CLR, it is required to employ a standard classification algorithm. The Decision Tree method C4.5 has been widely used in this field. In this research, we show that a version of C4.5 based on imprecise probabilities recently proposed, known as Credal C4.5, is more appropriate than C4.5 to handle the binary classification tasks in CLR. Experimental results reveal that Credal C4.5 outperforms C4.5 when both methods are used in CLR and that the difference is more statistically significant as the label noise level is higher.
Article
Multilabel causal feature selection, as a well‐known and effective approach in dealing with high‐dimensional multilabel data, is a popular topic. Amount of causal feature selection algorithms have achieved a great deal of success in classification and prediction tasks. However, the descriptive information of data is collected from different data sources in many practical applications. While few researches focus on the causal variable discovery in multisource environments due to the complex causal relationships. To address these problems, we propose a causal feature selection framework in multisource environments to solve the above problems. Firstly, we mine the causal mechanism with respect to the class attribute under the assumption that only a single data source is included. Secondly, by utilizing the concept of causal invariance in causal inference, we formulate the problem of causal feature selection with multiple data sources as a search problem for an invariant set across data sources. In addition, we give the upper and lower bounds of the causal invariant set. Finally, we design a novel multisource multilabel causal feature selection (MMCFS) algorithm. To verify the effectiveness of the proposed algorithm, we compare it with 12 feature selection methods on synthetic datasets. Experiment results show that the classification performance of MMCFS achieves highly competitive performance against other comparing algorithms.
Article
Multi-label classification (MLC) has recently attracted increasing interest in the machine learning community. Several studies provide surveys of methods and datasets for MLC, and a few provide empirical comparisons of MLC methods. However, they are limited in the number of methods and datasets considered. This paper provides a comprehensive empirical investigation of a wide range of MLC methods on a wealth of datasets from different domains. More specifically, our study evaluates 26 methods on 42 benchmark datasets using 20 evaluation measures. The evaluation methodology used meets the highest literature standards for designing and conducting large-scale, time-limited experimental studies. First, the methods were selected based on their use in the community to ensure a balanced representation of methods across the MLC taxonomy of methods within the study. Second, the datasets cover a wide range of complexity and application domains. The selected evaluation measures assess the predictive performance and efficiency of the methods. The results of the analysis identify RFPCT, RFDTBR, ECCJ48, EBRJ48, and AdaBoost.MH as the best-performing methods across the spectrum of performance measures. Whenever a new method is introduced, it should be compared with different subsets of MLC methods selected according to relevant (and possibly different) evaluation criteria.
Chapter
The Republic of Tatarstan has always been a pioneer among other regions of Russia in the development of innovative and digital technologies. Tatarstan is changing and adapting to be relevant in healthcare, economy, and transport areas. Challenges of modern transport systems cannot be solved with strategies and tools from the past. Intelligent transport systems (ITS) are being actively introduced to overcome the challenges of modern life. The article describes the development of ITS in the Republic of Tatarstan. There are considered issues related to existing technologies, priority areas of development, and the implementation of a unified ITS environment. Much attention is paid to the ITS subsystems, which are being developed in order to ensure the safety and comfort of the residents of the Republic.
Article
Full-text available
Multi-label feature selection has been essential in many big data applications and plays a significant role in processing high-dimensional data. However, the existing online stream feature selection methods ignore the existence of missing labels. Inspired by the neighborhood rough set that does not require prior knowledge of the feature space, we propose a novel online multi-label stream feature selection algorithm called OFS-Mean. We define a neighborhood relationship that can automatically select an appropriate number of neighbors. Without any prior space and parameters, the algorithm’s performance of the algorithm is improved by real-time online prediction of missing labels based on the similarity between the instance and its neighbors. The proposed OFS-Mean divides the feature selection process into two stages: online feature importance evaluation and online redundancy update to screen important features. With the support of neighborhood rough set, the proposed OFS-Mean can adapt to various types of datasets, improving the algorithm generalization ability. In the experiment, the similarity test is used to verify the prediction results; the comparison with the traditional semi-supervised feature selection method under the condition of selecting the same number of features has achieved ideal results.
Article
Accurate detection of passable regions in images is important for ensuring the safe navigation of unmanned surface vehicles, especially in inland waterways with irregular waterlines and various obstacles. However, the existing methods are susceptible to environmental changes and produce high false-positive rates (FPRs) for confusable textures and complex edge details. We, therefore, propose a collision-free waterway segmentation network based on deep learning such that pixel-level classification results can be obtained. The segmentation accuracy for indistinguishable textures is improved by learning the context dependency of features through a modified context prior, and the detailed refinement of waterlines and small obstacles is achieved via an asymmetric encoder–decoder structure. To learn the features of waterways as comprehensively as possible, data integration and data augmentation are performed on three public datasets. In addition, a new annotated urban waterway dataset called the Dasha River dataset is proposed and made publicly available. The proposed model is tested and cross validated using multiple inland and maritime water segmentation datasets, the results of which show that the model achieves superior performance than the current state of the art with pixel accuracy (PA) of 97.43% and FPR of 1.37%.
Article
Full-text available
Understanding users’ requirements are essential to developing an effective AI service system, in which requirement expressions of users can be resolved into intent detection and slot filling tasks. In a lot of literature, the two tasks are normally considered as independent tasks and obtain satisfactory performance. Recently, many researchers have found that intent detection and slot filling can benefit each other since they always appear together in a sentence and may include shared information. Most of the existing joint models employ the structures of encoder and decoder and capture the cross-impact between two tasks by concatenation of hidden state information from two encoders, which ignore the dependencies among slot tags in specific intent. In this paper, we propose a novel Double-Bi-LSTM-CRF Model (DBLC), which can fit the dependency among hidden slot tags while considering the cross-impact between intent detection and slot filling. We also design and implement an intention chatbot on the tourism area, which can assist users to complete a travel plan through human-computer interaction. Extensive experiments show that our DBLC achieves state-of-the-art results on the benchmark ATIS, SNIPS, and multi-domain datasets.
Article
Full-text available
Classifier chains are an effective technique for modeling label dependencies in multi-label classification. However, the method requires a fixed, static order of the labels. While in theory, any order is sufficient, in practice, this order has a substantial impact on the quality of the final prediction. Dynamic classifier chains denote the idea that for each instance to classify, the order in which the labels are predicted is dynamically chosen. The complexity of a naïve implementation of such an approach is prohibitive, because it would require to train a sequence of classifiers for every possible permutation of the labels. To tackle this problem efficiently, we propose a new approach based on random decision trees which can dynamically select the label ordering for each prediction. We show empirically that a dynamic selection of the next label improves over the use of a static ordering under an otherwise unchanged random decision tree model. In addition, we also demonstrate an alternative approach based on extreme gradient boosted trees, which allows for a more target-oriented training of dynamic classifier chains. Our results show that this variant outperforms random decision trees and other tree-based multi-label classification methods. More importantly, the dynamic selection strategy allows to considerably speed up training and prediction.
Article
Multi-label image recognition is a basic and challenging task in computer vision and multimedia fields. Graph Convolutional Networks (GCNs) are often used to learn the multi-label semantic features and multi-label dependency. Although the label semantic features in GCNs can learn the global image visual representation well, they are rarely used on the local image regions. Therefore, we try to use GCNs to learn global and local features at the same time, and make a balance between them. In this paper, we give a multi-scale semantic attention model MS-SGA-GCN including three main modules (i.e., MS, SGA and GCN) for multi-label image recognition. The Multi-Scale module (MS) utilizes feature maps of different sizes to obtain global features and have strong generalization capabilities. Semantic Guide Attention module (SGA) applies the label embeddings learned by GCNs to guide the generation of the cross-modality class-specific attention maps, which can discover the locations of semantically related regions for each label. Experiments show that our model on two datasets MS-COCO and PASCAL VOC2007 separately achieves the classification accuracy by 83.4% and 94.2%, which has a competitive advantage over other mainstream models.
Article
In Multi-Label Classification (MLC), Classifier Chains (CC) are considered simple and effective methods to exploit correlations between labels. A CC considers a binary classifier per label, in which the previous labels, according to an established order, are used as additional features. The label order strongly influences the performance of the CC, and there is no way to determine the optimal order so far. In this work, a new label ordering method based on label correlations is proposed. It uses a non-parametric model based on imprecise probabilities to estimate the correlations between pairs of labels. Then, it employs a greedy procedure that, to insert the labels in the chain, considers the correlations among the candidate labels and the ones already inserted, as well as the correlations between the candidate labels and the ones non-inserted yet. We argue that our proposal presents some advantages over the label ordering methods in CC developed so far based on label correlations. It is also shown that our proposal achieves better experimental results than the label ordering methods proposed so far that use label correlations in CC.
Article
This study creatively puts forward XGBoosts with correlation-based and knowledge-based function (CK-XGBoost) to design flotation backbone process according to the natural properties of a copper mine. The decision-making of flotation backbone process of copper mine is a multi label problem with high feature dimension and small instance set. The proposed CK-XGBoost selects label-specific features and trains a binary XGBoost for each label, and the information of other labels is used to assist the classifier modeling through correlation-based function and knowledge-based function. The correlation-based function utilizes the Pearson correlation coefficient between labels in training set, so that the predicted values of two strong positive correlation labels are similar and the predicted values of two strong negative correlation labels are contrary. The knowledge-based function utilizes domain knowledge, which is independent of the distribution of training set. The experimental results demonstrate the significantly superiority of the proposed CK-XGBoost in the decision-making of copper flotation backbone process.
Article
Multi-label classification (MLC) is one of the challenging tasks in computer vision, where it confronts high dimensional problem both in output label and input feature spaces. This paper proposed solving MLC through multi-output residual embedding (MoRE), which learns appropriate distance metric by analyzing the residuals between input and output spaces. Unlike traditional MLC paradigms that learn relationships between label space and feature space, our proposed approach further learns a low-rank structure in residuals between input and output spaces. And it encodes such residual projection to achieve dimension reduction in label space, enhancing the performance of the proposed algorithm in processing high dimensional MLC task. Furthermore, considering the label correlations between instances and its neighbors, multiple residuals of instances neighbors are also incorporated into the proposed model to further learn more appropriate distance metric in the same way. Overall, with residual embedding learning from instances and their neighbors, the obtained metric can learn a more appropriate low-rank structure in label space to handle high dimensional problem in MLC. Experimental results on several data sets, such as Cal500, Corel5k, Bibtex, Delicious, Tmc2007, 20ng, Mirflickr and Rcv1s1, demonstrate the excellent predictive performance of MoRE among STOA methods, such as LMMO-kNN, M3MDC, KRAM, SEEM, CPLST, CSSP, FaIE.
Conference Paper
Online Social networking(OSN)platforms are being used today very commonly. These Online social networking sites allow their users to post messages or write comments on an area called as the user wall. However these OSNs provide very little control to the users on the content of messages being posted on their walls. In this paper, the proposed system gives the user the ability to control the content of messages being posted. The messages are classified into different categories such as Sexual, Political, Religious and Vulgar. Based on the output of the classification, the filtering rules decide the posting of messages on the user walls. The classification is done using two different classifiers, Radial Basis Probabilistic Neural network and Radial Basis Function Network.The comparative performance study shows the effectiveness of the Radial Basis Probabilistic Neural Network classification technique and is able to producemore accurate results as compared to the latter classifier.The proposed system also gives recommendations of filtering rules to the users using collaborative filtering technique.
Article
Text classification has been widely explored in natural language processing. In this article, we propose a novel adaptive dense ensemble model (AdaDEM) for text classification, which includes local ensemble stage (LES) and global dense ensemble stage (GDES). To strengthen the classification ability and robustness of the enhanced layer, we propose a selective ensemble model based on enhanced attention convolutional neural networks (EnCNNs). To increase the diversity of the ensemble system, these EnCNNs are generated by using two manners: 1) different sample subsets and 2) different granularity kernels. Then, an evaluation criterion that considers both accuracy and diversity is proposed in LES to obtain effective integration results. Furthermore, to make better use of information flow, we develop an adaptive dense ensemble structure with multiple enhanced layers in GDES to mitigate the issue that there may be redundant or invalid enhanced layers in the cascade structure. We conducted extensive experiments against state-of-the-art methods on multiple real-world datasets, including long and short texts, which has verified the effectiveness and generality of our method.
Article
Multi-label text classification aims at assigning more than one class to a given text document, which makes the task more ambiguous and challenging at the same time. The ambiguities come from the fact that often several labels in the prescribed label set are semantically close to each other, making clear demarcation between them difficult. As a consequence, any Machine Learning based approach for developing multi-label classification scheme needs to define its feature space by choosing features beyond linguistic or semi-linguistic features, so that the semantic closeness between the labels is also taken into account. The present work describes a scheme of feature extraction where the training document set and the prescribed label set are intertwined in a novel way to capture the ambiguity in a meaningful way. In particular, experiments were conducted using Topic Modeling and Fuzzy C-means clustering which aim at measuring the underlying uncertainty using probability and membership based measures, respectively. Several Nonparametric hypothesis tests establish the effectiveness of the features obtained through Fuzzy C-Means clustering in multi-label classification. A new algorithm has been proposed for training the system for multi-label classification using the above set of features.
Article
Full-text available
This dissertation introduces a new theoretical model for text classification systems, including systems for document retrieval, automated indexing, electronic mail filtering, and similar tasks. The Concept Learning model emphasizes the role manual and automated feature selection and classifier formation in text classification. It enables drawing on results from statistics and machine learning in explaining the effectiveness of alternate representations of text, and specifies desirable characteristics of text representations. The use of syntactic parsing to produce indexing phrases has been widely investigated as a possible route to better text representations. Experiments with syntactic phrase indexing, however, have never yielded significant improvements in text retrieval performance. The Concept Learning model suggests that the poor statistical characteristics of a syntactic indexing phrase representation negate its dsirable semantic characteristics. The application of term clustering to this representation to improve its statistical properties while retaining its desirable meaning properties is proposed. Standard term clustering strategies from information retrieval (IR), based on cooccurence of indexing terms in documents or groups of documents, were tested on a syntactic indexing phrase representation. In experiments using a standard text retrieval test collection, small effectiveness improvements were obtained. As a means of evaluating representation quality, a text retrieval test collection introduces a number of confounding factors. In contrast, the text categorization task allows much cleaner determination of text representation properties. In preparation for the use of text categorization to study text representation, a more effective and theoretically well-founded probablistic text categorization algorithm was developed, building on work by Maron, Fuhr, and others. Text categorization experiments supported a number of predictions of the Concept Learning model about properties of phrasal representations, including dimensionality properties not previously measured for text representations. However, in carefully controlled experiments using syntactic phrases produced by Church's stochastic bracketer, in conjunction with reciprocal nearest neighbor clustering, term clustering was found to produce essentially no improvement in the properties of the phrasal representation. New cluster analysis approaches are proposed to remedy the problems found in traditional term clustering methods.
Article
Full-text available
We study online learning algorithms that predict by com-bining the predictions of several subordinate prediction algorithms, sometimes called "experts." These simple algorithms belong to the multiplicative weights family of algorithms. The performance of these algorithms degrades only logarithmically with the number of experts, making them particularly useful in applications where the number of experts is very large. However, in applications such as text categorization, it is often natural for some of the experts to abstain from making predictions on some of the instances. We show how to transform algorithms that assume that all experts are always awake to algorithms that do not require this assumption. We also show how to derive corresponding loss bounds. Our method is very general, and can be applied to a large family of online learning algorithms. We also give applications to various prediction models including decision graphs and "switching" experts.
Conference Paper
Full-text available
A new boosting algorithm of Freund and Schapire is used to improve the performance of decision trees which are constructed usin: the information ratio criterion of Quinlan's C4.5 algorithm. This boosting algorithm iteratively constructs a series of decision tress, each decision tree being trained and pruned on examples that have been filtered by previously trained trees. Examples that have been incorrectly classified by the previous trees in the ensemble are resampled with higher probability to give a new probability distribution for the next ace in the ensemble to tnin on. Results from optical cha-xc:er reco~tion (OCR), and knowledge discovery and data mining problems show that in comparison to single trees, or to trees trained independenrly_ or to trees trained on subsets of the feature space, the boosring ensemble is much better.
Conference Paper
Full-text available
We are interested in the problem of understanding fluently spoken language. In particular, we consider people's responses to the open-ended prompt of “How may I help you?”. We then further restrict the problem to classifying and automatically routing such a call, based on the meaning of the user's response. Thus, we aim at extracting a relatively small number of semantic actions from the utterances of a very large set of users who are not trained to the system's capabilities and limitations. In this paper, we describe the main components of our speech understanding system: the large vocabulary recognizer and the language understanding module performing the call-type classification. In particular, we propose automatic algorithms for selecting phrases from a training corpus in order to enhance the prediction power of the standard word n-gram. The phrase language models are integrated into stochastic finite state machines which outperform standard word n-gram language models. From the speech recognizer output we recognize and exploit automatically-acquired salient phrase fragments to make a call-type classification. This system is evaluated on a database of 10 K fluently spoken utterances collected from interactions between users and human agents
Conference Paper
Full-text available
We are interested in providing automated services via natural spoken dialog systems. There are many issues that arise when such systems are targeted for large populations of non-expert users. In this paper, we describe an experimental vehicle to explore these issues, that of automatically routing calls based on a user's fluently spoken response to open-ended prompts such as `How may I help you?' A spoken dialog system for call-routing has been constructed, with subsequent processing for information retrieval and form-filling. To enable experimental evaluations, a database has been generated of 10000 fluently spoken transactions between customers and human agents. We report on preliminary experimental results for that database
Article
Full-text available
The authors' adaptive resampling approach surpasses previous decision-tree performance and validates the effectiveness of small, pooled local dictionaries. They demonstrate their approach using the Reuters-21578 benchmark data and a real-world customer E-mail routing system
Article
Full-text available
Uncertainty sampling methods iteratively request class labels for training instances whose classes are uncertain despite the previous labeled instances. These methods can greatly reduce the number of instances that an expert need label. One problem with this approach is that the classifier best suited for an application may be too expensive to train or use during the selection of instances. We test the use of one classifier (a highly efficient probabilistic one) to select examples for training another (the C4.5 rule induction program). Despite being chosen by this heterogeneous approach, the uncertainty samples yielded classifiers with lower error rates than random samples ten times larger. 1 Introduction Machine learning algorithms have been used to build classification rules from data sets consisting of hundreds of thousands of instances [4]. In some applications unlabeled training instances are abundant but the cost of labeling an instance with its class is high. In the informatio...
Article
Full-text available
This paper examines the use of inductive learning to categorize natural language documents into predefined content categories. Categorization of text is of increasing importance in information retrieval and natural language processing systems. Previous research on automated text categorization has mixed machine learning and knowledge engineering methods, making it difficult to draw conclusions about the performance of particular methods. In this paper we present empirical results on the performance of a Bayesian classifier and a decision tree learning algorithm on two text categorization data sets. We find that both algorithms achieve reasonable performance and allow controlled tradeoffs between false positives and false negatives. The stepwise feature selection in the decision tree algorithm is particularly effective in dealing with the large feature sets common in text categorization. However, even this algorithm is aided by an initial prefiltering of features, confirming the results...
Chapter
A significant problem in many information filtering systems is the dependence on the user for the creation and maintenance of a user profile, which describes the user's interests. NewsWeeder is a netnews-filtering system that addresses this problem by letting the user rate his or her interest level for each article being read (1-5), and then learning a user profile based on these ratings. This paper describes how NewsWeeder accomplishes this task, and examines the alternative learning methods used. The results show that a learning algorithm based on the Minimum Description Length (MDL) principle was able to raise the percentage of interesting articles to be shown to users from 14% to 52% on average. Further, this performance significantly outperformed (by 21%) one of the most successful techniques in Information Retrieval (IR), term-frequency/inverse-document-frequency (tf-idf) weighting.
Article
Recent research in machine learning has been concerned with scaling-up to large data sets. Since information retrieval is a domain where such data sets are widespread, it provides an ideal application area for machine learning. This paper studies the ability of symbolic learning algorithms to perform a text categorization task. This ability depends on both text representation and feature filtering. We present a unified view of text categorization systems, focusing on the selection of features. A new selection technique, SCAR, is proposed for k-DNF (disjunctive normal form) learners and evaluated on the Reuters financial data set. Even though our experimental results do not outperform earlier approaches, they give rise to promising perspectives.
Article
A number of techniques have been studied for the automatic assignment of controlled subject headings and classifications from free indexing. These techniques involve the automatic manipulation and truncation of the free-index phrases assigned to a document and the use of a manually-constructed thesaurus and automatically-generated dictionaries together with statistical ranking and weighting methods. These are based on the use of a statistically-generated ‘adhesion coefficient’ which reflects the degree of association between the free-indexing terms, the controlled subject headings, and the classifications. By the analysis of a large sample of manually-indexed documents the system generates dictionaries of free-language and controlled-language terms together with their associated classifications and adhesion coefficients. Having learnt from the manually-indexed documents the system uses these dictionaries in the subsequent automatic classification procedure. The accuracy and cost-effectiveness of the automatically-assigned subject headings and classifications has been compared with that of the manual system. The results were encouraging and the costs comparable to those of a manual system.
Article
Categorization of text images into content-oriented classes would be a useful capability in a variety of document handling systems. Many methods can be u s e d t o c at-egorize texts once their words are known, but OCR can garble a large proportion of words, particularly when low quality images are used. Despite this, we show for one data set that fax quality images can be c at-egorized with nearly the same accuracy as the original text. Further, the categoriza-tion system can be t r ained on noisy OCR output, without need for the true text of any image, or for editing of OCR output. The u s e o f a v e ctor space classiier and train-ing method r obust to large feature sets, com-bined with discarding of low frequency OCR output strings are the key to our approach.
Article
This paper describes experimental results on using Winnow and Weighted-Majority based algorithms on a real-world calendar scheduling domain. These two algorithms have been highly studied in the theoretical machine learning literature. We show here that these algorithms can be quite competitive practically, outperforming the decision-tree approach currently in use in the Calendar Apprentice system in terms of both accuracy and speed. One of the contributions of this paper is a new variant on the Winnow algorithm (used in the experiments) that is especially suited to conditions with string-valued classifications, and we give a theoretical analysis of its performance. In addition we show how Winnow can be applied to achieve a good accuracy/coverage tradeoff and explore issues that arise such as concept drift. We also provide an analysis of a policy for discarding predictors in Weighted-Majority that allows it to speed up as it learns.
Article
We are interested in providing automated services via natural spoken dialog systems. By natural, we mean that the machine understands and acts upon what people actually say, in contrast to what one would Like them to say. There are many issues that arise when such systems are targeted for large populations of non-expert users. In this paper, we focus on the task of automatically routing telephone calls based on a user's fluently spoken response to the open-ended prompt of "How may I help you?". We first describe a database generated from 10,000 spoken transactions between customers and human agents. We then describe methods for automatically acquiring language models for both recognition and understanding from such data. Experimental results evaluating call-classification from speech are reported for that database. These methods have been embedded within a spoken dialog system, with subsequent processing for information retrieval and formfilling. (C) 1997 Elsevier Science B.V.
Conference Paper
Since October 1985, the automatic indexing system AIR/PHYS has been used in the input production of the physics data base of the Fachinformationszentrum Karlsruhe/West Germany. The texts to be indexed are abstracts written in English. The system of descriptors is prescribed. For the application of the AIR/PHYS system a large-scale dictionary containing more than 600000 word-descriptor relations resp. phrase-descriptor relations has been developed. Most of these relations have been obtained by means of statistical and heuristical methods. In consequence, the relation system is rather imperfect. Therefore, the indexing system needs some fault-tolerating features. An appropriate indexing approach and the corresponding structure of the AIR/PHYS system are described. Finally, the conditions of the application as well as problems of further development are discussed.
Conference Paper
We describe the results of extensive machine learning experiments on large collections of Reuters’ English and German newswires. The goal of these experiments was to automatically discover classification patterns that can be used for assignment of topics to the individual newswires. Our results with the English newswire collection show a very large gain in performance as compared to published benchmarks, while our initial results with the German newswires appear very promising. We present our methodology, which seems to be insensitive to the language of the document collections, and discuss issues related to the differences in results that we have obtained for the two collections.
Article
In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update Littlestone-Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games, and prediction of points in ℝn. In the second part of the paper we apply the multiplicative weight-update technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of the new boosting algorithm to the problem of learning functions whose range, rather than being binary, is an arbitrary finite set or a bounded segment of the real line.
Article
Recent developments in the storage, retrieval, and manipulation of large text files are described. The text analysis problem is examined, and modern approaches leading to the identification and retrieval of selected text items in response to search requests are discussed.
Article
The boosting algorithm AdaBoost, developed by Freund and Schapire, has exhibited outstanding performance on several benchmark problems when using C4.5 as the "weak" algorithm to be "boosted." Like other ensemble learning approaches, AdaBoost constructs a composite hypothesis by voting many individual hypotheses. In practice, the large amount of memory required to store these hypotheses can make ensemble methods hard to deploy in applications. This paper shows that by selecting a subset of the hypotheses, it is possible to obtain nearly the same levels of performance as the entire set. The results also provide some insight into the behavior of AdaBoost. 1 Introduction The adaptive boosting algorithm AdaBoost (Freund & Schapire, 1995) in combination with the decision-tree algorithm C4.5 (Quinlan, 1993) has been shown to be a very accurate learning procedure (Freund & Schapire, 1996; Quinlan, 1996; Breiman, 1996b). Like all ensemble methods, AdaBoost works by generating ...
Article
In this paper, we describe an automated learning approach to text categorization based on perceptron learning and a new feature selection metric, called correlation coefficient. Our approach has been tested on the standard Reuters text categorization collection. Empirical results indicate that our approach outperforms the best published results on this Reuters collection. In particular, our new feature selection method yields considerable improvement. We also investigate the usability of our automated learning approach by actually developing a system that categorizes texts into a tree of categories. We compare the accuracy of our learning approach to a rule-based, expert system approach that uses a text categorization shell built by Carnegie Group. Although our automated learning approach still gives a lower accuracy, by appropriately incorporating a set of manually chosen words to use as features, the combined, semi-automated approach yields accuracy close to the rulebased approach. ...
Article
ion, Inductive Learning and Probabilistic Assumptions Norbert Fuhr, Ulrich Pfeifer University of Dortmund, Dortmund, Germany Categories and Subject Descriptors G.1.2 [Numerical Analysis] Approximation -- nonlinear approximation, least squares approximation H.3.1 [Information Storage and Retrieval] Content Analysis and Indexing -- indexing methods H.3.3 [Information Storage and Retrieval] Information Search and Retrieval -- retrieval models I.2.6 [Artificial Intelligence] Learning -- Parameter Learning General Terms: Experimentation, Theory Additional Keywords and Phrases: logistic regression, probabilistic indexing, probabilistic retrieval, controlled vocabulary Abstract We show that former approaches in probabilistic information retrieval are based on one or two of the three concepts abstraction, inductive learning and probabilistic assumptions, and we propose a new approach which combines all three concepts. This approach is illustrated for the case of indexing with a controlled ...
Article
Breiman's bagging and Freund and Schapire's boosting are recent methods for improving the predictive power of classifier learning systems. Both form a set of classifiers that are combined by voting, bagging by generating replicated bootstrap samples of the data, and boosting by adjusting the weights of training instances. This paper reports results of applying both techniques to a system that learns decision trees and testing on a representative collection of datasets. While both approaches substantially improve predictive accuracy, boosting shows the greater benefit. On the other hand, boosting also produces severe degradation on some datasets. A small change to the way that boosting combines the votes of learned classifiers reduces this downside and also leads to slightly better results on most of the datasets considered. Introduction Designers of empirical machine learning systems are concerned with such issues as the computational cost of the learning method and the accuracy and ...
Article
Many existing rule learning systems are computationally expensive on large noisy datasets. In this paper we evaluate the recently-proposed rule learning algorithm IREP on a large and diverse collection of benchmark problems. We show that while IREP is extremely efficient, it frequently gives error rates higher than those of C4.5 and C4.5rules. We then propose a number of modifications resulting in an algorithm RIPPERk that is very competitive with C4.5rules with respect to error rates, but much more efficient on large samples. RIPPERk obtains error rates lower than or equivalent to C4.5rules on 22 of 37 benchmark problems, scales nearly linearly with the number of training examples, and can efficiently process noisy datasets containing hundreds of thousands of examples. 1 INTRODUCTION Systems that learn sets of rules have a number of desirable properties. Rule sets are relatively easy for people to understand [ Catlett, 1991 ] , and rule learning systems outperform...
Article
. One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximumnumber of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the ...
Article
. This paper describes a new technique for solving multiclass learning problems by combining Freund and Schapire's boosting algorithm with the main ideas of Dietterich and Bakiri's method of error-correcting output codes (ECOC). Boosting is a general method of improving the accuracy of a given base or "weak" learning algorithm. ECOC is a robust method of solving multiclass learning problems by reducing to a sequence of two-class problems. We show that our new hybrid method has advantages of both: Like ECOC, our method only requires that the base learning algorithm work on binary-labeled data. Like boosting, we prove that the method comes with strong theoretical guarantees on the training and generalization error of the final combined hypothesis assuming only that the base learning algorithm perform slightly better than random guessing. Althoughprevious methodswere known for boostingmulticlass problems, the new method may be significantly faster and require less programming effort in cr...
Article
A probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization framework. The analysis results in a probabilistic version of the Rocchio classifier and offers an explanation for the TFIDF word weighting heuristic. The Rocchio classifier, its probabilistic variant and a standard naive Bayes classifier are compared on three text categorization tasks. The results suggest that the probabilistic algorithms are preferable to the heuristic Rocchio classifier. This research is sponsored by the Wright Laboratory, Aeronautical Systems Center, Air Force Materiel Command, USAF, and the Advanced Research Projects Agency (ARPA) under grant F33615-93-1-1330. The US Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation thereon. Views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of Wright Laboratory or the United States Government. Keywords: text categorization, relevance feedback, naive Bayes classifier, information retrieval, vector space retrieval model, machine learning 1
Article
The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. Existing classification schemes which ignore the hierarchical structure and treat the topics as separate classes are often inadequate in text classification where the there is a large number of classes and a huge number of relevant features needed to distinguish between them. We propose an approach that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree. As we show, each of these smaller problems can be solved accurately by focusing only on a very small set of features, those relevant to the task at hand. This set of relevant features varies widely throughout the hierarchy, so that, while the overall relevant feature set may be large, each classifier only examines a small subset. The use of reduced feature sets allows us to utilize more complex (probabilistic) models, without encountering many of the standard computational and robustness difficulties. 1
An empirical evaluation of bagging and boosting r?? Pruning adaptive boosting
  • R Maclin
  • D R Opitz
  • E Schapireandy
  • D D Margineantu
  • T G Dietterich
Maclin, R., & Opitz, D. (1997). An empirical evaluation of bagging and boosting. In Proceedings of the Fourteenth National Conference on Artificial Intelligence, pp. 546–551. r?? R?E?SCHAPIREANDY? SINGER Margineantu, D. D., & Dietterich, T. G. (1997). Pruning adaptive boosting. In Machine Learning: Proceedings of the Fourteenth InternationalConference, pp. 211–218
A probabilisticanalysis ofthe RochhioalgorithmwithTFIDF fortext categorization
  • T Joachims
Joachims, T. (1997). A probabilisticanalysis ofthe RochhioalgorithmwithTFIDF fortext categorization. In Machine Learning: Proceedings of the Fourteenth International Conference, pp. 143–151
Representation and learning in information retrieval Heterogeneous uncertainty sampling for supervised learning
  • D Lewis
Lewis, D. (1992). Representation and learning in information retrieval. Technical Report 91-93, Computer Science Department, University of Massachusetts at Amherst. Ph.D. Thesis. Lewis, D. & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. Machine Learning: Proceedings of the Eleventh International Conference.
Text categorization of low quality images. Symposium on Document Analysis and Information Retrieval (pp. 301–315)
  • D J Ittner
  • D D Lewis
  • D D Ahn
Ittner, D.J., Lewis, D.D., & Ahn, D.D. (1995). Text categorization of low quality images. Symposium on Document Analysis and Information Retrieval (pp. 301–315). Las Vegas, NV. ISRI; Univ. of Nevada, Las Vegas.
Arcing classifiers. The Annals of Statistics
  • L Breiman
Breiman, L. (1998). Arcing classifiers. The Annals of Statistics, 26(3), 801–849.
  • C J Van Rijsbergen
van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths, London.