Conference PaperPDF Available

News Articles Classification Using Random Forests and Weighted Multimodal Features

Authors:

Abstract and Figures

This research investigates the problem of news articles classification. The classification is performed using N-gram textual features extracted from text and visual features generated from one representative image. The application domain is news articles written in English that belong to four categories: Business-Finance, Lifestyle-Leisure, Science-Technology and Sports downloaded from three well-known news web-sites (BBC, Reuters, and TheGuardian). Various classification experiments have been performed with the Random Forests machine learning method using N-gram textual features and visual features from a representative image. Using the N-gram textual features alone led to much better accuracy results (84.4%) than using the visual features alone (53%). However, the use of both N-gram textual features and visual features led to slightly better accuracy results (86.2%). The main contribution of this work is the introduction of a news article classification framework based on Random Forests and multimodal features (textual and visual), as well as the late fusion strategy that makes use of Random Forests operational capabilities.
Content may be subject to copyright.
A preview of the PDF is not available
... Alvari et al. [5] used semi-supervised SVM to detect such Web pages based on 12 Ahmadi et al. [4] proposed to identify pornographic Web pages using a decision tree with meta-features, including: However, classifying text mostly centers on the idea of counting the frequencies with which terms of a lexicon appear in the text to form a feature vector and applying those feature vectors to train a classifier. Kohonen self-organizing neural network [49,50], nearest neighbor [19], Bayesian [32], naïve Bayesian [34,53,101], SVM [34,83], and random forest [56] are among the applied Web page classifiers based on term-frequency feature vectors. Liparas et al. [56] applied the frequencies of 100 most frequent unigrams, 50 most frequent bigrams, 30 most frequent trigrams, and 15 most frequent fourgrams to classify the textual content of news article Web pages into four categories, using random forest. ...
... Kohonen self-organizing neural network [49,50], nearest neighbor [19], Bayesian [32], naïve Bayesian [34,53,101], SVM [34,83], and random forest [56] are among the applied Web page classifiers based on term-frequency feature vectors. Liparas et al. [56] applied the frequencies of 100 most frequent unigrams, 50 most frequent bigrams, 30 most frequent trigrams, and 15 most frequent fourgrams to classify the textual content of news article Web pages into four categories, using random forest. Li et al. [53] used naïve Bayesian to classify Web pages into ten different subjects based on term-frequency feature vectors extracted from the title and main text. ...
... Among Web page image recognition methods, for purposes other than pornography detection, Liparas et al. [56] applied five MPEG-7 visual descriptors, with a total of 320 features, in a random forest to classify news article Web images into four categories. The MPEG-7 standard specifies a set of descriptors, each capturing a different aspect of human perception, i.e. color, texture, and shape. ...
Article
Full-text available
The explosive growth of the amount of information on Internet has made Web page classification essential for Web information management, retrieval, and integration, Web page indexing, topic-specific Web crawling, topic-specific information extraction models, advertisement removal, filtering out unwanted, futile, or harmful contents, and parental control systems. Owing to the recent staggering growth of performance and memory space in computing machines, along with specialization of machine learning models for text and image classification, many researchers have begun to target the Web page classification problem. Yet, automatic Web page classification remains at its early stages because of its complexity, diversity of Web pages’ contents (images of different sizes, text, hyperlinks, etc.), and its computational cost. This paper not only surveys the proposed methodologies in the literature, but also traces their evolution and portrays different perspectives toward this problem. Our study investigates the following: (a) metadata and contextual information surrounding the terms are mostly ignored in textual content classification, (b) the structure and distribution of text in HTML tags and hyperlinks are understudied in textual content classification, (c) measuring the effectives of features in distinguishing among Web page classes or measuring the contribution of each feature in the classification accuracy is a prominent research gap, (d) image classification methods rely heavily on computationally intensive and problem-specific analyses for feature extraction, (e) semi-supervised learning is understudied, despite its importance in Web page classification because of the massive amount of unlabeled Web pages and the high cost of labeling, (f) deep learning, convolutional and recurrent networks, and reinforcement learning remain underexplored but intriguing for Web page classification, and last but not least (g) developing a detailed testbed along with evaluation metrics and establishing standard benchmarks remain a gap in assessing Web page classifiers.
... We have also compared with the published dataset and classifiers [29], [42]. The performance is also compared with other promising techniques in literature [14], [28], [36], [23] for text classification. These techniques include RCNN-LSTM the state-of-the-art Deep Neural Learning (DN L), Naive Bayes (N B), Bag-of-words (BoW) model, Decision Tree (DT ), Random Forest (RF ), DPLSA, LLDA, and SemiLDA. ...
... Furthermore, we adopted the best algorithms suggested by Hindle et al. [14] to classify large change commits into five categories, and Soliman et al. [35] to classify architectural discussions. We also explore Naive Bayes (N B), Decision Trees (DT ), and Random Forest (RF ) [23], [14], [35] for our dataset with the WEKA [13] tool utilizing word-to-vector features [27]. Among them, the most promising classifiers such as N B, and DT have less than 55% F1. ...
Conference Paper
Full-text available
Causes of software architectural change are classified as perfective, preventive, corrective, and adaptive. Change classification is used to promote common approaches for addressing similar changes, produce appropriate design documentation for a release, construct a developer’s profile, form a balanced team, support code review, etc. However, automated architectural change classification techniques are in their infancy, perhaps due to the lack of a benchmark dataset and the need for extensive human involvement. To address these shortcomings, we present a benchmark dataset and a text classifier for determining the architectural change rationale from commit descriptions. First, we explored source code properties for change classification independent of project activity descriptions and found poor outcomes. Next, through extensive analysis, we identified the challenges of classifying architectural change from text and proposed a new classifier that uses concept tokens derived from the concept analysis of change samples. We also studied the sensitivity of change classification of various types of tokens present in commit messages. The experimental outcomes employing 10- fold and cross-project validation techniques with five popular open-source systems show that the F1 score of our proposed classifier is around 70%. The precision and recall are mostly consistent among all categories of change and more promising than competing methods for text classification.
... They have been used in various fields for multi-modal learning. Some examples include classification of Alzheimer's disease [29], automatic job-candidate screening based on video CV's [30], and news article classification [31]. The features used depend on the problem at hand: [29] uses, amongst others, MRI volumes and voxel-based FDG-PET signal intensities. ...
... The features used depend on the problem at hand: [29] uses, amongst others, MRI volumes and voxel-based FDG-PET signal intensities. On the other hand, the job-candidate used videos [30], while the news article one used n-gram textual features and a representative image [31]. Even so in the case of missing/incomplete data. ...
Preprint
Full-text available
For robots to operate in a three dimensional world and interact with humans, learning spatial relationships among objects in the surrounding is necessary. Reasoning about the state of the world requires inputs from many different sensory modalities including vision ($V$) and haptics ($H$). We examine the problem of desk organization: learning how humans spatially position different objects on a planar surface according to organizational ''preference''. We model this problem by examining how humans position objects given multiple features received from vision and haptic modalities. However, organizational habits vary greatly between people both in structure and adherence. To deal with user organizational preferences, we add an additional modality, ''utility'' ($U$), which informs on a particular human's perceived usefulness of a given object. Models were trained as generalized (over many different people) or tailored (per person). We use two types of models: random forests, which focus on precise multi-task classification, and Markov logic networks, which provide an easily interpretable insight into organizational habits. The models were applied to both synthetic data, which proved to be learnable when using fixed organizational constraints, and human-study data, on which the random forest achieved over 90% accuracy. Over all combinations of $\{H, U, V\}$ modalities, $UV$ and $HUV$ were the most informative for organization. In a follow-up study, we gauged participants preference of desk organizations by a generalized random forest organization vs. by a random model. On average, participants rated the random forest models as 4.15 on a 5-point Likert scale compared to 1.84 for the random model
... Combination of textual and visual modalities has proved to be beneficial in a multitude of NLP tasks. [17] achieves consistent performance increases on classification of web articles by using N-gram textual features and visual descriptors. [4] presents early and late fusion techniques for combining visual features obtained from pixel intensity distributions and textual features obtained with bag-of-words model for page classification. ...
Preprint
Page-level analysis of documents has been a topic of interest in digitization efforts, and multimodal approaches have been applied to both classification and page stream segmentation. In this work, we focus on capturing finer semantic relations between pages of a multi-page document. To this end, we formalize the task as semantic parsing of interpage relations and we propose an end-to-end approach for interpage dependency extraction, inspired by the dependency parsing literature. We further design a multi-task training approach to jointly optimize for page embeddings to be used in segmentation, classification, and parsing of the page dependencies using textual and visual features extracted from the pages. Moreover, we also combine the features from two modalities to obtain multimodal page embeddings. To the best of our knowledge, this is the first study to extract rich semantic interpage relations from multi-page documents. Our experimental results show that the proposed method increased LAS by 41 percentage points for semantic parsing, increased accuracy by 33 percentage points for page stream segmentation, and 45 percentage points for page classification over a naive baseline.
... Random Forest [13,14,25] Random forest is an ensemble method for learning multiple decision trees. Random forests are being used for various problems such as detection, classification, and regression. ...
Article
Full-text available
With the proliferation of mobile devices, the amount of social media users and online news articles are rapidly increasing, and text information online is accumulating as big data. As spatio-temporal information becomes more important, research on extracting spatiotemporal information from online text data and utilizing it for event analysis is being actively conducted. However, if spatiotemporal information that does not describe the core subject of a document is extracted, it is rather difficult to guarantee the accuracy of core event analysis. Therefore, it is important to extract spatiotemporal information that describes the core topic of a document. In this study, spatio-temporal information describing the core topic of a document is defined as ‘representative spatio-temporal information’, and documents containing representative spatiotemporal information are defined as ‘representative spatio-temporal documents’. We proposed a character-level Convolution Neuron Network (CNN)-based document classifier to classify representative spatio-temporal documents. To train the proposed CNN model, 7400 training data were constructed for representative spatio-temporal documents. The experimental results show that the proposed CNN model outperforms traditional machine learning classifiers and existing CNN-based classifiers.
... (4) where 8 9 represents the probability of the i-th class and n is the number of classes Ai that represents the i-th feature. The final step is the prediction of the sample by taking the majority prediction from all trees, or known as majority voting [7], [19], [20]. ...
... These feature vectors are applied to train the classifier to classify web pages. For example, Lipras et al. [16] classified the web pages of news articles into four categories using the frequency of unigram, bigram, trigrams, and four grams feature vector by random forest classifier. ...
Article
Full-text available
Internet technologies are emerging very fast nowadays, due to which web pages are generated exponentially. Web page categorization is required for searching and exploring relevant web pages based on users' queries and is a tedious task. The majority of web page categorization techniques ignore semantic features and the contextual knowledge of the web page. This paper proposes a web page categorization method that categorizes web pages based on semantic features and con-textual knowledge. Initially, the GloVe model is applied to capture the semantic features of the web pages. Thereafter, a Stacked Bidirectional long short-term memory (BiLSTM) with symmetric structure is applied to extract the contextual and latent symmetry information from the semantic features for web page categorization. The performance of the proposed model has been evaluated on the publicly available WebKB dataset. The proposed model shows superiority over the existing state-of-the-art machine learning and deep learning methods.
... It used the sliding window method for fast scanning of images. Liparas et al., [16] have developed a hybrid system which combined visual and textual features for web page categorization. N-gram model applied to extract textual features from the text. ...
Article
Full-text available
Due to the explosive growth of the multimedia content on the world wide web (WWW), searching, retrieving, and recommending information becomes a challenging task. Visual information on the web pages is advantageous for web mining tasks, that can be used as features to categorize the web pages. In this paper, a novel framework has been proposed to categorize web pages based on multimedia features, specifically image using machine learning techniques. The Deep Convolution Neural Network VGG-19 has been utilized to determine the feature vectors of images. The transfer learning implemented to reduces the computational cost of the proposed framework. The proposed framework effectiveness demonstrated by comparing performance with two handcrafted image descriptor methods: Fisher Vector (FV) and Vector of Aggregated Local Descriptor (VALD). The proposed framework has achieved a classification accuracy of 86%.
... The OOB error estimate is the averaged prediction error for each training case, using only the trees that do not include that case in their bootstrap sample. For more details on the RF algorithm and its underlying notions, we refer for example to [26] and [32]. ...
Article
Full-text available
Sleep disorders are medical disorders of a subject's sleep architecture and based on their severity, they can interfere with mental, emotional and physical functioning. The most common ones are insomnia, narcolepsy, sleep apnea, bruxism, etc. There is an increased risk of developing sleep disorders in elderly like insomnia, periodic leg movements, rapid eye movement (REM) behaviour disorders, sleep disorder breathing, etc. Consequently, their accurate diagnosis and classification are important steps towards an early stage treatment that could save the life of a patient. The Electroencephalographic (EEG) signal is the most sensitive and important biosignal, which is able to capture the brain sleep activity that is sensitive to sleep. In this study, we attempt to analyse EEG sleep activity via complementary cross-frequency coupling (CFC) estimates, which further feed a classifier, aiming to discriminate sleep disorders. We adopted an open EEG Database with recordings that were grouped into seven sleep disorders and a healthy control. The EEG brain activity from common sensors has been analysed with two basic types of cross-frequency coupling (CFC). Finally, a Random Forest (RF) classification model was built on CFC patterns, which were extracted from non-cyclic alternating pattern (CAP) epochs. Our RF CFC model achieved a 74% multiclass accuracy. Both types of CFC, phase-to-amplitude (PAC) and amplitude-amplitude coupling (AAC) patterns contribute to the accuracy of the RF model, thus supporting their complementary information. CFC patterns, in conjunction with the RF classifier proved a valuable biomarker for the classification of sleep disorders.
Article
In the era of globalization, student placement is very challenging issue for all educational institutions. For engineering institutions, placement is a key factor to maintain good ranking in the university as well as in other national and international ranking agencies. In this paper, we have proposed a few supervised machine learning classifiers which may be used to predict the placement of a student in the IT industry based on their academic performance in class Tenth, Twelve, Graduation, and Backlog till date in Graduation. We also compare the results of different proposed classifiers. Various parameters used to compare and analyze the results of different developed classifiers are accuracy score, percentage accuracy score, confusion matrix, heatmap, and classification report. Classification report generated by developed classifiers consists of parameters precision, recall, f1-score, and support. The classification algorithms Support Vector Machine, Gaussian Naive Bayes, K-Nearest Neighbor, Random Forest, Decision Tree, Stochastic Gradient Descent, Logistic Regression, and Neural Network are used to develop the classifiers. All the developed classifiers are also tested on new data which are excluded from the dataset used in the experiment.
Article
Full-text available
In this paper, we devise an approach for identifying and classifying contents of interest related to geographic communities from news articles streams. We first conduct a short study on related works, and then present our approach, which consists in 1) filtering out contents irrelevant to communities and 2) classifying the remaining relevant news articles. Using a confidence threshold, the filtering and classification tasks can be performed in one pass using the weights learned by the same algorithm. We use Bayesian text classification, and because of important empiric class imbalance in Web-crawled corpora, we test several approaches: Naïve Bayes, Complementary Naïve Bayes, use of {1,2,3}-Grams, and use of oversampling. We find out in our testing experiment on Japanese prefectures that 3-gram CNB with oversampling is the most effective approach in terms of precision, while retaining acceptable training time and testing time.
Conference Paper
Full-text available
In this paper, we investigate a specific area of document classification in which the documents come as a flow over the time. Moreover, the exact number of classes of document to deal with is not known from the beginning and could evolve over the time. To be able to perform classification task in such area, we need specific classifiers that are able to perform incremental learning and change their modeling over the time. More specifically, we are focusing our study on SVM approaches, known to perform well, and for which incremental (i-SVM) procedures exist. Nevertheless, most of them are only able to deal with a fixed number of classes. So we designed a new incremental learning procedure based on one-class SVMs. This one is able to improve its classification accuracy over the time, with the arrival of new labeled data, without performing any complete retraining. Moreover, when instances are coming with a previously unknown label (appearing of a new class), the training procedure is able to modify the classifier model to recognize this corresponding new kind of documents. To investigate this area, waiting for collecting documents images as a flow, we did first experiments on the Optical Recognition of Handwritten Digits Data Set. These experiments show that our incremental approach is able: to perform, at each time, as well as a static one-class classifier fully retrained using all previously seen data; to model very quickly and efficiently new incoming classes.
Article
Full-text available
Unlabeled documents vastly outnumber labeled documents in text classification. For this reason, semi-supervised learning is well suited to the task. Representing text as a combination of unigrams and bigrams has not shown consistent improvements compared to using uni-grams in supervised text classification. Therefore, a natural question is whether this finding extends to semi-supervised learning, which provides a different way of combining multiple representations of data. In this paper, we investigate this question experimentally running two semi-supervised algorithms, Co-Training and Self-Training, on several text datasets. Our results do not indicate improvements by combining unigrams and bigrams in semi-supervised text classification. In addition, they suggest that this fact may stem from a strong "correlation" between unigrams and bigrams.
Conference Paper
Naive Bayes is often used in text classification applications and experiments because of its simplicity and effectiveness. However, its performance is often degraded because it does not model text well, and by inappropriate feature selection and the lack of reliable confidence scores. We address these problems and show that they can be solved by some simple corrections. We demonstrate that our simple modifications are able to improve the performance of Naive B ayes for text classification significantly.
Conference Paper
This paper proposes an improved random forest algorithm for image classification. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is image data. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to classify image data with a large number of object categories. With the new feature weighting method for subspace sampling and tree selection method, we can effectively reduce subspace size and improve classification performance without increasing error bound. Experimental results on image datasets with diverse characteristics have demonstrated that the proposed method could generate a random forest model with higher performance than the random forests generated by Breiman's method.
Conference Paper
Neural Networks such as RBFN and BPNN have been widely studied in the area of network intrusion detection, with the purpose of detecting a variety of network anomalies (e.g., worms, malware). In real-world applications, however, the performance of these neural networks is dynamic regarding the use of different datasets. One of the reasons is that there are some redundant features for the dataset. To mitigate this issue, in this paper, we propose an approach of combining Neural Networks with Random Forest to improve the accuracy of detecting network intrusions. In particular, we design an intelligent anomaly detection system that uses the algorithm of Random Forest in the process of feature selection and selects an appropriate algorithm in an adaptive way. In the evaluation, we conducted two major experiments using the KDD1999 dataset and a real dataset respectively. The experimental results indicate that Random Forest can enhance the performance of Neural Networks by identifying important and closely related features and that our developed system can select a better algorithm intelligently.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.