Conference PaperPDF Available

News Articles Classification Using Random Forests and Weighted Multimodal Features

Authors:

Abstract and Figures

This research investigates the problem of news articles classification. The classification is performed using N-gram textual features extracted from text and visual features generated from one representative image. The application domain is news articles written in English that belong to four categories: Business-Finance, Lifestyle-Leisure, Science-Technology and Sports downloaded from three well-known news web-sites (BBC, Reuters, and TheGuardian). Various classification experiments have been performed with the Random Forests machine learning method using N-gram textual features and visual features from a representative image. Using the N-gram textual features alone led to much better accuracy results (84.4%) than using the visual features alone (53%). However, the use of both N-gram textual features and visual features led to slightly better accuracy results (86.2%). The main contribution of this work is the introduction of a news article classification framework based on Random Forests and multimodal features (textual and visual), as well as the late fusion strategy that makes use of Random Forests operational capabilities.
Content may be subject to copyright.
A preview of the PDF is not available
... Based on the above studies, the problem identified as follows: "Web pages contain various information, and categorizing them is a complex problem". Some web page categorization methods exploit text features with keywords only without considering the context of web pages [7,8]. Deep-learning-based web page categorization is a hot and new research area for efficiently categorizing web pages with improved performance [9][10][11]. ...
... Supervised-learning algorithms mainly focused on feature-selection and feature-extraction methods to categorize web pages. Earlier researchers used BOW and TF-IDF techniques to categorize web pages with a traditional machine-learning algorithm [7,8]. In this section, we discuss various research work, providing state-of-the-art proposed work. ...
... Lipras et al. [8] proposed a web page categorization model based on a random forest classifier applied to categorize news articles into four categories. These categories are Business-Finance, Lifestyle-Leisure, Science-Technology, and Sports. ...
Article
Full-text available
The World Wide Web has revolutionized the way we live, causing the number of web pages to increase exponentially. The web provides access to a tremendous amount of information, so it is difficult for internet users to locate accurate and useful information on the web. In order to categorize pages accurately based on the queries of users, methods of categorizing web pages need to be developed. The text content of web pages plays a significant role in the categorization of web pages. If a word’s position is altered within a sentence, causing a change in the interpretation of that sentence, this phenomenon is called polysemy. In web page categorization, the polysemy property causes ambiguity and is referred to as the polysemy problem. This paper proposes a fine-tuned model to solve the polysemy problem, using contextual embeddings created by the symmetry multi-head encoder layer of the Bidirectional Encoder Representations from Transformers (BERT). The effectiveness of the proposed model was evaluated by using the benchmark datasets for web page categorization, i.e., WebKB and DMOZ. Furthermore, the experiment series also fine-tuned the proposed model’s hyperparameters to achieve 96.00% and 84.00% F1-Scores, respectively, demonstrating the proposed model’s importance compared to baseline approaches based on machine learning and deep learning.
... The advantage of random forest is that it can work very well on large amounts of data. In addition, random forests can estimate features that are important in the classification process and provide experimental methods to detect correlations between features [7]. ...
... Prediction results from the random forest are obtained through the highest results from each decision tree (voting for classification), as shown in Figure 1. For random forest consisting of trees, Equation (1) is used to predict the class label of the case through voting [7]. ...
Article
Full-text available
The growth of news articles on the internet occurs in a short period with large amounts so necessary to be grouped into several categories for easy access. There is a method for grouping news articles, namely classification. One of the classification methods is random forest which is built on decision tree. This research discusses the application of random forest as a method of classifying news articles into six categories, these are business, entertainment, health, politics, sport, and news. The data used is Cable News Network (CNN) articles from 2011 to 2022. The data is in form of text and has large amounts so good handling is needed to avoid overfitting and underfitting. Random forest is proper to apply to the data because the algorithm works very well on large amounts of data. However, random forest has a difficult interpretation if the combination of parameters is not appropriate in the data processing. Therefore, hyperparameter optimization is needed to discover the best combination of parameters in the random forest. This research uses search cross-validation (SearchCV) method to optimize hyperparameters in the random forest by testing the combinations one by one and validating those. Then we obtain the classification of news articles into six categories with an accuracy value of 0.81 on training and 0.76 on testing.
... In total, 26 distinct combinations were identified using data of six different modalities. [188], [262], [265], [273], [274], [277], [294], [300], [376], [379], [393], Video & Audio & Sensor [252], [296], [409], Video & Audio [153], [171], [229], [253], [255], [266], [275], [281], [287], [292], [295], [298], [315], Video & Text [199], Video & Sensor [271], Video & Signal [250], [283], Image & Audio & Text [111], [175], [204], [213], [216], [263], [264], [288], [340], [396], [406], [249], [334], [335], Image & Audio [45], [177], [182], [184], [186], [189], [192], [208], [233], [234], [237]- [239], [244], [285], [338], [412], Image & Text [47], [53] [124], [156], [161], [164], [170], [179], [181], [185], [193], [197], [201], [211], [219], [223], [251], [286], [290], [293], [306], [309], [314], [336], [342], [355]- [358], [360]- [364], [370], [371], [374], [375], [385], [408], [413], [415], [ [125], [127], [187], [206], [241], [326], [365], [395], [411], Image & Numerical [62], [75], [119], [126], [167], [313], [331], [353], [405], [410], Audio & Text & Sensor [384], Audio & Text [180], [282], [377], [391], [392], Text & Signal [109], Text & Numerical [304], [349], Sensor & Signal [240], [242], [258], [389], Sensor & Numerical [183], Signal & Numerical [205], [257], [260], [318]. Figure 10 displays the extracted information related to each modality and data type with the links between them. ...
... A total of 212 articles related to fusion learning were encountered. Of 155 articles, 99 were model-agnostic, where 62 pertained to early [55], [56], [58], [59], [62], [ [119], [120], [133], [141], [142], [166], [173], [207], [213], [240], [242], [250], [252], [254], [258], [259], [270], [271], [280], [282], [299], [303], [305]- [307], [313], [320], [324], [326], [330], [334], [337], [347], [349], [357], [359], [364], [367], [381], [382], [384], [391], [393], [397], [405], [406], 23 pertained to late [114], [127], [136], [153], [161], [167], [174], [180], [181], [218], [241], [256], [264], [276], [279], [308], [322], [360], [365], [366], [387], [389], [392] and 14 pertained to hybrid [182], [189], [253], [296], [310]- [312], [315], [316], [318], [319], [323], [325], [380]. In all, 56 model-based studies were discovered, with 46 relating to VOLUME 4, 2016 ...
Article
Full-text available
Multimodal machine learning (MML) is a tempting multidisciplinary research area where heterogeneous data from multiple modalities and machine learning (ML) are combined to solve critical problems. Usually, research works use data from a single modality, such as images, audio, text, and signals. However, real-world issues have become critical now, and handling them using multiple modalities of data instead of a single modality can significantly impact finding solutions. ML algorithms play an essential role by tuning parameters in developing MML models. This paper reviews recent advancements in the challenges of MML, namely: representation, translation, alignment, fusion and co-learning, and presents the gaps and challenges. A systematic literature review (SLR) applied to define the progress and trends on those challenges in the MML domain. In total, 1032 articles were examined in this review to extract features like source, domain, application, modality, etc. This research article will help researchers understand the constant state of MML and navigate the selection of future research directions.
... Menurut Jonathan (2021), Random Forest memiliki sebuah mekanisme internal yang menyediakan estimasi dari proses generalization error-nya sendiri, atau yang biasa disebut dengan outof-bag (OOB) error estimate. Perumusan untuk RF yang terdiri dari N trees dinyatakan sebagai berikut (Liparas, 2014): ...
Article
Full-text available
Telah banyak penelitian implementasi data mining pada perfoma akademik mahasiswa yang dilakukan untuk mencari kinerja terbaik dari algoritma klasifikasi, namun penelitian yang menguji hubungan atribut-atribut dengan dimensi data yang tinggi pada pemodelan terhadap label data yang digunakan masih rendah. Penelitian ini bertujuan untuk mengkomparasi peningkatan akurasi algoritma klasifikasi yakni Naive Bayes, C4.5, Random Forest, dan Logistic Regression yang telah dioptimasi dengan beberapa algoritma seleksi fitur seperti Chi-Square, CFS, Information Gain dan ANOVA. Dataset yang digunakan berjumlah 2663 record, dengan membagi data menggunakan metode 5-fold cross validation kemudian dilakukan evaluasi kinerja algoritma menggunakan confusion matrix. Hasil penelitian yang diperoleh adalah optimasi Chi-square memiliki nilai tertinggi dalam meningkatkan akurasi pemodelan algoritma klasifikasi, dengan rata-rata peningkatan akurasi sebesar 2.45%. Sementara, hasil perbandingan algoritma klasifikasi dalam menangani data prediksi performa mahasiswa menghasilkan algoritma Random Forest sebagai algoritma klasifikasi tertinggi dengan persentase accuracy sebesar 94.5%, precision 95%, recall 94, f1-score 94%.
... Prediction results from Random Forest are obtained through the highest results from each decision tree (voting for classification and average for regression). Random Forest has an internal mechanism that provides an estimate of its generalization error called the out-of-bag (OOB) error estimate [26], [27]. A visualization of how the RF algorithm works is presented in Fig 2. ...
Article
Full-text available
Tourism and urban areas experienced rapid development at the beginning of the 21st century. This condition is caused by natural, cultural, and artificial tourist destinations and adequate infrastructure support. Tourist destinations in urban areas add to urbanization because apart from being the center of government, trade, and industry, it is also a tourist destination that can attract tourists. Monitoring the development of urban tourism is carried out in the cities of Denpasar and Bali, as well-known destinations at the world level. The development of the urban area can be detected through multi-temporal and multispectral remote sensing imagery in combination with machine learning technology. This study aims to determine the spatial distribution of urban tourism development from 2013 to 2021. This study uses remote sensing and machine learning methods with the Random Forest (RF) algorithm on Google Earth Engine (GEE) cloud computing. The RF algorithm is one of the non-parametric classification algorithms which is widely applied in remote sensing data classification because of its insensitivity to excessive noise and training data and its good performance. The material used is Landsat 8, especially on the Operational Land Imager (OLI) sensor. The result showed that integrating remote sensing, GEE cloud computing, and machine learning, especially the RF algorithm, effectively monitors urban tourism expansion. The overall accuracy of the RF model with simple training data is above 90%. We found that within nine years, vegetated land was changed into an urban area of 20.23 km ² . For this reason, special attention is needed from the government to make regulations on spatial planning and control over land conversion so that there will still be green open spaces in the future.
... For example, news is categorized in the infotainment category, while based on the content of the news or the words contained in it, the news should be categorized in the politics category. Journalists and news monitoring companies (media monitoring companies) often face problems identifying topics in a very large number of news articles around the world [1]. Errors in categorizing or classifying information/news can also occur because the method used is still manual by reading the entire article to find the main topic. ...
... Sepertiga sisanya dikategorikan berdasarkan pohon yang dibentuk dan digunakan untuk menguji kinerjanya. Estimasi kesalahan OOB adalah rata-rata kesalahan prediksi untuk setiap kasus pelatihan y menggunakan pohon yang tidak menyertakan y dalam sampel bootstrap (Dimitris, et al., 2014). Sebagai salah satu pembanding yaitu algoritma GBT, merupakan algoritma pembelajaran mesin yang akurat untuk memprediksi variable target dengan menggabungkan perkiraan satu set model yang lebih sederhana dan lebih lemah sehingga nantinya akan terbentuk prediksi akhir yang lebih akurat (Chen & Guestrin, 2016). ...
Article
Full-text available
Pembangkit listrik tenaga surya (PLTS) menjadi solusi yang paling popular dan diterapkan dibanyak negara. Namun interkoneksi PLTS ke sistem jaringan transmisi listrik menghadirkan permasalahan kepada operator jaringan dikarenakan memiliki sifat fluktuasi dalam menghasilkan energi listrik. Faktor-faktor yang berpotensi mempengaruhi sifat fluktuasi energi listrik adalah meteorologi dan parameter cuaca. Salah satu langkah mitigasi untuk mengatasi kondisi tersebut yaitu dengan memprediksi produksi keluaran daya PLTS. Penelitian ini mengajukan metode prediksi produksi daya dengan pra-proses data, penerapan model regresi, serta penentuan skenario uji coba. Data histori PLTS berasal dari sistem SCADA selama setahun yang terdiri atas faktor nilai produksi keluaran daya, radiasi, suhu lingkungan, suhu peralatan, dan kecepatan angin. Data yang telah diolah selanjutnya dimodelkan menggunakan algoritma Random Forest Regression (RFR). Dalam proses pemodelan dilakukan skenario pengaturan beberapa parameter, seperti proses perbaikan hilang rekam, normalisasi data dan filter produksi. Evaluasi dilakukan dengan menganalisis perbandingan kinerja setiap algoritma beserta kombinasi skenarionya. Hasil eksperimen menunjukkan bahwa RFR mempunyai kinerja tinggi dengan nilai R 2 sebesar 0.9679 dan RMSE sebesar 0.0438. Pemilihan skenario yang tepat terbukti memberi peningkatan kinerja akurasi sebesar RFR 2,90%.
Chapter
The increase in the availability of data on the Internet in the past years has created an enormous amount of data and research in the field of Artificial Intelligence and Machine Learning. With the advancement in technology, computational power has also increased dramatically in the past few years, and this has led to more and more advancements in Artificial Intelligence research and its applications. Mizo language, which is a low-resource language, also tends to emerge in recent years along with these advancements and with the help of news articles collected from the two biggest news outlets for the Mizo language namely Vanglaini and The Aizawl Post, an approach to news classification based of their category was done in this paper. This paper tested several machine learning methods using supervised classification techniques and got the highest accuracy among other low-resource languages in most of the models tested and among which Multinomial Naive Bayes classification gives an accuracy of 96% and is the highest when compared to the other models.
Article
Full-text available
In this paper, we devise an approach for identifying and classifying contents of interest related to geographic communities from news articles streams. We first conduct a short study on related works, and then present our approach, which consists in 1) filtering out contents irrelevant to communities and 2) classifying the remaining relevant news articles. Using a confidence threshold, the filtering and classification tasks can be performed in one pass using the weights learned by the same algorithm. We use Bayesian text classification, and because of important empiric class imbalance in Web-crawled corpora, we test several approaches: Naïve Bayes, Complementary Naïve Bayes, use of {1,2,3}-Grams, and use of oversampling. We find out in our testing experiment on Japanese prefectures that 3-gram CNB with oversampling is the most effective approach in terms of precision, while retaining acceptable training time and testing time.
Conference Paper
Full-text available
In this paper, we investigate a specific area of document classification in which the documents come as a flow over the time. Moreover, the exact number of classes of document to deal with is not known from the beginning and could evolve over the time. To be able to perform classification task in such area, we need specific classifiers that are able to perform incremental learning and change their modeling over the time. More specifically, we are focusing our study on SVM approaches, known to perform well, and for which incremental (i-SVM) procedures exist. Nevertheless, most of them are only able to deal with a fixed number of classes. So we designed a new incremental learning procedure based on one-class SVMs. This one is able to improve its classification accuracy over the time, with the arrival of new labeled data, without performing any complete retraining. Moreover, when instances are coming with a previously unknown label (appearing of a new class), the training procedure is able to modify the classifier model to recognize this corresponding new kind of documents. To investigate this area, waiting for collecting documents images as a flow, we did first experiments on the Optical Recognition of Handwritten Digits Data Set. These experiments show that our incremental approach is able: to perform, at each time, as well as a static one-class classifier fully retrained using all previously seen data; to model very quickly and efficiently new incoming classes.
Article
Full-text available
Unlabeled documents vastly outnumber labeled documents in text classification. For this reason, semi-supervised learning is well suited to the task. Representing text as a combination of unigrams and bigrams has not shown consistent improvements compared to using uni-grams in supervised text classification. Therefore, a natural question is whether this finding extends to semi-supervised learning, which provides a different way of combining multiple representations of data. In this paper, we investigate this question experimentally running two semi-supervised algorithms, Co-Training and Self-Training, on several text datasets. Our results do not indicate improvements by combining unigrams and bigrams in semi-supervised text classification. In addition, they suggest that this fact may stem from a strong "correlation" between unigrams and bigrams.
Conference Paper
Naive Bayes is often used in text classification applications and experiments because of its simplicity and effectiveness. However, its performance is often degraded because it does not model text well, and by inappropriate feature selection and the lack of reliable confidence scores. We address these problems and show that they can be solved by some simple corrections. We demonstrate that our simple modifications are able to improve the performance of Naive B ayes for text classification significantly.
Conference Paper
This paper proposes an improved random forest algorithm for image classification. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is image data. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to classify image data with a large number of object categories. With the new feature weighting method for subspace sampling and tree selection method, we can effectively reduce subspace size and improve classification performance without increasing error bound. Experimental results on image datasets with diverse characteristics have demonstrated that the proposed method could generate a random forest model with higher performance than the random forests generated by Breiman's method.
Conference Paper
Neural Networks such as RBFN and BPNN have been widely studied in the area of network intrusion detection, with the purpose of detecting a variety of network anomalies (e.g., worms, malware). In real-world applications, however, the performance of these neural networks is dynamic regarding the use of different datasets. One of the reasons is that there are some redundant features for the dataset. To mitigate this issue, in this paper, we propose an approach of combining Neural Networks with Random Forest to improve the accuracy of detecting network intrusions. In particular, we design an intelligent anomaly detection system that uses the algorithm of Random Forest in the process of feature selection and selects an appropriate algorithm in an adaptive way. In the evaluation, we conducted two major experiments using the KDD1999 dataset and a real dataset respectively. The experimental results indicate that Random Forest can enhance the performance of Neural Networks by identifying important and closely related features and that our developed system can select a better algorithm intelligently.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.