Figure - uploaded by José A. Sáez
Content may be subject to copyright.
Notation used.

Notation used.

Source publication
Article
Full-text available
This paper presents the first review of noise models in classification covering both label and attribute noise. Their study reveals the lack of a unified nomenclature in this field. In order to address this problem, a tripartite nomenclature based on the structural analysis of existing noise models is proposed. Additionally, a revision of their cur...

Context in source publication

Context 1
... second cause involves discrepancies in the name of the noise models, as well as in the terminology related to them [28,29]. In order to delve into this aspect, the notation compiled in Table 1 is used, which is also employed in the remainder of this work. Let D be a classification dataset composed by n samples x i (i ∈ {1, . . . ...

Similar publications

Article
Full-text available
This research aimed to analyze students' Higher Order Thinking Skills (HOTS) using a two-tier multiple choice (TTMC) test instrument. This study used a descriptive quantitative method with a purposive sampling technique. The instruments used are multiple-choice tests with free reasoning, the TTMC test for quantitative data, interview guides, and le...
Article
Full-text available
The genus Pareucamptonyx Olmi, 1991 is endemic to the New World and comprises three described species, two from the Neotropical region and one from the Nearctic. Here is presented the first revision of species of Pareucamptonyx from the Neotropical region, including four new species described from Brazil: P. albopictus Martins sp. nov., P. kumagaia...
Article
Full-text available
The purpose of this study was to analyze contents of Korea’s 2015 revised special education curriculum, focusing on papers published in Korean journals and candidate journals from 2016 (i.e., after December 2015 when the 2015 revised special education curriculum was established) to 2022 (present). Future research directions regarding the curriculum...
Article
Full-text available
Two bound herbaria recently discovered in the Botanic Museum of Pisa are here presented. The two herbaria include 199 specimens and 142 different taxa, all collected in Tuscany. Among these taxa, five are Tuscan endemics, four Italian endemics, five aliens in Tuscany and two extinct. After a taxonomic and nomenclatural revision, the two herbar...

Citations

... However, despite the proposals for encoding amino acid categories, one of the most widely used methods in the literature is one-hot encoding or its derivatives [9], [11], [14]. These techniques can increase data complexity by creating multiple new attributes, which could potentiate other problems, such as information loss or noise in the data [6], [34]. Dealing with these additional variables may also negatively impact the interpretability and classification performance of the models built. ...
Article
Full-text available
Categorical attributes are common in many classification tasks, presenting certain challenges as the number of categories grows. This situation can affect data handling, negatively impacting the building time of models, their complexity and, ultimately, their classification performance. In order to mitigate these issues, this research proposes a novel preprocessing technique for grouping attribute categories in classification datasets. This approach combines the exact representation of the association between categorical values in a Euclidean space, clustering methods and attribute quality metrics to group similar attribute categories based on their contribution to the classification task. To estimate its effectiveness, the proposal is evaluated within the context of HIV-1 protease cleavage site prediction, where each attribute represents an amino acid that can take multiple possible values. The results obtained on HIV-1 real-world datasets show a significant reduction in the number of categories per attribute, with an average reduction percentage ranging from 74% to 81%. This reduction leads to simplified data representations and improved classification performances compared to not preprocessing. Specifically, improvements of up to 0.07 in accuracy and 0.19 in geometric mean are observed across different datasets and classification algorithms. Additionally, extensive simulations on synthetic datasets with varied characteristics are carried out, providing consistent and reliable results that validate the robustness of the proposal. These findings highlight the capability of the developed method to enhance cleavage prediction, which could potentially contribute to understanding viral processes and developing targeted therapeutic strategies.
... However, despite the efforts to minimize errors, human error remains a potential source of inaccuracies in the data [52] given the dependence on data entered by nursing professionals. In terms of assumptions, the study considers that the collected data accurately represent the patient population under investigation, without significant bias or missing information. ...
Article
Full-text available
Pressure ulcers carry a significant risk in clinical practice. This paper proposes a practical and interpretable approach to estimate the risk levels of pressure ulcers using decision tree models. In order to address the common problem of imbalanced learning in nursing classification datasets, various oversampling configurations are analyzed to improve the data quality prior to modeling. The decision trees built are based on three easily identifiable and clinically relevant pressure ulcer risk indicators: mobility, activity, and skin moisture. Additionally, this research introduces a novel tabular visualization method to enhance the usability of the decision trees in clinical practice. Thus, the primary aim of this approach is to provide nursing professionals with valuable insights for assessing the potential risk levels of pressure ulcers, which could support their decision-making and allow, for example, the application of suitable preventive measures tailored to each patient’s requirements. The interpretability of the models proposed and their performance, evaluated through stratified cross-validation, make them a helpful tool for nursing care in estimating the pressure ulcer risk level.
... In order to study the impact of noise on the above datasets, two types of noise (regressand noise and attribute noise) are considered separately, the introduction of which is based on widely used approaches to simulate noise in classification data (Sáez, 2022(Sáez, , 2023. For both types of noise, to inject a noise level % into a evapotranspiration dataset, % of the samples of each affected variable are selected and their values are replaced by other random ones within the domain. ...
Article
In smart agriculture, the accurate prediction of evapotranspiration plays a crucial role in optimizing water usage and maximizing crop yield. However, the increasing adoption of IoT sensor technologies has resulted in the accumulation of large amounts of data, which are frequently contaminated by noise and pose a significant challenge to extract reliable knowledge through data modeling. This research addresses the problem of noisy IoT sensor data and its impact on evapotranspiration prediction, an essential aspect of agricultural practices. The effect of noise on sensor variables and evapotranspiration is extensively analyzed by simulating different noise levels in evapotranspiration datasets collected from various agricultural areas in Spain, enabling a comprehensive evaluation of its impact on the performance of data science models. Despite the potential consequences of this type of errors, a noise preprocessing stage is often overlooked in existing literature in this field, which is necessary to improve data quality prior to modeling. In order to address this challenge, this paper proposes the usage of regression noise filters as approach to mitigate the detrimental effects of noisy IoT sensor data on evapotranspiration prediction. Additionally, we introduce the rgnoisefilt R package, which offers a practical and efficient implementation of noise filtering techniques for regression datasets, providing a valuable solution for handling noisy data in smart agriculture applications. The experimental results obtained emphasize the negative impacts of noise on evapotranspiration prediction performance and highlight the importance of an appropriate data treatment to mitigate system deterioration. Furthermore, the findings of this research emphasize the efficacy of the regression noise filters implemented in the rgnoisefilt software, enhancing the performance of the models built and providing a valuable tool for improving data quality in smart agriculture.
... Sáez et al. [16] provide more details and explanations about these methods. Another recent comprehensive review of diferent methods of adding noise to class variable, attribute variables, or both in combination is given by Sáez [28]. Sáez has also presented an R package which is called noisemodel [29]. ...
Article
Full-text available
Class noise is a common issue that affects the performance of classification techniques on real-world data sets. Class noise appears when a class variable in data sets has incorrect class labels. In the case of noisy data, the robustness of classification techniques against noise could be more important than the performance results on noise-free data sets. The decision tree method is one of the most popular techniques for classification tasks. The C4.5, CART, and random forest (RF) algorithms are considered to be three of the most used algorithms in decision trees. The aim of this paper is to reach conclusions on which decision tree algorithm is better to use for building decision trees in terms of its performance and robustness against class noise. In order to achieve this aim, we study and compare the performance of the models when applied to class variables with noise. The results obtained indicate that the RF algorithm is more robust to data sets with noisy class variable than other algorithms.
... In future work, we plan to study within the clinical setting whether these models lead to significant reductions in PUs cases in patients at risk by assigning the most appropriate support surfaces in each case. Furthermore, it is interesting to delve into other aspects of data preprocessing in order to further improve the models created, in addition to the class imbalance that has been addressed, such as overlapping among classes or noise [37]. ...
Preprint
Full-text available
Pressure ulcers carry a significant risk in clinical practice and require effective preventive measures. This paper proposes a practical and interpretable approach to estimate the risk levels of pressure ulcers using decision tree models. In order to address the common problem of imbalanced learning in nursing classification datasets, various oversampling configurations are analyzed to improve data quality prior to modeling. The decision trees built are based on three easily identifiable and clinically relevant pressure ulcer risk indicators: mobility, activity and skin moisture. Their analysis allows nursing professionals to predict the risk levels of pressure ulcer and make informed decisions about patient care. Additionally, this research introduces a novel tabular visualization method to enhance the usability of the decision trees in clinical practice. The approach proposed aims to support nursing professionals in making timely decisions regarding the appropriate preventive interventions according to the risk levels of pressure ulcers, thus improving patient outcomes and healthcare costs. The usefulness and effectiveness of the models presented make them a valuable resource for nursing care in the prevention of pressure ulcers.
... Later on, several studies aimed to provide a more homogeneous view of corruption, often referred to as noise or distribution shift. However, their frameworks typically relied on corruption-invariant assumptions of the marginal or conditional probabilities, and the extent of exhaustiveness in representing all possible corruption models within their framework is merely conjectured, or not considered [15,16,17,18]. ...
Preprint
Full-text available
Corruption is frequently observed in collected data and has been extensively studied in machine learning under different corruption models. Despite this, there remains a limited understanding of how these models relate such that a unified view of corruptions and their consequences on learning is still lacking. In this work, we formally analyze corruption models at the distribution level through a general, exhaustive framework based on Markov kernels. We highlight the existence of intricate joint and dependent corruptions on both labels and attributes, which are rarely touched by existing research. Further, we show how these corruptions affect standard supervised learning by analyzing the resulting changes in Bayes Risk. Our findings offer qualitative insights into the consequences of "more complex" corruptions on the learning problem, and provide a foundation for future quantitative comparisons. Applications of the framework include corruption-corrected learning, a subcase of which we study in this paper by theoretically analyzing loss correction with respect to different corruption instances.
... 11,34 However, noise affecting real-world data and, particularly, chemical datasets, is usually not quantifiable, and its characteristics are unknown. In order to test the effectiveness of these methods in a controlled environment, noise models 35 have been proposed to introduce errors into datasets in a supervised way. The utility of noise models in the field of noisy data research is clearly reflected by their wide use in the specialized literature, as they are frequently employed in dozens of papers published each year in this area. ...
... This paper presents the noisemodel R package, which provides the implementation of a total of 72 noise models for classification data found in the specialized literature. 35 It includes methods to introduce label noise, attribute noise, and both in combination. 10,41 The software is built following an S3 formulation of methods (with default and formula class alternatives), allowing the models to be applied in a unified and user-friendly way. ...
... For a practical recommendation on the potential noise model to be used based on the knowledge of the problem treated, the characteristics of the available data, and the noise to be simulated, the reader may consult S aez. 35 3 | USAGE OF THE NOISEMODEL PACKAGE ...
Article
Full-text available
Classification datasets created from chemical processes can be affected by errors, which impair the accuracy of the models built. This fact highlights the importance of analyzing the robustness of classifiers against different types and levels of noise to know their behavior against potential errors. In this context, noise models have been proposed to study noise‐related phenomenology in a controlled environment, allowing errors to be introduced into the data in a supervised manner. This paper introduces the noisemodel R package, which contains the first extensive implementation of noise models for classification datasets, proposing it as support tool to analyze the impact of errors related to chemical data. It provides 72 noise models found in the specialized literature that allow errors to be introduced in different ways in classes and attributes. Each of them is properly documented and referenced, unifying their results through a specific S3 class, which benefits from customized print, summary and plot methods. The usage of the package is illustrated through four application examples considering real‐world chemical datasets, where errors are prone to occur. The software presented will help to deepen the understanding of the problem of noisy chemical data, as well as to develop new robust algorithms and noise preprocessing methods properly adapted to different types of errors in this scenario.
Article
Class imbalance learning is a challenging task in machine learning applications. To balance training data, traditional class imbalance learning approaches, such as class resampling or reweighting, are commonly applied in the literature. However, these methods can have significant limitations, particularly in the presence of noisy data, missing values, or when applied to advanced learning paradigms like semi-supervised or federated learning. To address these limitations, this paper proposes a novel and theoretically-ensured latent F eature R ectification method for cl A ss i M balance l E arning (FRAME). The proposed FRAME can automatically learn multiple centroids for each class in the latent space and then perform class balancing. Unlike data-level methods, FRAME balances feature in the latent space rather than the original space. Compared to algorithm-level methods, FRAME can distinguish different classes based on distance without the need to adjust the learning algorithms. Through latent feature rectification, FRAME can effectively mitigate contaminated noises/missing values without worrying about structural variations in the data. In order to accommodate a wider range of applications, this paper extends FRAME to the following three main learning paradigms: fully-supervised learning, semi-supervised learning, and federated learning. Extensive experiments on 10 binary-class datasets demonstrate that our FRAME can achieve competitive performance than the state-of-the-art methods and its robustness to noises/missing values.