Figure 1 - uploaded by David Martin Ward Powers
Content may be subject to copyright.
Illustration of ROC Analysis. The main diagonal represents chance with parallel isocost lines representing equal cost-performance. Points above the diagonal represent performance better than chance, those below worse than chance. For a single good (dotted=green) system, AUC is area under curve (trapezoid between green line and x=[0,1] ). The perverse (dashed=red) system shown is the same (good) system with class labels reversed.  

Illustration of ROC Analysis. The main diagonal represents chance with parallel isocost lines representing equal cost-performance. Points above the diagonal represent performance better than chance, those below worse than chance. For a single good (dotted=green) system, AUC is area under curve (trapezoid between green line and x=[0,1] ). The perverse (dashed=red) system shown is the same (good) system with class labels reversed.  

Source publication
Article
Full-text available
Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear...

Similar publications

Article
Full-text available
Evaluating the accuracy (ie, estimating the sensitivity and specificity) of new diagnostic tests without the presence of a gold standard is of practical meaning and has been the subject of intensive study for several decades. Existing methods use 2 or more diagnostic tests under several basic assumptions and then estimate the accuracy parameters vi...
Conference Paper
Full-text available
In this research work, the latest version 7 (V7) of the Tropical Rainfall Measuring Mission (TRMM) Multi-satellite Precipitation Analysis (TMPA) 3B42 product is evaluated over Greece, at five different temporal scales; 3, 6, 12, 24 and 48 hrs. Evaluation consists in assessing the performance of the product in reproducing observed, non-zero precipit...
Article
Full-text available
We propose a novel method to forecast corporate earnings which combines the accuracy of analysts' forecasts with the unbiasedness of a mechanical model. Our choice of variables is driven by recent insights from the earnings forecasts literature and the resulting model outperforms all analyzed methods in terms of accuracy, bias, and earnings respons...
Article
Full-text available
The polychoric instrumental variable (PIV) approach is a recently proposed method to fit a confirmatory factor analysis model with ordinal data. In this paper, we first examine the small-sample properties of the specification tests for testing the validity of instrumental variables (IVs). Second, we investigate the effects of using different number...
Conference Paper
Full-text available
In this paper we present a study on the credibility of lay users evaluations of health related web content. We investigate the differences between their approach and the approach of medical experts, analyse whether we can increase their accuracy using a simple support system, and explore the effectiveness of the wisdom of crowds approach. We find t...

Citations

... Even the SWIR 1 and SWIR 2 regions also include a large built-up class has been chosen to assess the performance of the classifier. F1 has been defined as the harmonic mean of recall and precision values (Powers 2020). A good F1 score is also indicative of good classification performance. ...
Article
Full-text available
Processing of hyperspectral remote sensing datasets poses challenges in terms of computational expense pertaining to data redundancy. As such, band selection becomes indispensable to address redundancy while preserving the optimal spectral information. This paper proposes a novel architecture using Genetic Algorithm (GA) optimizing technique with Random Forest (RF) classifier for efficient band selection with the Hyperspectral Precursor of the Application Mission (PRISMA) dataset. The optimal bands are BLUE (λ=492.69 nm), NIR (λ=959.52 nm), and SWIR 1 (λ=1626.78 nm). This paper also involves an application of the selected bands to accurately identify and quantify built-up pixels by means of a new spectral index named Hyperspectral Imagery-based Built-up Index (HIBI). The proposed index was used to map built-up pixels in six cities around the world namely Jaipur, Varanasi, Delhi, Tokyo, Moscow and Jakarta to establish its robustness. This analysis shows that the proposed index has an accuracy of 94.02%, higher than all the other indices considered for this study. Moreover, the spectral separability analysis also establishes the efficiency of the proposed index to differentiate built-up pixels from spectrally similar land use or land cover classes.
... During each cross-validation, we presented the classic accuracy, precision, recall, and F1 score. 51 Notably, as our task was multiclass classification, we used the macroaverage, 52 to give a fairer value that emphasizes the rare category and calculated the four indices mentioned above. A macro-average will compute the result of each class independently, and then calculate the mean result of all the classes. ...
Article
Full-text available
Objective: Doctors, nowadays, primarily use auditory-perceptual evaluation, such as the grade, roughness, breathiness, asthenia, and strain scale, to evaluate voice quality and determine the treatment. However, the results predicted by individual physicians often differ, because of subjective perceptions, and diagnosis time interval, if the patient's symptoms are hard to judge. Therefore, an accurate computerized pathological voice quality assessment system will improve the quality of assessment. Method: This study proposes a self_attention-based system, with a deep learning technology, named self_attention-based bidirectional long-short term memory (SA BiLSTM). Different pitches [low, normal, high], and vowels [/a/, /i/, /u/], were added into the proposed model, to make it learn how professional doctors evaluate the grade, roughness, breathiness, asthenia, and strain scale, in a high dimension view. Results: The experimental results showed that the proposed system provided higher performance than the baseline system. More specifically, the macro average of the F1 score, presented as decimal, was used to compare the accuracy of classification. The (G, R, and B) of the proposed system were (0.768±0.011, 0.820±0.009, and 0.815±0.009), which is higher than the baseline systems: deep neural network (0.395±0.010, 0.312±0.019, 0.321±0.014) and convolution neural network (0.421±0.052, 0.306±0.043, 0.3250±0.032) respectively. Conclusions: The proposed system, with SA BiLSTM, pitches, and vowels, provides a more accurate way to evaluate the voice. This will be helpful for clinical voice evaluations and will improve patients' benefits from voice therapy.
... Even the SWIR 1 and SWIR 2 regions also include a large built-up class has been chosen to assess the performance of the classifier. F1 has been defined as the harmonic mean of recall and precision values (Powers 2020). A good F1 score is also indicative of good classification performance. ...
Article
Full-text available
Processing of hyperspectral remote sensing datasets poses challenges in terms of computational expense pertaining to data redundancy. As such, band selection becomes indispensable to address redundancy while preserving the optimal spectral information. This paper proposes a novel architecture using Genetic Algorithm (GA) optimizing technique with Random Forest (RF) classifier for efficient band selection with the Hyperspectral Precursor of the Application Mission (PRISMA) dataset. The optimal bands are BLUE (λ = 492.69 nm), NIR (λ = 959.52 nm), and SWIR 1 (λ = 1626.78 nm). This paper also involves an application of the selected bands to accurately identify and quantify built-up pixels by means of a new spectral index named Hyperspectral Imagery-based Built-up Index (HIBI). The proposed index was used to map built-up pixels in six cities around the world namely Jaipur, Varanasi, Delhi, Tokyo, Moscow and Jakarta to establish its robustness. This analysis shows that the proposed index has an accuracy of 94.02%, higher than all the other indices considered for this study. Moreover, the spectral separability analysis also establishes the efficiency of the proposed index to differentiate built-up pixels from spectrally similar land use or land cover classes.
... Tese measures are computed based on the measures of the confusion matrix. First, recall or sensitivity is the amount of real positive values that are accurately labeled as positive, whereas precision is the predictive positive values or confdence of a model [49]. Likewise, the harmonic mean of sensitivity and recall is referred to as F-measure [50]. ...
Article
Full-text available
In numerous perilous cases, a quick medical decision is needed for the early detection of chronic diseases to avoid austere consequences that may be fatal. Chronic kidney disease (CKD) is a prevalent disease that presents a variety of challenges, including soaring costs for intervention, urgency, and, more importantly, difficulty in early detection of the disease. The current study carries out a prediction-based method that helps in detecting and diagnosing CKD patients which enables a fast and accurate decision-making process at the early stage. A combination of preprocessing and feature selection methods was developed; additionally, several prediction models, such as K-nearest neighbor (KNN), support vector machine (SVM), random forest (RF), and bagging, were trained based on the processed dataset. The performance evaluation shows higher reliability of all models in terms of accuracy, precision, sensitivity, F-measure, specificity, and area under the curve (AUC) score. Specifically, KNN outperformed with an accuracy of 99.50%, sensitivity of 99.2%, precision of 100%, specificity of 98.7%, and F-measure and AUC score of 99.6%. The experimental results of KNN show the best fitted model compared to the existing state-of-the-art methods. Moreover, the reduced feature set proves that just a few clinical tests are enough to detect CKD, resulting in diagnosis cost reduction.
... In this preliminary evaluation, the collected data are divided based on the data collection technologies we utilized (force platform, EMGs, or acceleration data) into three different datasets, as shown in Table 3. Subsequently, for each dataset type (i.e., A, B, and C), we applied the data-processing pipeline defined in Section 4.3 and shown in Figure 10. In particular, the data, in segmentation and feature representation forms were segmented in time windows of 150 ms as the hypothetical duration of the APA movement phase), were used to train three different machine learning models, whose performances in recognizing the movement phases of Figure 5 were measured in terms of accuracy A), precision (P), recall (R), and F1-score (F1), defined as follows [57]: ...
Article
Full-text available
In medicine and sport science, postural evaluation is an essential part of gait and posture correction. There are various instruments for quantifying the postural system’s efficiency and determining postural stability which are considered state-of-the-art. However, such systems present many limitations related to accessibility, economic cost, size, intrusiveness, usability, and time-consuming set-up. To mitigate these limitations, this project aims to verify how wearable devices can be assembled and employed to provide feedback to human subjects for gait and posture improvement, which could be applied for sports performance or motor impairment rehabilitation (from neurodegenerative diseases, aging, or injuries). The project is divided into three parts: the first part provides experimental protocols for studying action anticipation and related processes involved in controlling posture and gait based on state-of-the-art instrumentation. The second part provides a biofeedback strategy for these measures concerning the design of a low-cost wearable system. Finally, the third provides algorithmic processing of the biofeedback to customize the feedback based on performance conditions, including individual variability. Here, we provide a detailed experimental design that distinguishes significant postural indicators through a conjunct architecture that integrates state-of-the-art postural and gait control instrumentation and a data collection and analysis framework based on low-cost devices and freely accessible machine learning techniques. Preliminary results on 12 subjects showed that the proposed methodology accurately recognized the phases of the defined motor tasks (i.e., rotate, in position, APAs, drop, and recover) with overall F1-scores of 89.6% and 92.4%, respectively, concerning subject-independent and subject-dependent testing setups.
... By considering white color or 1 as positive and black color or 0 as negative in the BW image, TP is when the model correctly predicts the positive class and TN is when the model correctly predicts the negative class. FP is when the model incorrectly predicts the positive class and FN is when the model incorrectly predicts the negative class [25,26]. Figure 7 shows mentioned concept as a visual. ...
... Figure 8 illustrates, Comparison of results on an MRI brain tumor sample by different algorithms. Now, Accuracy [25,26] is the percent of pixels that are correctly classified as (2): ...
... Precision [25,26] means how many of those predicted objects had matching ground truth annotation and is calculated as (3): ...
Preprint
Full-text available
Accurate detection of brain tumors could save lots of lives and increasing the accuracy of this binary classification even as much as a few percent has high importance. Neural Gas Networks (NGN) is a fast, unsupervised algorithm that could be used in data clustering, image pattern recognition, and image segmentation. In this research, we used the metaheuristic Firefly Algorithm (FA) for image contrast enhancement as pre-processing and NGN weights for feature extraction and segmentation of Magnetic Resonance Imaging (MRI) data on two brain tumor datasets from the Kaggle platform. Also, tumor classification is conducted by Support Vector Machine (SVM) classification algorithms and compared with a deep learning technique plus other features in train and test phases. Additionally, NGN tumor segmentation is evaluated by famous performance metrics such as Accuracy, F-measure, Jaccard, and more versus ground truth data and compared with traditional segmentation techniques. The proposed method is fast and precise in both tasks of tumor classification and segmentation compared with other methods. A classification accuracy of 95.14 % and segmentation accuracy of 0.977 is achieved by the proposed method.
... This progression score is our assessment of the relatedness. 61,75 and, in general, strongly depends on the specific problem under investigation. Here we discuss the practical meaning of the performance indicators we used to compare the ML algorithms. ...
... • Precision Precision is defined as the ratio between true positives and positives 61 . In our case, we predict that a number of products will be competitively exported by some countries; these are the positives. ...
... By using mP@k we quantify the correctness of our possible recommendations of k products, on average, for a country. • Recall Recall is defined as the ratio between true positives and the sum of true positives and false negatives or, in other words, the total number of products that a country will export after years 61 . So a high recall is associated with a low number of false negatives, that is, if we predict that a country will not start exporting a product, that country will usually not export that product. ...
Article
Full-text available
Economic complexity methods, and in particular relatedness measures, lack a systematic evaluation and comparison framework. We argue that out-of-sample forecast exercises should play this role, and we compare various machine learning models to set the prediction benchmark. We find that the key object to forecast is the activation of new products, and that tree-based algorithms clearly outperform both the quite strong auto-correlation benchmark and the other supervised algorithms. Interestingly, we find that the best results are obtained in a cross-validation setting, when data about the predicted country was excluded from the training set. Our approach has direct policy implications, providing a quantitative and scientifically tested measure of the feasibility of introducing a new product in a given country.
... Furthermore, definitions of the different evaluation metrics used to evaluate the performance of the machine learning-based approaches for driving behavior estimations are explained. Commonly used metrics are ACC, DR, precision, FAR, AUC, and F1 score which compares the estimated and actual behavior for similarities (Powers, 2011). ...
... Estimation of lane-changing behaviors is either prediction or recognition of lane-changing behaviors. As for the metrics, ACC, DR, precision, and FAR are defined using true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values given by Powers (2011). ...
... Higher ACC, DR, and precision values and low FAR values indicate better estimation performance of the approaches. A commonly known property of ACC is that it performs poorly with a non-balanced data set (López et al., 2013), while DR and precision provide a better evaluation for non-balanced data sets (Powers, 2011). The drawbacks of DR and precision are that DR does not consider FP values, while precision does not consider FN values (Sharath and Mehran, 2021). ...
Article
Full-text available
A major aspect in the development of advanced driving assistance systems (ADASs) is the research in developing human driving behavior prediction and recognition models. Recent contributions focus on developing these models for estimating different driving behaviors like lane or speed change. Thus, the models are incorporated into the ADAS to generate warnings and hints for safe maneuvers. Driving behavior recognition and prediction models are generally developed based on machine learning (ML) algorithms and are proven to generate accurate estimations. Previous review research contributions tend to focus on ML-based models for the prediction and recognition of speed change, trajectory change, and even driving styles. Due to high number of driving errors occurring during a lane change, a state-of-art review of different ML-based models for lane-changing behavior prediction and recognition is helpful to present a comparison between different models in terms of structure, influencing input variables, and performance. This enables the integration of the most efficient model for the development of ADASs to avoid accidents during a lane change. First, definitions and terms related to the model’s task and evaluation metrics used to evaluate the model’s performance are described to improve the readability. Then, the different input variables of the models affecting the lane-changing behaviors are presented. Next, a review of the models developed based on well-known approaches, such as artificial neural network (ANN), hidden Markov model (HMM), and support vector machine (SVM), using different input variables is given. Three lane-changing behaviors are focused on here: left/right lane change and lane keeping. The advantages and disadvantages of the different ML models with a comparison are summarized as well. Finally, the improvements required in the future are discussed.
... Precision is responsible for measuring the relevancy of the generated prediction results. A recall is responsible for measuring the correctly classified RUW tweets and the F1-Score is a weighted combination of recall and precision scores [73] ...
Article
Full-text available
The beginning of this decade brought utter international chaos with the COVID-19 pandemic and the Russia-Ukraine war (RUW). The ongoing war has been building pressure across the globe. People have been showcasing their opinions through different communication media, of which social media is the prime source. Consequently, it is important to analyze people's emotions toward the RUW. This paper therefore aims to provide the framework for automatically classifying the distinct societal emotions on Twitter, utilizing the amalgamation of Emotion Robustly Optimized Bidirectional Encoder Representations from the Transformers Pre-training Approach (Emoroberta) and machine-learning (ML) techniques. This combination shows the originality of our proposed framework, i.e., Russia-Ukraine War emotions (RUemo), in the context of the RUW. We have utilized the Twitter dataset related to the RUW available on Kaggle.com. The RUemo framework can extract the 27 distinct emotions of Twitter users that are further classified by ML techniques. We have achieved 95% of testing accuracy for multilayer perceptron and logistic regression ML techniques for the multiclass emotion classification task. Our key finding indicates that:First, 81% of Twitter users in the survey show a neutral position toward RUW; second, there is evidence of social bots posting RUW-related tweets; third, other than Russia and Ukraine, users mentioned countries such as Slovakia and the USA; and fourth, the Twitter accounts of the Ukraine President and the US President are also mentioned by Twitter users. Overall, the majority of tweets describe the RUW in key terms related more to Ukraine than to Russia.
... Other statistical measures used for evaluating wetland classification performance are the overall accuracy (OA), producers' accuracy (PA), and users' accuracy (UA), all calculated from the error matrices classes [52][53][54]. The formula for calculating UA, PA, and OA is as follows: ...
Article
Full-text available
Wetlands are a valuable ecosystem that provides various services to flora and fauna. This study developed and compared deep and shallow learning models for wetland classification across the climatically dynamic landscape of Alberta’s Parkland and Grassland Natural Region. This approach to wetland mapping entailed exploring multi-temporal (combination of spring/summer and fall months over 4 years—2017 to 202) and multisensory (Sentinel 1 and 2 and Advanced Land Observing Satellite, ALOS) data as input in the predictive models. This input image consisted of S1 dual-polarization vertical-horizontal bands, S2 near-infrared and shortwave infrared bands, and ALOS-derived topographic wetness index. The study explored the ResU-Net deep learning (DL) model and two shallow learning models, namely random forest (RF) and support vector machine (SVM). We observed a significant increase in the average F1-score of the ResNet model prediction (0.82) compared to SVM and RF prediction of 0.69 and 0.69, respectively. The SVM and RF models showed a significant occurrence of mixed pixels, particularly marshes and swamps confused for upland classes (such as agricultural land). Overall, it was evident that the ResNet CNN predictions performed better than the SVM and RF models. The outcome of this study demonstrates the potential of the ResNet CNN model and exploiting open-access satellite imagery to generate credible products across large landscapes. Graphical Abstract