Figure 5 - uploaded by David Martin Ward Powers
Content may be subject to copyright.
Illustration of significance and Cramer's V. 110Monte Carlo simulations with 11 stepped expected Informedness (red) levels with Bookmakerestimated Informedness (red dots), Markedness (green dot) and Correlation (blue dot), with significance (p+1) calculated using G 2 , X 2 , and Fisher estimates, and (skewed) Cramer's V Correlation estimates calculated from both G 2 and X 2 . Here K=4, N=128, X=1.96, α=β=0.05.
Source publication
Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear...
Context in source publication
Context 1
... the lower degree of freedom model. The error introduced by the Cramer's V (K-1 degree of freedom) approximation to significance from G 2 or χ 2 can be viewed in two ways. If we start with a G 2 or χ 2 estimate as intended by Cramer we can test the accuracy of the estimate versus the true correlation, markedness and informedness as illustrated in Fig. 5. Note that we can see here that Cramer's V underestimates association for high levels of informedness, whilst it is reasonably accurate for lower levels. If we use (53) to (55) to estimate significance from the empirical association measures, we will thus underestimate significance under conditions of high association -viz. it the test ...
Similar publications
Evaluating the accuracy (ie, estimating the sensitivity and specificity) of new diagnostic tests without the presence of a gold standard is of practical meaning and has been the subject of intensive study for several decades. Existing methods use 2 or more diagnostic tests under several basic assumptions and then estimate the accuracy parameters vi...
In this research work, the latest version 7 (V7) of the Tropical Rainfall Measuring Mission (TRMM) Multi-satellite Precipitation Analysis (TMPA) 3B42 product is evaluated over Greece, at five different temporal scales; 3, 6, 12, 24 and 48 hrs. Evaluation consists in assessing the performance of the product in reproducing observed, non-zero precipit...
We propose a novel method to forecast corporate earnings which combines the accuracy of analysts' forecasts with the unbiasedness of a mechanical model. Our choice of variables is driven by recent insights from the earnings forecasts literature and the resulting model outperforms all analyzed methods in terms of accuracy, bias, and earnings respons...
The polychoric instrumental variable (PIV) approach is a recently proposed method to fit a confirmatory factor analysis model with ordinal data. In this paper, we first examine the small-sample properties of the specification tests for testing the validity of instrumental variables (IVs). Second, we investigate the effects of using different number...
In this paper we present a study on the credibility of lay users evaluations of health related web content. We investigate the differences between their approach and the approach of medical experts, analyse whether we can increase their accuracy using a simple support system, and explore the effectiveness of the wisdom of crowds approach. We find t...
Citations
... Therefore, performance evaluation metrics were used for the binary classification problem. The performance evaluation metrics used are defined below (Powers, 2020;Warrens, 2008). ...
... This part evaluates the performance of traditional approaches to algorithms for sexual harassment sentiment classification. It is based on four measures: precision, accuracy, recall, and the F-scale [25,26]. All of these four-measure have the following equations, which can be defined as follows: - ...
Due to advances in technology, social media has become the most popular medium for spreading news. Many messages are published on social media sites such as Facebook, Twitter, Instagram, etc. Social media platforms also provide opportunities to express opinions and social phenomena such as hate, offensive language, racism, sexual content, and all forms of verbal violence, which have amazingly increased. These behaviors do not only affect specific countries, groups, or societies but extend beyond these areas into people’s daily lives. This study examines sexual content and harassment discourse in Arabic social media to build an accurate system for detecting sexual harassment expressions. The dataset was collected from Twitter posts to make the classification. A deep learning model was developed as a classification system to identify sexual speech using Bidirectional Long-Short-Term Memory (BiLSTM), Temporal Convolutional Network (TCN) with word embedding and the FastText previously trained on the Arabic language model. The proposed (TCN-BiLSTM) model was compared with Extreme Gradient Boosting (XGBoost). The CASH dataset implemented with the (TCN -Bi-LSTM) model gate obtained an accuracy rate of 96.65% and an F0.5 value of 0.969. The implementation of XGBoost using word embeddings resulted in an accuracy rate of 92.56% and an F0.5 value of 0.925. Findings and manual interpretation showed that different text representation methods with various deep learning algorithms obtain higher classification performance easily in complex sentences. This strategy is helpful with languages that are difficult to study morphologically, like Arabic, Turkish, and Lithuanianز
... MCC is used as a balanced measure of the quality of classifications even if the classes are of different size [68], which reflects the correlation between the prediction and ground truth. A maximum score of 1 indicates perfect matching between the segmented voxels and its ground-truth [69]. The ROC area, also known as AUC (area under the ROC curve), is also used to evaluate the segmentation method [70]. ...
The application of machine learning and computer vision in microtomography provides new opportunities to directly analyze the microstructural evolutions of strain-hardening cementitious composites (SHCC) under tensile load, especially the strain-hardening process. For the first time, a state-of-the-art machine-learning pipeline combined with digital volume correlation for automated microtomography segmentation analysis (MSA) was developed to separate different components and quantify the in-situ 3D morphological properties of the fibers and pore networks imaged with in-situ synchrotron X-ray computed microtomography. Strain localization and crack initiation were observed around the interconnected pores where strain localized instead of the weakest cross-section defined by the fiber distribution and porosity. Fibers reinforced the crack planes through fiber debonding, bridging, bending, stretching, and orientation redistribution, which contributed to the crack width control and ductility of SHCC in the experiment. This work is essential to understand the progressive damage mechanisms of SHCC and help refine the characterization, modeling, and design of the composite using a bottom-up approach.
... Since we cannot display the raw signal of the two kinds of data for all records, we need some indicators to objectively evaluate the consistency of the data. Follow Sterr's work [17], we extracted the sleeping parts of each record and calculated the RSP of each frequency band, including delta (0.5-4 Hz), theta (4-8 Hz), alpha (8-12 Hz), sigma (12)(13)(14)(15)(16), and beta (16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), with the total frequency range from 0.5Hz to 30Hz. Due to differences in the recording locations and the hardware, we do not expect the RSPs of each record of LM and PSG to be strictly equal, but they should follow the same trend as the participant's overnight sleep state changes. ...
... In order to evaluate the performance of the LM automatic sleep staging algorithm, we calculated the Accuracy, F1-Score [20], and Kappa-score of the sleep staging results against to the ground truth as evaluation metrics. These metrics are computed as follows: . ...
Background:
Polysomnography (PSG) is the gold standard for sleep monitoring and diagnosis, yet it is difficult to use in home environments. This study evaluated the performance of a wearable electroencephalographic(EEG) device, LANMAO Sleep Recorder in EEG recording and sleep staging algorithm by comparing with PSG.
Method:
Sleep of 7 Chinese adults were recorded concurrently with PSG and LANMAO devices. First, we validated the consistency of the raw signals with relative spectral power and Pearson correlation coefficient. Second, we evaluated the performance of the automated sleep staging algorithm integrated in the LANMAO device by comparing with the staging by experts.
Results:
The Pearson correlation coefficient between the relative spectral power of multiple frequency bands during the sleep stages ranged from 0.7613 to 0.8816, with the strongest correlation observed for delta waves (r=0.8816). The overall F1-Score of the automated sleep staging algorithm was 84.03%, with individual F1-Scores for each class as follows: Wake: 93.67%, REM: 87.23%, Light Sleep: 72.10%, and Deep Sleep: 82.82%.
Conclusion:
The results suggest that the EEG recorded by the LANMAO Sleep Recorder is precise and valid, and its automated sleep staging algorithm can accurately perform sleep staging with high accuracy. Therefore, in specific scenarios such as the home environment, LANMAO devices can work as a promising PSG alternative for sleep monitoring.
Keywords:
automatic sleep staging; PSG; forehead; EEG; wearable; machine learning
... To evaluate the success of the model, we consider its overall accuracy, recall, and precision (Junker et al., 1999;Powers, 2020). Accuracy is the percentage of patches correctly identified (i.e., as beaver-dammed or not beaver-dammed) in the landscape, recall is the percentage of known beaver dams that were correctly identified (i.e., what percentage of the manually mapped dams did the model find), and precision is the percentage of patches identified as beaver dams that were actually beaver dams (i.e., if the model predicts something is a beaver dam, what percentage of the time is it correct). ...
Beavers are ecosystem engineers that create and maintain riparian wetland ecosystems in a variety of ecologic, climatic, and physical settings. Despite the large‐scale implications of ongoing beaver conservation and range expansion, relatively few landscape‐scale studies have been conducted, due in part to the significant time required to manually locate beaver dams at scale. To address this need, we developed EEAGER—an image recognition machine learning model that detects beaver complexes in aerial and satellite imagery. We developed the model in the western United States using 13,344 known beaver dam locations and 56,728 nearby locations without beaver dams. Performance assessment was performed in twelve held out evaluation polygons of known beaver occupancy but previously unmapped dam locations. These polygons represented regions similar to the training data as well as more novel landscape settings. Our model performed well overall (accuracy = 98.5%, recall = 63.03%, precision = 25.83%) in these areas, with stronger performance in regions similar to where the model had been trained. We favored recall over precision, which results in a more complete catalog of beaver dams found but also a higher incidence of false positives to be manually removed during quality control. These results have far‐reaching implications for monitoring of beaver‐based river restoration, as well as potential applications detecting other complex landforms.
... To identify informative reviews, we manually classified a fraction of reviews (see Section V) and used them to train a Naive Bayes classifier (following [18]). This setup resulted in the F1 score (the harmonic mean of precision and recall [35]) of 0.82, calculated as the average of ten 10-fold cross-validation runs. ...
Evolving software with an increasing number of features is harder to understand and thus harder to use. Software release planning has been concerned with planning these additions. Moreover, software of increasing size takes more effort to be maintained. In the domain of mobile apps, too much functionality can easily impact usability, maintainability, and resource consumption. Hence, it is important to understand the extent to which the law of continuous growth applies to mobile apps. Previous work showed that the deletion of functionality is common and sometimes driven by user reviews. However, it is not known if these deletions are visible or important to the app users. In this study, we performed a survey study with 297 mobile app users to understand the significance of functionality deletion for them. Our results showed that for the majority of users, the deletion of features corresponds with negative sentiments and change in usage and even churn. Motivated by these preliminary results, we propose RADIATION to input user reviews and recommend if any functionality should be deleted from an app's User Interface (UI). We evaluate RADIATION using historical data and surveying developers' opinions. From the analysis of 190,062 reviews from 115 randomly selected apps, we show that RADIATION can recommend functionality deletion with an average F-Score of 74% and if sufficiently many negative user reviews suggest so.
... The performance of the developed model was evaluated using accuracy, recall, precision, and F1-scores. 55,56 The model attained excellent performance for 4-class (COVID-19, COPD, HF, and normal) classification using the db1 database (Table 4), 2-class (COVID-19 and non-COVID) classification with db2 database (Table 5), and 4-class (COVID-19, viral pneumonia, lung opacity, and healthy) classification with db3 database (Table 6). kNN and SVM classifiers yielded the best overall accuracy rates for chest CT and CXR images, respectively. ...
COVID‐19, chronic obstructive pulmonary disease (COPD), heart failure (HF), and pneumonia can lead to acute respiratory deterioration. Prompt and accurate diagnosis is crucial for effective clinical management. Chest X‐ray (CXR) and chest computed tomography (CT) are commonly used for confirming the diagnosis, but they can be time‐consuming and biased. To address this, we developed a computationally efficient deep feature engineering model called Hybrid‐Patch‐Alex for automated COVID‐19, COPD, and HF diagnosis. We utilized one CXR dataset and two CT image datasets, including a newly collected dataset with four classes: COVID‐19, COPD, HF, and normal. Our model employed a hybrid patch division method, transfer learning with pre‐trained AlexNet, iterative neighborhood component analysis for feature selection, and three standard classifiers (k‐nearest neighbor, support vector machine, and artificial neural network) for automated classification. The model achieved high accuracy rates of 99.82%, 92.90%, and 97.02% on the respective datasets, using kNN and SVM classifiers.
... These criteria are all obtained from the confusion matrix, and there are other items as well. 46 Consider that we train the model with n data, but in the end, we report the accuracy with only one number. So it is impossible to reflect well on some aspects of accuracy. ...
... Therefore, this makes us focus more on the optimization of that evaluation criterion and not on optimizing the accuracy of the model. 46 ...
... After training, the testing dataset, containing new and unseen texts, is used to assess the model's generalization ability by assigning their text type. Since we work only with balanced datasets, i.e., each type has the same number of texts as its counterpart, the accuracy measure (Powers, 2011) is suitable: ...
Our work aims to evaluate the strength of the association between function words and several text types: novels, poems, academic articles, reviews and blog posts, and the accuracy of their classification to these categories. The principal conclusion is that the types of texts are distin-guishable based only on the function words, either by vocabulary or vocabulary diversity. Such findings may impact the techniques of authorship attribution based on function words and text clustering techniques since some function words add information about the text types/genres, in addition to content words.
... For every target user, recommendation methods output an ordered list with k items. Therefore, to evaluate their performance, we apply the following metrics commonly found in the literature [20,38,41,42,46], where r are the relevant items contained by the recommendation list with size k: ...
Traditionally, recommender systems use collaborative filtering or content-based approaches based on ratings and item descriptions. However, this information is unavailable in many domains and applications, and recommender systems can only tackle the problem using information about interactions or implicit knowledge. Within this scenario, this work proposes a novel approach based on link prediction techniques over graph structures that exclusively considers interactions between users and items to provide recommendations. We present and evaluate two alternative recommendation methods: one item-based and one user-based that apply the edge weight, common neighbours, Jaccard neighbours, Adar/Adamic, and Preferential Attachment link prediction techniques. This approach has two significant advantages, which are the novelty of our proposal. First, it is suitable for minimal knowledge scenarios where explicit data such as ratings or preferences are not available. However, as our evaluation demonstrates, this approach outperforms state-of-the-art techniques using a similar level of interaction knowledge. Second, our approach has another relevant feature regarding one of the most significant concerns in current artificial intelligence research: the recommendation methods presented in this paper are easily interpretable for the users, improving their trust in the recommendations.