added a

**research item**Project

Updates

0 new

0

Recommendations

0 new

0

Followers

0 new

6

Reads

1 new

158

Why you should use chance-correct measures rather than F-measure. Why F1-score is never a good idea.

Commonly used evaluation measures including Recall, Precision, F-Factor and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case. .

Over the last decade there has been increasing concern about the biases embodied in traditional evaluation methods for Natural Language Processing/Learning, particularly methods borrowed from Information Retrieval. Without knowledge of the Bias and Prevalence of the contingency being tested, or equivalently the expectation due to chance, the simple conditional probabilities Recall, Precision and Accuracy are not meaningful as evaluation measures, either individually or in combinations such as F-factor.
The existence of bias in NLP measures leads to the ‘improvement’ of systems by increasing their bias, such as the practice of improving tagging and parsing scores by using most common value (e.g. water is always a Noun) rather than the attempting to discover the correct one. In this paper, we will analyze both biased and unbiased measures theoretically, characterizing the precise relationship between all these measures.

Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.

Receiver Operating Characteristics (ROC) has increasingly been advocated as a mechanism for evaluating classifiers, particularly when the precise conditions and costs of deployment are not known. Area Under the Curve (AUC) is then used a single figure for comparing how good too methods or algorithms are. Additional support for ROC AUC is cited in its equivalence to the non-parametric Wilcoxon signed rank test, but we show that this is in general misleading and that use of AUC implicitly makes theoretical assumptions that are not well met in practice. This paper advocates two ROC-related measures that separate out two specific types of goodness that are wrapped up in ROC-AUC, which we call Consistency (Con) and Certainty (Cert). We treat primarily the dichotomous 2 class case, but discuss also the generalization to mulitple classes.

Receiver Operating Characterics curves have recently received considerable attention as mechanisms for comparing different algorithms or experimental results. However the common approach of comparing Area Under the Curve has come in for some criticism with alternatives such as Area Under Kappa and H-measure being proposed as alternative measure. However, these measures have their own idiosyncracies and neglect certain advantages of RoC analysis that do not carry over to the proposed approach. In addition they suffer from the general desire for a one fits all mentality as opposed to a pareto approach. That is we want a single number to optimize, rather than a suite of numbers to trade off. The starting point for all of this research is the inadequacy and bias of traditional measures such as Accuracy, Recall and Precision, and F-measure - these should never be used singly or as a group, and ROC analysis is a very good alternative to them if used correctly, but treating it as a graph rather than trying to boil it down to a single number. Other measures that seek to remove the bias in traditional measures include Krippendorf's Alpha, Scott's Pi, Powers' Informedness and Markedness, as well as a great many variant Kappa statistics. The original Kappa statistics were intrinsically dichotomous but the family has been well generalized to allow for multiple classes and multiple sources of labels. We discuss the proper and improper use of ROC curves, the issues with AUC and Kappa, and make a recommendation as to the approach to take in comparing experimental algorithms or other kinds of tests.

It is becoming clear that traditional evaluation measures used in Computational Linguistics (including Error Rates, Accuracy, Recall, Precision and F-measure) are of limited value for unbiased evaluation of systems, and are not meaningful for comparison of algorithms unless both the dataset and algorithm parameters are strictly controlled for skew (Prevalence and Bias). The use of techniques originally designed for other purposes, in particular Receiver Operating Characteristics Area Under Curve, plus variants of Kappa, have been proposed to fill the void. This paper aims to clear up some of the confusion relating to evaluation, by demonstrating that the usefulness of each evaluation method is highly dependent on the assumptions made about the distributions of the dataset and the underlying populations. The behaviour of a number of evaluation measures is compared under common assumptions. Deploying a system in a context which has the opposite skew from its validation set can be expected to approximately negate Fleiss Kappa and halve Cohen Kappa but leave Powers Kappa unchanged. For most performance evaluation purposes, the latter is thus most appropriate, whilst for comparison of behaviour, Matthews Correlation is recommended.

Cross-Validation (CV) is the primary mechanism used in Machine Learning to control generalization error in the absence of sufficiently large quantities of marked up (tagged or labelled) data to undertake independent testing, training and validation (including early stopping, feature selection, parameter tuning, boosting and/or fusion). Repeated Cross-Validation (RCV) is used to try to further improve the accuracy of our performance estimates, including compensating for outliers. Typically a Machine Learning researcher will the compare a new target algorithm against a wide range of competing algorithms on a wide range of standard datasets. The combination of many training folds, many CV repetitions, many algorithms and parameterizations, and many training sets, adds up to a very large number of data points to compare, and a massive multiple testing problem quadratic in the number of individual test combinations. Research in Machine Learning sometimes involves basic significance testing, or provides confidence intervals, but seldom addresses the multiple testing problem whereby the assumption of p

In the evaluation of models, theories, information retrieval systems, learning systems and neural networks we must deal with the ubiquitous contingency matrix of decisions versus events. In general this is manifested as the result matrix for a series of experiments aimed at predicting or labeling a series of events. The classical evaluation techniques come from information retrieval, using recall and precision as measures. These are now applied well beyond this field, but unfortunately they have fundamental flaws, are frequently abused, and can prefer substandard models. This paper proposes a well-principled evaluation technique that better takes into account the negative effect of an incorrect result and is directly quantifiable as the probability that an informed decision was made rather than a random guess. It furthermore generalizes a number of well known dichotomous techniques including kappa, RoC AUC, Youden, Gini and DeltaP as discussed in more detail in follow-on papers.

The F-measure or F-score is one of the most commonly used “single number” measures in Information Retrieval, Natural Language Processing and Machine Learning, but it is based on a mistake, and the flawed assumptions render it unsuitable for use in most contexts! Fortunately, there are better alternatives…

Evaluation often aims to reduce the correctness or error characteristics of a system down to a single number, but that always involves trade-offs. Another way of dealing with this is to quote two numbers, such as Recall and Precision, or Sensitivity and Specificity. But it can also be useful to see more than this, and a graphical approach can explore sensitivity to cost, prevalence, bias, noise, parameters and hyper-parameters. Moreover, most techniques are implicitly based on two balanced classes, and our ability to visualize graphically is intrinsically two dimensional, but we often want to visualize in a multiclass context. We review the dichotomous approaches relating to Precision, Recall, and ROC as well as the related LIFT chart, exploring how they handle unbalanced and multiclass data, and deriving new probabilistic and information theoretic variants of LIFT that help deal with the issues associated with the handling of multiple and unbalanced classes.