ArticlePDF Available

Abstract

It has been past more than 15 years since the F-measure was first introduced to evaluation tasks of information extraction technology at the Fourth Message Understanding Conference (MUC-4) in 1992. Recently, sometimes I see some confusion with the definition of the F-measure, which seems to be triggered by lack of background knowledge about how the F-measure was derived. Since I was not involved in the process of the introduction or device of the F-measure, I might not be the best person to explain this but I hope this note would be a little help for those who are wondering what the F-measure really is. This introduction is devoted to provide brief but sufficient information on the F-measure.
c
2007 Y. Sasaki, Version: 26th October, 2007 1
The truth of the F-measure
Yutaka Sasaki
Research Fellow
School of Computer Science, University of Manchester
MIB, 131 Princess Street, Manchester, M1 7DN
Yutaka.Sasakiamanchester.ac.uk
October 26, 2007
Abstract
It has been past more than 15 years since the F-measure was first
introduced to evaluation tasks of information extraction technology
at the Fourth Message Understanding Conference (MUC-4) in 1992.
Recently, sometimes I see some confusion with the definition of the F-
measure, which seems to be triggered by lack of background knowledge
about how the F-measure was derived. Since I was not involved in the
process of the introduction or device of the F-measure, I might not be
the best person to explain this but I hope this note would be a little
help for those who are wondering what the F-measure really is. This
introduction is devoted to provide brief but sufficient information on
the F-measure.
1 Overview
Definition of the F-measure
The F-measure is defined as a harmonic mean of precision (P) and recall
(R):1
F=2P R
P+R.
1In biomedicine, precision is called positive predictive value (PPV) and recall is called
sensitivity but to my knowledge, there is nothing corresponding to the F-measure in the
domain.
If you are satisfied with this definition and need no further information,
that’s it. However, if you are deeply interested in the definition of the F-
measure, you should recap the definitions of the arithmetic and harmonic
means.
Arithmetic and harmonic means
The arithmetic mean A(an average in a usual sense) and the harmonic mean
Hare defined as follows.
A=1
n
n
X
i=1
xi=1
n(x1+x2+... +xn).
H=n
Pn
i=1
1
xi
=n
1
x1+... +1
xn
.
When x1=Pand x2=R,Aand Hwill be:
A=1
2(P+R).
H=2
1
P+1
R
=2
P+R
P R
=2P R
P+R.
The harmonic mean is more intuitive than the arithmetic mean when
computing a mean of ratios.
Suppose that you have a finger print recognition system and its precision
and recall be 1.0 and 0.2, respectively. Intuitively, the total performance of
the system should be very low because the system covers only 20% of the
registered finger prints, which means it is almost useless.
The arithmetic mean of 1 and 0.2 is 0.6 whereas the harmonic mean of
them is
2·1·2
10
1 + 2
10
=4
12 =1
3.
As you see in this example, the harmonic mean (0.333...) is a more
reasonable score than the arithmetic mean (0.6).
2
2 Derivation of the F-measure
Some researchers call the definition of the F-measure in the previous section
F1-measure. What is 1 of F1?
The full definition of the F-measure is given as follows.[Chinchor, 1992]
Fβ=(β2+ 1)P R
β2P+R(0 β+).
βis a parameter that controls a balance between P and R. When β= 1,
F1comes to be equivalent to the harmonic mean of P and R. If β > 1,
F becomes more recall-oriented and if β < 1, it becomes more precision-
oriented, e.g., F0=P.
While it seems that van Rijsbergen did not define the formula of the
F-measure per se, the origin of the definition of the F-measure is van Rijs-
bergen’s E (effectiveness) function [van Rijsbergen, 1979]:
E= 1 1
α1
P+ (1 α)1
R
,
where α=1
β2+1 .
Let’s remove αusing β.
E= 1 1
1
β2+ 1
1
P+11
β2+ 11
R
,
= 1 P R
1
β2+1 R+β2+11
β2+1 P.
= 1 (β2+ 1)P R
R+β2P.
Now you see that
E= 1 Fβ.
Note that Frises if R or P gets better whereas Ebecomes small if R or
P improves. This seems the reason why F is more commonly used than E.
Some people use αas a parameter of F.
Fα=1
α1
P+ (1 α)1
R
(0 α1).
3
There is nothing wrong with this definition of F but use of this definition
might cause an unnecessary confusion because Fα=0.5=Fβ=1 . An attention
is needed that the commonly used notation F1means Fβ=1 , not Fα=1.
3 Further investigation in β
Still, some of you are not sure why β2is used instead of βin α=1
β2+1 .
The best way to understand this is to read Chapter 7 of van Rijsbergen’s
masterpiece[van Rijsbergen, 1979]. However, let me try to explanation the
reason.
βis the parameter that controls the weighting between P and R. For-
mally, βis defined as follows:
β=R/P, where E
∂P =E
∂R .
The motivation behind this condition is that at the point where the
gradients of E w.r.t. P and R are equal, the ratio of R against P should be
a desired ratio β.
Please recall that E is defined as follows:
E= 1 1
α1
P+ (1 α)1
R
,
= 1 P R
αR + (1 α)P.
Now we calculate ∂E
∂P and E
∂R . By the quotient rule on the derivative
of a composite function, (f /g)= (fgf g)/g2. For conciseness, let g=
αR + (1 α)P.
∂E
∂P =R(αR + (1 α)P)P R(1 α)
g2.
∂E
∂R =P(αR + (1 α)P)P
g2.
Then, ∂E
∂P = E
∂R is equivalent to:
R(αR + (1 α)P)P R(1 α) = P(αR + (1 α)P)P Rα,
4
which can be simplified to:
αR2= (1 α)P2.
As β=R/P , we can replace Rwith βP .2
αβ2P2= (1 α)P2.
αβ2= 1 α.
α(β2+ 1) = 1.
α=1
β2+ 1.(1)
4 End Note
There is one thing that remains unsolved, which is why the F-measure is
called F. A personal communication with David D. Lewis several years ago
revealed that when the F-measure was introduced to MUC-4, the name was
accidentally selected by the consequence of regarding a different F function
in van Rijsbergen’s book as the definition of the “F-measure”.
Finally, if you have any comments, please contact me by email.
References
[Chinchor, 1992] Nancy Chinchor, MUC-4 Evaluation Metrics, in Proc.
of the Fourth Message Understanding Conference, pp. 22–29, 1992.
http://www.aclweb.org/anthology-new/M/M92/M92-1002.pdf
[van Rijsbergen, 1979] C. J. van Rijsbergen, Information Retrieval, London:
Butterworths, 1979. http://www.dcs.gla.ac.uk/Keith/Preface.html
2In van Rijsbergen’s book, β=P /R but I believe this is a typing error.
5
... The following objective measures were used: F -the evenly balanced F-measure that is calculated using the following equation [16] [17]: ...
... Where P represents Precision, and R represents Recall or Sensitivity [16]. ...
... We use the metrics Recall (R), Precision (P), and F1-Score to evaluate the model's accuracy. These values are calculated using the following formulas [10]. ...
... Precision (consumer's accuracy, CA) was calculated as true positive pixels divided by all pixels predicted to be in the class [54,55]. The F1 score was then determined by calculating the harmonic mean between the precision and recall metrics [56]. Finally, the Kappa coefficient was calculated in order to determine the level of agreement between multiple measurements of accuracy [57]. ...
Article
Full-text available
Analyses of thermokarst hazard risk are becoming increasingly crucial in the context of global warming. A significant aspect of thermokarst research is the mapping of landscapes based on their vulnerability to thermokarst processes. The exponential growth of remote sensing data and the advent of novel techniques have paved the way for the creation of sophisticated techniques for the study of natural disasters, including thermokarst phenomena. This study applies machine learning techniques to assess the vulnerability of tundra landscapes to thermokarst by integrating supervised classification using random forest with morphometric analysis based on the Topography Position Index. We recognized that the thermokarst landscape with the greatest potential for future permafrost thawing occupies 20% of the study region. The thermokarst-affected terrains and water bodies located in the undegraded uplands account for 13% of the total area, while those in depressions and valleys account for 44%. A small part (6%) of the study region represents areas with stable terrains within depressions and valleys that underwent topographic alterations and are likely to maintain stability in the future. This approach enables big geodata-driven predictive modeling of permafrost hazards, improving thermokarst risk assessment. It highlights machine learning and Google Earth Engine’s potential for forecasting landscape transformations in vulnerable Arctic regions.
... Recall aims to maximize the number of true positives, whereas precision aims to minimize the number of false positives. Smaller classes will have more weight, and it compensates models with similar precision and recall values (Sasaki 2007). ...
Technical Report
Full-text available
Population-wide travel surveys are conducted to investigate individuals' patterns of traveling. These surveys are often burdensome. The widespread use of smartphones allows for the collection of trip data without relying on traditional travel survey diary responses. Location monitoring data can be utilized to split the day into segments where someone is traveling (track), or is stationary (stop). Currently, respondents have to label each trip and stop, what mode of transport they travel with, and what the purpose of a stop is (e.g. shopping, work, school, home). By integrating GPS data from smartphones with administrative data and temporal and spatial data, this paper studies to what degree it is possible to automatically predict the trip purpose. Multiple machine-learning models were trained and evaluated to unveil the effectiveness of stop-purpose prediction. In late 2022, Statistics Netherlands collected GPS data that contained 12 distinct labels denoting the purposes of the trips. The most optimal artificial neural network model and extreme gradient boosting technique obtained a balanced accuracy of 90% for the purpose of being at home. Primarily, classes that included only a small number of observations were erroneously categorized as classes with a large number of observations. Administrative data do not help to improve model prediction beyond spatiotemporal covariates. Increasing the duration of data gathering substantially enhanced the precision of the model. To summarize, smartphone-based travel data has considerable potential as a data source for trip purpose prediction but cannot yet be used to predict trip purpose automatically.
... In machine learning classification problems, several performance metrics are commonly used to evaluate the performance of a model. These metrics include accuracy, precision, recall, and F1-score, each of which measures different aspects of classification performance [23][24][25]. ...
Article
Full-text available
Food safety and quality control is an important component of sustainable healthcare in smart cities as it contributes to public health, preventive healthcare, sustainable agriculture, consumer empowerment, economic development, and regulatory compliance. The focus of this paper is on adulteration identification of honey. Honey is a natural sweetener that has been utilised for thousands of years for its different health advantage such as culinary sector, skincare, wound treatment, and as a natural cough suppressant. Adulteration of honey refers to the practice of introducing contaminants or diluting pure honey with other substances such as sugars, syrups, or water in order to increase volume and lower manufacturing costs. There are different mechanisms for identifying adulterated honey e.g physiocochemical properties, chromatography, spectroscopy, and hyperspectral imaging; each of which presents its own sets of challenges and limitation. The current study uses a publicly available dataset with different types of honey adulterated with sugar syrup. Hyperspectral imaging is used to extract spectral features of the honey samples. As the dataset represents an unbalanced representation of the adulterated samples. We propose to balance the samples and train the machine learning models across two validation strategies: k-fold crossvalidation and leave-oneout validation. Various models have been generated to extract different information from the dataset. The performance of the models across the different strategies has been reported. The current research study offers a viable way to maintain consumer trust and advance transparency in the honey sector, in addition to helping to protect the purity of honey products. Hence, by prioritizing food safety and quality, smart cities can create healthier, safer, and more resilient communities for their residents.
Thesis
Full-text available
Artificial Intelligence (AI) has achieved results surpassing human capabilities in recognition tasks. Systems have widely integrated Machine Learning (ML), a field of AI, achieving outstanding results. Despite their success, ML systems are opaque boxes and remain untrusted by humans. To establish better trust, decisions in automated systems must be explainable. This research approaches the problem of improving ML’s explainability by introducing two explainable methods and architectures. The first is a property-based explainable method that provides the rationale for decisions through reasoning about explainable properties. The second is a case-based explainable method in which related training cases justify decisions. Following the methods results in producing the components to construct the explainable architectures. Herein, the explainable architectures are applied to the common research problem of recognizing handwritten characters and a more focused application of detecting hardware trojans in gate-level netlists. As the methods are applied, their strengths and tradeoffs emerge. The property-based explainable method benefits representational learning applications with data that has many dimensions, while the case-based explainable method excels in applications with few dimensions, requires low effort to implement, and higher relative recognition accuracy than the property-based method. Applying the property-based explainable architecture to Modified National Institute of Standards and Technology handwriting database achieved a recognition accuracy of 97.3% with an average explainability of 76.8%. The case-based architecture had an accuracy of 98.4% with a correspondence factor of 97.9% on MNIST. The case-based architecture produced a 97.4% correspondence factor for training cases on Trust-Hub hardware trojans. In addition to the methods, this work presents several contributions in the form of concepts, techniques, and metrics. Contributions include explainable property transforms, probabilistic voting, an explainable knowledgebase, an explainable training index, techniques for unbalanced training data, handling ambiguous training data, indicating cases where the system may fail, and an unexplainable component to mitigate performance loss. Metrics presented include EPARS the effectiveness of decision systems to recognize a class, an explainability metric, and the correspondence of training samples to the prediction. Leveraging the explainable methods and contributions will improve human trust in and adoption of decision-making systems.
Article
Methane emissions pose significant environmental challenges, particularly in regions with extensive oil and gas operations. In this study, we present a comprehensive, data-driven approach to analyze and predict methane enhancements in the Permian Basin, a major US hydrocarbon-producing basin. Methane enhancement refers to “the increase in methane concentration above the baseline background level” (Dlugokencky et al. 2009). Despite advances in satellite remote sensing, accurately characterizing methane enhancements in oil and gas regions remains challenging due to the available data’s coarse spatial and temporal resolution. In this study, we address this gap by integrating satellite-retrieved methane concentrations with detailed operational data from the Permian Basin, facilitating a better understanding of the complex interactions influencing methane emissions. Building on previous studies (e.g., Bian et al. 2023), our work further refines the analytical framework for regional methane emission estimation. We employ Sentinel-5P satellite data, alongside oil and gas operational data—including well counts, production volumes, and facility proximities—to capture regional emission patterns. We begin with a descriptive analysis of methane enhancement data attributed to different operators based on their geographical distribution across the basin. Next, our approach utilizes supervised learning algorithms—namely, random forest regression and classification—and unsupervised clustering techniques—hierarchical density-based spatial clustering of applications with noise (HDBSCAN) and K-means++—to help predict methane enhancement levels quantitatively, offering insights into influential features contributing to methane emissions. Finally, impurity-based feature importance and SHapley Additive exPlanations (SHAP) values are used to evaluate these models’ predictive power and interpretability, decoding the “black-box” nature and enabling an in-depth understanding of the factors driving methane enhancements. Our analysis reveals several key insights. Feature importance evaluation highlights that wind speed, month, and gas production are the primary drivers of methane enhancements. Furthermore, unsupervised clustering using HDBSCAN and K-means++ uncovers spatially distinct clusters corresponding to different operational and geological zones. We not only explore the complex dynamics of methane emissions in the Permian Basin in this study but also set a foundation for future investigations to refine our comprehension and prediction capabilities of methane emissions in oil and gas regions.
ResearchGate has not been able to resolve any references for this publication.