Content uploaded by Yutaka Sasaki
Author content
All content in this area was uploaded by Yutaka Sasaki on Apr 07, 2015
Content may be subject to copyright.
c
2007 Y. Sasaki, Version: 26th October, 2007 1
The truth of the F-measure
Yutaka Sasaki
Research Fellow
School of Computer Science, University of Manchester
MIB, 131 Princess Street, Manchester, M1 7DN
Yutaka.Sasakiamanchester.ac.uk
October 26, 2007
Abstract
It has been past more than 15 years since the F-measure was first
introduced to evaluation tasks of information extraction technology
at the Fourth Message Understanding Conference (MUC-4) in 1992.
Recently, sometimes I see some confusion with the definition of the F-
measure, which seems to be triggered by lack of background knowledge
about how the F-measure was derived. Since I was not involved in the
process of the introduction or device of the F-measure, I might not be
the best person to explain this but I hope this note would be a little
help for those who are wondering what the F-measure really is. This
introduction is devoted to provide brief but sufficient information on
the F-measure.
1 Overview
Definition of the F-measure
The F-measure is defined as a harmonic mean of precision (P) and recall
(R):1
F=2P R
P+R.
1In biomedicine, precision is called positive predictive value (PPV) and recall is called
sensitivity but to my knowledge, there is nothing corresponding to the F-measure in the
domain.
If you are satisfied with this definition and need no further information,
that’s it. However, if you are deeply interested in the definition of the F-
measure, you should recap the definitions of the arithmetic and harmonic
means.
Arithmetic and harmonic means
The arithmetic mean A(an average in a usual sense) and the harmonic mean
Hare defined as follows.
A=1
n
n
X
i=1
xi=1
n(x1+x2+... +xn).
H=n
Pn
i=1
1
xi
=n
1
x1+... +1
xn
.
When x1=Pand x2=R,Aand Hwill be:
A=1
2(P+R).
H=2
1
P+1
R
=2
P+R
P R
=2P R
P+R.
The harmonic mean is more intuitive than the arithmetic mean when
computing a mean of ratios.
Suppose that you have a finger print recognition system and its precision
and recall be 1.0 and 0.2, respectively. Intuitively, the total performance of
the system should be very low because the system covers only 20% of the
registered finger prints, which means it is almost useless.
The arithmetic mean of 1 and 0.2 is 0.6 whereas the harmonic mean of
them is
2·1·2
10
1 + 2
10
=4
12 =1
3.
As you see in this example, the harmonic mean (0.333...) is a more
reasonable score than the arithmetic mean (0.6).
2
2 Derivation of the F-measure
Some researchers call the definition of the F-measure in the previous section
F1-measure. What is 1 of F1?
The full definition of the F-measure is given as follows.[Chinchor, 1992]
Fβ=(β2+ 1)P R
β2P+R(0 ≤β≤+∞).
βis a parameter that controls a balance between P and R. When β= 1,
F1comes to be equivalent to the harmonic mean of P and R. If β > 1,
F becomes more recall-oriented and if β < 1, it becomes more precision-
oriented, e.g., F0=P.
While it seems that van Rijsbergen did not define the formula of the
F-measure per se, the origin of the definition of the F-measure is van Rijs-
bergen’s E (effectiveness) function [van Rijsbergen, 1979]:
E= 1 −1
α1
P+ (1 −α)1
R
,
where α=1
β2+1 .
Let’s remove αusing β.
E= 1 −1
1
β2+ 1
1
P+1−1
β2+ 11
R
,
= 1 −P R
1
β2+1 R+β2+1−1
β2+1 P.
= 1 −(β2+ 1)P R
R+β2P.
Now you see that
E= 1 −Fβ.
Note that Frises if R or P gets better whereas Ebecomes small if R or
P improves. This seems the reason why F is more commonly used than E.
Some people use αas a parameter of F.
Fα=1
α1
P+ (1 −α)1
R
(0 ≤α≤1).
3
There is nothing wrong with this definition of F but use of this definition
might cause an unnecessary confusion because Fα=0.5=Fβ=1 . An attention
is needed that the commonly used notation F1means Fβ=1 , not Fα=1.
3 Further investigation in β
Still, some of you are not sure why β2is used instead of βin α=1
β2+1 .
The best way to understand this is to read Chapter 7 of van Rijsbergen’s
masterpiece[van Rijsbergen, 1979]. However, let me try to explanation the
reason.
βis the parameter that controls the weighting between P and R. For-
mally, βis defined as follows:
β=R/P, where ∂E
∂P =∂E
∂R .
The motivation behind this condition is that at the point where the
gradients of E w.r.t. P and R are equal, the ratio of R against P should be
a desired ratio β.
Please recall that E is defined as follows:
E= 1 −1
α1
P+ (1 −α)1
R
,
= 1 −P R
αR + (1 −α)P.
Now we calculate ∂E
∂P and ∂ E
∂R . By the quotient rule on the derivative
of a composite function, (f /g)′= (f′g−f g′)/g2. For conciseness, let g=
αR + (1 −α)P.
∂E
∂P =−R(αR + (1 −α)P)−P R(1 −α)
g2.
∂E
∂R =−P(αR + (1 −α)P)−P Rα
g2.
Then, ∂E
∂P =∂ E
∂R is equivalent to:
R(αR + (1 −α)P)−P R(1 −α) = P(αR + (1 −α)P)−P Rα,
4
which can be simplified to:
αR2= (1 −α)P2.
As β=R/P , we can replace Rwith βP .2
αβ2P2= (1 −α)P2.
⇒αβ2= 1 −α.
⇒α(β2+ 1) = 1.
⇒α=1
β2+ 1.(1)
4 End Note
There is one thing that remains unsolved, which is why the F-measure is
called F. A personal communication with David D. Lewis several years ago
revealed that when the F-measure was introduced to MUC-4, the name was
accidentally selected by the consequence of regarding a different F function
in van Rijsbergen’s book as the definition of the “F-measure”.
Finally, if you have any comments, please contact me by email.
References
[Chinchor, 1992] Nancy Chinchor, MUC-4 Evaluation Metrics, in Proc.
of the Fourth Message Understanding Conference, pp. 22–29, 1992.
http://www.aclweb.org/anthology-new/M/M92/M92-1002.pdf
[van Rijsbergen, 1979] C. J. van Rijsbergen, Information Retrieval, London:
Butterworths, 1979. http://www.dcs.gla.ac.uk/Keith/Preface.html
2In van Rijsbergen’s book, β=P /R but I believe this is a typing error.
5