On the robustness of the h-index

Jerome K Vanclay

Journal Article: 02/2007;

Abstract

The h-index (Hirsch, 2005) is robust, remaining relatively unaffected by errors in the long tails of the citations-rank distribution, such as typographic errors that short-change frequently-cited papers and create bogus additional records. This robustness, and the ease with which h-indices can be verified, support the use of a Hirsch-type index over alternatives such as the journal impact factor. These merits of the h-index apply to both individuals and to journals.

Source: arXiv

Comments on this publication

ResearchGate members can add comments. Sign up now and post your comment!

Similar publications

Page 1
 
Page 2
 
Page 3
 
Page 4
 
Page 5
 
End of preview.
Page 1
Journal of the American Society for Information Science and Technology, in press
(manuscript JASIST-2006-05-0133.R1, accepted 23 October 2006)

On the robustness of the h-index
Jerome K. Vanclay
School of Environmental Science and Management
Southern Cross University, Lismore NSW 2480, Australia
Tel +61 2 6620 3147, Fax +61 2 6621 2669, JVanclay@scu.edu.au


Abstract
The h-index (Hirsch, 2005) is robust, remaining relatively unaffected by errors in the long tails of
the citations-rank distribution, such as typographic errors that short-change frequently-cited
papers and create bogus additional records. This robustness, and the ease with which h-indices
can be verified, support the use of a Hirsch-type index over alternatives such as the journal impact
factor. These merits of the h-index apply to both individuals and to journals.


Introduction
Despite well-recognised flaws (e.g., Jennings 1998; Seglen 1997), the ISI journal impact factor
(JIF, the mean number of citations per paper) continues to have a major influence on scientific
endeavour (Bordons et al 2002; Monastersky 2005). Hirsch (2006) proposed an alternative h-
index that has been shown to be effective (Bornmann & Daniel 2005; Oppenheim 2006) and
consistent with other metrics (Cronin and Meho 2006). Although initially proposed for individual
scientists, others have suggested extensions of the h-index to teams and journals (e.g. Braun et al
Page 2
2005). However, some of the statistical properties of these metrics have not received sufficient
attention. Hirsch’s h-index avoids several problems with the JIF, including censorship (in the
statistical sense of truncating data contributing to the numerator or denominator; Butler and
Visser 2006), errors (Lange 2002; Gehanno 2005), manipulation (Agrawal 2005; Karandikar and
Sunder 2003; Mannino 2005; Monastersky 2005) and with long-tailed distributions (Redner
1998).

League tables usually show impact factors in neat columns with counts of total citations, total
publications, and inferred impacts. Sadly, these data are not as precise as they may appear
(Garfield 2005, Bensman in press). The total number of citations may be affected by error,
manipulation, and by the selection of journals and articles that contribute to the count. The total
number of publications may also be influenced by censorship (Are editorials included in the
published output of a journal? Is ‘grey literature’ included in the count of an individual’s
output?). Thus both the number of citations and the number of publications are likely to be
approximate and often biased, with the result that the inferred impact factor may include
considerable error. These problems of censorship and manipulation are likely to be greatest in
both tails of the distribution. For instance, some ‘highly-cited’ articles may be mentioned in the
media or other influential avenues not seen by ISI, while conversely, an arbitrary decision to
include (or exclude) contributions in the grey literature may inflate the tally of an individual’s
total output. Hirsch’s h-index avoids many of these issues by ignoring the long-tails of the
distribution, and focussing on the ‘middle part’ of the Zipf plot of number of citations versus
ranked paper number (Hirsch 2006; Fig. 1).
Page 3
110
100
1000
1 10 100
Ranked publications
C
ita
tio
ns

Figure 1. Citations accruing to the author’s publications, including self-citations and ‘grey’
publications (conference proceedings, etc). The solid line is a power-curve Y=aXb and the dashed
line indicates h-index 14.

Hirsch’s h-index has two further advantages: it is an integer, so avoids the false impression of
precision conveyed by the three decimal points in the ISI impact factor, and is much easier to
verify than most alternatives. If disputed, it may be difficult to reliably verify the total number of
citations or an index based on the mean number of citations per publication (e.g., the JIF).
However, a dispute surrounding a h-index is easy to verify. Most of the publications of a journal
or individual receive more or many fewer than the n citations contributing to a h-index of n, so
verifying the index involves checking the citations accruing to just a few publications ranked
higher than n (e.g., with n-1 citations). Such checks simply need to establish whether typographic
errors or other factors may have concealed one or two citations associated with these ‘threshold’
publications, allowing the index to rise to n+1 after these anomalies are redressed. The great
majority of errors (and distortions) in citation databases lie in the long tails, and tend not affect
the h-index greatly. It is a relatively simple matter to check the citations accruing to one or two
publications, in contrast to the challenge of verifying the total number of citations and
publications.
Page 4
Approach and Methods
The robustness of the h-index is illustrated with my own publication record. Table 1 illustrates the
raw data obtained from two service providers (see Bakkalbasi et al 2006 for a comparison of
these and other service providers): from Google Scholar (GS) by searching for ‘author:j-vanclay’
(http://scholar.google.com/scholar?q=author%3Aj-vanclay), and from ISI’s Web of Science
(WoS) by searching for “VANCLAY J*”. A naive interpretation of these raw data (including self-
citations) suggests a h-index of 11 and 12 respectively, or 13 if based on the larger of these
alternatives (Table 1). Both these databases contain some obvious errors (for instance, 3 entries
without author-tags in GS, and typographic errors in WoS that generated erroneous duplicates not
shown in Table 1). Correcting these obvious errors indicated h-indices of 13, 12 and 14
respectively (Table 2).

Table 1. Raw citation data retrieved (on 15 May 2006) from Google Scholar (GS) and ISI Web of Science
(WoS), including self-citations, truncated at rank 20. Emboldened row indicates the h-index for the column
based on Max(GS,WoS).
Rank GS WoS Max Publication Date Vol Page
1 172 96 172 Book: Modelling Forest Growth and Yield 1994
2 57 50 57 Forest Science 1995 41 7
3 53 53 53 Ecological Modelling 1997 98 1
4 35 41 41 Forest Ecology and Management 1995 71 267
5 40 40 Report: A Sustainable Forest Future 1999
6 40 40 Forest Ecology and Management 1991 42 143
7 29 36 36 Forest Ecology and Management 1989 27 245
8 30 32 32 Forest Ecology and Management 1995 71 251
9 26 10 26 Forest Ecology and Management 2003 172 229
10 15 19 19 Journal of Tropical Forest Science 1991 4 59
11 9 17 17 Forest Ecology and Management 1992 54 257
12 15 16 16 Forest Science 1991 37 1656
13 13 12 13 Ambio 1993 22 225
14 13 13 Forest Ecology and Management 2001 150 27
15 11 6 11 Forest Ecology and Management 1994 69 299
16 11 8 11 Canadian Journal of Forest Research 1992 22 1235
17 10 2 10 Agroforestry Forum 1998 9 47
18 10 9 10 Forest Ecology and Management 1997 94 149
19 6 9 9 Photogramm. Eng. and Remote Sensing 1990 56 1383
20 7 7 Forest Ecology and Management 2001 150 79
Page 5
Table 2 includes corrections for all 20 entries, but most of these corrections have no bearing on
the resulting h-index, and it is normally necessary to effect corrections only to entries ranked
higher (i.e., with fewer citations) than the preliminary h-index. Table 2 assumes that the larger of
the two citation counts is a good approximation of the total, but this may not always be so, and it
is prudent to examine the union of the two sets of citations. This need be done only for a few
cases. Most of the entries in Table 2 already exceed the estimated h-index, and a further increase
in the citation count will have no bearing on the estimate. And many publications with low
citation counts (Figure 1) are unlikely to reach the h-index. Thus it is prudent to check citations
only for those publications with rank larger than interim h-index, and for which the sum of the
two citation counts is less than the rank of the publication.

Table 2. Citations accruing to top 20 publications after correcting obvious errors. Corrections shown in
bold. Italics indicate one publication that increased substantially in rank.
Rank GS
Raw Correct
WoS
Raw Correct
Max Publication Date Vol Page
1 172 177 96 142 177 Book: Modelling ... 1994
2 57 58 50 52 58 Forest Science 1995 41 7
3 53 53 53 53 53 Ecological Modelling 1997 98 1
4 35 41 41 42 42 For. Ecol. Manage. 1995 71 267
5 40 42 42 Report: A Sustainable ... 1999
6 27 40 40 40 For. Ecol. Manage. 1991 42 143
7 29 30 36 36 36 For. Ecol. Manage. 1989 27 245
8 30 30 32 32 32 For. Ecol. Manage. 1995 71 251
9 26 26 10 10 26 For. Ecol. Manage. 2003 172 229
10 15 15 19 19 19 J. Trop. For. Sci. 1991 4 59
11 19 13 13 19 For. Ecol. Manage. 2001 150 27
12 9 9 17 17 17 For. Ecol. Manage. 1992 54 257
13 15 15 16 16 16 Forest Science 1991 37 1656
14 13 14 12 13 14 Ambio 1993 22 225
15 11 7 8 11 For. Ecol. Manage. 2001 150 79
16 11 11 8 8 11 Can. J. Forest Res. 1992 22 1235
17 11 11 6 7 11 For. Ecol. Manage. 1994 69 299
18 10 11 2 2 11 Agroforestry Forum 1998 9 47
19 10 10 9 10 10 For. Ecol. Manage. 1997 94 149
20 6 6 9 9 9 Photogramm. Eng. Rem. S. 1990 56 1383
End of preview.
Preview full-text

Science & Research Jobs

Keywords

h-index
 
h-indices
 
Hirsch-type index
 
journal impact factor
 
journals
 
short-change frequently-cited papers
 
tails
 
typographic errors