Journal of the American Society for Information Science and Technology

Published by Wiley

Online ISSN: 1532-2890


Print ISSN: 1532-2882


Figure 1: Citation distributions differ by level of aggregation, as well as time period. Citations for all papers in these sets are tabulated at the end of 2006. The left panel is on double logarithmic scales, and the right panel is the same data on linear scale. A, Distribution of the number of citations to all scientific articles indexed in the Web of Science between 1955 and 2006. B, Distribution of citations to all scientific articles published in calendar year 2000. Note that the tail decays faster, for high impact papers have not yet had enough time to accumulate citations. C, Distribution of citations to all scientific articles published in year 2000 in the field of “Hematology.” Note that the median number of citations is significantly (p < 0.001) higher in hematology than the median number of citations overall. D, Distribution of citations to all scientific articles published in year 2000 in the journal Circulation (which is classified in the field of “Hematology”). The median number of citations to papers published in Circulation is significantly higher (p < 0.001) than the median number of citations to papers in hematology. At the aggregation level of D, we find that almost all of the data is consistent with a discrete lognormal distribution. Thus, the global distribution A is likely a mixture of discrete lognormal distributions.
Figure 2: Goodness of fit of latent normal model to journal citation distribution data.A, Comparison of the model to data for articles published in the steady state period (1970–1998) for the journal Circulation. We can not reject hypothesis H1 (p1=0.4). B, Plot of residuals,  against the independent variable, n for the journal Circulation. For journals where hypothesis bf H1 cannot be rejected (p > 0.05), the residuals are uncorrelated with the number of citations. C, Comparison of the model to data for articles published in the steady state period (1996–1998) of the journal Science. In this case, p1 is near zero, indicating that we can reject hypothesis H1 with high confidence. D, Plot of residuals,  against the independent variable, n for the journal Science. For journal where hypothesis H1 is rejected (p < 0.05), the residuals are correlated. In this particular case, the model under-predicts the number of uncited articles. Whereas for the “true” model of all journal citation distributions, we would expect 5% (109) of the journals to yield p > 0.05, we observe that 10% (229) of the journals yield p > 0.05. For the purpose of comparison, the dashed lines in A and C indicate best fit normal distributions.
Figure 3: Hypothesis H1 tends to be rejected more often for high volume journals. We divide journals in our analysis by quintile according to two attributes: number of papers per annum, Npa, and the best fit mean of the discrete lognormal model, μDLN. Most (70%) of the journals for which hypothesis H1 is rejected are in the top quintile of Npa. This is to be expected to some extent, since the test has more statistical power to detect even small deviations from the model for larger N (see Appendix). However, among the journals in the top quintile of Npa, we find that 57% of the journals for which hypothesis H1 is rejected are in the top quintile of μDLN. That is, large, highly-cited journals are the most likely to fail the test.
Figure 4: Hypothesis H3 is rejected for seven journals. In the set of journals for which hypothesis H1 is rejected at αFDR, some tests fail because the parameters of the discrete lognormal distribution actually vary slightly in time. Panel A shows the mean of the discrete lognormal distribution as a function of time for The Astrophysical Journal (Ap. J.). The error bars are intended to show the “width” of the distribution, or the standard deviation σDLN, as opposed to the estimated error. For The Astrophysical Journal, none of the individual years are inconsistent with the discrete lognormal model, ∀y∈{1958, 1959, … 1988}: p > 0.05. However, when the data from all years in the steady state period (shaded) are aggregated, p is low enough for hypothesis H1 to be rejected with high confidence. In Panel B, we see that for Journal of the American Chemical Society (JACS) there are 4 years out of 20 for which p < 0.05. This number of rejections is sufficient to rejected hypothesis H3 at p3 < 0.001. Thus, the time varying mean is not sufficient to explain the deviations from the model expectations for JACS. For purposes of estimating the ultimate number of citations that papers will receive, the heuristic method for determining a “steadystate” is adequate. However, when the number of papers is large enough for the test to be very sensitive, we see that the distribution is actually a mixture of discrete lognormal distributions with a time-varying mean.
Figure 5: Figure A1. Power analysis of our testing procedure. We draw data from a mixture of two normal distributions with different means but identical variance. 25% (A), 50% (B), and 75% (C) of the points in these samples were drawn from the normal with the largest mean. Nsample is the number of data points in each sample considered and Δμ is the difference in the means of the distributions. The power of the testing procedure depends on both the specific nature of the deviation from the null hypothesis and the number of points in the dataset. It is clear that a distribution that results from a mixture of lognormal distributions that are separated by only one standard deviation are essentially impossible to detect.
Statistical Validation of a Global Model for the Distribution of the Ultimate Number of Citations Accrued by Papers Published in a Scientific Journal
  • Article
  • Full-text available

July 2010


156 Reads



A central issue in evaluative bibliometrics is the characterization of the citation distribution of papers in the scientific literature. Here, we perform a large-scale empirical analysis of journals from every field in Thomson Reuters' Web of Science database. We find that only 30 of the 2,184 journals have citation distributions that are inconsistent with a discrete lognormal distribution at the rejection threshold that controls the False Discovery Rate at 0.05. We find that large, multidisciplinary journals are over-represented in this set of 30 journals, leading us to conclude that, within a discipline, citation distributions are lognormal. Our results strongly suggest that the discrete lognormal distribution is a globally accurate model for the distribution of "eventual impact" of scientific papers published in single-discipline journal in a single year that is removed sufficiently from the present date.

Adapting Semantic Natural Language Processing Technology to Address Information Overload in Influenza Epidemic Management

December 2010


76 Reads


Graciela Rosemblat





Explosion of disaster health information results in information overload among response professionals. The objective of this project was to determine the feasibility of applying semantic natural language processing (NLP) technology to addressing this overload. The project characterizes concepts and relationships commonly used in disaster health-related documents on influenza pandemics, as the basis for adapting an existing semantic summarizer to the domain. Methods include human review and semantic NLP analysis of a set of relevant documents. This is followed by a pilot-test in which two information specialists use the adapted application for a realistic information seeking task. According to the results, the ontology of influenza epidemics management can be described via a manageable number of semantic relationships that involve concepts from a limited number of semantic types. Test users demonstrate several ways to engage with the application to obtain useful information. This suggests that existing semantic NLP algorithms can be adapted to support information summarization and visualization in influenza epidemics and other disaster health areas. However, additional research is needed in the areas of terminology development (as many relevant relationships and terms are not part of existing standardized vocabularies), NLP, and user interface design.

Comparing a Rule-Based Versus Statistical System for Automatic Categorization of MEDLINE Documents According to Biomedical Specialty

December 2009


55 Reads

Automatic document categorization is an important research problem in Information Science and Natural Language Processing. Many applications, including Word Sense Disambiguation and Information Retrieval in large collections, can benefit from such categorization. This paper focuses on automatic categorization of documents from the biomedical literature into broad discipline-based categories. Two different systems are described and contrasted: CISMeF, which uses rules based on human indexing of the documents by the Medical Subject Headings(®) (MeSH(®)) controlled vocabulary in order to assign metaterms (MTs), and Journal Descriptor Indexing (JDI) based on human categorization of about 4,000 journals and statistical associations between journal descriptors (JDs) and textwords in the documents. We evaluate and compare the performance of these systems against a gold standard of humanly assigned categories for one hundred MEDLINE documents, using six measures selected from trec_eval. The results show that for five of the measures, performance is comparable, and for one measure, JDI is superior. We conclude that these results favor JDI, given the significantly greater intellectual overhead involved in human indexing and maintaining a rule base for mapping MeSH terms to MTs. We also note a JDI method that associates JDs with MeSH indexing rather than textwords, and it may be worthwhile to investigate whether this JDI method (statistical) and CISMeF (rule based) might be combined and then evaluated showing they are complementary to one another.

Word Sense Disambiguation by Selecting the Best Semantic Type Based on Journal Descriptor Indexing: Preliminary Experiment

January 2006


123 Reads

An experiment was performed at the National Library of Medicine((R)) (NLM((R))) in word sense disambiguation (WSD) using the Journal Descriptor Indexing (JDI) methodology. The motivation is the need to solve the ambiguity problem confronting NLM's MetaMap system, which maps free text to terms corresponding to concepts in NLM's Unified Medical Language System((R)) (UMLS((R))) Metathesaurus((R)). If the text maps to more than one Metathesaurus concept at the same high confidence score, MetaMap has no way of knowing which concept is the correct mapping. We describe the JDI methodology, which is ultimately based on statistical associations between words in a training set of MEDLINE((R)) citations and a small set of journal descriptors (assigned by humans to journals per se) assumed to be inherited by the citations. JDI is the basis for selecting the best meaning that is correlated to UMLS semantic types (STs) assigned to ambiguous concepts in the Metathesaurus. For example, the ambiguity transport has two meanings: "Biological Transport" assigned the ST Cell Function and "Patient transport" assigned the ST Health Care Activity. A JDI-based methodology can analyze text containing transport and determine which ST receives a higher score for that text, which then returns the associated meaning, presumed to apply to the ambiguity itself. We then present an experiment in which a baseline disambiguation method was compared to four versions of JDI in disambiguating 45 ambiguous strings from NLM's WSD Test Collection. Overall average precision for the highest-scoring JDI version was 0.7873 compared to 0.2492 for the baseline method, and average precision for individual ambiguities was greater than 0.90 for 23 of them (51%), greater than 0.85 for 24 (53%), and greater than 0.65 for 35 (79%). On the basis of these results, we hope to improve performance of JDI and test its use in applications.

Natural Language Processing Versus Content-Based Image Analysis for Medical Document Retrieval

January 2009


74 Reads

One of the most significant recent advances in health information systems has been the shift from paper to electronic documents. While research on automatic text and image processing has taken separate paths, there is a growing need for joint efforts, particularly for electronic health records and biomedical literature databases. This work aims at comparing text-based versus image-based access to multimodal medical documents using state-of-the-art methods of processing text and image components. A collection of 180 medical documents containing an image accompanied by a short text describing it was divided into training and test sets. Content-based image analysis and natural language processing techniques are applied individually and combined for multimodal document analysis. The evaluation consists of an indexing task and a retrieval task based on the "gold standard" codes manually assigned to corpus documents. The performance of text-based and image-based access, as well as combined document features, is compared. Image analysis proves more adequate for both the indexing and retrieval of the images. In the indexing task, multimodal analysis outperforms both independent image and text analysis. This experiment shows that text describing images can be usefully analyzed in the framework of a hybrid text/image retrieval system.

Extending SemRep to the Public Health Domain

October 2013


72 Reads

We describe the use of a domain-independent methodology to extend a natural language processing (NLP) application, SemRep (Rindflesch, Fiszman, & Libbus, 2005), based on the knowledge sources afforded by the Unified Medical Language System (UMLS®) (Humphreys, Lindberg, Schoolman, & Barnett, 1998) to support the area of health promotion within the public health domain. Public health professionals require good information about successful health promotion policies and programs that might be considered for application within their own communities. Our effort seeks to improve access to relevant information for the public health profession, to help those in the field remain an information-savvy workforce. NLP and semantic techniques hold promise to help public health professionals navigate the growing ocean of information by organizing and structuring this knowledge into a focused public health framework paired with a user-friendly visualization application as a way to summarize results of PubMed searches in this field of knowledge.

Emphasizing Social Features in Information Portals: Effects on New Member Engagement

November 2011


83 Reads

Many information portals are adding social features with hopes of enhancing the overall user experience. Invitations to join and welcome pages that highlight these social features are expected to encourage use and participation. While this approach is widespread and seems plausible, the effect of providing and highlighting social features remains to be tested. We studied the effects of emphasizing social features on users' response to invitations, their decisions to join, their willingness to provide profile information, and their engagement with the portal's social features. The results of a quasi-experiment found no significant effect of social emphasis in invitations on receivers' responsiveness. However, users receiving invitations highlighting social benefits were less likely to join the portal and provide profile information. Social emphasis in the initial welcome page for the site also was found to have a significant effect on whether individuals joined the portal, how much profile information they provided and shared, and how much they engaged with social features on the site. Unexpectedly, users who were welcomed in a social manner were less likely to join and provided less profile information; they also were less likely to engage with social features of the portal. This suggests that even in online contexts where social activity is an increasingly common feature, highlighting the presence of social features may not always be the optimal presentation strategy.

Studying PubMed usages in the field for complex problem solving: Implications for tool design

May 2013


52 Reads

Many recent studies on MEDLINE-based information seeking have shed light on scientists' behaviors and associated tool innovations that may improve efficiency and effectiveness. Few if any studies, however, examine scientists' problem-solving uses of PubMed in actual contexts of work and corresponding needs for better tool support. Addressing this gap, we conducted a field study of novice scientists (14 upper level undergraduate majors in molecular biology) as they engaged in a problem solving activity with PubMed in a laboratory setting. Findings reveal many common stages and patterns of information seeking across users as well as variations, especially variations in cognitive search styles. Based on findings, we suggest tool improvements that both confirm and qualify many results found in other recent studies. Our findings highlight the need to use results from context-rich studies to inform decisions in tool design about when to offer improved features to users.

Figure 3: 724 journals related in their citing patterns with cosine ≥0.3 and with factor loading ≤ −0.1 or ≥0.1, colored in accordance with the 12-factor solution (Kamada & Kawai, 1989).[Color figure can be viewed in the online issue, which is available at]
Figure 4: Visualization of the 12 main dimensions of 565 (of the 1,157) journals included in the A&HCI 2008; factor loadings ≥0.2 or ≤ −0.2.
Figure 5: The distribution of Humanities PhD graduates in the United States in 2008.[Color figure can be viewed in the online issue, which is available at]Source. U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, Integrated Postsecondary Data System; accessed via the National Science Foundation's online integrated science and engineering resources data system, WebCASPAR (at
Figure 6: Departmental structure of UCLA Humanities section (according to shared faculty among teaching programs).Source. UCLA General Catalog 2010–2011, available at
Twelve factors distinguished by factor analysis (Varimax; SPSS v18.0).
The Structure of the Arts & Humanities Citation Index: A Mapping on the Basis of Aggregated Citations Among 1,157 Journals

December 2011


1,516 Reads

Using the Arts & Humanities Citation Index (A&HCI) 2008, we apply mapping techniques previously developed for mapping journal structures in the Science and Social Science Citation Indices. Citation relations among the 110,718 records were aggregated at the level of 1,157 journals specific to the A&HCI, and the journal structures are questioned on whether a cognitive structure can be reconstructed and visualized. Both cosine-normalization (bottom up) and factor analysis (top down) suggest a division into approximately twelve subsets. The relations among these subsets are explored using various visualization techniques. However, we were not able to retrieve this structure using the ISI Subject Categories, including the 25 categories which are specific to the A&HCI. We discuss options for validation such as against the categories of the Humanities Indicators of the American Academy of Arts and Sciences, the panel structure of the European Reference Index for the Humanities (ERIH), and compare our results with the curriculum organization of the Humanities Section of the College of Letters and Sciences of UCLA as an example of institutional organization.

The decline in the concentration of citations, 1900-2007

April 2009


116 Reads

This paper challenges recent research (Evans, 2008) reporting that the concentration of cited scientific literature increases with the online availability of articles and journals. Using Thomson Reuters' Web of Science, the present paper analyses changes in the concentration of citations received (two- and five-year citation windows) by papers published between 1900 and 2005. Three measures of concentration are used: the percentage of papers that received at least one citation (cited papers); the percentage of papers needed to account for 20, 50 and 80 percent of the citations; and, the Herfindahl-Hirschman index. These measures are used for four broad disciplines: natural sciences and engineering, medical fields, social sciences, and the humanities. All these measures converge and show that, contrary to what was reported by Evans, the dispersion of citations is actually increasing.

Statistical Tests and Research Assessments: A comment on Schneider (2012)

June 2013


467 Reads

In a recent presentation at the 17th International Conference on Science and Technology Indicators, Schneider (2012) criticised the proposal of Bornmann, de Moya Anegon, and Leydesdorff (2012) and Leydesdorff and Bornmann (2012) to use statistical tests in order to evaluate research assessments and university rankings. We agree with Schneider's proposal to add statistical power analysis and effect size measures to research evaluations, but disagree that these procedures would replace significance testing. Accordingly, effect size measures were added to the Excel sheets that we bring online for testing performance differences between institutions in the Leiden Ranking and the SCImago Institutions Ranking.

An empirical investigation of the g-index for 26 physicists in comparison with the h-Index, the A-index, and the R-index

July 2008


111 Reads

Hirsch has introduced the h-index to quantify an individual's scientific research output by the largest number h of a scientist's papers that received at least h citations. In order to take into account the highly skewed frequency distribution of citations, Egghe proposed the g-index as an improvement of the h-index. I have worked out 26 practical cases of physicists from the Institute of Physics at Chemnitz University of Technology and compare the h and g values. It is demonstrated that the g-index discriminates better between different citation patterns. This can also be achieved by evaluating Jin's A-index which reflects the average number of citations in the h-core and interpreting it in conjunction with the h-index. h and A can be combined into the R-index to measure the h-core's citation intensity. I have also determined the A and R values for the 26 data sets. For a better comparison, I utilize interpolated indices. The correlations between the various indices as well as with the total number of papers and the highest citation counts are discussed. The largest Pearson correlation coefficient is found between g and R. Although the correlation between g and h is relatively strong, the arrangement of the data set is significantly different, depending on whether they are put into order according to the values of either h or g.

Co-Occurrence Matrices and their Applications in Information Science: Extending ACA to the Web Environment

October 2006


1,455 Reads

Co-occurrence matrices, such as co-citation, co-word, and co-link matrices, have been used widely in the information sciences. However, confusion and controversy have hindered the proper statistical analysis of this data. The underlying problem, in our opinion, involved understanding the nature of various types of matrices. This paper discusses the difference between a symmetrical co-citation matrix and an asymmetrical citation matrix as well as the appropriate statistical techniques that can be applied to each of these matrices, respectively. Similarity measures (like the Pearson correlation coefficient or the cosine) should not be applied to the symmetrical co-citation matrix, but can be applied to the asymmetrical citation matrix to derive the proximity matrix. The argument is illustrated with examples. The study then extends the application of co-occurrence matrices to the Web environment where the nature of the available data and thus data collection methods are different from those of traditional databases such as the Science Citation Index. A set of data collected with the Google Scholar search engine is analyzed using both the traditional methods of multivariate analysis and the new visualization software Pajek that is based on social network analysis and graph theory.

The "Academic Trace" of the Performance Matrix: A Mathematical Synthesis of the h-Index and the Integrated Impact Indicator (I3)

April 2014


165 Reads

The h-index provides us with nine natural classes which can be written as a matrix of three vectors. The three vectors are: X=(X1, X2, X3) indicate publication distribution in the h-core, the h-tail, and the uncited ones, respectively; Y=(Y1, Y2, Y3) denote the citation distribution of the h-core, the h-tail and the so-called "excess" citations (above the h-threshold), respectively; and Z=(Z1, Z2, Z3)= (Y1-X1, Y2-X2, Y3-X3). The matrix V=(X,Y,Z)T constructs a measure of academic performance, in which the nine numbers can all be provided with meanings in different dimensions. The "academic trace" tr(V) of this matrix follows naturally, and contributes a unique indicator for total academic achievements by summarizing and weighting the accumulation of publications and citations. This measure can also be used to combine the advantages of the h-index and the Integrated Impact Indicator (I3) into a single number with a meaningful interpretation of the values. We illustrate the use of tr(V) for the cases of two journal sets, two universities, and ourselves as two individual authors.

Mapping Academic Institutions According to Their Journal Publication Profile: Spanish Universities as a Case Study

November 2012


95 Reads

We introduce a novel methodology for mapping academic institutions based on their journal publication profiles. We believe that journals in which researchers from academic institutions publish their works can be considered as useful identifiers for representing the relationships between these institutions and establishing comparisons. However, when academic journals are used for research output representation, distinctions must be introduced between them, based on their value as institution descriptors. This leads us to the use of journal weights attached to the institution identifiers. Since a journal in which researchers from a large proportion of institutions published their papers may be a bad indicator of similarity between two academic institutions, it seems reasonable to weight it in accordance with how frequently researchers from different institutions published their papers in this journal. Cluster analysis can then be applied to group the academic institutions, and dendrograms can be provided to illustrate groups of institutions following agglomerative hierarchical clustering. In order to test this methodology, we use a sample of Spanish universities as a case study. We first map the study sample according to an institution's overall research output, then we use it for two scientific fields (Information and Communication Technologies, as well as Medicine and Pharmacology) as a means to demonstrate how our methodology can be applied, not only for analyzing institutions as a whole, but also in different disciplinary contexts.

Author-Choice Open-Access Publishing in the Biological and Medical Literature: A Citation Analysis

January 2009


87 Reads

In this article, we analyze the citations to articles published in 11 biological and medical journals from 2003 to 2007 that employ author-choice open access models. Controlling for known explanatory predictors of citations, only 2 of the 11 journals show positive and significant open access effects. Analyzing all journals together, we report a small but significant increase in article citations of 17%. In addition, there is strong evidence to suggest that the open access advantage is declining by about 7% per year, from 32% in 2004 to 11% in 2007. Comment: citation changes; final manuscript

The effect of 'Open Access' upon citation impact: An analysis of ArXiv's Condensed Matter Section

November 2007


172 Reads

This article statistically analyses how the citation impact of articles deposited in the Condensed Matter section of the preprint server ArXiv (hosted by Cornell University), and subsequently published in a scientific journal, compares to that of articles in the same journal that were not deposited in that archive. Its principal aim is to further illustrate and roughly estimate the effect of two factors, 'early view' and 'quality bias', upon differences in citation impact between these two sets of papers, using citation data from Thomson Scientific's Web of Science. It presents estimates for a number of journals in the field of condensed matter physics. In order to discriminate between an 'open access' effect and an early view effect, longitudinal citation data was analysed covering a time period as long as 7 years. Quality bias was measured by calculating ArXiv citation impact differentials at the level of individual authors publishing in a journal, taking into account co-authorship. The analysis provided evidence of a strong quality bias and early view effect. Correcting for these effects, there is in a sample of 6 condensed matter physics journals studied in detail, no sign of a general 'open access advantage' of papers deposited in ArXiv. The study does provide evidence that ArXiv accelerates citation, due to the fact that that ArXiv makes papers earlier available rather than that it makes papers freely available.

Accounting for the Uncertainty in the Evaluation of Percentile Ranks

November 2012


76 Reads

In a recent paper entitled "Inconsistencies of Recently Proposed Citation Impact Indicators and how to Avoid Them," Schreiber (2012, at arXiv:1202.3861) proposed (i) a method to assess tied ranks consistently and (ii) fractional attribution to percentile ranks in the case of relatively small samples (e.g., for n < 100). Schreiber's solution to the problem of how to handle tied ranks is convincing, in my opinion (cf. Pudovkin & Garfield, 2009). The fractional attribution, however, is computationally intensive and cannot be done manually for even moderately large batches of documents. Schreiber attributed scores fractionally to the six percentile rank classes used in the Science and Engineering Indicators of the U.S. National Science Board, and thus missed, in my opinion, the point that fractional attribution at the level of hundred percentiles-or equivalently quantiles as the continuous random variable-is only a linear, and therefore much less complex problem. Given the quantile-values, the non-linear attribution to the six classes or any other evaluation scheme is then a question of aggregation. A new routine based on these principles (including Schreiber's solution for tied ranks) is made available as software for the assessment of documents retrieved from the Web of Science (at

The Relationship Between Acquaintanceship and Coauthorship in Scientific Collaboration Networks

November 2011


39 Reads

This article examines the relationship between acquaintanceship and coauthorship patterns in a multi-disciplinary, multi-institutional, geographically distributed research center. Two social networks are constructed and compared: a network of coauthorship, representing how researchers write articles with one another, and a network of acquaintanceship, representing how those researchers know each other on a personal level, based on their responses to an online survey. Statistical analyses of the topology and community structure of these networks point to the importance of small-scale, local, personal networks predicated upon acquaintanceship for accomplishing collaborative work in scientific communities.

How Can Journal Impact Factors be Normalized across Fields of Science? An Assessment in terms of Percentile Ranks and Fractional Counts

January 2013


110 Reads

Using the CD-ROM version of the Science Citation Index 2010 (N = 3,705 journals), we study the (combined) effects of (i) fractional counting on the impact factor (IF) and (ii) transformation of the skewed citation distributions into a distribution of 100 percentiles and six percentile rank classes (top-1%, top-5%, etc.). Do these approaches lead to field-normalized impact measures for journals? In addition to the two-year IF (IF2), we consider the five-year IF (IF5), the respective numerators of these IFs, and the number of Total Cites, counted both as integers and fractionally. These various indicators are tested against the hypothesis that the classification of journals into 11 broad fields by PatentBoard/National Science Foundation provides statistically significant between-field effects. Using fractional counting the between-field variance is reduced by 91.7% in the case of IF5, and by 79.2% in the case of IF2. However, the differences in citation counts are not significantly affected by fractional counting. These results accord with previous studies, but the longer citation window of a fractionally counted IF5 can lead to significant improvement in the normalization across fields.

FIG. 1. Changes in impact factor from 1994 to 2005. Panel (a) is a log-log plot of 1994 impact factor against 2005 impact factor for the 4,300 journals that were listed in every year from 1994 to 2005 in the JCR dataset. Shading indicates density of points, with darker tones representing higher density. Panel (b) plots the rank-order distribution of impact factors for these 4,300 journals from 1994 to 2005. The progression of darkening shade indicates years, with the lightest shade representing 1994 and the darkest 2005. Note that the highest and lowest-scoring journals do not fall within the scales of the plot. 
FIG. 2. Differences in citation patterns across fields. This figure illustrates the differences in c , p , v , and IF across fields. Panel (a) shows differences in c , the average number of items cited per paper. Panel (b) shows differences in p , the fraction of citations to papers published in the two previous calendar years. Panel (c) shows differences in v , the fraction of citations to papers published in JCR-listed journals, and panel (d) shows differences in IF, the weighted impact factor. Fields are categorized and mapped as in Rosvall and Bergstrom (2008). 
FIG. 3. Calculating v t. The top panel gives the schematic for calculating v t for the entire dataset, and the bottom panel gives the schematic for specific fields. Details are provided in the text.
TABLE 4 . Table showing the results of hierarchical partitioning.
Differences in Impact Factor Across Fields and Over Time

January 2008


623 Reads

The impact factor of an academic journal for any year is the number of times the average article published in that journal in the previous two years are cited in that year. From 1994-2005, the average impact factor of journals listed by the ISI has been increasing by an average of 2.6 percent per year. This paper documents this growth and explores its causes.

Differences in Impact Factor Across Fields and Over Time

January 2009


341 Reads

The bibliometric measure impact factor is a leading indicator of journal influence, and impact factors are routinely used in making decisions ranging from selecting journal subscriptions to allocating research funding to deciding tenure cases. Yet journal impact factors have increased gradually over time, and moreover impact factors vary widely across academic disciplines. Here we quantify inflation over time and differences across fields in impact factor scores and determine the sources of these differences. We find that the average number of citations in reference lists has increased gradually, and this is the predominant factor responsible for the inflation of impact factor scores over time. Field-specific variation in the fraction of citations to literature indexed by Thomson Scientific's Journal Citation Reports is the single greatest contributor to differences among the impact factors of journals in different fields. The growth rate of the scientific literature as a whole, and cross-field differences in net size and growth rate of individual fields, have had very little influence on impact factor inflation or on cross-field differences in impact factor.

How are new citation-based journal indicators adding to the bibliometric toolbox? J Am Soc Inf Sci Technol

July 2009


104 Reads

The launching of Scopus and Google Scholar, and methodological developments in Social Network Analysis have made many more indicators for evaluating journals available than the traditional Impact Factor, Cited Half-life, and Immediacy Index of the ISI. In this study, these new indicators are compared with one another and with the older ones. Do the various indicators measure new dimensions of the citation networks, or are they highly correlated among them? Are they robust and relatively stable over time? Two main dimensions are distinguished -- size and impact -- which together shape influence. The H-index combines the two dimensions and can also be considered as an indicator of reach (like Indegree). PageRank is mainly an indicator of size, but has important interactions with centrality measures. The Scimago Journal Ranking (SJR) indicator provides an alternative to the Journal Impact Factor, but the computation is less easy.

Performance-related differences of bibliometric statistical properties of research groups: Cumulative advantages and hierarchically layered networks

December 2006


90 Reads

In this paper we distinguish between top-performance and lower performance groups in the analysis of statistical properties of bibliometric characteristics of two large sets of research groups. We find intriguing differences between top-performance and lower performance groups, but also between the two sets of research groups. Particularly these latter differences are interesting, as they may indicate the influence of research management strategies. Lower performance groups have a larger scale-dependent cumulative advantage than top-performance groups. We also find that regardless of performance, larger groups have less not-cited publications. We introduce a simple model in which processes at the micro level lead to the observed phenomena at the macro level. Top-performance groups are, on average, more successful in the entire range of journal impact. We fit our findings into a concept of hierarchically layered networks. In this concept, the network of research groups constitutes a layer of one hierarchical step higher than the basic network of publications connected by citations. The cumulative size-advantage of citations received by a group looks like preferential attachment in the basic network in which highly connected nodes (publications) increase their connectivity faster than less connected nodes. But in our study it is size that causes an advantage. In general, the larger a group (node in the research group network), the more incoming links this group acquires in a non-linear, cumulative way. Moreover, top-performance groups are about an order of magnitude more efficient in creating linkages (i.e., receiving citations) than the lower performance groups.

How Fractional Counting of Citations Affects the Impact Factor: Normalization in Terms of Differences in Citation Potentials Among Fields of Science

February 2011


89 Reads

The ISI-Impact Factors suffer from a number of drawbacks, among them the statistics-why should one use the mean and not the median?-and the incomparability among fields of science because of systematic differences in citation behavior among fields. Can these drawbacks be counteracted by counting citation weights fractionally instead of using whole numbers in the numerators? (i) Fractional citation counts are normalized in terms of the citing sources and thus would take into account differences in citation behavior among fields of science. (ii) Differences in the resulting distributions can be tested statistically for their significance at different levels of aggregation. (iii) Fractional counting can be generalized to any document set including journals or groups of journals, and thus the significance of differences among both small and large sets can be tested. A list of fractionally counted Impact Factors for 2008 is available online at The in-between group variance among the thirteen fields of science identified in the U.S. Science and Engineering Indicators is not statistically significant after this normalization. Although citation behavior differs largely between disciplines, the reflection of these differences in fractionally counted citation distributions could not be used as a reliable instrument for the classification.

Top-cited authors