PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Benford's Law is a tool to assess the validity of datasets. Citation numbers are a method to quantify a researcher's academic achievement, but the legitimacy can be questioned. Benford's Law can point out data manipulation in accounting and financial markets; however, its effectiveness in the academic field is unknown. Our results showed that Benford's Law is valid for analyzing a list of cited publications, both for individual researchers and average university values. The Benfordness increases according to the quantity of cited publications. Nevertheless, for 1000+ publications, a non-traditional distribution appears. This goes against the Law, and here we do cannot explain the reason for its appearance. This study also shows that the world rank of a university is not directly correlated to the Benfordness of its publications.
On the Benfordness of academic citations
Andre E. Vellwock and Anran Wei
Benford’s Law is a tool to assess the validity of datasets. Citation numbers are a method to quantify a
researcher's academic achievement, but the legitimacy can be questioned. Benford’s Law can point out data
manipulation in accounting and financial markets; however, its effectiveness in the academic field is unknown.
Our results showed that Benford’s Law is valid for analyzing a list of cited publications, both for individual
researchers and average university values. The Benfordness increases according to the quantity of cited
publications. Nevertheless, for 1000+ publications, a non-traditional distribution appears. This goes against the
Law, and here we do cannot explain the reason for its appearance. This study also shows that the world rank of
a university is not directly correlated to the Benfordness of its publications.
Keywords: Benford’s Law, academic citation, publications, data analysis
Introduction
Researchers are evaluated based on their academic
citations, resulting in factors such as h-index and i10-
index. The validity of these numbers is thus essential to
guarantee a fair performance assessment. Benford’s Law
is a number distribution also applied as a statistical tool
to indicate manipulation in datasets. It is broadly used in
finance (1), accounting (2), and politics (3, 4). Recently,
its applicability has been validated in identifying possible
manipulation in COVID-19 numbers (5). Here we tested
citation numbers against Benford’s Law distribution, for
individual researchers and universities. The proposal is
not to localize possible citation manipulations, but to
evaluate if citation numbers follow the Law.
Results and discussions
The Benfordness of the data is measured by the d*-factor,
the lowest the value the better fit to the Benford’s
distribution. In Figure 1a, the d*-factor of individual
researchers (blue square) and university average values
(dark grey circle) are plotted against the quantity of cited
publications (QCT). A direct correlation between the
variables is evident, with a second-order polynomial fit
(R2 of 0.92) where
The graph shows this hold while the publication numbers
are below 1000. Above, a more random distribution is
present (dashed square). This result goes against the
fundament of Benford’s Law: a larger dataset should
essentially reduce variation and exacerbate Benfordness.
We cannot explain this behavior statistically.
The graph in Figure 1b aims to evaluate if the university
world ranking affects the institution's average d*-factor.
High ranking universities showed the least values of d*,
thus presenting a high Benfordness. Universities in the
100ths position have an increase of d*. This tendency is
rejected by analyzing the institutions with lower-ranking
(501-510th), where the deviation to Benford’s Law is
decreased. For example, the East China Normal
University has a d*-factor in the same range as the
highest institutions. On another hand, Hitotsubashi
University has a higher d*. This large variation for lower-
ranking universities does not let us trace a tendency
between ranking and d*-factor. The result shows that
Figure 1. (a) The influence of the quantity of cited
publications for the d*-factor of individual researchers
and universities. (b) The absence of correlation between
university world ranking and average d*-factor.
QTC is more important than the position in a rank,
regarding Benford’s Law.
Methods
Data selection and assessment
Nine universities were chosen from three ranking levels
according to the QS World University Rankings 2021
(Appendix A). The top ten most cited researchers,
according to Google Scholar, were selected for each
university. The total quantity of studied individual
researchers is 90. With the aid of the software Publish or
Perish 7, a list of publications and their citation numbers
were acquired, the methodology is illustrated in Figure 2.
The software has a limitation of 1000 publications, thus
dfactor= 0.394.04104QCT+ 1.16107QCT2 1
if a researcher had 1000+ cited publications, the ones not
given by software were manually obtained from Google
Scholar. The data acquisition was made from 30 October
2020 to 01 November 2020.
Figure 2. Schematic representing the data acquisition
method for a hypothetical author with hypothetical
publications.
Benford’s Law
In a dataset, isolating the first digit of each number let us
obtain a list with from 1 to 9. Dividing the time one digit
is present by the total quantity of numbers gives us a
fraction. Doing this for 1, 2, 3 …9, we have a frequency
distribution. Benford’s Law states that the fraction of
each first number follows the distribution P(d) with
Listing the number fractions as
d
1
2
3
4
5
P(d)
[%]
30.1
17.6
12.5
9.7
7.9
The Benfordness, thus the deviation from Benford’s Law,
was evaluated by quantifying the d*-factor (5), expressed
as
where d is the first digit from 1 to 9, and P ̃(d) stands for
the real distribution of each first digit in the dataset.
Conclusions
Citation numbers do follow Benford’s Law, increasing
the application of the distribution. The deviation to the
Law is correlated to the quantity of cited publications of
the specific researcher. The correlation obeys a second-
order polynomial fit with a coefficient close to 1.0. For
highly cited individuals, oscillations occur but
fundamentally the d*-factor values are low. We also
showed that the university ranking does not openly
influence the Benfordness. A study taking into
consideration a larger number of institutions is needed for
a more detailed assessment.
References
1. Cho W, Gaines B. Breaking the (Benford) Law:
Statistical Fraud Detection in Campaign Finance. The
American Statistician. 2007;61:218-23.
2. Durtschi C, Hillison WA, Pacini C, editors. The
effective use of Benford's Law to assist in detecting fraud
in accounting data. 2004.
3. Deckert J, Myagkov M, Ordeshook PC.
Benford's Law and the Detection of Election Fraud.
Political Analysis. 2011;19(3):245-68.
4. Beber B, Scacco A. What the numbers say: A
digit-based test for election fraud. Political analysis.
2012;20(2):211-34.
5. Wei A, Vellwock AE. Is COVID-19 data
reliable? A statistical analysis with Benford's Law. 2020.
Appendix A
Table 1. Selected universities and their world ranking
Ranking
University
1
Massachusetts Institute of Technology
2
Stanford University
3
Harvard University
101
Pennsylvania State University
102
Trinity College Dublin
103
Technical University of Denmark
501-510
Christian-Albrechts-University zu Kiel
501-510
East China Normal University
501-510
Hitotsubashi University
=log10+1log10 =log10 1+ 1
, for = 1,,9 1
dfactor = 
2
9
=1 1.03606 1
... Incomplete understanding does not prevent the emergence of more and more proposals for the practical use of Benford's Law in a wide area of sciences from geodesy [5] and geology [6] through genomics [7] and ecology [8,9] to scientometrics [10]. ...
Article
Full-text available
We suggest two approaches to the statistical analysis of texts, both based on the study of numerals occurrence in literary texts. The first approach is related to Benford’s Law and the analysis of the frequency distribution of various leading digits of numerals contained in the text. In coherent literary texts, the share of the leading digit 1 is even larger than prescribed by Benford’s Law and can reach 50 percent. The frequencies of occurrence of the digit 1, as well as, to a lesser extent, the digits 2 and 3, are usually a characteristic the author’s style feature, manifested in all (sufficiently long) literary texts of any author. This approach is convenient for testing whether a group of texts has common authorship: the latter is dubious if the frequency distributions are sufficiently different. The second approach is the extension of the first one and requires the study of the frequency distribution of numerals themselves (not their leading digits). The approach yields non-trivial information about the author, stylistic and genre peculiarities of the texts and is suited for the advanced stylometric analysis. The proposed approaches are illustrated by examples of computer analysis of the literary texts in English and Russian.
Article
Full-text available
Benford's law has been promoted as providing the auditor with a tool that is simple and effec- tive for the detection of fraud. The purpose of this paper is to assist auditors in the most effec- tive use of digital analysis based on Benford's law. The law is based on a peculiar observation that certain digits appear more frequently than others in data sets. For example, in certain data sets, it has been observed that more than 30% of numbers begin with the digit one. After dis- cussing the background of the law and development of its use in auditing, we show where dig- ital analysis based on Benford's law can most effectively be used and where auditors should exercise caution. Specifically, we identify data sets which can be expected to follow Benford's distribution, discuss the power of statistical tests, types of frauds that would be detected and not be detected by such analysis, the potential problems that arise when an account contains too few observations, as well as issues related to base rate of fraud. An actual example is pro- vided demonstrating where Benford's law proved successful in identifying fraud in a popula- tion of accounting data.
Article
Is it possible to detect manipulation by looking only at electoral returns? Drawing on work in psychology, we exploit individuals' biases in generating numbers to highlight suspicious digit patterns in reported vote counts. First, we show that fair election procedures produce returns where last digits occur with equal frequency, but laboratory experiments indicate that individuals tend to favor some numerals over others, even when subjects have incentives to properly randomize. Second, individuals underestimate the likelihood of digit repetition in sequences of random integers, so we should observe relatively few instances of repeated numbers in manipulated vote tallies. Third, laboratory experiments demonstrate a preference for pairs of adjacent digits, which suggests that such pairs should be abundant on fraudulent return sheets. Fourth, subjects avoid pairs of distant numerals, so those should appear with lower frequency on tainted returns. We test for deviations in digit patterns using data from Sweden's 2002 parliamentary elections, Senegal's 2000 and 2007 presidential elections, and previously unavailable results from Nigeria's 2003 presidential election. In line with observers' expectations, we find substantial evidence that manipulation occurred in Nigeria as well as in Senegal in 2007.
Article
Benford's law is seeing increasing use as a diagnostic tool for isolating pockets of large datasets with irregularities that deserve closer inspection. Popular and academic accounts of campaign finance are rife with tales of corruption, but the complete dataset of transactions for federal campaigns is enormous. Performing a systematic sweep is extremely arduous; hence, these data are a natural candidate for initial screening by comparison to Ben- ford's distributions.
Benford's Law and the Detection of Election Fraud
Benford's Law and the Detection of Election Fraud. Political Analysis. 2011;19(3):245-68.
Is COVID-19 data reliable? A statistical analysis with Benford's Law
  • A Wei
  • A E Vellwock
Wei A, Vellwock AE. Is COVID-19 data reliable? A statistical analysis with Benford's Law. 2020.