PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Benford’s Law is applied as a method to analyze and find data manipulation in large datasets. It is consistently recognized as a valid method to combat financial fraud and tax evasion. Here, we studied its application to datasets of COVID-19, targeting data manipulation in the following: total confirmed cases, daily confirmed cases, total confirmed deaths, daily confirmed deaths. We considered countries among the most total confirmed cases on the day 1 September 2020 and China. General results showed that COVID-19’s numbers do follow Benford’s Law. Moreover, no evidence of data manipulation is seen for data from the USA, Brazil, India, Peru, South Africa, Colombia, Mexico, Spain, Argentina, Chile, the United Kingdom, France, Saudi Arabia, China, Philippines, Belgium, Pakistan, and Italy. Results suggest a high possibility of data manipulations for Russia’s data. A small divergence is present in Iran’s numbers.
1
Is COVID-19 data reliable? A statistical analysis with Benford’s Law
1
Anran Wei, Andre E. Vellwock
2
3
Abstract
4
Benford’s Law is applied as a method to analyze and find data manipulation in large
5
datasets. It is consistently recognized as a valid method to combat financial fraud and tax
6
evasion. Here, we studied its application to datasets of COVID-19, targeting data
7
manipulation in the following: total confirmed cases, daily confirmed cases, total
8
confirmed deaths, daily confirmed deaths. We considered countries among the most total
9
confirmed cases on the day 1 September 2020 and China. General results showed that
10
COVID-19’s numbers do follow Benford’s Law. Moreover, no evidence of data
11
manipulation is seen for data from the USA, Brazil, India, Peru, South Africa, Colombia,
12
Mexico, Spain, Argentina, Chile, the United Kingdom, France, Saudi Arabia, China,
13
Philippines, Belgium, Pakistan, and Italy. Results suggest a high possibility of data
14
manipulations for Russia’s data. A small divergence is present in Iran’s numbers.
15
Keywords: COVID-19, Benford’s Law, statistics, coronavirus, data manipulation, data
16
analysis
17
18
1. Introduction
19
Benford’s Law, also called the NewcombBenford Law, was firstly observed by
20
Newcomb (1) and popularized by Benford (2). The Law is widely applied to test the
21
authenticity of data in various fields of our daily life. Most of them are intended for
22
financial applications (3), accounting fraud (4-6), and politics (3, 7-10). Benford’s Law
23
points out that the first digit of a naturally occurring decimal number is more likely to be
24
equal to 1, and the possibilities of the first digit to be equal to the subsequent numbers,
25
i.e., 2 ~ 9, decrease progressively. With strong evidence that common diseases numbers
26
indeed follow Benford’s distribution (11), studies have attempted to analyze COVID-19
27
Benfordness, namely, how well their numbers fit Benford’s Law. From here on, we are
28
going to mention the date when each study was made or published, since COVID-19
29
numbers may change over time. Sambridge and Jackson (12) studied data until 9 April
30
2
2020 and suggested that data from the United States, Japan, Indonesia, and most European
31
countries follow well the distribution, but also suggested anomalies. The authors
32
illustrated that data for total death in the Czech Republic completely does not follow
33
Benford’s Law, specifically it has a high peak at the “9” that is contrary to the distribution.
34
Despite proving that COVID-19 follows Benford’s Law, they also confirmed that the
35
timeline in which the datasets are evaluated is crucial. Exemplifying, after the virus
36
growth is reduce and daily confirmed cases flattens out if the cases are already in the
37
thousands the first digit unquestionably tends to become constant. The same consideration
38
was seen by Lee, Han (13) in a paper submitted on 28 April 2020. Koch and Okamura
39
(14), on 28 April 2020, demonstrated that the USA’s, Italy’s and China’s COVID’s
40
confirmed cases numbers match the Law, showing high Benfordness and no data
41
manipulation for these countries. Idrovo and Manrique-Hernández (15), with data up to
42
15 March 2020, likewise proved no data manipulation on China’s numbers. In the same
43
timeline, checking confirmed cases data until 30 April 2020, Raul (16) indicated that Italy,
44
Portugal, Netherlands, the United Kingdom, Denmark, Belgium and Chile may have
45
altered COVID data, among a large study of 23 countries. All these studies proved that
46
COVID can be viewed in the Benford’s Law perspective, but also the conclusions are
47
somewhat contradictory regarding which country has possible data manipulation.
48
49
2. Methods
50
2.1 Benford’s Law and d* factor
51
Benford’s Law can be inadequately explained based on an intuition that the mantissa of
52
logarithms of exponentially growing numbers tends to be uniformly distributed. For an
53
arbitrary decimal number , its logarithm can be written in the form as
54
           
55
where is the integer part and  is the mantissa of the logarithm of . It is obvious
56
that the first digit of can be identified by the log-mantissa  . For numbers with
57
first digit equal to   , the corresponding log-mantissa is located at the domain of
58
[ ). Considering the hypothesis of uniform distribution of log-
59
mantissa, the probability of the first digit to be equal to is thus given by
60
3
  
  
61
Here, we list the probability distribution for each number from 1 to 9 to be the first digit
62
according to Benford’s Law, as shown in Table 1.
63
Table 1. First digit distribution according to the Benford’s Law.
64
d
1
2
3
4
5
6
7
8
9
P(d)
30.1%
17.6%
12.5%
9.7%
7.9%
6.7%
5.8%
5.1%
4.6%
65
In our study, deviations to Benford’s curve were quantified by the d*-factor (d*) (3, 17)
66
that is fundamentally the Euclidian distance between the country numbers and Benford’s
67
distribution, after normalizing it by the maximum possible distance, 1.03606, the situation
68
when there is only a peak at 9 and zero for other first digits (17). If a dataset matches
69
exactly the Benford’s curve, d* is equal to 0.0. A higher Euclidian distance results in a
70
lower Benfordness, thus, d* is closer to 1.0. Goodman (17) proposed that a d* higher than
71
0.25 is high evidence of data manipulation. The calculation of d* is expressed as
72

 
73
where is the first digit from 1 to 9, and
stands for the probability distribution of
74
each first digit in real datasets.
75
76
2.2 Data collection and preprocessing
77
Data were obtained on 1 September 2020 in the COVID-19 Data Repository by the Center
78
for Systems Science and Engineering (CSSE) at Johns Hopkins University (18). It was
79
considered countries among the highest total cases, thus the USA, Brazil, India, Russia,
80
Peru, South Africa, Colombia, Mexico, Spain, Argentina, Chile, Iran, the United
81
Kingdom, France, Saudi Arabia, Philippines, Belgium, Pakistan, Italy. China was
82
included as it was the initial country with a boost in confirmed cases. Except for the USA,
83
the UK, France and China, the imported data already presents the country results per day
84
for total confirmed cases, daily confirmed cases, total deaths and daily deaths. For the
85
USA, the UK and France, data is divided into regions (provinces or cities), thus we
86
4
summed to achieve the country data per day. As cases of China quickly increased and
87
stabilized in a matter of days, the datasets of China are relatively small and thus not
88
adequate for applying the Benford’s Law. Here, the data of China was preprocessed in a
89
different way, which was not summed to have the country’s data per day, but left as
90
provinces’ data per day, increasing the dataset and allowing us to test its Benfordness.
91
Then, the first digits of these data for overall countries or provinces are recorded to obtain
92
following probability distributions.
93
94
3. Results and discussions
95
Results will be addressed in sections regarding data of the whole world and the other
96
selected countries. Figures 1-4 illustrates the results and the data comparisons for the
97
whole world and each country to the Benford’s Law distribution (line in black). Figure 1
98
and Figure 2 shows the total confirmed cases and daily confirmed cases, respectively.
99
Meanwhile, Figure 3 and Figure 4 shows the total death number and the daily death
100
number accordingly.
101
102
3.1 World
103
The d* values of the whole world data were below 0.10 (Figures 1a, 2a, 3a, 4a). Moreover,
104
daily cases and total deaths were below 0.03 (Figures 2a, 3a). These results validate the
105
use of Benford’s Law for the COVID-19 data, in all the four characteristics here studied.
106
107
3.2 USA
108
The USA is an ideal country to analyze the numbers, as it has consistent daily cases, thus
109
a large database. A small variation in the daily cases is seen in Figure 2b, with a slightly
110
larger peak at “2”. However, this is unseen in the total cases (Figure 1b). Figure 3b shows
111
an extremely large peak at “1” and a high d*. Nevertheless, this occurs due to the total
112
death numbers reaching more than 100 thousand, keeping the first digit “1” for a long
113
time. The daily death cases validate the hypothesis, by showing a good correlation with
114
the Benford’s Law. This raises the attention to the need to analyze if the alterations of
115
data are normal or not before drawing final conclusions.
116
5
117
3.3 Brazil
118
Brazil is another case of high Benfordness. However, a small variation can be seen in the
119
daily death number, with a d* of 0.21. This is due to daily deaths higher than 1,000 and
120
below 1,999, especially in the last weeks of the studied data.
121
122
3.4 India
123
There are no indications of data alteration for the India numbers (Figures 1d, 2d, 3d, 4d),
124
with a maximum d* of 0.15 and good Benfordness.
125
126
3.5 Russia
127
Results suggest high possibility of data manipulations for Russia’s data. Figure 1e
128
illustrates the lack of Benfordness for the total confirmed cases. The pattern resembles a
129
random distribution: if we calculate the d* related to a constant probability of 1/9 for all
130
first digits, it shows that the d* is 0.13, a value lower than the one related to the Benford
131
distribution (0.20). Daily cases (Figure 2e) reconfirms the lack correlation to the
132
Benford’s Law, with a d* of 0.30 and no apparent large peak, leading to the conclusion
133
that the high d* is not due to a constant first digit as seen in the USA and Brazil but most
134
probably data alteration. Death numbers are also off, the high fraction of “1” in the total
135
deaths (Figure 3e) is explained by reaching 10,000 plus cases. However, the almost equal
136
fractions of the others' first digits, “2” to “9”, suggests a constant growth of the total
137
deaths. This behavior is not exhibited in the other countries. Daily deaths (Figure 4e) also
138
do not follow Benford’s Law.
139
140
3.6 Peru
141
Total confirmed and death cases have a good correlation to Benford’s curve (Figures 1f
142
and 3f). A small deviation is shown in Figures 2e and 4e, but the latter can be explained
143
to a more frequent daily death rate between 100 and 199, in agreement to the country size
144
and its currently COVID-19 situation.
145
6
146
3.7 South Africa
147
There are no indications of data alteration for the South Africa’s numbers (Figures 1g, 2g,
148
3g, 4g), with a maximum d* of 0.10 and good Benfordness.
149
150
3.8 Colombia
151
There are no indications of data alteration for the Colombia’s numbers (Figures 1h, 2h,
152
3h, 4h), with a maximum d* of 0.12. Extreme low values of d* are seen for the confirmed
153
cases, both total and daily.
154
155
3.9 Mexico
156
There are no indications of data alteration for the Mexico’s numbers (Figures 1i, 2i, 3i,
157
4i), with a maximum d* of 0.17.
158
159
3.10 Spain
160
Spain’s numbers might give a wrong indication, with d* higher than 0.45 for total cases
161
and total deaths. Nevertheless, this is due to the reduction of transmission and deaths as
162
well as a consequent constancy of the first digit. The total confirmed cases stabilized
163
around 200,000 cases and the total deaths around 20,000. Naturally, peaks at “2” are
164
present in Figures 1j and 3j. The lack of data manipulation is confirmed by a low d* in
165
daily confirmed cases (Figure 2j) and daily deaths (Figure 4j).
166
167
3.11 Argentina
168
Argentina is one of the countries with the most agreement with Benford’s distribution,
169
showing no evidence of data manipulation. A maximum d* of 0.09 seen for the daily
170
death cases.
171
172
3.12 Chile
173
7
Chile has a small deviation only in the total confirmed cases (Figure 1l). However, the
174
same does not occur for the other data (Figures 2l, 3l, 4l). The alteration in Figure 1l can
175
be explained by the decrease of confirmed cases after Chile reached 100 thousand,
176
growing the decline rate until the cases were around 300,000. Thus, there is no
177
confirmation of data manipulation.
178
179
3.13 Iran
180
Iran’s daily confirmed cases have a peak at “2” (Figure 2m), resulting in a d* of 0.42,
181
which cannot be explained. Nevertheless, the total confirmed cases correctly follow
182
Benford’s distribution (Figure 1m). The other data is partially in agreement with
183
Benford’s (Figures 3m and 4m), with an odd peak at “1” for daily death (Figure 4m)
184
185
3.14 United Kingdom
186
The United Kingdom curves have high d* values for total confirmed cases and deaths
187
(Figures 1n and 3n), while low d* for daily cases and deaths (Figures 2n and 4n). The
188
high values are due to the flattening of the curve and the slowing down of the growing of
189
confirmed cases and deaths in the last weeks. Benfordness in Figures 2n and 4n validates
190
the data.
191
192
3.15 France
193
France’s numbers are like United Kingdom’s, with a low Benfordness of total numbers
194
(Figures 1o and 3o) keeping a high Benfordness of daily numbers (Figures 2o and 4o).
195
Regarding total confirmed cases, a peak at “1” is seen, due to the flattening around
196
100,000 to 199,999 (Figure 3o). In conclusion, the results for France are valid, without
197
any apparent manipulation.
198
199
3.16 Saudi Arabia
200
8
Saudi Arabia’s shows good Benfordness for total confirmed cases, daily confirmed cases,
201
and total deaths (Figures 1p, 2p, 3p). Instead, daily deaths do not follow Benford’s curve,
202
with a 102.9% curve.
203
204
3.17 China
205
China’s numbers show great Benfordness, especially for total and daily confirmed cases,
206
and total deaths (Figures 1q, 2q, 3q). A higher d* is seen for daily deaths, with a peak at
207
“1”. This is explained as the data considers provinces and cities, and not the full summed
208
country as the other countries in our study. The Hubei province in China is the one mainly
209
affected by the virus, presenting daily deaths consistently higher than 10, while other
210
provinces showed frequently single daily deaths, thus creating the peak at “1”.
211
212
3.18 Philippines
213
There are no indications of data alteration for the Philippines’s numbers (Figures 1r, 2r,
214
3r, 4r), with a maximum d* of 0.18.
215
216
3.19 Belgium
217
Belgium has good Benfordness for confirmed cases, total and daily (Figures 1s and 2s).
218
A high peak at “9” for total deaths is due to the flattening of deaths between 9,000 and
219
9,999, and the d* of 0.06 for daily deaths reaffirms the lack of data manipulation.
220
221
3.20 Pakistan
222
Pakistan shows great Benfordness for all data (Figures 1t, 2t, 3t, 4t), with a maximum d*
223
of 0.28.
224
225
3.21 Italy
226
9
Italy’s numbers show peaks at “2” and “3” for total confirmed cases and total deaths,
227
respectively. These are due to the recent decrease infection in the country. Daily data have
228
high Benfordness, with maximum d* of 0.12.
229
230
4. Conclusions
231
The application of the Benford’s Law to assess COVID-19 data was confirmed. It
232
presented a valid method to measure variations on countries’ datasets and suggested
233
possible data manipulations. Our analysis suggested a high possibility of manipulation
234
on Russia’s COVID-19 numbers, for all the data: total and daily confirmed cases and
235
deaths. Small deviations were also seen for Iran’s daily confirmed cases and daily deaths.
236
No data manipulation is shown for data from the USA, Brazil, India, Peru, South Africa,
237
Colombia, Mexico, Spain, Argentina, Chile, the United Kingdom, France, Saudi Arabia,
238
China, Philippines, Belgium, Pakistan, and Italy.
239
240
References
241
1. Newcomb S. Note on the Frequency of Use of the Different Digits in Natural
242
Numbers. American Journal of Mathematics. 1881;4(1):39-40.
243
2. Benford F. The Law of Anomalous Numbers. Proceedings of the American
244
Philosophical Society. 1938;78(4):551-72.
245
3. Cho W, Gaines B. Breaking the (Benford) Law: Statistical Fraud Detection in
246
Campaign Finance. The American Statistician. 2007;61:218-23.
247
4. Durtschi C, Hillison WA, Pacini C, editors. The effective use of Benford's Law to
248
assist in detecting fraud in accounting data2004.
249
5. Grammatikos T, Papanikolaou NI. Applying Benford’s Law to Detect Accounting
250
Data Manipulation in the Banking Industry. Journal of Financial Services Research. 2020.
251
6. Winter C, Schneider M, Yannikos Y, editors. Detecting Fraud Using Modified
252
Benford Analysis. Advances in Digital Forensics VII; 2011 2011//; Berlin, Heidelberg:
253
Springer Berlin Heidelberg.
254
7. Deckert J, Myagkov M, Ordeshook PC. Benford's Law and the Detection of
255
Election Fraud. Political Analysis. 2011;19(3):245-68.
256
8. Beber B, Scacco A. What the numbers say: A digit-based test for election fraud.
257
Political analysis. 2012;20(2):211-34.
258
10
9. Brady HE. Comments on Benford s Law and the Venezuelan Election. MS dated
259
January. 2005;19:2005.
260
10. Mebane Jr WR, editor Election forensics: Vote counts and Benford’s law.
261
Summer Meeting of the Political Methodology Society, UC-Davis, July; 2006.
262
11. Sambridge M, Tkalčić H, Jackson A. Benford's law in the natural sciences.
263
Geophysical Research Letters. 2010;37(22).
264
12. Sambridge M, Jackson A. National COVID numbers - Benford's law looks for
265
errors. Nature. 2020;581(7809):384.
266
13. Lee K-B, Han S, Jeong Y. COVID-19, flattening the curve, and Benford’s law.
267
Physica A: Statistical Mechanics and its Applications. 2020;559:125090.
268
14. Koch C, Okamura K. Benford's Law and COVID-19 Reporting. SSRN. 2020.
269
15. Idrovo AJ, Manrique-Hernández EF. Data Quality of Chinese Surveillance of
270
COVID-19: Objective Analysis Based on WHO's Situation Reports. Asia Pac J Public
271
Health. 2020;32(4):165-7.
272
16. Raul I. How Valid are the Reported Cases of People Infected with Covid-19 in
273
the World? International Journal of Coronaviruses. 2020;1(2):53-6.
274
17. Goodman W. The promises and pitfalls of Benford's law. Significance.
275
2016;13(3):38-41.
276
18. Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-
277
19 in real time. The Lancet Infectious Diseases. 2020;20(5):533-4.
278
279
11
280
Figure 1. Total confirmed cases for (a) the whole world and (b-u) selected countries. The
281
black curve refers to Benford's Law probability.
282
12
283
Figure 2. Daily confirmed cases for (a) the whole world and (b-u) selected countries. The
284
black curve refers to Benford's Law probability.
285
13
286
Figure 3. Total deaths for (a) the whole world and (b-u) selected countries. The black
287
curve refers to Benford's Law probability.
288
289
14
Figure 4. Daily deaths for (a) the whole world and (b-u) selected countries. The black line
290
refers to Benford's Law probability.
291
292
... previous studies (Wei and Vellwock 2020). The distribution of data was more similar to a uniform distribution instead of be the result with forensic techniques like NBL. ...
... By making use of one goodness-of-fit test, they found conformity to NBL the United States, Brazil, India, Peru, South Africa, Colombia, Mexico, Spain, Argentina, Chile, the United Kingdom, France, Saudi Arabia, China, the Philippines, Belgium, Pakistan, and Italy. There were some deviations that were explainable e.g., for Spain, the reduction of transmission and deaths where the total cases and deaths were stabilized at a level(Wei and Vellwock 2020). In their paper, the deviation is generally attributed to the flattening of the curve. ...
Thesis
Full-text available
The COVID-19 pandemic has demonstrated the need for transparent and accurate data reporting and the importance of efficient institutional responses from authorities worldwide. Researchers challenged the legitimacy of the reported data and the competency of relevant authorities to handle the pandemic successfully. This paper aims to use Newcomb-inconsistencies in data by testing the first-distribution, on the reported daily COVID-19 cases from 150 countries. The NBL suggests that the frequencies of first-digits in randomly selected and naturally occurring numbers follows a logarithmic distribution where the expected frequency of digit 1 is 30.1%, of digit 2 is 17.6%, and so on until the digit 9 is observed only 4.6% of the time. Hence, when there is nonconformity, this suggests an abnormality in the data to be investigated further for manipulation, miscalculation, and a change in the nature of the data in hand. As an addition to prior research on the matter, this paper will be adopting a more holistic approach and will be specifically looking for correlations between certain institutional characteristics and conformity to NBL. The hypothesis of the paper is that the conformity of COVID-19 daily cases to NBL is correlated with economic, demographic, and socio-political factors of the countries, and that nonconformity does not necessarily suggest intentional misreporting. In the analysis, the countries are grouped based on their level of conformity and their correlation with these three groups of factors are examined. The results show that wealthier, less-19 data conform better to NBL. The we detect and measure the number of cases or having a higher income before the pandemic population and higher total number of cases might have exceeded the capacity of the health system and made monitoring harder to maintain. Lastly, more democratic states having better conformity is most likely due to more transparent policies from authorities and less political concerns related to crisis control.
... Benford's law (Benford 1938) has been advocated as a simple, (arguably) effective method for auditors to not only identify discrepancies in data, but to uncover potential data manipulation in financial statements (Durtschi, Hillison, and Pacini 2004), ERP systems (Ma'arif et al. 2020), or official information released by authorities (Wei and Vellwock 2020). Simply put, Benford's law states that in many naturally occurring collections of numbers the leading digit is likely to be small. ...
Preprint
The impact of statistical methods on the audit practice is growing because of the increasing availability of audit data and the statistical methods to analyze these data. A key aspect in the statistical approach to auditing is assessing the strength of evidence for or against a hypothesis. Unfortunately, the often-used frequentist statistical methods cannot provide the statistical evidence that audit standards demand directly nor easily. In this article we discuss an alternative approach that can provide this evidence: Bayesian inference. Firstly, we explore the philosophical differences between frequentist and Bayesian inference. Secondly, we discuss misconceptions in the interpretation of frequentist statistical evidence, and finally we discuss how Bayesian inference allows the auditor to obtain and interpret statistical evidence in line with audit standards via its alternative to the p value, the Bayes factor.
... The reliability of COVID-19 was studied by [12], which also used Benford Law to the total number, new cases and deaths in Russia and found a high possibility of incorrecteness manipulation on reported numbers. In their report [13] found for European countries such as: France, Germany, Spain, UK, Switzerland and Italy that records of cumulative infections and deaths fitted well to the BL and show consistent reporting. ...
Article
Full-text available
For many countries attempting to control the fast-rising number of coronavirus cases and deaths, the race is on to “flatten the curve,” since the spread of coronavirus disease 2019 (COVID-19) has taken on pandemic proportions. In the absence of significant control interventions, the curve could be steep, with the number of COVID-19 cases growing exponentially. In fact, this level of proliferation may already be happening, since the number of patients infected in Italy closely follows an exponential trend. Thus, we propose a test. When the numbers are taken from an exponential distribution, it has been demonstrated that they automatically follow Benford’s Law (BL). As a result, if the current control interventions are successful and we flatten the curve (i.e., we slow the rate below an exponential growth rate), then the number of infections or deaths will not obey BL. For this reason, BL may be useful for assessing the effects of the current control interventions and may be able to answer the question, “How flat is flat enough?” In this study, we used an epidemic growth model in the presence of interventions to describe the potential for a flattened curve, and then investigated whether the epidemic growth model followed BL for ten selected countries with a relatively high mortality rate. Among these countries, South Korea showed a particularly high degree of control intervention. Although all of the countries have aggressively fought the epidemic, our analysis shows that all countries except for Japan satisfied BL, indicating the growth rates of COVID-19 were close to an exponential trend. Based on the simulation table in this study, BL test shows that the data from Japan is incorrect.
Article
Full-text available
Benford's law has been promoted as providing the auditor with a tool that is simple and effec- tive for the detection of fraud. The purpose of this paper is to assist auditors in the most effec- tive use of digital analysis based on Benford's law. The law is based on a peculiar observation that certain digits appear more frequently than others in data sets. For example, in certain data sets, it has been observed that more than 30% of numbers begin with the digit one. After dis- cussing the background of the law and development of its use in auditing, we show where dig- ital analysis based on Benford's law can most effectively be used and where auditors should exercise caution. Specifically, we identify data sets which can be expected to follow Benford's distribution, discuss the power of statistical tests, types of frauds that would be detected and not be detected by such analysis, the potential problems that arise when an account contains too few observations, as well as issues related to base rate of fraud. An actual example is pro- vided demonstrating where Benford's law proved successful in identifying fraud in a popula- tion of accounting data.
Article
Full-text available
More than 100 years ago it was predicted that the distribution of first digits of real world observations would not be uniform, but instead follow a trend where measurements with lower first digit (1,2,…) occur more frequently than those with higher first digits (…,8,9). This result has long been known but regarded largely as a mathematical curiosity and received little attention in the natural sciences. Here we show that the first digit rule is likely to be a widespread phenomenon and may provide new ways to detect anomalous signals in data. We test 15 sets of modern observations drawn from the fields of physics, astronomy, geophysics, chemistry, engineering and mathematics, and show that Benford's law holds for them all. These include geophysical observables such as the length of time between geomagnetic reversals, depths of earthquakes, models of Earth's gravity, geomagnetic and seismic structure. In addition we find it also holds for other natural science observables such as the rotation frequencies of pulsars; green-house gas emissions, the masses of exoplanets as well as numbers of infectious diseases reported to the World Health Organization. The wide range of areas where it is manifested opens up new possibilities for exploitation. An illustration is given of how seismic energy from an earthquake can be detected from just the first digit distribution of displacement counts on a seismometer, i.e., without actually looking at the details of a seismogram at all. This led to the first ever detection of an earthquake using first digit information alone.
Article
Is it possible to detect manipulation by looking only at electoral returns? Drawing on work in psychology, we exploit individuals' biases in generating numbers to highlight suspicious digit patterns in reported vote counts. First, we show that fair election procedures produce returns where last digits occur with equal frequency, but laboratory experiments indicate that individuals tend to favor some numerals over others, even when subjects have incentives to properly randomize. Second, individuals underestimate the likelihood of digit repetition in sequences of random integers, so we should observe relatively few instances of repeated numbers in manipulated vote tallies. Third, laboratory experiments demonstrate a preference for pairs of adjacent digits, which suggests that such pairs should be abundant on fraudulent return sheets. Fourth, subjects avoid pairs of distant numerals, so those should appear with lower frequency on tainted returns. We test for deviations in digit patterns using data from Sweden's 2002 parliamentary elections, Senegal's 2000 and 2007 presidential elections, and previously unavailable results from Nigeria's 2003 presidential election. In line with observers' expectations, we find substantial evidence that manipulation occurred in Nigeria as well as in Senegal in 2007.
Article
The proliferation of elections in even those states that are arguably anything but democratic has given rise to a focused interest on developing methods for detecting fraud in the official statistics of a state's election returns. Among these efforts are those that employ Benford's Law, with the most common application being an attempt to proclaim some election or another fraud free or replete with fraud. This essay, however, argues that, despite its apparent utility in looking at other phenomena, Benford's Law is problematical at best as a forensic tool when applied to elections. Looking at simulations designed to model both fair and fraudulent contests as well as data drawn from elections we know, on the basis of other investigations, were either permeated by fraud or unlikely to have experienced any measurable malfeasance, we find that conformity with and deviations from Benford's Law follow no pattern. It is not simply that the Law occasionally judges a fraudulent election fair or a fair election fraudulent. Its "success rate" either way is essentially equivalent to a toss of a coin, thereby rendering it problematical at best as a forensic tool and wholly misleading at worst.
Conference Paper
Large enterprises frequently enforce accounting limits to reduce the impact of fraud. As a complement to accounting limits, auditors use Benford analysis to detect traces of undesirable or illegal activities in accounting data. Unfortunately, the two fraud fighting measures often do not work well together. Accounting limits may significantly disturb the digit distribution examined by Benford analysis, leading to high false alarm rates, additional investigations and, ultimately, higher costs. To better handle accounting limits, this paper describes a modified Benford analysis technique where a cut-off log-normal distribution derived from the accounting limits and other properties of the data replaces the distribution used in Benford analysis. Experiments with simulated and real-world data demonstrate that the modified Benford analysis technique significantly reduces false positive errors.