Content uploaded by Andre E. Vellwock
Author content
All content in this area was uploaded by Andre E. Vellwock on Sep 18, 2020
Content may be subject to copyright.
1
Is COVID-19 data reliable? A statistical analysis with Benford’s Law
1
Anran Wei, Andre E. Vellwock
2
3
Abstract
4
Benford’s Law is applied as a method to analyze and find data manipulation in large
5
datasets. It is consistently recognized as a valid method to combat financial fraud and tax
6
evasion. Here, we studied its application to datasets of COVID-19, targeting data
7
manipulation in the following: total confirmed cases, daily confirmed cases, total
8
confirmed deaths, daily confirmed deaths. We considered countries among the most total
9
confirmed cases on the day 1 September 2020 and China. General results showed that
10
COVID-19’s numbers do follow Benford’s Law. Moreover, no evidence of data
11
manipulation is seen for data from the USA, Brazil, India, Peru, South Africa, Colombia,
12
Mexico, Spain, Argentina, Chile, the United Kingdom, France, Saudi Arabia, China,
13
Philippines, Belgium, Pakistan, and Italy. Results suggest a high possibility of data
14
manipulations for Russia’s data. A small divergence is present in Iran’s numbers.
15
Keywords: COVID-19, Benford’s Law, statistics, coronavirus, data manipulation, data
16
analysis
17
18
1. Introduction
19
Benford’s Law, also called the Newcomb–Benford Law, was firstly observed by
20
Newcomb (1) and popularized by Benford (2). The Law is widely applied to test the
21
authenticity of data in various fields of our daily life. Most of them are intended for
22
financial applications (3), accounting fraud (4-6), and politics (3, 7-10). Benford’s Law
23
points out that the first digit of a naturally occurring decimal number is more likely to be
24
equal to 1, and the possibilities of the first digit to be equal to the subsequent numbers,
25
i.e., 2 ~ 9, decrease progressively. With strong evidence that common diseases numbers
26
indeed follow Benford’s distribution (11), studies have attempted to analyze COVID-19
27
Benfordness, namely, how well their numbers fit Benford’s Law. From here on, we are
28
going to mention the date when each study was made or published, since COVID-19
29
numbers may change over time. Sambridge and Jackson (12) studied data until 9 April
30
2
2020 and suggested that data from the United States, Japan, Indonesia, and most European
31
countries follow well the distribution, but also suggested anomalies. The authors
32
illustrated that data for total death in the Czech Republic completely does not follow
33
Benford’s Law, specifically it has a high peak at the “9” that is contrary to the distribution.
34
Despite proving that COVID-19 follows Benford’s Law, they also confirmed that the
35
timeline in which the datasets are evaluated is crucial. Exemplifying, after the virus
36
growth is reduce and “daily confirmed cases” flattens out if the cases are already in the
37
thousands the first digit unquestionably tends to become constant. The same consideration
38
was seen by Lee, Han (13) in a paper submitted on 28 April 2020. Koch and Okamura
39
(14), on 28 April 2020, demonstrated that the USA’s, Italy’s and China’s COVID’s
40
confirmed cases numbers match the Law, showing high Benfordness and no data
41
manipulation for these countries. Idrovo and Manrique-Hernández (15), with data up to
42
15 March 2020, likewise proved no data manipulation on China’s numbers. In the same
43
timeline, checking confirmed cases data until 30 April 2020, Raul (16) indicated that Italy,
44
Portugal, Netherlands, the United Kingdom, Denmark, Belgium and Chile may have
45
altered COVID data, among a large study of 23 countries. All these studies proved that
46
COVID can be viewed in the Benford’s Law perspective, but also the conclusions are
47
somewhat contradictory regarding which country has possible data manipulation.
48
49
2. Methods
50
2.1 Benford’s Law and d* factor
51
Benford’s Law can be inadequately explained based on an intuition that the mantissa of
52
logarithms of exponentially growing numbers tends to be uniformly distributed. For an
53
arbitrary decimal number , its logarithm can be written in the form as
54
55
where is the integer part and is the mantissa of the logarithm of . It is obvious
56
that the first digit of can be identified by the log-mantissa . For numbers with
57
first digit equal to , the corresponding log-mantissa is located at the domain of
58
[ ). Considering the hypothesis of uniform distribution of log-
59
mantissa, the probability of the first digit to be equal to is thus given by
60
3
61
Here, we list the probability distribution for each number from 1 to 9 to be the first digit
62
according to Benford’s Law, as shown in Table 1.
63
Table 1. First digit distribution according to the Benford’s Law.
64
d
1
2
3
4
5
6
7
8
9
P(d)
30.1%
17.6%
12.5%
9.7%
7.9%
6.7%
5.8%
5.1%
4.6%
65
In our study, deviations to Benford’s curve were quantified by the d*-factor (d*) (3, 17)
66
that is fundamentally the Euclidian distance between the country numbers and Benford’s
67
distribution, after normalizing it by the maximum possible distance, 1.03606, the situation
68
when there is only a peak at “9” and zero for other first digits (17). If a dataset matches
69
exactly the Benford’s curve, d* is equal to 0.0. A higher Euclidian distance results in a
70
lower Benfordness, thus, d* is closer to 1.0. Goodman (17) proposed that a d* higher than
71
0.25 is high evidence of data manipulation. The calculation of d* is expressed as
72
73
where is the first digit from 1 to 9, and
stands for the probability distribution of
74
each first digit in real datasets.
75
76
2.2 Data collection and preprocessing
77
Data were obtained on 1 September 2020 in the COVID-19 Data Repository by the Center
78
for Systems Science and Engineering (CSSE) at Johns Hopkins University (18). It was
79
considered countries among the highest total cases, thus the USA, Brazil, India, Russia,
80
Peru, South Africa, Colombia, Mexico, Spain, Argentina, Chile, Iran, the United
81
Kingdom, France, Saudi Arabia, Philippines, Belgium, Pakistan, Italy. China was
82
included as it was the initial country with a boost in confirmed cases. Except for the USA,
83
the UK, France and China, the imported data already presents the country results per day
84
for total confirmed cases, daily confirmed cases, total deaths and daily deaths. For the
85
USA, the UK and France, data is divided into regions (provinces or cities), thus we
86
4
summed to achieve the country data per day. As cases of China quickly increased and
87
stabilized in a matter of days, the datasets of China are relatively small and thus not
88
adequate for applying the Benford’s Law. Here, the data of China was preprocessed in a
89
different way, which was not summed to have the country’s data per day, but left as
90
provinces’ data per day, increasing the dataset and allowing us to test its Benfordness.
91
Then, the first digits of these data for overall countries or provinces are recorded to obtain
92
following probability distributions.
93
94
3. Results and discussions
95
Results will be addressed in sections regarding data of the whole world and the other
96
selected countries. Figures 1-4 illustrates the results and the data comparisons for the
97
whole world and each country to the Benford’s Law distribution (line in black). Figure 1
98
and Figure 2 shows the total confirmed cases and daily confirmed cases, respectively.
99
Meanwhile, Figure 3 and Figure 4 shows the total death number and the daily death
100
number accordingly.
101
102
3.1 World
103
The d* values of the whole world data were below 0.10 (Figures 1a, 2a, 3a, 4a). Moreover,
104
daily cases and total deaths were below 0.03 (Figures 2a, 3a). These results validate the
105
use of Benford’s Law for the COVID-19 data, in all the four characteristics here studied.
106
107
3.2 USA
108
The USA is an ideal country to analyze the numbers, as it has consistent daily cases, thus
109
a large database. A small variation in the daily cases is seen in Figure 2b, with a slightly
110
larger peak at “2”. However, this is unseen in the total cases (Figure 1b). Figure 3b shows
111
an extremely large peak at “1” and a high d*. Nevertheless, this occurs due to the total
112
death numbers reaching more than 100 thousand, keeping the first digit “1” for a long
113
time. The daily death cases validate the hypothesis, by showing a good correlation with
114
the Benford’s Law. This raises the attention to the need to analyze if the alterations of
115
data are normal or not before drawing final conclusions.
116
5
117
3.3 Brazil
118
Brazil is another case of high Benfordness. However, a small variation can be seen in the
119
daily death number, with a d* of 0.21. This is due to daily deaths higher than 1,000 and
120
below 1,999, especially in the last weeks of the studied data.
121
122
3.4 India
123
There are no indications of data alteration for the India numbers (Figures 1d, 2d, 3d, 4d),
124
with a maximum d* of 0.15 and good Benfordness.
125
126
3.5 Russia
127
Results suggest high possibility of data manipulations for Russia’s data. Figure 1e
128
illustrates the lack of Benfordness for the total confirmed cases. The pattern resembles a
129
random distribution: if we calculate the d* related to a constant probability of 1/9 for all
130
first digits, it shows that the d* is 0.13, a value lower than the one related to the Benford
131
distribution (0.20). Daily cases (Figure 2e) reconfirms the lack correlation to the
132
Benford’s Law, with a d* of 0.30 and no apparent large peak, leading to the conclusion
133
that the high d* is not due to a constant first digit as seen in the USA and Brazil but most
134
probably data alteration. Death numbers are also off, the high fraction of “1” in the total
135
deaths (Figure 3e) is explained by reaching 10,000 plus cases. However, the almost equal
136
fractions of the others' first digits, “2” to “9”, suggests a constant growth of the total
137
deaths. This behavior is not exhibited in the other countries. Daily deaths (Figure 4e) also
138
do not follow Benford’s Law.
139
140
3.6 Peru
141
Total confirmed and death cases have a good correlation to Benford’s curve (Figures 1f
142
and 3f). A small deviation is shown in Figures 2e and 4e, but the latter can be explained
143
to a more frequent daily death rate between 100 and 199, in agreement to the country size
144
and its currently COVID-19 situation.
145
6
146
3.7 South Africa
147
There are no indications of data alteration for the South Africa’s numbers (Figures 1g, 2g,
148
3g, 4g), with a maximum d* of 0.10 and good Benfordness.
149
150
3.8 Colombia
151
There are no indications of data alteration for the Colombia’s numbers (Figures 1h, 2h,
152
3h, 4h), with a maximum d* of 0.12. Extreme low values of d* are seen for the confirmed
153
cases, both total and daily.
154
155
3.9 Mexico
156
There are no indications of data alteration for the Mexico’s numbers (Figures 1i, 2i, 3i,
157
4i), with a maximum d* of 0.17.
158
159
3.10 Spain
160
Spain’s numbers might give a wrong indication, with d* higher than 0.45 for total cases
161
and total deaths. Nevertheless, this is due to the reduction of transmission and deaths as
162
well as a consequent constancy of the first digit. The total confirmed cases stabilized
163
around 200,000 cases and the total deaths around 20,000. Naturally, peaks at “2” are
164
present in Figures 1j and 3j. The lack of data manipulation is confirmed by a low d* in
165
daily confirmed cases (Figure 2j) and daily deaths (Figure 4j).
166
167
3.11 Argentina
168
Argentina is one of the countries with the most agreement with Benford’s distribution,
169
showing no evidence of data manipulation. A maximum d* of 0.09 seen for the daily
170
death cases.
171
172
3.12 Chile
173
7
Chile has a small deviation only in the total confirmed cases (Figure 1l). However, the
174
same does not occur for the other data (Figures 2l, 3l, 4l). The alteration in Figure 1l can
175
be explained by the decrease of confirmed cases after Chile reached 100 thousand,
176
growing the decline rate until the cases were around 300,000. Thus, there is no
177
confirmation of data manipulation.
178
179
3.13 Iran
180
Iran’s daily confirmed cases have a peak at “2” (Figure 2m), resulting in a d* of 0.42,
181
which cannot be explained. Nevertheless, the total confirmed cases correctly follow
182
Benford’s distribution (Figure 1m). The other data is partially in agreement with
183
Benford’s (Figures 3m and 4m), with an odd peak at “1” for daily death (Figure 4m)
184
185
3.14 United Kingdom
186
The United Kingdom curves have high d* values for total confirmed cases and deaths
187
(Figures 1n and 3n), while low d* for daily cases and deaths (Figures 2n and 4n). The
188
high values are due to the flattening of the curve and the slowing down of the growing of
189
confirmed cases and deaths in the last weeks. Benfordness in Figures 2n and 4n validates
190
the data.
191
192
3.15 France
193
France’s numbers are like United Kingdom’s, with a low Benfordness of total numbers
194
(Figures 1o and 3o) keeping a high Benfordness of daily numbers (Figures 2o and 4o).
195
Regarding total confirmed cases, a peak at “1” is seen, due to the flattening around
196
100,000 to 199,999 (Figure 3o). In conclusion, the results for France are valid, without
197
any apparent manipulation.
198
199
3.16 Saudi Arabia
200
8
Saudi Arabia’s shows good Benfordness for total confirmed cases, daily confirmed cases,
201
and total deaths (Figures 1p, 2p, 3p). Instead, daily deaths do not follow Benford’s curve,
202
with a 102.9% curve.
203
204
3.17 China
205
China’s numbers show great Benfordness, especially for total and daily confirmed cases,
206
and total deaths (Figures 1q, 2q, 3q). A higher d* is seen for daily deaths, with a peak at
207
“1”. This is explained as the data considers provinces and cities, and not the full summed
208
country as the other countries in our study. The Hubei province in China is the one mainly
209
affected by the virus, presenting daily deaths consistently higher than 10, while other
210
provinces showed frequently single daily deaths, thus creating the peak at “1”.
211
212
3.18 Philippines
213
There are no indications of data alteration for the Philippines’s numbers (Figures 1r, 2r,
214
3r, 4r), with a maximum d* of 0.18.
215
216
3.19 Belgium
217
Belgium has good Benfordness for confirmed cases, total and daily (Figures 1s and 2s).
218
A high peak at “9” for total deaths is due to the flattening of deaths between 9,000 and
219
9,999, and the d* of 0.06 for daily deaths reaffirms the lack of data manipulation.
220
221
3.20 Pakistan
222
Pakistan shows great Benfordness for all data (Figures 1t, 2t, 3t, 4t), with a maximum d*
223
of 0.28.
224
225
3.21 Italy
226
9
Italy’s numbers show peaks at “2” and “3” for total confirmed cases and total deaths,
227
respectively. These are due to the recent decrease infection in the country. Daily data have
228
high Benfordness, with maximum d* of 0.12.
229
230
4. Conclusions
231
The application of the Benford’s Law to assess COVID-19 data was confirmed. It
232
presented a valid method to measure variations on countries’ datasets and suggested
233
possible data manipulations. Our analysis suggested a high possibility of manipulation
234
on Russia’s COVID-19 numbers, for all the data: total and daily confirmed cases and
235
deaths. Small deviations were also seen for Iran’s daily confirmed cases and daily deaths.
236
No data manipulation is shown for data from the USA, Brazil, India, Peru, South Africa,
237
Colombia, Mexico, Spain, Argentina, Chile, the United Kingdom, France, Saudi Arabia,
238
China, Philippines, Belgium, Pakistan, and Italy.
239
240
References
241
1. Newcomb S. Note on the Frequency of Use of the Different Digits in Natural
242
Numbers. American Journal of Mathematics. 1881;4(1):39-40.
243
2. Benford F. The Law of Anomalous Numbers. Proceedings of the American
244
Philosophical Society. 1938;78(4):551-72.
245
3. Cho W, Gaines B. Breaking the (Benford) Law: Statistical Fraud Detection in
246
Campaign Finance. The American Statistician. 2007;61:218-23.
247
4. Durtschi C, Hillison WA, Pacini C, editors. The effective use of Benford's Law to
248
assist in detecting fraud in accounting data2004.
249
5. Grammatikos T, Papanikolaou NI. Applying Benford’s Law to Detect Accounting
250
Data Manipulation in the Banking Industry. Journal of Financial Services Research. 2020.
251
6. Winter C, Schneider M, Yannikos Y, editors. Detecting Fraud Using Modified
252
Benford Analysis. Advances in Digital Forensics VII; 2011 2011//; Berlin, Heidelberg:
253
Springer Berlin Heidelberg.
254
7. Deckert J, Myagkov M, Ordeshook PC. Benford's Law and the Detection of
255
Election Fraud. Political Analysis. 2011;19(3):245-68.
256
8. Beber B, Scacco A. What the numbers say: A digit-based test for election fraud.
257
Political analysis. 2012;20(2):211-34.
258
10
9. Brady HE. Comments on Benford s Law and the Venezuelan Election. MS dated
259
January. 2005;19:2005.
260
10. Mebane Jr WR, editor Election forensics: Vote counts and Benford’s law.
261
Summer Meeting of the Political Methodology Society, UC-Davis, July; 2006.
262
11. Sambridge M, Tkalčić H, Jackson A. Benford's law in the natural sciences.
263
Geophysical Research Letters. 2010;37(22).
264
12. Sambridge M, Jackson A. National COVID numbers - Benford's law looks for
265
errors. Nature. 2020;581(7809):384.
266
13. Lee K-B, Han S, Jeong Y. COVID-19, flattening the curve, and Benford’s law.
267
Physica A: Statistical Mechanics and its Applications. 2020;559:125090.
268
14. Koch C, Okamura K. Benford's Law and COVID-19 Reporting. SSRN. 2020.
269
15. Idrovo AJ, Manrique-Hernández EF. Data Quality of Chinese Surveillance of
270
COVID-19: Objective Analysis Based on WHO's Situation Reports. Asia Pac J Public
271
Health. 2020;32(4):165-7.
272
16. Raul I. How Valid are the Reported Cases of People Infected with Covid-19 in
273
the World? International Journal of Coronaviruses. 2020;1(2):53-6.
274
17. Goodman W. The promises and pitfalls of Benford's law. Significance.
275
2016;13(3):38-41.
276
18. Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-
277
19 in real time. The Lancet Infectious Diseases. 2020;20(5):533-4.
278
279
11
280
Figure 1. Total confirmed cases for (a) the whole world and (b-u) selected countries. The
281
black curve refers to Benford's Law probability.
282
12
283
Figure 2. Daily confirmed cases for (a) the whole world and (b-u) selected countries. The
284
black curve refers to Benford's Law probability.
285
13
286
Figure 3. Total deaths for (a) the whole world and (b-u) selected countries. The black
287
curve refers to Benford's Law probability.
288
289
14
Figure 4. Daily deaths for (a) the whole world and (b-u) selected countries. The black line
290
refers to Benford's Law probability.
291
292