Content uploaded by Andre E. Vellwock

Author content

All content in this area was uploaded by Andre E. Vellwock on Sep 18, 2020

Content may be subject to copyright.

1

Is COVID-19 data reliable? A statistical analysis with Benford’s Law

1

Anran Wei, Andre E. Vellwock

2

3

Abstract

4

Benford’s Law is applied as a method to analyze and find data manipulation in large

5

datasets. It is consistently recognized as a valid method to combat financial fraud and tax

6

evasion. Here, we studied its application to datasets of COVID-19, targeting data

7

manipulation in the following: total confirmed cases, daily confirmed cases, total

8

confirmed deaths, daily confirmed deaths. We considered countries among the most total

9

confirmed cases on the day 1 September 2020 and China. General results showed that

10

COVID-19’s numbers do follow Benford’s Law. Moreover, no evidence of data

11

manipulation is seen for data from the USA, Brazil, India, Peru, South Africa, Colombia,

12

Mexico, Spain, Argentina, Chile, the United Kingdom, France, Saudi Arabia, China,

13

Philippines, Belgium, Pakistan, and Italy. Results suggest a high possibility of data

14

manipulations for Russia’s data. A small divergence is present in Iran’s numbers.

15

Keywords: COVID-19, Benford’s Law, statistics, coronavirus, data manipulation, data

16

analysis

17

18

1. Introduction

19

Benford’s Law, also called the Newcomb–Benford Law, was firstly observed by

20

Newcomb (1) and popularized by Benford (2). The Law is widely applied to test the

21

authenticity of data in various fields of our daily life. Most of them are intended for

22

financial applications (3), accounting fraud (4-6), and politics (3, 7-10). Benford’s Law

23

points out that the first digit of a naturally occurring decimal number is more likely to be

24

equal to 1, and the possibilities of the first digit to be equal to the subsequent numbers,

25

i.e., 2 ~ 9, decrease progressively. With strong evidence that common diseases numbers

26

indeed follow Benford’s distribution (11), studies have attempted to analyze COVID-19

27

Benfordness, namely, how well their numbers fit Benford’s Law. From here on, we are

28

going to mention the date when each study was made or published, since COVID-19

29

numbers may change over time. Sambridge and Jackson (12) studied data until 9 April

30

2

2020 and suggested that data from the United States, Japan, Indonesia, and most European

31

countries follow well the distribution, but also suggested anomalies. The authors

32

illustrated that data for total death in the Czech Republic completely does not follow

33

Benford’s Law, specifically it has a high peak at the “9” that is contrary to the distribution.

34

Despite proving that COVID-19 follows Benford’s Law, they also confirmed that the

35

timeline in which the datasets are evaluated is crucial. Exemplifying, after the virus

36

growth is reduce and “daily confirmed cases” flattens out if the cases are already in the

37

thousands the first digit unquestionably tends to become constant. The same consideration

38

was seen by Lee, Han (13) in a paper submitted on 28 April 2020. Koch and Okamura

39

(14), on 28 April 2020, demonstrated that the USA’s, Italy’s and China’s COVID’s

40

confirmed cases numbers match the Law, showing high Benfordness and no data

41

manipulation for these countries. Idrovo and Manrique-Hernández (15), with data up to

42

15 March 2020, likewise proved no data manipulation on China’s numbers. In the same

43

timeline, checking confirmed cases data until 30 April 2020, Raul (16) indicated that Italy,

44

Portugal, Netherlands, the United Kingdom, Denmark, Belgium and Chile may have

45

altered COVID data, among a large study of 23 countries. All these studies proved that

46

COVID can be viewed in the Benford’s Law perspective, but also the conclusions are

47

somewhat contradictory regarding which country has possible data manipulation.

48

49

2. Methods

50

2.1 Benford’s Law and d* factor

51

Benford’s Law can be inadequately explained based on an intuition that the mantissa of

52

logarithms of exponentially growing numbers tends to be uniformly distributed. For an

53

arbitrary decimal number , its logarithm can be written in the form as

54

55

where is the integer part and is the mantissa of the logarithm of . It is obvious

56

that the first digit of can be identified by the log-mantissa . For numbers with

57

first digit equal to , the corresponding log-mantissa is located at the domain of

58

[ ). Considering the hypothesis of uniform distribution of log-

59

mantissa, the probability of the first digit to be equal to is thus given by

60

3

61

Here, we list the probability distribution for each number from 1 to 9 to be the first digit

62

according to Benford’s Law, as shown in Table 1.

63

Table 1. First digit distribution according to the Benford’s Law.

64

d

1

2

3

4

5

6

7

8

9

P(d)

30.1%

17.6%

12.5%

9.7%

7.9%

6.7%

5.8%

5.1%

4.6%

65

In our study, deviations to Benford’s curve were quantified by the d*-factor (d*) (3, 17)

66

that is fundamentally the Euclidian distance between the country numbers and Benford’s

67

distribution, after normalizing it by the maximum possible distance, 1.03606, the situation

68

when there is only a peak at “9” and zero for other first digits (17). If a dataset matches

69

exactly the Benford’s curve, d* is equal to 0.0. A higher Euclidian distance results in a

70

lower Benfordness, thus, d* is closer to 1.0. Goodman (17) proposed that a d* higher than

71

0.25 is high evidence of data manipulation. The calculation of d* is expressed as

72

73

where is the first digit from 1 to 9, and

stands for the probability distribution of

74

each first digit in real datasets.

75

76

2.2 Data collection and preprocessing

77

Data were obtained on 1 September 2020 in the COVID-19 Data Repository by the Center

78

for Systems Science and Engineering (CSSE) at Johns Hopkins University (18). It was

79

considered countries among the highest total cases, thus the USA, Brazil, India, Russia,

80

Peru, South Africa, Colombia, Mexico, Spain, Argentina, Chile, Iran, the United

81

Kingdom, France, Saudi Arabia, Philippines, Belgium, Pakistan, Italy. China was

82

included as it was the initial country with a boost in confirmed cases. Except for the USA,

83

the UK, France and China, the imported data already presents the country results per day

84

for total confirmed cases, daily confirmed cases, total deaths and daily deaths. For the

85

USA, the UK and France, data is divided into regions (provinces or cities), thus we

86

4

summed to achieve the country data per day. As cases of China quickly increased and

87

stabilized in a matter of days, the datasets of China are relatively small and thus not

88

adequate for applying the Benford’s Law. Here, the data of China was preprocessed in a

89

different way, which was not summed to have the country’s data per day, but left as

90

provinces’ data per day, increasing the dataset and allowing us to test its Benfordness.

91

Then, the first digits of these data for overall countries or provinces are recorded to obtain

92

following probability distributions.

93

94

3. Results and discussions

95

Results will be addressed in sections regarding data of the whole world and the other

96

selected countries. Figures 1-4 illustrates the results and the data comparisons for the

97

whole world and each country to the Benford’s Law distribution (line in black). Figure 1

98

and Figure 2 shows the total confirmed cases and daily confirmed cases, respectively.

99

Meanwhile, Figure 3 and Figure 4 shows the total death number and the daily death

100

number accordingly.

101

102

3.1 World

103

The d* values of the whole world data were below 0.10 (Figures 1a, 2a, 3a, 4a). Moreover,

104

daily cases and total deaths were below 0.03 (Figures 2a, 3a). These results validate the

105

use of Benford’s Law for the COVID-19 data, in all the four characteristics here studied.

106

107

3.2 USA

108

The USA is an ideal country to analyze the numbers, as it has consistent daily cases, thus

109

a large database. A small variation in the daily cases is seen in Figure 2b, with a slightly

110

larger peak at “2”. However, this is unseen in the total cases (Figure 1b). Figure 3b shows

111

an extremely large peak at “1” and a high d*. Nevertheless, this occurs due to the total

112

death numbers reaching more than 100 thousand, keeping the first digit “1” for a long

113

time. The daily death cases validate the hypothesis, by showing a good correlation with

114

the Benford’s Law. This raises the attention to the need to analyze if the alterations of

115

data are normal or not before drawing final conclusions.

116

5

117

3.3 Brazil

118

Brazil is another case of high Benfordness. However, a small variation can be seen in the

119

daily death number, with a d* of 0.21. This is due to daily deaths higher than 1,000 and

120

below 1,999, especially in the last weeks of the studied data.

121

122

3.4 India

123

There are no indications of data alteration for the India numbers (Figures 1d, 2d, 3d, 4d),

124

with a maximum d* of 0.15 and good Benfordness.

125

126

3.5 Russia

127

Results suggest high possibility of data manipulations for Russia’s data. Figure 1e

128

illustrates the lack of Benfordness for the total confirmed cases. The pattern resembles a

129

random distribution: if we calculate the d* related to a constant probability of 1/9 for all

130

first digits, it shows that the d* is 0.13, a value lower than the one related to the Benford

131

distribution (0.20). Daily cases (Figure 2e) reconfirms the lack correlation to the

132

Benford’s Law, with a d* of 0.30 and no apparent large peak, leading to the conclusion

133

that the high d* is not due to a constant first digit as seen in the USA and Brazil but most

134

probably data alteration. Death numbers are also off, the high fraction of “1” in the total

135

deaths (Figure 3e) is explained by reaching 10,000 plus cases. However, the almost equal

136

fractions of the others' first digits, “2” to “9”, suggests a constant growth of the total

137

deaths. This behavior is not exhibited in the other countries. Daily deaths (Figure 4e) also

138

do not follow Benford’s Law.

139

140

3.6 Peru

141

Total confirmed and death cases have a good correlation to Benford’s curve (Figures 1f

142

and 3f). A small deviation is shown in Figures 2e and 4e, but the latter can be explained

143

to a more frequent daily death rate between 100 and 199, in agreement to the country size

144

and its currently COVID-19 situation.

145

6

146

3.7 South Africa

147

There are no indications of data alteration for the South Africa’s numbers (Figures 1g, 2g,

148

3g, 4g), with a maximum d* of 0.10 and good Benfordness.

149

150

3.8 Colombia

151

There are no indications of data alteration for the Colombia’s numbers (Figures 1h, 2h,

152

3h, 4h), with a maximum d* of 0.12. Extreme low values of d* are seen for the confirmed

153

cases, both total and daily.

154

155

3.9 Mexico

156

There are no indications of data alteration for the Mexico’s numbers (Figures 1i, 2i, 3i,

157

4i), with a maximum d* of 0.17.

158

159

3.10 Spain

160

Spain’s numbers might give a wrong indication, with d* higher than 0.45 for total cases

161

and total deaths. Nevertheless, this is due to the reduction of transmission and deaths as

162

well as a consequent constancy of the first digit. The total confirmed cases stabilized

163

around 200,000 cases and the total deaths around 20,000. Naturally, peaks at “2” are

164

present in Figures 1j and 3j. The lack of data manipulation is confirmed by a low d* in

165

daily confirmed cases (Figure 2j) and daily deaths (Figure 4j).

166

167

3.11 Argentina

168

Argentina is one of the countries with the most agreement with Benford’s distribution,

169

showing no evidence of data manipulation. A maximum d* of 0.09 seen for the daily

170

death cases.

171

172

3.12 Chile

173

7

Chile has a small deviation only in the total confirmed cases (Figure 1l). However, the

174

same does not occur for the other data (Figures 2l, 3l, 4l). The alteration in Figure 1l can

175

be explained by the decrease of confirmed cases after Chile reached 100 thousand,

176

growing the decline rate until the cases were around 300,000. Thus, there is no

177

confirmation of data manipulation.

178

179

3.13 Iran

180

Iran’s daily confirmed cases have a peak at “2” (Figure 2m), resulting in a d* of 0.42,

181

which cannot be explained. Nevertheless, the total confirmed cases correctly follow

182

Benford’s distribution (Figure 1m). The other data is partially in agreement with

183

Benford’s (Figures 3m and 4m), with an odd peak at “1” for daily death (Figure 4m)

184

185

3.14 United Kingdom

186

The United Kingdom curves have high d* values for total confirmed cases and deaths

187

(Figures 1n and 3n), while low d* for daily cases and deaths (Figures 2n and 4n). The

188

high values are due to the flattening of the curve and the slowing down of the growing of

189

confirmed cases and deaths in the last weeks. Benfordness in Figures 2n and 4n validates

190

the data.

191

192

3.15 France

193

France’s numbers are like United Kingdom’s, with a low Benfordness of total numbers

194

(Figures 1o and 3o) keeping a high Benfordness of daily numbers (Figures 2o and 4o).

195

Regarding total confirmed cases, a peak at “1” is seen, due to the flattening around

196

100,000 to 199,999 (Figure 3o). In conclusion, the results for France are valid, without

197

any apparent manipulation.

198

199

3.16 Saudi Arabia

200

8

Saudi Arabia’s shows good Benfordness for total confirmed cases, daily confirmed cases,

201

and total deaths (Figures 1p, 2p, 3p). Instead, daily deaths do not follow Benford’s curve,

202

with a 102.9% curve.

203

204

3.17 China

205

China’s numbers show great Benfordness, especially for total and daily confirmed cases,

206

and total deaths (Figures 1q, 2q, 3q). A higher d* is seen for daily deaths, with a peak at

207

“1”. This is explained as the data considers provinces and cities, and not the full summed

208

country as the other countries in our study. The Hubei province in China is the one mainly

209

affected by the virus, presenting daily deaths consistently higher than 10, while other

210

provinces showed frequently single daily deaths, thus creating the peak at “1”.

211

212

3.18 Philippines

213

There are no indications of data alteration for the Philippines’s numbers (Figures 1r, 2r,

214

3r, 4r), with a maximum d* of 0.18.

215

216

3.19 Belgium

217

Belgium has good Benfordness for confirmed cases, total and daily (Figures 1s and 2s).

218

A high peak at “9” for total deaths is due to the flattening of deaths between 9,000 and

219

9,999, and the d* of 0.06 for daily deaths reaffirms the lack of data manipulation.

220

221

3.20 Pakistan

222

Pakistan shows great Benfordness for all data (Figures 1t, 2t, 3t, 4t), with a maximum d*

223

of 0.28.

224

225

3.21 Italy

226

9

Italy’s numbers show peaks at “2” and “3” for total confirmed cases and total deaths,

227

respectively. These are due to the recent decrease infection in the country. Daily data have

228

high Benfordness, with maximum d* of 0.12.

229

230

4. Conclusions

231

The application of the Benford’s Law to assess COVID-19 data was confirmed. It

232

presented a valid method to measure variations on countries’ datasets and suggested

233

possible data manipulations. Our analysis suggested a high possibility of manipulation

234

on Russia’s COVID-19 numbers, for all the data: total and daily confirmed cases and

235

deaths. Small deviations were also seen for Iran’s daily confirmed cases and daily deaths.

236

No data manipulation is shown for data from the USA, Brazil, India, Peru, South Africa,

237

Colombia, Mexico, Spain, Argentina, Chile, the United Kingdom, France, Saudi Arabia,

238

China, Philippines, Belgium, Pakistan, and Italy.

239

240

References

241

1. Newcomb S. Note on the Frequency of Use of the Different Digits in Natural

242

Numbers. American Journal of Mathematics. 1881;4(1):39-40.

243

2. Benford F. The Law of Anomalous Numbers. Proceedings of the American

244

Philosophical Society. 1938;78(4):551-72.

245

3. Cho W, Gaines B. Breaking the (Benford) Law: Statistical Fraud Detection in

246

Campaign Finance. The American Statistician. 2007;61:218-23.

247

4. Durtschi C, Hillison WA, Pacini C, editors. The effective use of Benford's Law to

248

assist in detecting fraud in accounting data2004.

249

5. Grammatikos T, Papanikolaou NI. Applying Benford’s Law to Detect Accounting

250

Data Manipulation in the Banking Industry. Journal of Financial Services Research. 2020.

251

6. Winter C, Schneider M, Yannikos Y, editors. Detecting Fraud Using Modified

252

Benford Analysis. Advances in Digital Forensics VII; 2011 2011//; Berlin, Heidelberg:

253

Springer Berlin Heidelberg.

254

7. Deckert J, Myagkov M, Ordeshook PC. Benford's Law and the Detection of

255

Election Fraud. Political Analysis. 2011;19(3):245-68.

256

8. Beber B, Scacco A. What the numbers say: A digit-based test for election fraud.

257

Political analysis. 2012;20(2):211-34.

258

10

9. Brady HE. Comments on Benford s Law and the Venezuelan Election. MS dated

259

January. 2005;19:2005.

260

10. Mebane Jr WR, editor Election forensics: Vote counts and Benford’s law.

261

Summer Meeting of the Political Methodology Society, UC-Davis, July; 2006.

262

11. Sambridge M, Tkalčić H, Jackson A. Benford's law in the natural sciences.

263

Geophysical Research Letters. 2010;37(22).

264

12. Sambridge M, Jackson A. National COVID numbers - Benford's law looks for

265

errors. Nature. 2020;581(7809):384.

266

13. Lee K-B, Han S, Jeong Y. COVID-19, flattening the curve, and Benford’s law.

267

Physica A: Statistical Mechanics and its Applications. 2020;559:125090.

268

14. Koch C, Okamura K. Benford's Law and COVID-19 Reporting. SSRN. 2020.

269

15. Idrovo AJ, Manrique-Hernández EF. Data Quality of Chinese Surveillance of

270

COVID-19: Objective Analysis Based on WHO's Situation Reports. Asia Pac J Public

271

Health. 2020;32(4):165-7.

272

16. Raul I. How Valid are the Reported Cases of People Infected with Covid-19 in

273

the World? International Journal of Coronaviruses. 2020;1(2):53-6.

274

17. Goodman W. The promises and pitfalls of Benford's law. Significance.

275

2016;13(3):38-41.

276

18. Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-

277

19 in real time. The Lancet Infectious Diseases. 2020;20(5):533-4.

278

279

11

280

Figure 1. Total confirmed cases for (a) the whole world and (b-u) selected countries. The

281

black curve refers to Benford's Law probability.

282

12

283

Figure 2. Daily confirmed cases for (a) the whole world and (b-u) selected countries. The

284

black curve refers to Benford's Law probability.

285

13

286

Figure 3. Total deaths for (a) the whole world and (b-u) selected countries. The black

287

curve refers to Benford's Law probability.

288

289

14

Figure 4. Daily deaths for (a) the whole world and (b-u) selected countries. The black line

290

refers to Benford's Law probability.

291

292