Is COVID-19 data reliable? A statistical analysis with Benford’s Law
Anran Wei, Andre E. Vellwock
Benford’s Law is applied as a method to analyze and find data manipulation in large
datasets. It is consistently recognized as a valid method to combat financial fraud and tax
evasion. Here, we studied its application to datasets of COVID-19, targeting data
manipulation in the following: total confirmed cases, daily confirmed cases, total
confirmed deaths, daily confirmed deaths. We considered countries among the most total
confirmed cases on the day 1 September 2020 and China. General results showed that
COVID-19’s numbers do follow Benford’s Law. Moreover, no evidence of data
manipulation is seen for data from the USA, Brazil, India, Peru, South Africa, Colombia,
Mexico, Spain, Argentina, Chile, the United Kingdom, France, Saudi Arabia, China,
Philippines, Belgium, Pakistan, and Italy. Results suggest a high possibility of data
manipulations for Russia’s data. A small divergence is present in Iran’s numbers.
Keywords: COVID-19, Benford’s Law, statistics, coronavirus, data manipulation, data
Benford’s Law, also called the Newcomb–Benford Law, was firstly observed by
Newcomb (1) and popularized by Benford (2). The Law is widely applied to test the
authenticity of data in various fields of our daily life. Most of them are intended for
financial applications (3), accounting fraud (4-6), and politics (3, 7-10). Benford’s Law
points out that the first digit of a naturally occurring decimal number is more likely to be
equal to 1, and the possibilities of the first digit to be equal to the subsequent numbers,
i.e., 2 ~ 9, decrease progressively. With strong evidence that common diseases numbers
indeed follow Benford’s distribution (11), studies have attempted to analyze COVID-19
Benfordness, namely, how well their numbers fit Benford’s Law. From here on, we are
going to mention the date when each study was made or published, since COVID-19
numbers may change over time. Sambridge and Jackson (12) studied data until 9 April
2020 and suggested that data from the United States, Japan, Indonesia, and most European
countries follow well the distribution, but also suggested anomalies. The authors
illustrated that data for total death in the Czech Republic completely does not follow
Benford’s Law, specifically it has a high peak at the “9” that is contrary to the distribution.
Despite proving that COVID-19 follows Benford’s Law, they also confirmed that the
timeline in which the datasets are evaluated is crucial. Exemplifying, after the virus
growth is reduce and “daily confirmed cases” flattens out if the cases are already in the
thousands the first digit unquestionably tends to become constant. The same consideration
was seen by Lee, Han (13) in a paper submitted on 28 April 2020. Koch and Okamura
(14), on 28 April 2020, demonstrated that the USA’s, Italy’s and China’s COVID’s
confirmed cases numbers match the Law, showing high Benfordness and no data
manipulation for these countries. Idrovo and Manrique-Hernández (15), with data up to
15 March 2020, likewise proved no data manipulation on China’s numbers. In the same
timeline, checking confirmed cases data until 30 April 2020, Raul (16) indicated that Italy,
Portugal, Netherlands, the United Kingdom, Denmark, Belgium and Chile may have
altered COVID data, among a large study of 23 countries. All these studies proved that
COVID can be viewed in the Benford’s Law perspective, but also the conclusions are
somewhat contradictory regarding which country has possible data manipulation.
2.1 Benford’s Law and d* factor
Benford’s Law can be inadequately explained based on an intuition that the mantissa of
logarithms of exponentially growing numbers tends to be uniformly distributed. For an
arbitrary decimal number , its logarithm can be written in the form as
where is the integer part and is the mantissa of the logarithm of . It is obvious
that the first digit of can be identified by the log-mantissa . For numbers with
first digit equal to , the corresponding log-mantissa is located at the domain of
[ ). Considering the hypothesis of uniform distribution of log-
mantissa, the probability of the first digit to be equal to is thus given by
Here, we list the probability distribution for each number from 1 to 9 to be the first digit
according to Benford’s Law, as shown in Table 1.
Table 1. First digit distribution according to the Benford’s Law.
In our study, deviations to Benford’s curve were quantified by the d*-factor (d*) (3, 17)
that is fundamentally the Euclidian distance between the country numbers and Benford’s
distribution, after normalizing it by the maximum possible distance, 1.03606, the situation
when there is only a peak at “9” and zero for other first digits (17). If a dataset matches
exactly the Benford’s curve, d* is equal to 0.0. A higher Euclidian distance results in a
lower Benfordness, thus, d* is closer to 1.0. Goodman (17) proposed that a d* higher than
0.25 is high evidence of data manipulation. The calculation of d* is expressed as
where is the first digit from 1 to 9, and
stands for the probability distribution of
each first digit in real datasets.
2.2 Data collection and preprocessing
Data were obtained on 1 September 2020 in the COVID-19 Data Repository by the Center
for Systems Science and Engineering (CSSE) at Johns Hopkins University (18). It was
considered countries among the highest total cases, thus the USA, Brazil, India, Russia,
Peru, South Africa, Colombia, Mexico, Spain, Argentina, Chile, Iran, the United
Kingdom, France, Saudi Arabia, Philippines, Belgium, Pakistan, Italy. China was
included as it was the initial country with a boost in confirmed cases. Except for the USA,
the UK, France and China, the imported data already presents the country results per day
for total confirmed cases, daily confirmed cases, total deaths and daily deaths. For the
USA, the UK and France, data is divided into regions (provinces or cities), thus we
summed to achieve the country data per day. As cases of China quickly increased and
stabilized in a matter of days, the datasets of China are relatively small and thus not
adequate for applying the Benford’s Law. Here, the data of China was preprocessed in a
different way, which was not summed to have the country’s data per day, but left as
provinces’ data per day, increasing the dataset and allowing us to test its Benfordness.
Then, the first digits of these data for overall countries or provinces are recorded to obtain
following probability distributions.
3. Results and discussions
Results will be addressed in sections regarding data of the whole world and the other
selected countries. Figures 1-4 illustrates the results and the data comparisons for the
whole world and each country to the Benford’s Law distribution (line in black). Figure 1
and Figure 2 shows the total confirmed cases and daily confirmed cases, respectively.
Meanwhile, Figure 3 and Figure 4 shows the total death number and the daily death
The d* values of the whole world data were below 0.10 (Figures 1a, 2a, 3a, 4a). Moreover,
daily cases and total deaths were below 0.03 (Figures 2a, 3a). These results validate the
use of Benford’s Law for the COVID-19 data, in all the four characteristics here studied.
The USA is an ideal country to analyze the numbers, as it has consistent daily cases, thus
a large database. A small variation in the daily cases is seen in Figure 2b, with a slightly
larger peak at “2”. However, this is unseen in the total cases (Figure 1b). Figure 3b shows
an extremely large peak at “1” and a high d*. Nevertheless, this occurs due to the total
death numbers reaching more than 100 thousand, keeping the first digit “1” for a long
time. The daily death cases validate the hypothesis, by showing a good correlation with
the Benford’s Law. This raises the attention to the need to analyze if the alterations of
data are normal or not before drawing final conclusions.
Brazil is another case of high Benfordness. However, a small variation can be seen in the
daily death number, with a d* of 0.21. This is due to daily deaths higher than 1,000 and
below 1,999, especially in the last weeks of the studied data.
There are no indications of data alteration for the India numbers (Figures 1d, 2d, 3d, 4d),
with a maximum d* of 0.15 and good Benfordness.
Results suggest high possibility of data manipulations for Russia’s data. Figure 1e
illustrates the lack of Benfordness for the total confirmed cases. The pattern resembles a
random distribution: if we calculate the d* related to a constant probability of 1/9 for all
first digits, it shows that the d* is 0.13, a value lower than the one related to the Benford
distribution (0.20). Daily cases (Figure 2e) reconfirms the lack correlation to the
Benford’s Law, with a d* of 0.30 and no apparent large peak, leading to the conclusion
that the high d* is not due to a constant first digit as seen in the USA and Brazil but most
probably data alteration. Death numbers are also off, the high fraction of “1” in the total
deaths (Figure 3e) is explained by reaching 10,000 plus cases. However, the almost equal
fractions of the others' first digits, “2” to “9”, suggests a constant growth of the total
deaths. This behavior is not exhibited in the other countries. Daily deaths (Figure 4e) also
do not follow Benford’s Law.
Total confirmed and death cases have a good correlation to Benford’s curve (Figures 1f
and 3f). A small deviation is shown in Figures 2e and 4e, but the latter can be explained
to a more frequent daily death rate between 100 and 199, in agreement to the country size
and its currently COVID-19 situation.
3.7 South Africa
There are no indications of data alteration for the South Africa’s numbers (Figures 1g, 2g,
3g, 4g), with a maximum d* of 0.10 and good Benfordness.
There are no indications of data alteration for the Colombia’s numbers (Figures 1h, 2h,
3h, 4h), with a maximum d* of 0.12. Extreme low values of d* are seen for the confirmed
cases, both total and daily.
There are no indications of data alteration for the Mexico’s numbers (Figures 1i, 2i, 3i,
4i), with a maximum d* of 0.17.
Spain’s numbers might give a wrong indication, with d* higher than 0.45 for total cases
and total deaths. Nevertheless, this is due to the reduction of transmission and deaths as
well as a consequent constancy of the first digit. The total confirmed cases stabilized
around 200,000 cases and the total deaths around 20,000. Naturally, peaks at “2” are
present in Figures 1j and 3j. The lack of data manipulation is confirmed by a low d* in
daily confirmed cases (Figure 2j) and daily deaths (Figure 4j).
Argentina is one of the countries with the most agreement with Benford’s distribution,
showing no evidence of data manipulation. A maximum d* of 0.09 seen for the daily
Chile has a small deviation only in the total confirmed cases (Figure 1l). However, the
same does not occur for the other data (Figures 2l, 3l, 4l). The alteration in Figure 1l can
be explained by the decrease of confirmed cases after Chile reached 100 thousand,
growing the decline rate until the cases were around 300,000. Thus, there is no
confirmation of data manipulation.
Iran’s daily confirmed cases have a peak at “2” (Figure 2m), resulting in a d* of 0.42,
which cannot be explained. Nevertheless, the total confirmed cases correctly follow
Benford’s distribution (Figure 1m). The other data is partially in agreement with
Benford’s (Figures 3m and 4m), with an odd peak at “1” for daily death (Figure 4m)
3.14 United Kingdom
The United Kingdom curves have high d* values for total confirmed cases and deaths
(Figures 1n and 3n), while low d* for daily cases and deaths (Figures 2n and 4n). The
high values are due to the flattening of the curve and the slowing down of the growing of
confirmed cases and deaths in the last weeks. Benfordness in Figures 2n and 4n validates
France’s numbers are like United Kingdom’s, with a low Benfordness of total numbers
(Figures 1o and 3o) keeping a high Benfordness of daily numbers (Figures 2o and 4o).
Regarding total confirmed cases, a peak at “1” is seen, due to the flattening around
100,000 to 199,999 (Figure 3o). In conclusion, the results for France are valid, without
any apparent manipulation.
3.16 Saudi Arabia
Saudi Arabia’s shows good Benfordness for total confirmed cases, daily confirmed cases,
and total deaths (Figures 1p, 2p, 3p). Instead, daily deaths do not follow Benford’s curve,
with a 102.9% curve.
China’s numbers show great Benfordness, especially for total and daily confirmed cases,
and total deaths (Figures 1q, 2q, 3q). A higher d* is seen for daily deaths, with a peak at
“1”. This is explained as the data considers provinces and cities, and not the full summed
country as the other countries in our study. The Hubei province in China is the one mainly
affected by the virus, presenting daily deaths consistently higher than 10, while other
provinces showed frequently single daily deaths, thus creating the peak at “1”.
There are no indications of data alteration for the Philippines’s numbers (Figures 1r, 2r,
3r, 4r), with a maximum d* of 0.18.
Belgium has good Benfordness for confirmed cases, total and daily (Figures 1s and 2s).
A high peak at “9” for total deaths is due to the flattening of deaths between 9,000 and
9,999, and the d* of 0.06 for daily deaths reaffirms the lack of data manipulation.
Pakistan shows great Benfordness for all data (Figures 1t, 2t, 3t, 4t), with a maximum d*
Italy’s numbers show peaks at “2” and “3” for total confirmed cases and total deaths,
respectively. These are due to the recent decrease infection in the country. Daily data have
high Benfordness, with maximum d* of 0.12.
The application of the Benford’s Law to assess COVID-19 data was confirmed. It
presented a valid method to measure variations on countries’ datasets and suggested
possible data manipulations. Our analysis suggested a high possibility of manipulation
on Russia’s COVID-19 numbers, for all the data: total and daily confirmed cases and
deaths. Small deviations were also seen for Iran’s daily confirmed cases and daily deaths.
No data manipulation is shown for data from the USA, Brazil, India, Peru, South Africa,
Colombia, Mexico, Spain, Argentina, Chile, the United Kingdom, France, Saudi Arabia,
China, Philippines, Belgium, Pakistan, and Italy.
1. Newcomb S. Note on the Frequency of Use of the Different Digits in Natural
Numbers. American Journal of Mathematics. 1881;4(1):39-40.
2. Benford F. The Law of Anomalous Numbers. Proceedings of the American
Philosophical Society. 1938;78(4):551-72.
3. Cho W, Gaines B. Breaking the (Benford) Law: Statistical Fraud Detection in
Campaign Finance. The American Statistician. 2007;61:218-23.
4. Durtschi C, Hillison WA, Pacini C, editors. The effective use of Benford's Law to
assist in detecting fraud in accounting data2004.
5. Grammatikos T, Papanikolaou NI. Applying Benford’s Law to Detect Accounting
Data Manipulation in the Banking Industry. Journal of Financial Services Research. 2020.
6. Winter C, Schneider M, Yannikos Y, editors. Detecting Fraud Using Modified
Benford Analysis. Advances in Digital Forensics VII; 2011 2011//; Berlin, Heidelberg:
Springer Berlin Heidelberg.
7. Deckert J, Myagkov M, Ordeshook PC. Benford's Law and the Detection of
Election Fraud. Political Analysis. 2011;19(3):245-68.
8. Beber B, Scacco A. What the numbers say: A digit-based test for election fraud.
Political analysis. 2012;20(2):211-34.
9. Brady HE. Comments on Benford s Law and the Venezuelan Election. MS dated
10. Mebane Jr WR, editor Election forensics: Vote counts and Benford’s law.
Summer Meeting of the Political Methodology Society, UC-Davis, July; 2006.
11. Sambridge M, Tkalčić H, Jackson A. Benford's law in the natural sciences.
Geophysical Research Letters. 2010;37(22).
12. Sambridge M, Jackson A. National COVID numbers - Benford's law looks for
errors. Nature. 2020;581(7809):384.
13. Lee K-B, Han S, Jeong Y. COVID-19, flattening the curve, and Benford’s law.
Physica A: Statistical Mechanics and its Applications. 2020;559:125090.
14. Koch C, Okamura K. Benford's Law and COVID-19 Reporting. SSRN. 2020.
15. Idrovo AJ, Manrique-Hernández EF. Data Quality of Chinese Surveillance of
COVID-19: Objective Analysis Based on WHO's Situation Reports. Asia Pac J Public
16. Raul I. How Valid are the Reported Cases of People Infected with Covid-19 in
the World? International Journal of Coronaviruses. 2020;1(2):53-6.
17. Goodman W. The promises and pitfalls of Benford's law. Significance.
18. Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-
19 in real time. The Lancet Infectious Diseases. 2020;20(5):533-4.
Figure 1. Total confirmed cases for (a) the whole world and (b-u) selected countries. The
black curve refers to Benford's Law probability.
Figure 2. Daily confirmed cases for (a) the whole world and (b-u) selected countries. The
black curve refers to Benford's Law probability.
Figure 3. Total deaths for (a) the whole world and (b-u) selected countries. The black
curve refers to Benford's Law probability.
Figure 4. Daily deaths for (a) the whole world and (b-u) selected countries. The black line
refers to Benford's Law probability.