Article

The promises and pitfalls of Benford's law

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Since the 1990s, a mathematical phenomenon known as Benford’s law has been held aloft as a guard against fraud – as a way to check whether data sets are free from interference. Benford’s law does tell us something interesting about the frequency of leading digits in many natural data sets. But if a data set deviates from Benford’s law, is that evidence that the figures within are fraudulent? Not necessarily. Without an error term (which many articles fail to mention) it is too imprecise to say simply that a data set “does not conform”. To rectify this, this paper presents a concrete, empirical estimate for the phenomenon’s sampling distribution, where it is applicable. Many published test results alleging to have found non-conformance to Benford’s Law in post-hoc examined records, actually report levels of variation that are well within the range of ordinary variation. Available online at: DOI: 10.1111/j.1740-9713.2016.00919.x

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

Supplementary resource (1)

... Another prevalent technique is the Pearson's chi-square (χ 2 ) goodness-of-fit test with a confirmatory null hypothesis [7,9,19]. It is common knowledge that the χ 2 test is sensitive to the sample size and cannot make reliable inferences when the dataset consists of 5000 observations or more [2,19]. ...
... Another prevalent technique is the Pearson's chi-square (χ 2 ) goodness-of-fit test with a confirmatory null hypothesis [7,9,19]. It is common knowledge that the χ 2 test is sensitive to the sample size and cannot make reliable inferences when the dataset consists of 5000 observations or more [2,19]. Principally, if the sample size is too big, the null hypothesis will likely be rejected (even if there is no significant difference between the actual and expected subsets). ...
... As a less sensitive technique to the sample size, researchers apply the d-factor (d*), which is calculated by the Equation (3): where p i and p i are the observed and expected frequencies [2,7,19]. The d* ultimately measures the Euclidian distance between the measured and expected frequencies of leading digits. ...
Article
Full-text available
When it comes to COVID-19, access to reliable data is vital. It is crucial for the scientific community to use data reported by independent territories worldwide. This study evaluates the reliability of the pandemic data disclosed by 182 countries worldwide. We collected and assessed conformity of COVID-19 daily infections, deaths, tests, and vaccinations with Benford’s law since the beginning of the coronavirus pandemic. It is commonly accepted that the frequency of leading digits of the pandemic data shall conform to Benford’s law. Our analysis of Benfordness elicits that most countries partially distributed reliable data over the past eighteen months. Notably, the UK, Australia, Spain, Israel, and Germany, followed by 22 different nations, provided the most reliable COVID-19 data within the same period. In contrast, twenty-six nations, including Tajikistan, Belarus, Bangladesh, and Myanmar, published less reliable data on the coronavirus spread. In this context, over 31% of countries worldwide seem to have improved reliability. Our measurement of Benfordness moderately correlates with Johns Hopkin’s Global Health Security Index, suggesting that the quality of data may depend on national healthcare policies and systems. We conclude that economically or politically distressed societies have declined in conformity to the law over time. Our results are particularly relevant for policymakers worldwide.
... The Benford's Law distribution is such that the percent of occurrences beginning with a "1" is about 30.1% compared to the percent of instances beginning with a "9" is about 4.6%. Although not all data sets follow Benford's Law (i.e., social security numbers, zip codes, atomic weights, date/time, etc.), data which are calculated (i.e., price times quantity) typically conform to Benford's Law as they often involve logarithmic-type distributions (Goodman, 2016). ...
... The results of this complete analysis of the Example 1 sample may be found in Table 2. Although a table can be useful for examining the results, a visualization of the data (Figure 2) helps illustrate conformity, or lack of conformity to Benford's Law by analyzing observed and expected results (Goodman, 2016;Lesperance et al., 2016). In this case, the expected and observed data points are consistent with Benford's Law. ...
... Observed Count Example 2 Expected Count Example 2 particularly with a larger dataset. These results create a starting point for further investigation (Goodman, 2016). In this case, it is possible that expenses over $600 need additional approvals and paperwork. ...
Article
Full-text available
For some time, there has been a call for cross‐disciplinary teaching within the business disciplines. With the rise of data and analytics, there is an opportunity for cross‐disciplinary teaching by integrating technology throughout the business curriculum. However, many business professors have little experience in cross‐disciplinary teaching. We hope to rectify this by introducing an approach that uses prelecture material to prepare students for learning concepts and terms that are not core to the focal class. In a study that combines Structured Query Language (SQL) coding with an audit principle called Benford's Law, we analyze the impact of adding prelecture material about Benford's Law on student's cognitive load, knowledge gained, and instructional efficiency. Results indicate that students who received a prelecture on Benford's Law outperformed the control group in performance measures related to Benford's Law, and that the instructional efficiency on Benford's Law was also higher for the treatment group. We did not find any significant differences in cognitive load between the two groups, nor did the treatment group perform significantly better when tested on the SQL concepts. We conclude that cross‐discipline teaching is a fine balance between each discipline and communicating to students the importance of both disciplines is key.
... They then use the cutoff values from the chi-squared or similar distributions and give a "yes-or-no" type of answer to their binary research question. These test statistics and inference results greatly depend on the sample size and selected cutoff values 34 . With large enough sample sizes, the null hypothesis of compliance with the NBL will be rejected in almost every case. ...
... Using the conservative cutoff value of 0.25 for D proposed by Goodman 34 , we find that we cannot reject the NBL distribution for the entire world population for the aggregate cumulative number of confirmed cases and deaths. For individual countries, however, we find that 51 countries do not conform to the NBL when reporting the number of confirmed cases; and 86 countries do not conform to the NBL when reporting the number of deaths. ...
Article
Full-text available
The COVID-19 pandemic has spurred controversies related to whether countries manipulate reported data for political gains. We study the association between accuracy of reported COVID-19 data and developmental indicators. We use the Newcomb–Benford law (NBL) to gauge data accuracy. We run an OLS regression of an index constructed from developmental indicators (democracy level, gross domestic product per capita, healthcare expenditures, and universal healthcare coverage) on goodness-of-fit measures to the NBL. We find that countries with higher values of the developmental index are less likely to deviate from the Newcomb-Benford law. The relationship holds for the cumulative number of reported deaths and total cases but is more pronounced for the death toll. The findings are robust for second-digit tests and for a sub-sample of countries with regional data. The NBL provides a first screening for potential data manipulation during pandemics. Our study indicates that data from autocratic regimes and less developed countries should be treated with more caution. The paper further highlights the importance of independent surveillance data verification projects.
... Nevertheless many voices were quick to challenge this overly consensual message... First of all, many empirical datasets are known to fully disobey Benford's law ( [50,32,57,54,7,19]). In addition to that, this law often appeared to be a good approximation of the reality, but no more than an approximation ( [54,53,19,27,29]). Goodman, for example, in [29], discussed the necessity of introducing an error term. ...
... In addition to that, this law often appeared to be a good approximation of the reality, but no more than an approximation ( [54,53,19,27,29]). Goodman, for example, in [29], discussed the necessity of introducing an error term. Even the 20 different domains, tested by Benford (in [8]), displayed large fluctuations around theoretical values. ...
... Nevertheless many discordant voices brought a significantly different message. By putting aside the distributions known to fully disobey Benford's law ( [36,22,43,40,4,12]), this law often appeared to be a good approximation of the reality, but no more than an approximation ( [40,39,12,17,20]). Goodman, for example, in [20], discussed the necessity of introducing an error term. ...
... By putting aside the distributions known to fully disobey Benford's law ( [36,22,43,40,4,12]), this law often appeared to be a good approximation of the reality, but no more than an approximation ( [40,39,12,17,20]). Goodman, for example, in [20], discussed the necessity of introducing an error term. Even the 20 different domains, tested by Benford (in [5]), displayed large fluctuations around theoretical values. ...
Preprint
In this paper, we will see that the proportion of d as p th digit, where p > 1 and d $\in$ 0, 9, in data (obtained thanks to the hereunder developed model) is more likely to follow a law whose probability distribution is determined by a specific upper bound, rather than the generalization of Benford's Law to digits beyond the first one. These probability distributions fluctuate around theoretical values determined by Hill in 1995. Knowing beforehand the value of the upper bound can be a way to find a better adjusted law than Hill's one.
... There are number of necessary requirements for a dataset to fulfil in order to obey Benford's law distribution. According to Goodman (2016) ...
Article
Full-text available
This paper presents the application of Benford's law in psychological pricing detection. Benford's law is naturally occurring law which states that digits have predictable frequencies of appearance with digit one having the highest frequency. Psychological pricing is one of the marketing pricing strategies directed on price setting which have the psychological impact on certain consumers. In order to investigate the application of Benford's law in psychological pricing detection , Benford's law is observed in the case of first and last digits. In order to inspect if the first and last digits of the observed prices are distributed according to the Benford's law distribution or discrete uniform distribution respectively, mean absolute deviation measure, chi-square tests and Kolmogorov-Smirnov Z tests are used. Results of the analysis conducted on three price datasets have shown that the most dominating first digits are 1 and 2. On the other side, the most dominating last digits are 0, 5 and 9 respectively. The chi-square tests and Kolmogorov-Smirnov Z tests have showed that, at significance level of 5%, none of the three observed price datasets does have first digit distribution that fits to the Benford's law distribution. Likewise, mean absolute deviation values have shown that there are large differences between the last digit distributions and the discrete uniform distribution implying psychological pricing in all price datasets.
... A good mechanism for explaining the uneven distributions stipulated by Benford's law has been proposed in [41]. Benford's law has been used for evaluating possible fraud in accounting data [42], legal status [43], election data [44][45][46], macroeconomic data [47], price data [48], etc. From Equation (4), we observe that beyond the small digits, the probability approximately approaches the Zipf distribution with α = 1, P(d) = log 10 ...
Article
Full-text available
Mankind has long been fascinated by emergence in complex systems. With the rapidly accumulating big data in almost every branch of science, engineering, and society, a golden age for the study of complex systems and emergence has arisen. Among the many values of big data are to detect changes in system dynamics and to help science to extend its reach, and most desirably, to possibly uncover new fundamental laws. Unfortunately, these goals are hard to achieve using black-box machine-learning based approaches for big data analysis. Especially, when systems are not functioning properly, their dynamics must be highly nonlinear, and as long as abnormal behaviors occur rarely, relevant data for abnormal behaviors cannot be expected to be abundant enough to be adequately tackled by machine-learning based approaches. To better cope with these situations, we advocate to synergistically use mainstream machine learning based approaches and multiscale approaches from complexity science. The latter are very useful for finding key parameters characterizing the evolution of a dynamical system, including malfunctioning of the system. One of the many uses of such parameters is to design simpler but more accurate unsupervised machine learning schemes. To illustrate the ideas, we will first provide a tutorial introduction to complex systems and emergence, then we present two multiscale approaches. One is based on adaptive filtering, which is excellent at trend analysis, noise reduction, and (multi)fractal analysis. The other originates from chaos theory and can unify the major complexity measures that have been developed in recent decades. To make the ideas and methods better accessed by a wider audience, the paper is designed as a tutorial survey, emphasizing the connections among the different concepts from complexity science. Many original discussions, arguments, and results pertinent to real-world applications are also presented so that readers can be best stimulated to apply and further develop the ideas and methods covered in the article to solve their own problems. This article is purported both as a tutorial and a survey. It can be used as course material, including summer extensive training courses. When the material is used for teaching purposes, it will be beneficial to motivate students to have hands-on experiences with the many methods discussed in the paper. Instructors as well as readers interested in the computer analysis programs are welcome to contact the corresponding author.
... The Euclidean distance employed in this work, on the other hand, is independent of sample size and hence provides a metric that only becomes more precise with increasing sample size, but does not run away. Clearly, a disadvantage of using the Euclidean distance is that it is not a formal test statistic with associated statistical power (although Goodman (2016) suggested that data can be said to follow Benford's law when the Euclidean distance is shorter than ∼0.25). Many researchers have investigated and have proposed suitable metrics that can quantify statistical (dis)agreement between data and Benford's law (e.g. the Cramér-von Mises metric; Lesperance et al. 2016). ...
Article
Context. Benford’s law states that for scale- and base-invariant data sets covering a wide dynamic range, the distribution of the first significant digit is biased towards low values. This has been shown to be true for wildly different datasets, including financial, geographical, and atomic data. In astronomy, earlier work showed that Benford’s law also holds for distances estimated as the inverse of parallaxes from the ESA H IPPARCOS mission. Aims. We investigate whether Benford’s law still holds for the 1.3 billion parallaxes contained in the second data release of Gaia ( Gaia DR2). In contrast to previous work, we also include negative parallaxes. We examine whether distance estimates computed using a Bayesian approach instead of parallax inversion still follow Benford’s law. Lastly, we investigate the use of Benford’s law as a validation tool for the zero-point of the Gaia parallaxes. Methods. We computed histograms of the observed most significant digit of the parallaxes and distances, and compared them with the predicted values from Benford’s law, as well as with theoretically expected histograms. The latter were derived from a simulated Gaia catalogue based on the Besançon galaxy model. Results. The observed parallaxes in Gaia DR2 indeed follow Benford’s law. Distances computed with the Bayesian approach of Bailer-Jones et al. (2018, AJ, 156, 58) no longer follow Benford’s law, although low-value ciphers are still favoured for the most significant digit. The prior that is used has a significant effect on the digit distribution. Using the simulated Gaia universe model snapshot, we demonstrate that the true distances underlying the Gaia catalogue are not expected to follow Benford’s law, essentially because the interplay between the luminosity function of the Milky Way and the mission selection function results in a bi-modal distance distribution, corresponding to nearby dwarfs in the Galactic disc and distant giants in the Galactic bulge. In conclusion, Gaia DR2 parallaxes only follow Benford’s Law as a result of observational errors. Finally, we show that a zero-point offset of the parallaxes derived by optimising the fit between the observed most-significant digit frequencies and Benford’s law leads to a value that is inconsistent with the value that is derived from quasars. The underlying reason is that such a fit primarily corrects for the difference in the number of positive and negative parallaxes, and can thus not be used to obtain a reliable zero-point.
... A recent paper which used Benford's law to look at COVID-19 reporting data in Iran, the US and UK from February to April 2020 (Ghafari et al. 2020), cited (Goodman 2016) and noted problems with the Benford measurement and COVID, writing: ...
Preprint
Full-text available
In this paper, we use Monte Carlo simulations with a SIRD model parameterised from the literature and test with many metrics if Benford's Law is fulfilled in 4 different scenarios. The results confirm that the Newcomb-Benford law could theoretically be an adequate tool to assess Covid-19 infected data reporting. The challenges in using Benford's law in epidemics reporting are posed by the counting process in the real world where non malignant errors are introduced by lack of tests. One should as such see Benford's law not as fraud detection tool, than as a assistive tool to measure reporting effectiveness in the real world.
... On the other hand, a small fraction of suspected shell companies exhibited conformity. This finding is supported by previous studies arguing that Benford's Law does not have the absolute power to segregate naturally occurring data from managed data (Diekmann & Jann, 2010;Goodman, 2016;Kovalerchuk et al., 2007). At best, Benford's Law can be used as a part of a set of tools for segregating natural data occurrences and cooked data as in itself Benford's Law may fail to identify fudged financial data in a precise manner. ...
... The fit to this distribution, or partially modified forms, can be used as a content-related indicator (Bredl, Winker, and Kotschau 2012;Porras and English 2004;Schäfer et al. 2004;Schräpler and Wagner 2005;Swanson et al. 2003). For further information on the assumptions of Benford's Law, see Goodman (2016). Another content-related challenge for falsifiers is correctly estimating the frequency of rare or sensitive attributes; falsifiers often lack information about the real distribution of these attributes in the population. ...
... Nevertheless many discordant voices brought a significantly different message. By putting aside the distributions known to fully disobey Benford's law (Raimi (1976); Hill (1988); Tolle et al. (2000); Scott and Fasli (2001); Beer (2009) ;Deckert et al. (2011)), this law often appeared to be a good approximation of the reality, but no more than an approximation (Scott and Fasli (2001); Saville (2006); Deckert et al. (2011); Gauvrit and Delahaye (2011); Goodman (2016)). Goodman, for example, in Goodman (2016), discussed the necessity of introducing an error term. ...
Article
Full-text available
In this paper, we will see that the proportion of d as pth digit, where p > 1 and d ∈ [[0, 9]], in data (obtained thanks to the hereunder developed model) is more likely to follow a law whose probability distribution is determined by a specific upper bound, rather than the generalization of Benford’s law to digits beyond the first one. These probability distributions fluctuate around theoretical values of the distribution of the pth digit of Benford’s law. Knowing beforehand the value of the upper bound can be a way to find a better adjusted law than Benford’s one.
... (3) have many entries; and (4) are not intentionally designed. Such datasets have been called "Benford suitable" by Goodman [7]. ...
Article
Full-text available
The frequency of the first digits of numbers drawn from an exponential probability density oscillate around the Benford frequencies. Analysis, simulations and empirical evidence show that datasets must have at least 10,000 entries for these oscillations to emerge from finite-sample noise. Anecdotal evidence from population data is provided.
... .03631 is a normalization factor that assures that the normalized Euclidean distance is bounded by 0 and 1. A measure of fit to check concordance with Benford's law has been proposed by Goodman [14]. His "rule of thumb", which has been used in the literature (see, e.g., [15]), but whose statistical validity has been criticized in [13], is that compliance to Benford's law occurs when d * ≤ 0.25. 2 It is worth observing that the use of the Cho-Gaines' normalized Euclidean distance d * together with Goodman's rule-of-thumb for compliance to Benford' law would give a highly questionable In Fig. 2, we show the observed first-digit frequency distributions of weekly case counts for 15 selected countries superimposed to Benford's law. ...
Preprint
Full-text available
Using the Euclidean distance statistical test of Benford's law, we analyze the Covid-19 weekly case counts by country. While 62% of the 100 countries and territories considered in the present study conforms to Benford's law at a significant level α = 0.05 and 17% at a significant level 0.01 ≤ α < 0.05, the remaining 21% shows a deviation from it (p values smaller than 0.01). In particular, 5% of countries "breaks" Benford's law with a p value smaller than 0.001.
... For a data set that conforms to Benford's law, d* = 0.0; for a data set that is as non-conforming as possible, d* = 1.0. Goodman (2016) proposed that a d* higher than 0.25 is high evidence of data manipulation. ...
Article
The covid-19 disease has become a pandemic that spreads at an unexpected pace around the world. There are more than 450 million cases and 6 million deaths worldwide at the start of March 2022. Benford's law is a statistical technique that serves to determine whether data fraud has been committed in a data structure that uses repetitive numbers. In this study, 18 countries with more than 5 million cases worldwide were ranked using grey relational analysis with the help of Benford's law, an effective method of data fraud. 18 countries are listed separately for 2 years of data with the help of the grey relational analysis method and Benford’s analysis results. According to the results of the study, it was determined that some countries showed changes in data reliability between 2020 and 2021. It has been determined that the data of Germany, France, and the Netherlands are the most reliable.
... The probability distribution does not show any conclusive evidence to suggest a manipulation of data in any of the three countries. From this, we conclude that the likely low or inaccurate number of reported cases in Iran are due to other issues mentioned in this study and not manipulation of data [62]. We note that while this method can be used to test if data manipulation has occurred, it does not give any information about deliberate absence of data by, for instance, not reporting deaths from specific hospitals. ...
Preprint
Full-text available
Iran was among the first group of countries with a major outbreak of COVID-19 in Asia. With nearly 100 exported cases to various other countries by Feb 25, 2020 it has since been the epicentre of the outbreak in the Middle East. By examining the age- and gender-stratified national data taken from PCR-confirmed cases and deaths related to COVID-19 on Mar 13 (reported by the Iranian ministry of health) and those taken from hospitalised patients in 14 university hospitals across Tehran on Apr 4 (reported by Tehran University of Medical Sciences), we find that the crude case fatality ratio of the two reports in those aged 60 and younger are identical and are almost 10 times higher than those reported from China, Italy, Spain and several other European countries (reported from government or ministry of health websites). Assuming a constant attack rate across all age-groups, we adjust for demography, delay from confirmation to death, and under-ascertainment of cases, to estimate the infection fatality ratio based on the reports from Mar 13. We find that our estimates are aligned with reports from China and the UK for those aged 60 and above [n=4609], but are 2-3 times higher in younger age-groups [n=6756] suggesting that only less than 10% of symptomatic cases were detected across the country at the time. Using inbound travel data (from China to Iran) and matching the dates of the flights with prevalence of cases in China from Jan to Mar 2020, we assess the risk of importation of active cases into the country. Further, using outbound travel data, based on detected cases exported from Iran to several other countries, we estimate the size of the outbreak in the country on Feb 25 and Mar 6 to be 13,700 (95% CI: 7,600 - 33,300) and 60,500 (43,200 - 209,200), respectively. We next estimate the start of the outbreak using 18 whole-genome sequences obtained from cases with a travel history to Iran and the first sequence obtained from inside the country. Finally, we use a mathematical model to predict the evolution of the epidemic and assess its burden on the healthcare system. Our modelling analysis suggests the first peak of the epidemic was on Apr 5 and the next one likely follows within the next 6-10 weeks with approximately 30,000 ICU beds required (IQR: 12K - 60K) and over 1 million active cases (IQR: 740K - 3.7M) during the peak weeks. We caution that relaxed, stringent intervention measures, during a period of highly under-reported spread, would result in misinformed public health decisions and a significant burden on the hospitals in the coming weeks.
... For instance, exponentially distributed random variables were shown to satisfy BL approximately [23,24]. In addition, there are related phenomena with BL-like distributions that were explained from power laws [25,26] developed criteria when BL-like distributions may be expected. ...
Article
Full-text available
Benford’s law (BL) specifies the expected digit distributions of data in social sciences, such as demographic or financial data. We focused on the first-digit distribution and hypothesized that it would apply to data on locations of animals freely moving in a natural habitat. We believe that animal movement in natural habitats may differ with respect to BL from movement in more restricted areas (e.g., game preserve). To verify the BL-hypothesis for natural habitats, during 2015–2018, we collected telemetry data of twenty individuals of wild red deer from an alpine region of Austria. For each animal, we recorded the distances between successive position records. Collecting these data for each animal in weekly logbooks resulted in 1132 samples of size 65 on average. The weekly logbook data displayed a BL-like distribution of the leading digits. However, the data did not follow BL perfectly; for 9% (99) of the 1132 weekly logbooks, the chi-square test refuted the BL-hypothesis. A Monte Carlo simulation confirmed that this deviation from BL could not be explained by spurious tests, where a deviation from BL occurred by chance.
... Nevertheless many discordant voices brought a significantly different message. By putting aside the distributions known to fully disobey Benford's law [Rai76, Hil88, TBL00, SF01, Bee09, DMO11], this law often appeared to be a good approximation of the reality, but no more than an approximation [SF01,Sav06,DMO11,GD11,Goo16]. ...
Preprint
The package BeyondBenford compares the goodness of fit of Benford's and Blondeau Da Silva's (BDS's) digit distributions in a dataset. The package is used to check whether the data distribution is consistent with theoretical distributions highlighted by Blondeau Da Silva or not: this ideal theoretical distribution must be at least approximately followed by the data for the use of BDS's model to be well-founded. It also allows to draw histograms of digit distributions, both observed in the dataset and given by the two theoretical approaches. Finally, it proposes to quantify the goodness of fit via Pearson's chi-squared test.
... where D = 1.03631 is a normalization factor that assures that d * is bounded by 0 and 1. A measure of fit to Benford's law has been recently proposed by Goodman (2016). His "rule of thumb" for compliance to Benford's law is d * ≤ 0.25. ...
Preprint
Full-text available
A shorter version of this manuscript has been accepted for publication in Communications in Statistics - Theory and Methods
Article
Full-text available
The contrast of fraud in international trade is a crucial task of modern economic regulations. We develop statistical tools for the detection of frauds in customs declarations that rely on the Newcomb–Benford law for significant digits. Our first contribution is to show the features, in the context of a European Union market, of the traders for which the law should hold in the absence of fraudulent data manipulation. Our results shed light on a relevant and debated question, since no general known theory can exactly predict validity of the law for genuine empirical data. We also provide approximations to the distribution of test statistics when the Newcomb–Benford law does not hold. These approximations open the door to the development of modified goodness-of-fit procedures with wide applicability and good inferential properties.
Article
The Equity in Athletics Disclosure Act (EADA) database and the USA Today NCAA athletics department finance database are two of the most commonly used databases for scholars, policy makers, and other constituents interested in studying the economics of college athletics. Many in the higher education community, however, question the validity of these databases. This study used Benford’s Law of First Digits as a tool for spotting irregularities in EADA and USA Today college athletics financial data. After reviewing 5 years of data, the findings show that while there was some slight deviation from Benford’s Law, EADA and USA Today data largely conformed to the expectations of Benford’s Law.
Article
A way to model the distribution of first digits in some naturally occurring collections of data is here highlighted. The proportion of d as leading digit, d ∈⟦1,9⟧, in data is sometimes more likely to follow a specific law whose probability distribution is determined by a lower and an upper bound, rather than Benford’s Law, as one might have expected. These peculiar probability distributions fluctuate around Benford’s values, such fluctuations having often been observed in the literature in experimental data sets (where the physical, biological or economical quantities considered are lower and upper bounded). Knowing beforehand the values of these bounds enables to find, through the developed model, a better adjusted law than Benford’s one.
Article
Full-text available
To fight COVID-19, global access to reliable data is vital. Given the rapid acceleration of new cases and the common sense of global urgency, COVID-19 is subject to thorough measurement on a country-by-country basis. The world is witnessing an increasing demand for reliable data and impactful information on the novel disease. Can we trust the data on the COVID-19 spread worldwide? This study aims to assess the reliability of COVID-19 global data as disclosed by local authorities in 202 countries. It is commonly accepted that the frequency distribution of leading digits of COVID-19 data shall comply with Benford’s law. In this context, the author collected and statistically assessed 106,274 records of daily infections, deaths, and tests around the world. The analysis of worldwide data suggests good agreement between theory and reported incidents. Approximately 69% of countries worldwide show some deviations from Benford’s law. The author found that records of daily infections, deaths, and tests from 28% of countries adhered well to the anticipated frequency of first digits. By contrast, six countries disclosed pandemic data that do not comply with the first-digit law. With over 82 million citizens, Germany publishes the most reliable records on the COVID-19 spread. In contrast, the Islamic Republic of Iran provides by far the most non-compliant data. The author concludes that inconsistencies with Benford’s law might be a strong indicator of artificially fabricated data on the spread of SARS-CoV-2 by local authorities. Partially consistent with prior research, the United States, Germany, France, Australia, Japan, and China reveal data that satisfies Benford’s law. Unification of reporting procedures and policies globally could improve the quality of data and thus the fight against the deadly virus.
Preprint
Full-text available
This study analyzes the case of Romanian births, jointly distributed by age-groups of mother and father covering 1958-2019 under the potential influence of significant disruptors. Significant events such as anti-abortion laws application or abrogation, communism fall, and migration and their impact are analyzed. While in practice we may find pro and contra examples, a general controversy arises regarding whether births should or should not obey the Benford Law (BL). Moreover, the significant disruptors' impacts are not detailed discussed in such analysis. I find the distribution of births is First Digit Benford Law (BL1) conformant on the entire sample, but mixed results regarding the BL obedience in the dynamic analysis and by main sub-periods. Even though many disruptors are analyzed, only the 1967 Anti-abortion Decree has a significant impact. I capture an average lag of 15 years between the event, the Anti-abortion Decree, and the start of distortion of the births distribution. The distortion persists around 25 years, almost the entire fertility life (15 to 39) for the majority of the people from the cohorts born in 1967-1968.
Article
It has been known for more than a century that, counter to one’s intuition, the frequency of occurrence of the first significant digit in a very large number of numerical data sets is nonuniformly distributed. This result is encapsulated in Benford’s law, which states that the first (and higher) digits follow a logarithmic distribution. An interesting consequence of the counter intuitive nature of Benford’s law is that manipulation of data sets can lead to a change in compliance with the expected distribution—an insight that is exploited in forensic accountancy and financial fraud. In this investigation we have applied a Benford analysis to the distribution of price paid data for house prices in England and Wales pre and post-2014. A residual heat map analysis offers a visually attractive method for identifying interesting features, and two distinct patterns of human intervention are identified: (i) selling property at values just beneath a tax threshold, and (ii) psychological pricing, with a particular bias for the final digit to be 0 or 5. There was a change in legislation in 2014 to soften tax thresholds, and the influence of this change on house price paid data was clearly evident.
Article
We applied New-comb Benford's Law to validate the reliability of Covid-19 figures in Pakistan. Official data were taken from March 2020 till November 2020 and the first digit test is applied for the national aggregate dataset of total cases. There are shreds of evidence that the dataset of Pakistan does not conform to the New-comb Benford's Law theoretical expectations. The results are robust to the goodness of fit by applying the chi-square test. Although it is appreciating to concern over evidence-based policymaking, we found that the Pakistani epidemiological surveillance system fails to provide trustful data as per New-comb Benford's Law assumption.
Article
The availability of accurate information has proved fundamental to managing health crises. This research examined pandemic data provided by 198 countries worldwide two years after the outbreak of the deadly Coronavirus in Wuhan, China. We compiled and reevaluated the consistency of daily COVID-19 infections with Benford’s Law. It is commonly accepted that the distribution of the leading digits of pandemic data should conform to Newcomb-Benford’s expected frequencies. Consistency with the law of leading digits might be an indicator of data reliability. Our analysis shows that most countries have disseminated partially reliable data over 24 months. The United States, Israel, and Spain spread the most consistent COVID-19 data with the law. In line with previous findings, Belarus, Iraq, Iran, Russia, Pakistan, and Chile published questionable epidemic data. Against this trend, 45 percent of countries worldwide appeared to demonstrate significant BL conformity. Our measures of Benfordness were moderately correlated with the Johns Hopkins Global Health Security Index, suggesting that the conformity to Benford’s law may also depend on national health care policies and practices. Our findings might be of particular importance to policymakers and researchers around the world.
Article
Full-text available
The distribution of the first significant digit in numerals of connected texts is considered. Benford's law is found to hold approximately for them. Deviations from Benford's law are statistically significant author peculiarities that allow, under certain conditions, to distinguish between parts of the text with a different authorship.
Book
Full-text available
Benford's law states that the leading digits of many data sets are not uniformly distributed from one through nine, but rather exhibit a profound bias. This bias is evident in everything from electricity bills and street addresses to stock prices, population numbers, mortality rates, and the lengths of rivers. Here, Steven Miller brings together many of the world's leading experts on Benford's law to demonstrate the many useful techniques that arise from the law, show how truly multidisciplinary it is, and encourage collaboration. Beginning with the general theory, the contributors explain the prevalence of the bias, highlighting explanations for when systems should and should not follow Benford's law and how quickly such behavior sets in. They go on to discuss important applications in disciplines ranging from accounting and economics to psychology and the natural sciences. The contributors describe how Benford's law has been successfully used to expose fraud in elections, medical tests, tax filings, and financial reports. Additionally, numerous problems, background materials, and technical details are available online to help instructors create courses around the book. Emphasizing common challenges and techniques across the disciplines, this accessible book shows how Benford's law can serve as a productive meeting ground for researchers and practitioners in diverse fields.
Chapter
One Digit at a Time: The Z–statisticThe Chi–square and Kolmogorov–Smirnoff TestsThe Mean Absolute Deviation (MAD) TestTests Based on the Logarithmic Basis of Benford's LawCreating a Perfect Synthetic Benford SetThe Mantissa Arc TestSummary
Article
To detect manipulations or fraud in accounting data, auditors have successfully used Benford's law as part of their fraud detection processes. Benford's law proposes a distribution for first digits of numbers in naturally occurring data. Government accounting and statistics are similar in nature to financial accounting. In the European Union (EU), there is pressure to comply with the Stability and Growth Pact criteria. Therefore, like firms, governments might try to make their economic situation seem better. In this paper, we use a Benford test to investigate the quality of macroeconomic data relevant to the deficit criteria reported to Eurostat by the EU member states. We find that the data reported by Greece shows the greatest deviation from Benford's law among all euro states.
Article
The history, empirical evidence and classical explanations of the significant-digit (or Benford's) law are reviewed, followed by a summary of recent invariant-measure characterizations. Then a new statistical derivation of the law in the form of a CLT-like theorem for significant digits is presented. If distributions are selected at random (in any "unbiased" way) and random samples are then taken from each of these distributions, the significant digits of the combines sample will converge to the logarithmic (Benford) distribution. This helps explain and predict the appearance of the significant0digit phenomenon in many different empirical contexts and helps justify its recent application to computer design, mathematical modeling and detection of fraud in accounting data.
Article
Benford's law is seeing increasing use as a diagnostic tool for isolating pockets of large datasets with irregularities that deserve closer inspection. Popular and academic accounts of campaign finance are rife with tales of corruption, but the complete dataset of transactions for federal campaigns is enormous. Performing a systematic sweep is extremely arduous; hence, these data are a natural candidate for initial screening by comparison to Ben- ford's distributions.
Article
This article will concentrate on decimal (base 10) representations and significant digits; the corresponding analog of (3) for other bases b>1 is simply Prob(mantissa (base b) # t/b)=log
About Benford analysis. ACL User Guide
  • Acl Services
ACL Services (2009) About Benford analysis. ACL User Guide. bit.ly/1rEUYvt
Look out for No. 1. Financial Times
  • T Harford
Harford, T. (2011) Look out for No. 1. Financial Times, 9
  • Nigrini
  • Miller