## No full-text available

To read the full-text of this research,

you can request a copy directly from the author.

Since the 1990s, a mathematical phenomenon known as Benford’s law has been held aloft as a guard against fraud – as a way to check whether data sets are free from interference. Benford’s law does tell us something interesting about the frequency of leading digits in many natural data sets. But if a data set deviates from Benford’s law, is that evidence that the figures within are fraudulent? Not necessarily. Without an error term (which many articles fail to mention) it is too imprecise to say simply that a data set “does not conform”. To rectify this, this paper presents a concrete, empirical estimate for the phenomenon’s sampling distribution, where it is applicable. Many published test results alleging to have found non-conformance to Benford’s Law in post-hoc examined records, actually report levels of variation that are well within the range of ordinary variation. Available online at: DOI: 10.1111/j.1740-9713.2016.00919.x

To read the full-text of this research,

you can request a copy directly from the author.

... Goodman [8] mentioned the requirements for a data set to be compatible with Benford's law, namely: (a) sufficient sample size, (b) large span of number values, (c) positively skewed distribution of numbers, and (d) not humanassigned numbers. Though it seems difficult to decide on an adequate sample size for Benford's law [9], it has been shown that this holds true for data sets containing as few as 50 to 100 numbers [10]. ...

... Though it seems difficult to decide on an adequate sample size for Benford's law [9], it has been shown that this holds true for data sets containing as few as 50 to 100 numbers [10]. Some authors even illustrated the law using fewer samples (n < 50) [8] and with COVID-19 data [11]. Moreover, according to Koch & Okamura [12], the spread of COVID-19 demonstrates exponential growth and changes of magnitude, which correspond to the above requirements (c) and (b), respectively. ...

... Dig sets containing as few as 50 to 100 numbers [10]. Some authors even illustrated the law 55 fewer samples (n < 50) [8] and with COVID-19 data [11]. Moreover, according to Ko ...

... Another prevalent technique is the Pearson's chi-square (χ 2 ) goodness-of-fit test with a confirmatory null hypothesis [7,9,19]. It is common knowledge that the χ 2 test is sensitive to the sample size and cannot make reliable inferences when the dataset consists of 5000 observations or more [2,19]. ...

... Another prevalent technique is the Pearson's chi-square (χ 2 ) goodness-of-fit test with a confirmatory null hypothesis [7,9,19]. It is common knowledge that the χ 2 test is sensitive to the sample size and cannot make reliable inferences when the dataset consists of 5000 observations or more [2,19]. Principally, if the sample size is too big, the null hypothesis will likely be rejected (even if there is no significant difference between the actual and expected subsets). ...

... As a less sensitive technique to the sample size, researchers apply the d-factor (d*), which is calculated by the Equation (3): where p i and p i are the observed and expected frequencies [2,7,19]. The d* ultimately measures the Euclidian distance between the measured and expected frequencies of leading digits. ...

When it comes to COVID-19, access to reliable data is vital. It is crucial for the scientific community to use data reported by independent territories worldwide. This study evaluates the reliability of the pandemic data disclosed by 182 countries worldwide. We collected and assessed conformity of COVID-19 daily infections, deaths, tests, and vaccinations with Benford’s law since the beginning of the coronavirus pandemic. It is commonly accepted that the frequency of leading digits of the pandemic data shall conform to Benford’s law. Our analysis of Benfordness elicits that most countries partially distributed reliable data over the past eighteen months. Notably, the UK, Australia, Spain, Israel, and Germany, followed by 22 different nations, provided the most reliable COVID-19 data within the same period. In contrast, twenty-six nations, including Tajikistan, Belarus, Bangladesh, and Myanmar, published less reliable data on the coronavirus spread. In this context, over 31% of countries worldwide seem to have improved reliability. Our measurement of Benfordness moderately correlates with Johns Hopkin’s Global Health Security Index, suggesting that the quality of data may depend on national healthcare policies and systems. We conclude that economically or politically distressed societies have declined in conformity to the law over time. Our results are particularly relevant for policymakers worldwide.

... The Benford's Law distribution is such that the percent of occurrences beginning with a "1" is about 30.1% compared to the percent of instances beginning with a "9" is about 4.6%. Although not all data sets follow Benford's Law (i.e., social security numbers, zip codes, atomic weights, date/time, etc.), data which are calculated (i.e., price times quantity) typically conform to Benford's Law as they often involve logarithmic-type distributions (Goodman, 2016). ...

... The results of this complete analysis of the Example 1 sample may be found in Table 2. Although a table can be useful for examining the results, a visualization of the data (Figure 2) helps illustrate conformity, or lack of conformity to Benford's Law by analyzing observed and expected results (Goodman, 2016;Lesperance et al., 2016). In this case, the expected and observed data points are consistent with Benford's Law. ...

... Observed Count Example 2 Expected Count Example 2 particularly with a larger dataset. These results create a starting point for further investigation (Goodman, 2016). In this case, it is possible that expenses over $600 need additional approvals and paperwork. ...

For some time, there has been a call for cross‐disciplinary teaching within the business disciplines. With the rise of data and analytics, there is an opportunity for cross‐disciplinary teaching by integrating technology throughout the business curriculum. However, many business professors have little experience in cross‐disciplinary teaching. We hope to rectify this by introducing an approach that uses prelecture material to prepare students for learning concepts and terms that are not core to the focal class. In a study that combines Structured Query Language (SQL) coding with an audit principle called Benford's Law, we analyze the impact of adding prelecture material about Benford's Law on student's cognitive load, knowledge gained, and instructional efficiency. Results indicate that students who received a prelecture on Benford's Law outperformed the control group in performance measures related to Benford's Law, and that the instructional efficiency on Benford's Law was also higher for the treatment group. We did not find any significant differences in cognitive load between the two groups, nor did the treatment group perform significantly better when tested on the SQL concepts. We conclude that cross‐discipline teaching is a fine balance between each discipline and communicating to students the importance of both disciplines is key.

... Citation: Kopczewska K, Kopczewski T (2022) Natural spatial pattern-When mutual socio-geo distances between cities follow Benford's law. There are four features that make datasets potentially compatible with Benford's law: reasonable sample size-that enables least frequent values to appear, sufficient data span-to avoid all numbers starting with the same digit, right-skewed distribution-to replicate the multiplicative character of data, and no human intervention-that assures natural design [2]. Many studies confirm that natural distributions of numbers are consistent with Benford law. ...

... However, neither of the mutual-distances matrices of pure patterns conforms with Benford ( Fig 8D)-theoretical Benford distributions (red) are far for the empirical ones (black). The clue is that clustered point-pattern, because of its survivallike shape of density, has the biggest chance to be the driver of natural Benford-like spatial distribution [2], while the other two patterns are to supplement it. ...

Benford’s law states that the first digits of numbers in any natural dataset appear with defined frequencies. Pioneering, we use Benford distribution to analyse the geo-location of cities and their population in the majority of countries. We use distances in three dimensions: 1D between the population values, 2D between the cities, based on geo-coordinates of location, 3D between cities’ location and population, which jointly reflects separation and mass of urban locations. We get four main findings. Firstly, we empirically show that mutual 3D socio-geo distances between cities and populations in most countries conform with Benford’s law, and thus the urban geo-locations have natural spatial distribution. Secondly, we show empirically that the population of cities within countries follows the composition of gamma (1,1) distributions and that 1D distance between populations also conforms to Benford’s law. Thirdly, we pioneer in replicating spatial natural distribution–we discover in simulation that a mixture of three pure point-patterns: clustered, ordered and random in proportions 15:3:2 makes the 2D spatial distribution Benford-like. Complex 3D Benford-like patterns can be built upon 2D (spatial) Benford distribution and gamma (1,1) distribution of cities’ sizes. This finding enables generating 2D and 3D Benford distributions, which may replicate well the urban settlement. Fourth, we use historical settlement analysis to claim that the geo-location of cities and inhabitants worldwide followed the evolutionary process, resulting in natural Benford-like spatial distribution and to justify our statistical findings. Those results are very novel. This study develops new spatial distribution to simulate natural locations. It shows that evolutionary settlement patterns resulted in the natural location of cities, and historical distortions in urbanisation, even if persistent till now, are being evolutionary corrected.

... They then use the cutoff values from the chi-squared or similar distributions and give a "yes-or-no" type of answer to their binary research question. These test statistics and inference results greatly depend on the sample size and selected cutoff values 34 . With large enough sample sizes, the null hypothesis of compliance with the NBL will be rejected in almost every case. ...

... Using the conservative cutoff value of 0.25 for D proposed by Goodman 34 , we find that we cannot reject the NBL distribution for the entire world population for the aggregate cumulative number of confirmed cases and deaths. For individual countries, however, we find that 51 countries do not conform to the NBL when reporting the number of confirmed cases; and 86 countries do not conform to the NBL when reporting the number of deaths. ...

The COVID-19 pandemic has spurred controversies related to whether countries manipulate reported data for political gains. We study the association between accuracy of reported COVID-19 data and developmental indicators. We use the Newcomb–Benford law (NBL) to gauge data accuracy. We run an OLS regression of an index constructed from developmental indicators (democracy level, gross domestic product per capita, healthcare expenditures, and universal healthcare coverage) on goodness-of-fit measures to the NBL. We find that countries with higher values of the developmental index are less likely to deviate from the Newcomb-Benford law. The relationship holds for the cumulative number of reported deaths and total cases but is more pronounced for the death toll. The findings are robust for second-digit tests and for a sub-sample of countries with regional data. The NBL provides a first screening for potential data manipulation during pandemics. Our study indicates that data from autocratic regimes and less developed countries should be treated with more caution. The paper further highlights the importance of independent surveillance data verification projects.

... Nevertheless many voices were quick to challenge this overly consensual message... First of all, many empirical datasets are known to fully disobey Benford's law ( [50,32,57,54,7,19]). In addition to that, this law often appeared to be a good approximation of the reality, but no more than an approximation ( [54,53,19,27,29]). Goodman, for example, in [29], discussed the necessity of introducing an error term. ...

... In addition to that, this law often appeared to be a good approximation of the reality, but no more than an approximation ( [54,53,19,27,29]). Goodman, for example, in [29], discussed the necessity of introducing an error term. Even the 20 different domains, tested by Benford (in [8]), displayed large fluctuations around theoretical values. ...

... Nevertheless many discordant voices brought a significantly different message. By putting aside the distributions known to fully disobey Benford's law ( [36,22,43,40,4,12]), this law often appeared to be a good approximation of the reality, but no more than an approximation ( [40,39,12,17,20]). Goodman, for example, in [20], discussed the necessity of introducing an error term. ...

... By putting aside the distributions known to fully disobey Benford's law ( [36,22,43,40,4,12]), this law often appeared to be a good approximation of the reality, but no more than an approximation ( [40,39,12,17,20]). Goodman, for example, in [20], discussed the necessity of introducing an error term. Even the 20 different domains, tested by Benford (in [5]), displayed large fluctuations around theoretical values. ...

In this paper, we will see that the proportion of d as p th digit, where p > 1 and d $\in$ 0, 9, in data (obtained thanks to the hereunder developed model) is more likely to follow a law whose probability distribution is determined by a specific upper bound, rather than the generalization of Benford's Law to digits beyond the first one. These probability distributions fluctuate around theoretical values determined by Hill in 1995. Knowing beforehand the value of the upper bound can be a way to find a better adjusted law than Hill's one.

... The law indicates the chance of the leading number 1 of about one-third and decreasing in probability as the first number increases, with 9 occurring to less than 5%, (Fewster 2009). Benford's law was used to investigate tropical cyclone length homogeneity in Joannes-Boyar et al. (2015), fraud detection in financial matters such as Enron's accounting scandal, and Greece's economic reports to European authorities (Goodman 2016). Here, we use it as a check to see if the simulations have the same property as the observations. ...

The space–time fields of rainfall during a hurricane and tropical storm (TC) landfall are critical for coastal flood risk preparedness, assessment, and mitigation. We present an approach for the stochastic simulation of rainfall fields that leverages observed, high-resolution spatial fields of historical landfalling TCs rainfall that is derived from multiple instrumental and remote sensing sources, and key variables recorded for historical TCs. Spatial realizations of rainfall at each time step are simulated conditional on the variables representing the ambient conditions. We use 6 hourly precipitation fields of tropical cyclones from 1983 to 2019 that made landfall on the Gulf coast of the US, starting from 24 h before landfall until the end of the track. A conditional K-nearest neighbor method is used to generate the simulations. The TC attributes used for conditioning are the preseason large-scale climate indices, the storm maximum wind speed, minimum central pressure, the latitude and speed of movement of the storm center, and the proportion of storm area over land or ocean. Simulation of rainfall for three hurricanes that are kept out of the sample: Katrina [2005], Rita [2005], and Harvey [2017] are used to evaluate the method. The utility of coupling the approach to a hurricane track simulator applied for a full season is demonstrated by an out-of-sample simulation of the 2020 season.

... where D = ∑ 8 d=1 P 2 B (d) + [P(9) − 1] 2 ≃ 1.03631 is a normalization factor that assures that the normalized Euclidean distance is bounded by 0 and 1. The measure of fit to check concordance with Benford's law was taken to be the one proposed by Goodman (2016), according to which compliance with Benford's law occurs when d * ≤ 0.25. However, such a rule of thumb has been shown to be statistically unfounded in Campanelli (2021) and, generally, gives untrustworthy results for a number of data points either much less or much bigger than 40 (in particular the rule has a very low statistical power for a number of data points N ≫ 40.) ...

Using the Euclidean distance statistical test of Benford’s law, we analyse the COVID-19 weekly case counts by country. While 62% of the 100 countries and territories considered in the present study conforms to Benford’s law at a significant level of α = 0.05 and 17% at a significant level of 0.01 ≤ α < 0.05, the remaining 21% shows a deviation from it (p values smaller than 0.01). In particular, 5% of the countries ‘break’ Benford’s law with a p value smaller than 0.001.

... and by the author who, recently enough (Campanelli, 2023), has found an empirical expression of its cumulative distribution function. A simple measure of fit to Benford's law, instead, has been proposed by Goodman (2016). His "rule of thumb" for conformance to Benford's law is d * ≤ 0.25. ...

... In the analysis of first digits, data with zeroes were excluded. 15,16 The digits macro-developed by Ben Jann (ETH Zurich) was used for that purpose, and Pearson's x 2 and log likelihood ratio were used to evaluate the goodness-of-fit of the distributions. Stata 14 (Stata Corporation, College Station, TX, USA) was used for these analyses. ...

Introduction:
Mining injuries have decreased in a number of developed countries in recent decades. Although mining has become a very important sector of Colombia's economy, no analyses of mining injuries and fatalities have been conducted.
Objectives:
This study describes the occurrence of mining emergencies in Colombia between 2005 and 2018 and their principal characteristics.
Methods:
This retrospective ecological study analyzed mining emergencies registered by the National Mining Agency between 2005 and 2018. The study described the place, event type, legal status, mine type, extracted mineral, and number of injuries and fatalities. Benford's law was used to explore data quality.
Results:
A total of 1,235 emergencies occurred, with 751 injured workers and 1,364 fatalities. The majority of emergencies were from collapses, polluted air, and explosions, most of which occurred in coal (77.41%), gold (18.06%), and emerald (1.38%) mines. Many emergencies occurred in illegal mines (27.21%), most of which were for gold, construction materials, emeralds, and coal. Illegal mines had a higher relative proportion of injuries and fatalities than legal mines (p < 0.05). Mining disasters are likely to be underreported given that Benford's Law was not satisfied.
Conclusions:
As mining increases in Colombia, so are mining emergencies, injuries, and fatalities. This is the first full description of mining emergencies in Colombia based on the few available data.

... and by the author who, recently enough (Campanelli, 2023), has found an empirical expression of its cumulative distribution function. A simple measure of fit to Benford's law, instead, has been proposed by Goodman (2016). His "rule of thumb" for conformance to Benford's law is d * ≤ 0.25. ...

We discuss some limitations of the use of generic tests, such as the Pearson's chi^2 , for testing Benford's law. Statistics with known distribution and constructed under the specific null hypothesis that Benford's law holds, such as the Euclidean distance, are more appropriate when assessing the goodness-of-fit to Benford's law, and should be preferred over generic tests in quantitative analyses. The rule of thumb proposed by Goodman for compliance checking to Benford's law, instead, is shown to be statistically unfounded. For very large sample sizes (N > 1000), all existing statistical tests are inappropriate for testing Benford's law due to its empirical nature. We propose a new statistic whose sample values are asymptotically independent on the sample size making it a natural candidate for testing Benford's law in very large data sets.

... This law is besides used to detect possible frauds ( [34,31,3,12,27]). Even though this law appears to be a good approximation of the reality, it is no more than an approximation ( [16,20,22]); in what follows, we will prove that, under specific conditions, these fluctuations find their entire and obvious explanation. ...

A way to model the distribution of first digits in some naturally occurring collections of data is here highlighted. The proportion of d as leading digit, d ∈⟦1,9⟧, in data is sometimes more likely to follow a specific law whose probability distribution is determined by a lower and an upper bound, rather than Benford’s Law, as one might have expected. These peculiar probability distributions fluctuate around Benford’s values, such fluctuations having often been observed in the literature in experimental data sets (where the physical, biological or economical quantities considered are lower and upper bounded). Knowing beforehand the values of these bounds enables to find, through the developed model, a better adjusted law than Benford’s one.

... Nevertheless many discordant voices brought a significantly different message. By putting aside the distributions known to fully disobey Benford's law (Raimi (1976); Hill (1988); Tolle et al. (2000); Scott and Fasli (2001); Beer (2009) ;Deckert et al. (2011)), this law often appeared to be a good approximation of the reality, but no more than an approximation (Scott and Fasli (2001); Saville (2006); Deckert et al. (2011); Gauvrit and Delahaye (2011); Goodman (2016)). Goodman, for example, in Goodman (2016), discussed the necessity of introducing an error term. ...

In this paper, we will see that the proportion of d as pth digit, where p > 1 and d ∈ [[0, 9]], in data (obtained thanks to the hereunder developed model) is more likely to follow a law whose probability distribution is determined by a specific upper bound, rather than the generalization of Benford’s law to digits beyond the first one. These probability distributions fluctuate around theoretical values of the distribution of the pth digit of Benford’s law. Knowing beforehand the value of the upper bound can be a way to find a better adjusted law than Benford’s one.

... For a data set that conforms to Benford's law, d* = 0.0; for a data set that is as non-conforming as possible, d* = 1.0. Goodman (2016) proposed that a d* higher than 0.25 is high evidence of data manipulation. ...

The covid-19 disease has become a pandemic that spreads at an unexpected pace around the world. There are more than 450 million cases and 6 million deaths worldwide at the start of March 2022. Benford's law is a statistical technique that serves to determine whether data fraud has been committed in a data structure that uses repetitive numbers. In this study, 18 countries with more than 5 million cases worldwide were ranked using grey relational analysis with the help of Benford's law, an effective method of data fraud. 18 countries are listed separately for 2 years of data with the help of the grey relational analysis method and Benford’s analysis results. According to the results of the study, it was determined that some countries showed changes in data reliability between 2020 and 2021. It has been determined that the data of Germany, France, and the Netherlands are the most reliable.

... 1.03631 is a normalization factor that assures that the normalized Euclidean distance is bounded by 0 and 1. A measure of fit to check concordance with Benford's law has been proposed by Goodman [14]. His "rule of thumb", which has been used in the literature (see, e.g., [15]), but whose statistical validity has been criticized in [13], is that compliance to Benford's law occurs when d * ≤ 0.25. 2 It is worth observing that the use of the Cho-Gaines' normalized Euclidean distance d * together with Goodman's rule-of-thumb for compliance to Benford' law would give a highly questionable compliance to Benford's law for all countries excepted Honduras, for which d * = 0.260, and Tanzania, with d * = 0.251. ...

An extended version of this paper has been accepted for publication in Statistics in Transition. Main results unchanged.

... where D = 1.03631 is a normalization factor that assures that d * is bounded by 0 and 1. A measure of fit to Benford's law has been recently proposed by Goodman (2016). His "rule of thumb" for compliance to Benford's law is d * ≤ 0.25. ...

A shorter version of this manuscript has been accepted for publication in Communications in Statistics - Theory and Methods

... For instance, exponentially distributed random variables were shown to satisfy BL approximately [23,24]. In addition, there are related phenomena with BL-like distributions that were explained from power laws [25,26] developed criteria when BL-like distributions may be expected. ...

Benford’s law (BL) specifies the expected digit distributions of data in social sciences, such as demographic or financial data. We focused on the first-digit distribution and hypothesized that it would apply to data on locations of animals freely moving in a natural habitat. We believe that animal movement in natural habitats may differ with respect to BL from movement in more restricted areas (e.g., game preserve). To verify the BL-hypothesis for natural habitats, during 2015–2018, we collected telemetry data of twenty individuals of wild red deer from an alpine region of Austria. For each animal, we recorded the distances between successive position records. Collecting these data for each animal in weekly logbooks resulted in 1132 samples of size 65 on average. The weekly logbook data displayed a BL-like distribution of the leading digits. However, the data did not follow BL perfectly; for 9% (99) of the 1132 weekly logbooks, the chi-square test refuted the BL-hypothesis. A Monte Carlo simulation confirmed that this deviation from BL could not be explained by spurious tests, where a deviation from BL occurred by chance.

... (3) have many entries; and (4) are not intentionally designed. Such datasets have been called "Benford suitable" by Goodman [7]. ...

The frequency of the first digits of numbers drawn from an exponential probability density oscillate around the Benford frequencies. Analysis, simulations and empirical evidence show that datasets must have at least 10,000 entries for these oscillations to emerge from finite-sample noise. Anecdotal evidence from population data is provided.

... A good mechanism for explaining the uneven distributions stipulated by Benford's law has been proposed in [41]. Benford's law has been used for evaluating possible fraud in accounting data [42], legal status [43], election data [44][45][46], macroeconomic data [47], price data [48], etc. From Equation (4), we observe that beyond the small digits, the probability approximately approaches the Zipf distribution with α = 1, P(d) = log 10 ...

Mankind has long been fascinated by emergence in complex systems. With the rapidly accumulating big data in almost every branch of science, engineering, and society, a golden age for the study of complex systems and emergence has arisen. Among the many values of big data are to detect changes in system dynamics and to help science to extend its reach, and most desirably, to possibly uncover new fundamental laws. Unfortunately, these goals are hard to achieve using black-box machine-learning based approaches for big data analysis. Especially, when systems are not functioning properly, their dynamics must be highly nonlinear, and as long as abnormal behaviors occur rarely, relevant data for abnormal behaviors cannot be expected to be abundant enough to be adequately tackled by machine-learning based approaches. To better cope with these situations, we advocate to synergistically use mainstream machine learning based approaches and multiscale approaches from complexity science. The latter are very useful for finding key parameters characterizing the evolution of a dynamical system, including malfunctioning of the system. One of the many uses of such parameters is to design simpler but more accurate unsupervised machine learning schemes. To illustrate the ideas, we will first provide a tutorial introduction to complex systems and emergence, then we present two multiscale approaches. One is based on adaptive filtering, which is excellent at trend analysis, noise reduction, and (multi)fractal analysis. The other originates from chaos theory and can unify the major complexity measures that have been developed in recent decades. To make the ideas and methods better accessed by a wider audience, the paper is designed as a tutorial survey, emphasizing the connections among the different concepts from complexity science. Many original discussions, arguments, and results pertinent to real-world applications are also presented so that readers can be best stimulated to apply and further develop the ideas and methods covered in the article to solve their own problems. This article is purported both as a tutorial and a survey. It can be used as course material, including summer extensive training courses. When the material is used for teaching purposes, it will be beneficial to motivate students to have hands-on experiences with the many methods discussed in the paper. Instructors as well as readers interested in the computer analysis programs are welcome to contact the corresponding author.

... On the other hand, a small fraction of suspected shell companies exhibited conformity. This finding is supported by previous studies arguing that Benford's Law does not have the absolute power to segregate naturally occurring data from managed data (Diekmann & Jann, 2010;Goodman, 2016;Kovalerchuk et al., 2007). At best, Benford's Law can be used as a part of a set of tools for segregating natural data occurrences and cooked data as in itself Benford's Law may fail to identify fudged financial data in a precise manner. ...

... A recent paper which used Benford's law to look at COVID-19 reporting data in Iran, the US and UK from February to April 2020 (Ghafari et al. 2020), cited (Goodman 2016) and noted problems with the Benford measurement and COVID, writing: ...

In this paper, we use Monte Carlo simulations with a SIRD model parameterised from the literature and test with many metrics if Benford's Law is fulfilled in 4 different scenarios. The results confirm that the Newcomb-Benford law could theoretically be an adequate tool to assess Covid-19 infected data reporting. The challenges in using Benford's law in epidemics reporting are posed by the counting process in the real world where non malignant errors are introduced by lack of tests. One should as such see Benford's law not as fraud detection tool, than as a assistive tool to measure reporting effectiveness in the real world.

... The Euclidean distance employed in this work, on the other hand, is independent of sample size and hence provides a metric that only becomes more precise with increasing sample size, but does not run away. Clearly, a disadvantage of using the Euclidean distance is that it is not a formal test statistic with associated statistical power (although Goodman (2016) suggested that data can be said to follow Benford's law when the Euclidean distance is shorter than ∼0.25). Many researchers have investigated and have proposed suitable metrics that can quantify statistical (dis)agreement between data and Benford's law (e.g. the Cramér-von Mises metric; Lesperance et al. 2016). ...

Context. Benford’s law states that for scale- and base-invariant data sets covering a wide dynamic range, the distribution of the first significant digit is biased towards low values. This has been shown to be true for wildly different datasets, including financial, geographical, and atomic data. In astronomy, earlier work showed that Benford’s law also holds for distances estimated as the inverse of parallaxes from the ESA H IPPARCOS mission.
Aims. We investigate whether Benford’s law still holds for the 1.3 billion parallaxes contained in the second data release of Gaia ( Gaia DR2). In contrast to previous work, we also include negative parallaxes. We examine whether distance estimates computed using a Bayesian approach instead of parallax inversion still follow Benford’s law. Lastly, we investigate the use of Benford’s law as a validation tool for the zero-point of the Gaia parallaxes.
Methods. We computed histograms of the observed most significant digit of the parallaxes and distances, and compared them with the predicted values from Benford’s law, as well as with theoretically expected histograms. The latter were derived from a simulated Gaia catalogue based on the Besançon galaxy model.
Results. The observed parallaxes in Gaia DR2 indeed follow Benford’s law. Distances computed with the Bayesian approach of Bailer-Jones et al. (2018, AJ, 156, 58) no longer follow Benford’s law, although low-value ciphers are still favoured for the most significant digit. The prior that is used has a significant effect on the digit distribution. Using the simulated Gaia universe model snapshot, we demonstrate that the true distances underlying the Gaia catalogue are not expected to follow Benford’s law, essentially because the interplay between the luminosity function of the Milky Way and the mission selection function results in a bi-modal distance distribution, corresponding to nearby dwarfs in the Galactic disc and distant giants in the Galactic bulge. In conclusion, Gaia DR2 parallaxes only follow Benford’s Law as a result of observational errors. Finally, we show that a zero-point offset of the parallaxes derived by optimising the fit between the observed most-significant digit frequencies and Benford’s law leads to a value that is inconsistent with the value that is derived from quasars. The underlying reason is that such a fit primarily corrects for the difference in the number of positive and negative parallaxes, and can thus not be used to obtain a reliable zero-point.

... There are number of necessary requirements for a dataset to fulfil in order to obey Benford's law distribution. According to Goodman (2016) ...

This paper presents the application of Benford's law in psychological pricing detection. Benford's law is naturally occurring law which states that digits have predictable frequencies of appearance with digit one having the highest frequency. Psychological pricing is one of the marketing pricing strategies directed on price setting which have the psychological impact on certain consumers. In order to investigate the application of Benford's law in psychological pricing detection , Benford's law is observed in the case of first and last digits. In order to inspect if the first and last digits of the observed prices are distributed according to the Benford's law distribution or discrete uniform distribution respectively, mean absolute deviation measure, chi-square tests and Kolmogorov-Smirnov Z tests are used. Results of the analysis conducted on three price datasets have shown that the most dominating first digits are 1 and 2. On the other side, the most dominating last digits are 0, 5 and 9 respectively. The chi-square tests and Kolmogorov-Smirnov Z tests have showed that, at significance level of 5%, none of the three observed price datasets does have first digit distribution that fits to the Benford's law distribution. Likewise, mean absolute deviation values have shown that there are large differences between the last digit distributions and the discrete uniform distribution implying psychological pricing in all price datasets.

... The probability distribution does not show any conclusive evidence to suggest a manipulation of data in any of the three countries. From this, we conclude that the likely low or inaccurate number of reported cases in Iran are due to other issues mentioned in this study and not manipulation of data [62]. We note that while this method can be used to test if data manipulation has occurred, it does not give any information about deliberate absence of data by, for instance, not reporting deaths from specific hospitals. ...

Iran was among the first group of countries with a major outbreak of COVID-19 in Asia. With nearly 100 exported cases to various other countries by Feb 25, 2020 it has since been the epicentre of the outbreak in the Middle East. By examining the age- and gender-stratified national data taken from PCR-confirmed cases and deaths related to COVID-19 on Mar 13 (reported by the Iranian ministry of health) and those taken from hospitalised patients in 14 university hospitals across Tehran on Apr 4 (reported by Tehran University of Medical Sciences), we find that the crude case fatality ratio of the two reports in those aged 60 and younger are identical and are almost 10 times higher than those reported from China, Italy, Spain and several other European countries (reported from government or ministry of health websites). Assuming a constant attack rate across all age-groups, we adjust for demography, delay from confirmation to death, and under-ascertainment of cases, to estimate the infection fatality ratio based on the reports from Mar 13. We find that our estimates are aligned with reports from China and the UK for those aged 60 and above [n=4609], but are 2-3 times higher in younger age-groups [n=6756] suggesting that only less than 10% of symptomatic cases were detected across the country at the time. Using inbound travel data (from China to Iran) and matching the dates of the flights with prevalence of cases in China from Jan to Mar 2020, we assess the risk of importation of active cases into the country. Further, using outbound travel data, based on detected cases exported from Iran to several other countries, we estimate the size of the outbreak in the country on Feb 25 and Mar 6 to be 13,700 (95% CI: 7,600 - 33,300) and 60,500 (43,200 - 209,200), respectively. We next estimate the start of the outbreak using 18 whole-genome sequences obtained from cases with a travel history to Iran and the first sequence obtained from inside the country. Finally, we use a mathematical model to predict the evolution of the epidemic and assess its burden on the healthcare system. Our modelling analysis suggests the first peak of the epidemic was on Apr 5 and the next one likely follows within the next 6-10 weeks with approximately 30,000 ICU beds required (IQR: 12K - 60K) and over 1 million active cases (IQR: 740K - 3.7M) during the peak weeks. We caution that relaxed, stringent intervention measures, during a period of highly under-reported spread, would result in misinformed public health decisions and a significant burden on the hospitals in the coming weeks.

... The fit to this distribution, or partially modified forms, can be used as a content-related indicator (Bredl, Winker, and Kotschau 2012;Porras and English 2004;Schäfer et al. 2004;Schräpler and Wagner 2005;Swanson et al. 2003). For further information on the assumptions of Benford's Law, see Goodman (2016). Another content-related challenge for falsifiers is correctly estimating the frequency of rare or sensitive attributes; falsifiers often lack information about the real distribution of these attributes in the population. ...

Table A7A.1 Number of identical response patterns

... Nevertheless many discordant voices brought a significantly different message. By putting aside the distributions known to fully disobey Benford's law [Rai76, Hil88, TBL00, SF01, Bee09, DMO11], this law often appeared to be a good approximation of the reality, but no more than an approximation [SF01,Sav06,DMO11,GD11,Goo16]. ...

The package BeyondBenford compares the goodness of fit of Benford's and Blondeau Da Silva's (BDS's) digit distributions in a dataset. The package is used to check whether the data distribution is consistent with theoretical distributions highlighted by Blondeau Da Silva or not: this ideal theoretical distribution must be at least approximately followed by the data for the use of BDS's model to be well-founded. It also allows to draw histograms of digit distributions, both observed in the dataset and given by the two theoretical approaches. Finally, it proposes to quantify the goodness of fit via Pearson's chi-squared test.

Financial statement fraud is a costly problem for society. Detection models can help, but a framework to guide variable selection for such models is lacking. A novel Fraud Detection Triangle (FDT) framework is proposed specifically for this purpose. Extending the well‐known Fraud Triangle, the FDT framework can facilitate improved detection models. Using Benford's law, we demonstrate the posited framework's utility in aiding variable selection via the element of surprise evoked by suspicious information latent in the data. We call for more research into variables that measure rationalisations for fraud and suspicious phenomena arising as unintended consequences of financial statement fraud.

In this doctoral thesis, I work with one of the least studied sources of financing in the literature (that somehow relates to the political and the economic powers): electoral self-financing. I understand self-financing as the use of own resources (patrimony and/or income) by candidates in elections. My goal is to answer two main questions: what explains the variation in the use of own resources by candidates in their campaigns? In addition, what are the impacts of this form of electoral funding? To answer these questions, I organize this thesis into four chapters that can be read independently but are structured around these central problems. In the first chapter, I carry out a systematic review of the literature on the determinants and consequences of self-financing. The research’s designs and theoretical frameworks used in the works indicate two perspectives of “epistemic resources”: political sociology and the theory of strategic behavior. These papers focus on variables such as the wealth of candidates, political and socio-economic context, the structure of the electoral competition, and the strength of parties as determinants of the use of out-of-pocket resources in the campaigns. Regarding the consequences of self-financing, the most tested dependent variable was electoral performance. There is evidence that this source has more impact on proportional systems rather than on majority systems. In the second chapter, I investigate the regulation of self-financing in Brazil from 1950 to the political reforms of 2017 and 2019. The results show that the greatest restriction to this source of funding was a consequence of unexpected effects of the 2016 municipal election, such as the election of very wealthy candidates and outsiders. The issue, until then, was not perceived as a problem by Brazilian legislators. In the third chapter, I describe the process of the data collection, the building of the database, and the descriptive statistics of the selected variables. I work with an original database with information on candidates for the Offices of State and Federal Deputy, Senator, and Governor in three elections. Totaling more than 54 thousand cases. Finally, in the fourth chapter, I test a series of models to answer the Thesis’s questions. The results confirm that businesspeople are self-financing to a greater extent than other occupations (even with the inclusion of a series of control variables and the use of matching techniques). Moreover, this form of electoral financing has a greater impact on votes for candidates in the proportional system than in the majoritarian system. In the final considerations, I discuss the findings through the literature on Democratic Theory. This thesis contributes to the discussion of electoral funding in Brazil, and the literature on Political Elites.

The space-time fields of rainfall during a hurricane and tropical storm (TC) landfall are critical for coastal flood risk preparedness, assessment, and mitigation. We present an approach for the stochastic simulation of rainfall fields that leverages observed, high-resolution spatial fields of historical landfalling TCs rainfall that are derived from multiple instrumental and remote sensing sources, and key variables recorded for historical TCs. Spatial realizations of rainfall at each time step are simulated conditional on the variables representing the ambient conditions. We use 6 hourly precipitation fields of tropical cyclones from 1983 to 2019 that made landfall on the Gulf coast of the United States, starting from 24 hours before landfall until the end of the track. A conditional K- nearest neighbor method is used to generate the simulations. The TC attributes used for conditioning are the pre-season large-scale climate indices, the storm maximum wind speed, minimum central pressure, the latitude and speed of movement of the storm center, and the proportion of storm area over land or ocean. Simulation of rainfall for three hurricanes that are kept out of the sample: Katrina (2005), Rita (2005), and Harvey (2017) are used to evaluate the method. The utility of coupling the approach to a hurricane track simulator applied for a full season is demonstrated by an out-of-sample simulation of the 2020 season.

We numerically compute test values of the Euclidean distance statistic of Benford’s law as a function of the sample size. We also find an approximate analytical expression of the cumulative distribution function of such a statistic that makes possible the computation of p values.

The availability of accurate information has proved fundamental to managing health crises. This research examined pandemic data provided by 198 countries worldwide two years after the outbreak of the deadly Coronavirus in Wuhan, China. We compiled and reevaluated the consistency of daily COVID-19 infections with Benford’s Law. It is commonly accepted that the distribution of the leading digits of pandemic data should conform to Newcomb-Benford’s expected frequencies. Consistency with the law of leading digits might be an indicator of data reliability. Our analysis shows that most countries have disseminated partially reliable data over 24 months. The United States, Israel, and Spain spread the most consistent COVID-19 data with the law. In line with previous findings, Belarus, Iraq, Iran, Russia, Pakistan, and Chile published questionable epidemic data. Against this trend, 45 percent of countries worldwide appeared to demonstrate significant BL conformity. Our measures of Benfordness were moderately correlated with the Johns Hopkins Global Health Security Index, suggesting that the conformity to Benford’s law may also depend on national health care policies and practices. Our findings might be of particular importance to policymakers and researchers around the world.

We applied New-comb Benford's Law to validate the reliability of Covid-19 figures in Pakistan. Official data were taken from March 2020 till November 2020 and the first digit test is applied for the national aggregate dataset of total cases. There are shreds of evidence that the dataset of Pakistan does not conform to the New-comb Benford's Law theoretical expectations. The results are robust to the goodness of fit by applying the chi-square test. Although it is appreciating to concern over evidence-based policymaking, we found that the Pakistani epidemiological surveillance system fails to provide trustful data as per New-comb Benford's Law assumption.

It has been known for more than a century that, counter to one’s intuition, the frequency of occurrence of the first significant digit in a very large number of numerical data sets is nonuniformly distributed. This result is encapsulated in Benford’s law, which states that the first (and higher) digits follow a logarithmic distribution. An interesting consequence of the counter intuitive nature of Benford’s law is that manipulation of data sets can lead to a change in compliance with the expected distribution—an insight that is exploited in forensic accountancy and financial fraud. In this investigation we have applied a Benford analysis to the distribution of price paid data for house prices in England and Wales pre and post-2014. A residual heat map analysis offers a visually attractive method for identifying interesting features, and two distinct patterns of human intervention are identified: (i) selling property at values just beneath a tax threshold, and (ii) psychological pricing, with a particular bias for the final digit to be 0 or 5. There was a change in legislation in 2014 to soften tax thresholds, and the influence of this change on house price paid data was clearly evident.

This study analyzes the case of Romanian births, jointly distributed by age-groups of mother and father covering 1958-2019 under the potential influence of significant disruptors. Significant events such as anti-abortion laws application or abrogation, communism fall, and migration and their impact are analyzed. While in practice we may find pro and contra examples, a general controversy arises regarding whether births should or should not obey the Benford Law (BL). Moreover, the significant disruptors' impacts are not detailed discussed in such analysis. I find the distribution of births is First Digit Benford Law (BL1) conformant on the entire sample, but mixed results regarding the BL obedience in the dynamic analysis and by main sub-periods. Even though many disruptors are analyzed, only the 1967 Anti-abortion Decree has a significant impact. I capture an average lag of 15 years between the event, the Anti-abortion Decree, and the start of distortion of the births distribution. The distortion persists around 25 years, almost the entire fertility life (15 to 39) for the majority of the people from the cohorts born in 1967-1968.

To fight COVID-19, global access to reliable data is vital. Given the rapid acceleration of new cases and the common sense of global urgency, COVID-19 is subject to thorough measurement on a country-by-country basis. The world is witnessing an increasing demand for reliable data and impactful information on the novel disease. Can we trust the data on the COVID-19 spread worldwide? This study aims to assess the reliability of COVID-19 global data as disclosed by local authorities in 202 countries. It is commonly accepted that the frequency distribution of leading digits of COVID-19 data shall comply with Benford’s law. In this context, the author collected and statistically assessed 106,274 records of daily infections, deaths, and tests around the world. The analysis of worldwide data suggests good agreement between theory and reported incidents. Approximately 69% of countries worldwide show some deviations from Benford’s law. The author found that records of daily infections, deaths, and tests from 28% of countries adhered well to the anticipated frequency of first digits. By contrast, six countries disclosed pandemic data that do not comply with the first-digit law. With over 82 million citizens, Germany publishes the most reliable records on the COVID-19 spread. In contrast, the Islamic Republic of Iran provides by far the most non-compliant data. The author concludes that inconsistencies with Benford’s law might be a strong indicator of artificially fabricated data on the spread of SARS-CoV-2 by local authorities. Partially consistent with prior research, the United States, Germany, France, Australia, Japan, and China reveal data that satisfies Benford’s law. Unification of reporting procedures and policies globally could improve the quality of data and thus the fight against the deadly virus.

The Equity in Athletics Disclosure Act (EADA) database and the USA Today NCAA athletics department finance database are two of the most commonly used databases for scholars, policy makers, and other constituents interested in studying the economics of college athletics. Many in the higher education community, however, question the validity of these databases. This study used Benford’s Law of First Digits as a tool for spotting irregularities in EADA and USA Today college athletics financial data. After reviewing 5 years of data, the findings show that while there was some slight deviation from Benford’s Law, EADA and USA Today data largely conformed to the expectations of Benford’s Law.

The contrast of fraud in international trade is a crucial task of modern economic regulations. We develop statistical tools for the detection of frauds in customs declarations that rely on the Newcomb–Benford law for significant digits. Our first contribution is to show the features, in the context of a European Union market, of the traders for which the law should hold in the absence of fraudulent data manipulation. Our results shed light on a relevant and debated question, since no general known theory can exactly predict validity of the law for genuine empirical data. We also provide approximations to the distribution of test statistics when the Newcomb–Benford law does not hold. These approximations open the door to the development of modified goodness-of-fit procedures with wide applicability and good inferential properties.

The distribution of the first significant digit in numerals of connected texts is considered. Benford's law is found to hold approximately for them. Deviations from Benford's law are statistically significant author peculiarities that allow, under certain conditions, to distinguish between parts of the text with a different authorship.

Benford's law states that the leading digits of many data sets are not uniformly distributed from one through nine, but rather exhibit a profound bias. This bias is evident in everything from electricity bills and street addresses to stock prices, population numbers, mortality rates, and the lengths of rivers. Here, Steven Miller brings together many of the world's leading experts on Benford's law to demonstrate the many useful techniques that arise from the law, show how truly multidisciplinary it is, and encourage collaboration. Beginning with the general theory, the contributors explain the prevalence of the bias, highlighting explanations for when systems should and should not follow Benford's law and how quickly such behavior sets in. They go on to discuss important applications in disciplines ranging from accounting and economics to psychology and the natural sciences. The contributors describe how Benford's law has been successfully used to expose fraud in elections, medical tests, tax filings, and financial reports. Additionally, numerous problems, background materials, and technical details are available online to help instructors create courses around the book. Emphasizing common challenges and techniques across the disciplines, this accessible book shows how Benford's law can serve as a productive meeting ground for researchers and practitioners in diverse fields.

One Digit at a Time: The Z–statisticThe Chi–square and Kolmogorov–Smirnoff TestsThe Mean Absolute Deviation (MAD) TestTests Based on the Logarithmic Basis of Benford's LawCreating a Perfect Synthetic Benford SetThe Mantissa Arc TestSummary

To detect manipulations or fraud in accounting data, auditors have successfully used Benford's law as part of their fraud detection processes. Benford's law proposes a distribution for first digits of numbers in naturally occurring data. Government accounting and statistics are similar in nature to financial accounting. In the European Union (EU), there is pressure to comply with the Stability and Growth Pact criteria. Therefore, like firms, governments might try to make their economic situation seem better. In this paper, we use a Benford test to investigate the quality of macroeconomic data relevant to the deficit criteria reported to Eurostat by the EU member states. We find that the data reported by Greece shows the greatest deviation from Benford's law among all euro states.

The history, empirical evidence and classical explanations of the significant-digit (or Benford's) law are reviewed, followed by a summary of recent invariant-measure characterizations. Then a new statistical derivation of the law in the form of a CLT-like theorem for significant digits is presented. If distributions are selected at random (in any "unbiased" way) and random samples are then taken from each of these distributions, the significant digits of the combines sample will converge to the logarithmic (Benford) distribution. This helps explain and predict the appearance of the significant0digit phenomenon in many different empirical contexts and helps justify its recent application to computer design, mathematical modeling and detection of fraud in accounting data.

Benford's law is seeing increasing use as a diagnostic tool for isolating pockets of large datasets with irregularities that deserve closer inspection. Popular and academic accounts of campaign finance are rife with tales of corruption, but the complete dataset of transactions for federal campaigns is enormous. Performing a systematic sweep is extremely arduous; hence, these data are a natural candidate for initial screening by comparison to Ben- ford's distributions.

This article will concentrate on decimal (base 10) representations and significant digits; the corresponding analog of (3) for other bases b>1 is simply Prob(mantissa (base b) # t/b)=log

About Benford analysis. ACL User Guide

- Acl Services

ACL Services (2009) About Benford analysis. ACL User
Guide. bit.ly/1rEUYvt

Look out for No. 1. Financial Times

- T Harford

Harford, T. (2011) Look out for No. 1. Financial Times, 9

- Nigrini

- Miller