About
64
Publications
6,445
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
495
Citations
Citations since 2017
Introduction
Pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University. He published in Journal of Number Theory, Journal of the Royal Statistical Society (series B), and IEEE Pattern Analysis and Machine Intelligence.
Additional affiliations
February 1994 - June 1995
September 1988 - December 1993
Education
December 1994 - June 1995
Cambridge University
Field of study
- Statistics
Publications
Publications (64)
Available at MLTblog.com/3uSVGa6. This scratch course on stochastic processes covers significantly more material than usually found in traditional books or classes. The approach is original: I introduce a new yet intuitive type of random structure called perturbed lattice or Poisson-binomial process, as the gateway to all the stochastic processes....
Access the article (for free) at https://mltblog.com/3UPYGPz.
There are very few serious articles in the literature dealing with digits of irrational numbers to build a pseudo-random number generator (PRNG). It seems that this idea was abandoned long ago due to the computational complexity and the misconception that such PRNG’s are deterministic...
Access the article (for free) at https://mltblog.com/3HKnBS2.
In the context of synthetic data generation, I’ve been asked a few times to provide a case study focusing on real-life tabular data used in the finance or health industry. Here we go: this article fills this gap. The purpose is to generate a synthetic copy of the real data set, preserv...
For full description, see https://mltblog.com/3XCsVw9. Synthetic data is used more and more to augment real-life datasets, enriching them and allowing black-box systems to correctly classify observations or predict values that are well outside of training and validation sets. In addition, it helps understand decisions made by obscure systems such a...
Available at https://mltblog.com/3UuTnpZ. This 156-pages book covers the foundations of machine learning, with modern approaches to solving complex problems. Emphasis is on scalability, automation, testing, optimizing, and interpretability (explainable AI). For instance, regression techniques — including logistic and Lasso — are presented as a sing...
Read article at https://mltblog.com/3xF4a7X. A central problem in computer vision is to compare shapes and assess how similar they are. This is used for instance in text recognition. Modern techniques involve neural networks. In this article, I revisit a methodology designed in 1914, before computer even existed. It leads to an efficient, automated...
Full article available for free, at https://dsc.news/3pLWgnw. In simple English, we introduce a special, off-the-beaten-path type of point process called Poisson-binomial. We analyze its properties and perform simulations to see the distribution of points that it generates, in one and two dimensions, as well as to make inference about its parameter...
The article is available for free, at https://bit.ly/3EoNBNW. In this article, original stochastic processes are introduced. They may constitute one of the simplest examples and definitions of point processes. A limiting case is the standard Poisson process - the mother of all point processes - used in so many spatial statistics applications. We fi...
You can read this article at https://bit.ly/3uzZ7kL. In this article, I provide a short overview of the topic, with application to understanding why the Riemann hypothesis (arguably the most famous unsolved mathematical conjecture of all times) might be true, using probabilistic arguments. Stat-of-the-art, recent developments about this conjecture...
Read full article at https://bit.ly/2MuHYrx. This is the second part of my article Spectacular Visualization: The Eye of the Riemann Zeta Function, focusing on the most infamous unsolved mathematical conjecture, one that has a $1 million dollar price attached to it. I used the word deep not in the sense of deep neural networks, but because the impl...
See full article at https://dsc.news/2QBKz1V. Here I introduce an alternative to the power mean, called exponential mean, also based on a parameter p, that may have an appeal to data scientists and machine learning professionals. It is also a special case of the quasi-arithmetic mean. Though the concept is basic, there is very little if any literat...
The article is available for free at https://dsc.news/3dDwQS2.
Bernouilli lattice processes may be one of the simplest examples of point processes, and can be used as an introduction to learn about more complex spatial processes that rely on advanced measure theory for their definition. In this article, we show the differences and analogies betwee...
This article is available for free at https://dsc.news/2X4W9GK
Product of two large primes are at the core of many encryption algorithms, as factoring the product is very hard for numbers with a few hundred digits. The two prime factors are associated with the encryption keys (public and private keys). Here we describe a new approach to factoring...
This article is available for free at https://dsc.news/2YGvn8Y. We discuss a simple trick to significantly accelerate the convergence of an algorithm when the error term decreases in absolute value over successive iterations, with the error term oscillating (not necessarily periodically) between positive and negative values.
We first illustrate t...
This article is available at https://dsc.news/3cjMmlH.
The methodology described here has broad applications, leading to new statistical tests, new type of ANOVA (analysis of variance), improved design of experiments, interesting fractional factorial designs, a better understanding of irrational numbers leading to cryptography, gaming and Fintech...
This article is available for free (no sign-up required) at https://dsc.news/2GCev9a. Fermat's last conjecture has puzzled mathematicians for 300 years, and was eventually proved only recently, see here. In this note, I propose a generalization, that could actually lead to a much simpler proof and a more powerful result with broader applications, i...
This article is available for free, at https://dsc.news/2Y8VaEx. We study the properties of a typical chaotic system to derive general insights that apply to a large class of unusual statistical distributions. The purpose is to create a unified theory of these systems. These systems can be deterministic or random, yet due to their gentle chaotic na...
Read this article (free) at https://dsc.news/2ORztVc. In this article, we explore a new type of generalized univariate normal distributions that satisfies useful statistical properties, with interesting applications. This new class of distributions is defined by its characteristic function, and applications are discussed in the last section. These...
Available free of charge at https://dsc.news/2JrGxpt . Some original and very interesting material is presented here, with possible applications in Fintech. No need for a PhD in math to understand this article: I tried to make the presentation as simple as possible, focusing on high-level results rather than technicalities. Yet, professional statis...
Article available (for free) at https://dsc.news/2nX2xRC. I have used synthetic data sets many times for simulation purposes, most recently in my articles Six degrees of Separations between any two Datasets and How to Lie with p-values. Many applications (including the data sets themselves) can be found in my books Applied Stochastic Processes and...
This article is available at https://dsc.news/2knXsA9. This is an interesting data science conjecture, inspired by the well known six degrees of separation problem, stating that there is a link involving no more than 6 connections between any two people on Earth, say between you and anyone living (say) in North Korea.
Here the link is between any...
This article is available for free, at https://dsc.news/2m0eUed. The two deep conjectures highlighted in this article (conjectures B and C) are related to the digit distribution of well known math constants such as Pi or log 2, with an emphasis on binary digits of SQRT(2).
Download this article at https://dsc.news/2zytIoc. Continued fractions are usually considered as a beautiful, curious mathematical topic, but with applications mostly theoretical and limited to math and number theory. Here we show how it can be used in applied business and economics contexts, leveraging the mathematical theory developed for continu...
Available at https://dsc.news/2PelcoS. Potential applications are found in cryptography, Fintech (stock market modeling), Bitcoin, number theory, random number generation, benchmarking statistical tests and even gaming. However, the most interesting application is probably to gain insights about how non-normal numbers look like, especially their ch...
Available for free at https://dsc.news/2IByRkm. This book is intended for busy professionals working with data of any kind: engineers, BI analysts, statisticians, operations research, AI and machine learning professionals, economists, data scientists, biologists, and quants, ranging from beginners to executives. In about 300 pages and 28 chapters i...
Available at https://www.mdpi.com/2504-4990/1/2/42. The sensitivity of the elbow rule in determining an optimal number of clusters in high-dimensional spaces that are characterized by tightly distributed data points is demonstrated. The high-dimensional data samples are not artificially generated, but they are taken from a real world evolutionary m...
Read the article at https://dsc.news/2K89vvT. This simple introduction to matrix theory offers a refreshing perspective on the subject. Using a basic concept that leads to a simple formula for the power of a matrix, we see how it can solve time series, Markov chains, linear regression, data reduction, principal components analysis (PCA) and other m...
This is probably the first time that an explicit formula is obtained for the variance of the range, besides the uniform distribution. It has important consequences, and the result is also useful in applications. The expectation of the range grows indefinitely and is asymptotically equivalent to log(n). The variance of the range grows slowly and eve...
The full article is available at https://dsc.news/2ViasbN.
This crash course features a new fundamental statistics theorem -- even more important than the central limit theorem -- and a new set of statistical rules and recipes. We discuss concepts related to determining the optimum sample size, the optimum k in k-fold cross-validation, bootstrappi...
Full document available at https://dsc.news/2PUhCNh. We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology...
This article is available at https://dsc.news/2IJ136n.
I show here how I used the golden ratio for a new number guessing game (to generate chaos and randomness in ergodic time series) as well as new intriguing results, in particular:
1. Proof that the rabbit constant it is not normal in any base; this might be the first instance of a non-artific...
Based on deep math (number theory) it allows participants to compute the winning numbers in advance, the average return is zero, and the gain is known in advance: it is a function of how close your guess is, to the winning number.
Based on years of state-of-the-art mathematical research, this article sets the theoretical foundations for a broad cl...
Available at https://dsc.news/2uGqBYC
We investigate a large class of auto-correlated, stationary time series, proposing a new statistical test to measure departure from the base model, known as Brownian motion. We also discuss a methodology to deconstruct these time series, in order to identify the root mechanism that generates the observations....
Article available at https://dsc.news/2GEPcFj.
In this data science article, emphasis is placed on science, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attem...
Available a https://dsc.news/2xnazE9 . You won't learn this in textbooks, college classes, or data camps. Some of the material in this article is very advanced yet presented in simple English, with an Excel implementation for various statistical tests, and no arcane theory, jargon, or obscure theorems. It has a number of applications, in finance in...
The free book is available for download at https://dsc.news/2J80pjl .
This book is intended for professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject. It is accessible to practitioners with...
We illustrate pattern recognition techniques applied to an interesting mathematical problem: The representation of a number in non-conventional systems, generalizing the familiar base-2 or base-10 systems. The emphasis is on data science rather than mathematical theory, and the style is that of a tutorial, requiring minimum knowledge in mathematics...
The article can be accessed for free at https://mltblog.com/3B4jTxz. The method described here illustrates the concept of ensemble methods, applied to a real life NLP problem: ranking articles published on a website to predict performance of future blog posts yet to be written, and help decide on title and other features to maximize traffic volume...
Most of the articles on extreme events are focusing on the extreme values. Very little has been written about the arrival times of these events. This article fills the gap. The article is available online at http://www.datasciencecentral.com/profiles/blogs/distribution-of-arrival-times-of-extreme-events
In this article, we revisit the most fundamental statistics theorem, talking in layman terms. We investigate a special but interesting and useful case, which is not discussed in textbooks, data camps, or data science classes. This article is part of a series about off-the-beaten-path data science and mathematics, offering a fresh, original and simp...
Learn what it takes to succeed in the the most in-demand tech jobHarvard Business Review calls it the sexiest tech job of the 21st century. Data scientists are in demand, and this unique book shows you exactly what employers want and the skill set that separates the quality data scientist from other talented IT professionals. Data science involves...
A new unbiased consistent asymptotically normal estimator Uk of the intensity λ of a stationary multivariate Poisson point process is exhibited. This estimate is based on a combination of the j-th nearest neighbor (possibly non Euclidean) distances (j=1, ..., k) to a single fixed site x. A simple closed form containing logarithmic terms is obtained...
We propose a MCMC methodology to estimate all the components of the RodriguezIturbe model. This parametric model is associated with a likelihood function, and we use the Gibbs sampler to draw posterior deviates of the parameters in a Bayesian framework, conditionally on the data. The Gibbs sampler incorporates a Metropolis-Hastings step to sample t...
A new theoretical point of view is discussed here in the framework of density estimation. The discrete multivariate true density is viewed as a finite dimensional continuous random vector governed by a Markov random field structure. Estimating the density is then a problem of maximizing a conditional likelihood under a Bayesian framework. This maxi...
We investigate the solutions for the clustering and the discriminant analysis problems when the points are supposed to be distributed according to Poisson processes on convex supports. This leads to very intuitive criteria for homogeneous Poisson processes based on the Lebesgue measures of convex hulls. For non homogeneous Poisson processes, the Le...
A new theoretical point of view is discussed in the framework of density estimation. The multivariate true density, viewed as a prior or penalizing factor in a Bayesian framework, is modelled by a Gibbs potential. Estimating the density consists in maximizing the posterior. For efficiency of time, we are interested in an approximate estimator f̂ =...
The authors recall the complete classification procedure they have
developed. This is based on a nonstationary Poisson process for the
points distribution in the radiometric space. The intensity functions
are estimated nonparametrically by using the uniform kernel. For the a
priori probability estimates the authors use an EM-like algorithm, which
k...
In the framework of image remote sensing, Markov random fields are used to model the distribution of points both in the 2-dimensional geometrical layout of the image and in the spectral grid. The problems of image filtering and supervised classification are investigated. The mixture model of noise developed here and appropriate Gibbs densities yiel...
A new theoretical point of view is discussed in the framework of density estimation. The multivariate true density, viewed as a prior or penalizing factor in a Bayesian framework, is modelled by a Gibbs potential. Estimating the density consists in maximizing the posterior. For efficiency of time, we are interested in an approximate estimator f̂ =...
We prove the convergence of the simulated annealing procedure when
the decision to change the current configuration is blind of the cost of
the new configuration. In case of filtering binary images, the proof
easily generalizes to other procedures, including that of Metropolis. We
show that a function Q associated with the algorithm must be chosen...
Investigates a fully nonparametric supervised classification
scheme underlined by a nonhomogeneous poisson point process for the
distribution of points in the radiometric space. Uniform kernel
estimates of the intensity functions are investigated. In particular,
the choice of the kernel bandwidth is discussed. The a priori
probability for a point t...
We propose a simple and robust algorithm for exact inference in 2 × 2 contingency tables. It is based on recursive relations allowing efficient computation of odds-ratio estimates, confidence limits and p-values for Fisher's test. A factor of 3–10 is gained in terms of computer time compared with the classical algorithm of Thomas.
Instead of considering an additive Gaussian noise, we present a model where the observed image is a mixture of an arbitrary and discrete noise process with the true but unknown image. We also develop a filtering algorithm, used to remove the noise. This algorithm is the ICM (Besag, 1986). The novelty is that ICM has been adapted to our particular a...
In this short note we focus on clustering and classification problems related to image segmentation. Our aim is twofold. First we want to popularize some methods from computational geometry in grid clustering and classification, and second, we develop new techniques and present an efficient algorithm for the image segmentation on the bounded domain...
As an application of the behavioral answer proposed for the classification of Remote Sensing data using the non-parametric intensities of the training sets, this paper proposes to discuss the contribution of our research in Supervised Classification of Remote Sensing Data. The improvement of the results for thematic classification are estimated in...
Related to the papers by Rasson et al. (1993) and Orban-Ferauge (1993), the present author summarizes a new behavioral answer to the problem of classification of pixels from their spectral coordinates. The graphical illustration of the mathematical foundations of the method is shown, attempting to make clear what is the basic notion of the intensit...
During the last few years, Markov Random Field (Mrf) models have already been successfully applied in some applications in image remote sensing in a context of conditional maximum likelihood estimation. Here, in the same context, we propose some original uses of Mrf, especially in image segmentation, noise filtering and discriminant analysis. For i...
Instead of considering an additive Gaussian noise, we present a model where the observed image is a mixture of an arbitrary noise process with the true but unknown image. We have obtained consistent estimators for the proportions of the mixture. We have also estimated the distribution of the colours in the true image. The differences between the di...
The recursive relation g(n) = n − g(g(n − 1)), g(0) = 0, appears in the context of Fibonacci numbers, as you can see in Hofstadter [“Bach, Esher, Godel,” pp. 151–154, Intereditions, 1985]. Here we state and prove that .
Projects
Project (1)