Content uploaded by Gianluca Bontempi

Author content

All content in this area was uploaded by Gianluca Bontempi on Feb 10, 2023

Content may be subject to copyright.

A preview of the PDF is not available

This handbook (whose extended version is available at https://leanpub.com/statisticalfoundationsofmachinelearning) is dedicated to all students interested in machine learning who are not content with only running lines of (deep-learning) code but who are eager to learn about this discipline’s assumptions, limitations, and perspectives. When I was a student, my dream was to become an AI researcher and save humankind with intelligent robots. For several reasons, I abandoned such ambitions (but you never know). In exchange, I discovered that machine learning is much more than a conventional research domain since it is intimately associated with the scientific process transforming observations into knowledge.
The first version of this book was made publicly available in 2004 with two objectives and one ambition. The first objective was to provide a handbook to ULB students since I was (and still am) strongly convinced that a decent course should come with a decent handbook. The second objective was to group together all the material that I consider fundamental (or at least essential) for a Ph.D. student to undertake a thesis in my lab. At that time, there were already plenty of excellent machine learning reference books. However, most of the existing work did not sufficiently acknowledge what machine learning owes to statistics and concealed (or did not make explicit enough, notably because of incomplete or implicit notation) important assumptions underlying the process of inferring models from data.
The ambition was to make a free academic reference on the foundations of machine learning available on the web. There are several reasons for providing free access to this work: I am a civil servant in an institution that already takes care of my salary; most of the material is not original (though its organisation, notation definition, exercises, code and structure represent the primary added value of the author); in many parts of the world access to expensive textbooks or reference material is still difficult for the majority of students; most of the knowledge underlying this book was obtained by the author thanks to free (or at least non charged) references and, last but not least, education seems to be the last societal domain where a communist approach may be as effective as rewarding. Personally, I would be delighted if this book could be used to facilitate the access of underfunded educational and research communities to state-of-the-art scientific notions.

Content uploaded by Gianluca Bontempi

Author content

All content in this area was uploaded by Gianluca Bontempi on Feb 10, 2023

Content may be subject to copyright.

A preview of the PDF is not available

... For a broader introduction to time series forecasting, we suggests the following references: (Hyndman and Athanasopoulos, 2018), (Petropoulos et al., 2021), (Brockwell et al., 2016), (Terasvirta, Tjostheim, and Granger, 2010) for a more general introduction, and (Lütkepohl, 2005), (Tsay, 2014) for a focus on multivariate approaches. For machine learning, key references include (Hastie, Tibshirani, and Friedman, 2009), (James et al., 2013) and (Bontempi, 2013) for statistical learning theory, and (Aggarwal et al., 2018) for a specific focus on neural-based techniques. ...

... The structure of a generic machine learning procedure is summarized in Figure 2.1. Figure 2.1: Structure of a traditional machine learning procedure according to (Bontempi, 2013). The feedback loop is represented by the dashed line. ...

... According to (Bontempi, 2013), the approaches to feature selection can be classified into three main categories: ...

Time series forecasting deals with the prediction of future values of time-dependent quantities (e.g. stock price, energy load, city traffic) on the basis of their historical observations. In
its simplest form, forecasting concerns a single variable (i. e., univariate problem) and deals
with the prediction of a single future value (i. e., one-step-ahead).
Existing studies in the literature focused on extending the univariate time series framework to address multiple future predictions (also called multiple-step-ahead) and multiple
variables (multivariate approaches) accounting for their interdependencies. However, most
of the approaches deal either with the multiple-step-ahead aspect or the multivariate one,
rarely with both. Moreover state-of-the-art multivariate forecasting methods are restricted
to low dimensional problems, linear dependencies among the values and short forecasting
horizons.
The recent technological advances (notably the Big Data revolution) are instead shifting
the focus to problems characterized by a large number of variables, non-linear dependencies
and long forecasting horizons. Those forecasting tasks are recently more and more addressed
with a representation learning approach, by feeding the data into large scale deep neural
networks and letting the model learn the most suitable data representation for the task
at hand. This approach, despite its success and effectiveness, often requires considerable
computational power, intensive model calibration and lacks interpretability of the learned
model.
The motivation of this thesis is that the potential of more interpretable approaches
to multi-variate and multi-step-ahead tasks has not been sufficiently explored and that
the use of complex neural methods is often neither necessary nor advantageous. In this
perspective we explore two multivariate and multiple-step-ahead forecasting strategies
based on dimensionality reduction :
• The first strategy, called Dynamic Factor Machine Learning, is a machine learning
extension of a famous technique in econometrics: it transforms the original highdimension multivariate forecasting problem by extracting first a (small) set of latent
variables (also called factors) and forecasting them independently in a multi-step-ahead
yet univariate manner. Once the multi-step-ahead forecast of factors is computed, the
predictions are transformed back to the original space.
• The second strategy, called Selective Multivariate to Univariate Reduction through
Feature Engineering and Selection, addresses the dimensionality issue in the original
space and deals with the combinatorial explosion of possible spatial and temporal
dependencies by feature selection. The resulting strategies combines expert-based feature engineering, effective feature selection strategies (based on filters), and ensembles
of simple models in order to develop a set of computationally inexpensive yet effective
models.
The thesis is structured as follows. After the introduction, we present a description of the
fundamentals of time series analysis and a review of the state-of-the-art in the domain of
multivariate, multiple-step-ahead forecasting. Then, we provide a theoretical description of
the two original contributions, along with their positioning in the current scientific literature.
The final part of thesis is devoted to their empirical assessment on several synthetic and real
data benchmarks (notably from the domain of finance, traffic and wind forecasting) and
discusses their strengths and weaknesses.
The experimental results show that the proposed strategies are a promising alternative
to state-of-the-art models, overcoming their limitations in terms of problem size (in case of
statistical models) and interpretability (in case of large scale black-box machine learning
models, such as Deep Learning techniques).
Moreover, the findings show a potential for implementation of these strategies on largescale (> 10² variables and > 10³ samples) real forecasting tasks, providing competitive
results both in terms of computational efficiency and forecasting accuracy with respect to
state-of-the-art and deep learning strategies

... The process is extended to the class of regression problems when the values domain lies in that of real numbers where the methods are called 'regression trees' [17]. Nonlinear regression is instead based on developing predictive models, which combine basic functions, such as polynomial, sigmoid, and spline [18]. The polynomial regression is one of the simplest approaches, and it aims at fitting a model by using curves of order n > 2 (quadratic, cubic, etc.), while the spline approach aims at producing a piecewise model in which each model is trained with only the value lying in a specified interval. ...

... It performs this through an unsupervised process that projects the data from the original space into a lower dimensional one where the axes, called Principal Components (PCs), of this new space are computed by combining the original variables. The first PC is oriented along the direction with the maximum variance of data [18]. This mathematically corresponds to find the vector a = [a 1 , . . . ...

The large-scale deployment of pervasive sensors and decentralized computing in modern smart grids is expected to exponentially increase the volume of data exchanged by power system applications. In this context, the research for scalable and flexible methodologies aimed at supporting rapid decisions in a data rich, but information limited environment represents a relevant issue to address. To this aim, this paper investigates the role of Knowledge Discovery from massive Datasets in smart grid computing, exploring its various application fields by considering the power system stakeholder available data and knowledge extraction needs. In particular, the aim of this paper is dual. In the first part, the authors summarize the most recent activities developed in this field by the Task Force on “Enabling Paradigms for High-Performance Computing in Wide Area Monitoring Protective and Control Systems” of the IEEE PSOPE Technologies and Innovation Subcommittee. Differently, in the second part, the authors propose the development of a data-driven forecasting methodology, which is modeled by considering the fundamental principles of Knowledge Discovery Process data workflow. Furthermore, the described methodology is applied to solve the load forecasting problem for a complex user case, in order to emphasize the potential role of knowledge discovery in supporting post processing analysis in data-rich environments, as feedback for the improvement of the forecasting performances.

... Nonlinear regression is based on developing predictive models, which combine basic functions, such as polynomial, sigmoid, and spline [13]. The polynomial regression is one of the simplest approaches, and it aims at fitting a model by using curves of order n > 2 (quadratic, cubic, etc. ), while the spline approach aims at producing a piece-wise model in which each model is trained with only the value lying in a specified interval. ...

... It performs this through an unsupervised process that projects the data from the original space to a lower dimensional one, where the axis, which are called Principal Components (PC), of this new space are computed by combining the original variables. The first PC is oriented over the direction with the maximum variance of data [13]. This mathematically corresponds to find the vector a = [a 1 , . . . ...

The large-scale deployment of pervasive sensors and decentralized computing in modern smart grids is expected to exponentially increase the volume of data exchanged by power system applications. In this context, the research for scalable, and flexible methodologies aimed at supporting rapid decisions in a data rich, but information limited environment represents a relevant issue to address. To this aim, this paper outlines the potential role of Knowledge Discovery from massive Datasets in smart grid computing, presenting the most recent activities developed in this field by the Task Force on "En-abling Paradigms for High-Performance Computing in Wide Area Monitoring Protective and Control Systems" of the IEEE PSOPE Technologies and Innovation Subcommittee.

... The statistical phase allowed the authors to extract relevant and reliable information for ML operations. The impact of this phase was beneficial because it made the authors learn about the phenomena observed [11] in the collected data and know if the secondary data obtained are favorable for a linear regression. Consequently, the importance of the transformation phase to overcome the assumption of normality for a good regression analysis and modeling. ...

In this research, the authors found that statistical analysis is very important preliminary phase in Machine Learning, especially for regression problems. Indeed, when the authors developed the first single models using the same algorithms and the same dataset, they obtained poor performances. After verifying the assumptions of the multiple linear regression, they adjusted the used data and produced efficient models. Moreover, as the objective was to apply the stacking model to predict Patient's Length of Stay in a semi urban hospital, the results showed that the stacking regressor performed better than the seven different models implemented (Random Forest, Extra Trees, Decision Tree, XGBoost, Multilayer perceptron, Light GBM, Support Vector Regressor (SVR)) taken individually. The authors combined Random Forest Regressor, Extra Trees Regressor, Decision Tree Regressor, XGBoost, Light GBM, and SVR to build the stacking model. Using secondary data from four services (Pediatrics, Hospitalization, Gynecology, and Neonatology) of a semi-urban hospital, located in a region of ongoing war in eastern Democratic Republic of Congo (DRC), the study examined the minimum length of stay of a patient in hospital when admitted in one of the four above services. Performances were evaluated using MAE, RMSE, MSE, R-squared and Accuracy. The stacking regression model shifted from 85% of accuracy before statistical analysis phase to 91% after applying statistics and from 0.75 to 0.91 as R-squared.

... Websites and internet blogs provide open-source material about ML algorithms. Additionally, the literature contains an abundance of informative textbooks and peer-reviewed papers (Raschka, 2015;Hastie et al., 2017;Fernández et al., 2018;Elmousalami and Elaskary, 2020;Shalaby et al., 2020;Bontempi, 2021; Choubey and Karmakar, 2021; Ragab et al., ...

The integrity failure in gas lift wells had been proven to be more severe than other artificial lift wells across the industry. Accurate risk assessment is an essential requirement for predicting well integrity failures. In this study, a machine learning model was established for automated and precise prediction of integrity failures in gas lift wells. The collected data contained 9,000 data arrays with 23 features. Data arrays were structured and fed into 11 different machine learning algorithms to build an automated systematic tool for calculating the imposed risk of any well. The study models included both single and ensemble supervised learning algorithms (e.g., random forest, support vector machine, decision tree, and scalable boosting techniques). Comparative analysis of the deployed models was performed to determine the best predictive model. Further, novel evaluation metrics for the confusion matrix of each model were introduced. The results showed that extreme gradient boosting and categorical boosting outperformed all the applied algorithms. They can predict well integrity failures with an accuracy of 100% using traditional or proposed metrics. Physical equations were also developed on the basis of feature importance extracted from the random forest algorithm. The developed model will help optimize company resources and dedicate personnel efforts to high-risk wells. As a result, progressive improvements in health, safety, and environment and business performance can be achieved.

La cerveza es un bien de alto consumo en la actualidad que se caracteriza, en países latinos, por su alta concentración en la producción nacional. Estas cerveceras han creado niveles elevados de lealtad en los consumidores, pero carecen de características diferenciadoras debido a la baja diversidad de productos de este tipo. En los últimos años se ha observado un auge de las microcervecerías, las cuales en menos de diez años han crecido exponencialmente haciéndose fuerte en diferentes lugares como bares, pubs, cafés y mostrando aspectos de fortalecimiento ante un mercado dominado por una única empresa. Frente a esto, los consumidores han respondido generando patrones en los que, para diferentes ocasiones o eventos, se inclinan por cierto tipo de cerveza (artesanal o comercial). El estudio apunta a mostrar el panorama actual de la cerveza artesanal en
Colombia y generar bases suficientes para enlazar tendencias actuales de la cerveza artesanal y cómo esto influye en el sentimiento y motivaciones del consumidor. La investigación pretende destacar en un campo que no es muy conocido: las microcervecerías en la producción de cerveza artesanal, y obtener una respuesta satisfactoria desde el punto de vista del consumidor.
Los resultados obtenidos son ideales para entender el comportamiento del
consumidor durante el consumo de cerveza artesanal.

The aim of this paper is to investigate the effect of volatility surges during the COVID-19 pandemic crisis on long-term investment trading rules. These trading rules are derived from stock return forecasting based on a Multiple Step Ahead Direct Strategy, and built on the combination of machine learning models and the Autoregressive Fractionally Integrated Moving Average (ARFIMA) model. ARFIMA has the feature to account for the long memory and structural change in conditional variance process. The machine learning models considered are a particular Neural Network model (MLP), K-Nearest Neighbors (KNN) and Support Vector Regression (SVR). The trading performances of the produced models are evaluated in terms of economical metrics reflecting profitability and risk like: Annualized Return, Sharpe Ratio and Profit Ratio. The hybrid model performances are compared to the simple machine learning models and to the classical ARMA-GARCH model using a Volatility Proxy as external regressor. When applying these long-term investment trading rules to the CAC40 index, from May 2016 to May 2020, the finding is that both MLP-based and hybrid ARFIMA-MLP-based trading models show higher performances with a Sharpe Ratio close to 2 and a Profit Ratio around 40% despite the COVID-19 crisis.

A fundamental task in various disciplines of science, including biology, is to find underlying causal relations and make use of them. Causal relations can be seen if interventions are properly applied; however, in many cases they are difficult or even impossible to conduct. It is then necessary to discover causal relations by analyzing statistical properties of purely observational data, which is known as causal discovery or causal structure search. This paper aims to give a introduction to and a brief review of the computational methods for causal discovery that were developed in the past three decades, including constraint-based and score-based methods and those based on functional causal models, supplemented by some illustrations and applications.

The p-value has long been the figurehead of statistical analysis in biology, but its position is under threat. p is now widely recognized as providing quite limited information about our data, and as being easily misinterpreted. Many biologists are aware of p's frailties, but less clear about how they might change the way they analyse their data in response. This article highlights and summarizes four broad statistical approaches that augment or replace the p-value, and that are relatively straightforward to apply. First, you can augment your p-value with information about how confident you are in it, how likely it is that you will get a similar p-value in a replicate study, or the probability that a statistically significant finding is in fact a false positive. Second, you can enhance the information provided by frequentist statistics with a focus on effect sizes and a quantified confidence that those effect sizes are accurate. Third, you can augment or substitute p-values with the Bayes factor to inform on the relative levels of evidence for the null and alternative hypotheses; this approach is particularly appropriate for studies where you wish to keep collecting data until clear evidence for or against your hypothesis has accrued. Finally, specifically where you are using multiple variables to predict an outcome through model building, Akaike information criteria can take the place of the p-value, providing quantified information on what model is best. Hopefully, this quick-and-easy guide to some simple yet powerful statistical options will support biologists in adopting new approaches where they feel that the p-value alone is not doing their data justice.

Statistics: A Very Short Introduction describes a field very different from the dry and dusty discipline of the popular imagination. In its place is an exciting subject which uses deep theory and powerful software tools to shed light and enable understanding. And it sheds this light on all aspects of our lives, enabling astronomers to explore the origins of the universe, archaeologists to investigate ancient civilisations, governments to understand how to benefit and improve society, and businesses to learn how best to provide goods and services. Aimed at readers with no prior mathematical knowledge, this Very Short Introduction explores and explains how statistics work, and how we can decipher them.

This textbook introduces linear algebra and optimization in the context of machine learning. Examples and exercises are provided throughout this text book together with access to a solution’s manual. This textbook targets graduate level students and professors in computer science, mathematics and data science. Advanced undergraduate students can also use this textbook. The chapters for this textbook are organized as follows:
1. Linear algebra and its applications: The chapters focus on the basics of linear algebra together with their common applications to singular value decomposition, matrix factorization, similarity matrices (kernel methods), and graph analysis. Numerous machine learning applications have been used as examples, such as spectral clustering, kernel-based classification, and outlier detection. The tight integration of linear algebra methods with examples from machine learning differentiates this book from generic volumes on linear algebra. The focus is clearly on the most relevant aspects of linear algebra for machine learning and to teach readers how to apply these concepts.
2. Optimization and its applications: Much of machine learning is posed as an optimization problem in which we try to maximize the accuracy of regression and classification models. The “parent problem” of optimization-centric machine learning is least-squares regression. Interestingly, this problem arises in both linear algebra and optimization, and is one of the key connecting problems of the two fields. Least-squares regression is also the starting point for support vector machines, logistic regression, and recommender systems. Furthermore, the methods for dimensionality reduction and matrix factorization also require the development of optimization methods. A general view of optimization in computational graphs is discussed together with its applications to back propagation in neural networks.
A frequent challenge faced by beginners in machine learning is the extensive background required in linear algebra and optimization. One problem is that the existing linear algebra and optimization courses are not specific to machine learning; therefore, one would typically have to complete more course material than is necessary to pick up machine learning. Furthermore, certain types of ideas and tricks from optimization and linear algebra recur more frequently in machine learning than other application-centric settings. Therefore, there is significant value in developing a view of linear algebra and optimization that is better suited to the specific perspective of machine learning.

Updated to conform to Mathematica® 7.0, Introduction to Probability with Mathematica®, Second Edition continues to show students how to easily create simulations from templates and solve problems using Mathematica. It provides a real understanding of probabilistic modeling and the analysis of data and encourages the application of these ideas to practical problems. The accompanying CD-ROM offers instructors the option of creating class notes, demonstrations, and projects. New to the Second Edition • Expanded section on Markov chains that includes a study of absorbing chains • New sections on order statistics, transformations of multivariate normal random variables, and Brownian motion • More example data of the normal distribution • More attention on conditional expectation, which has become significant in financial mathematics • Additional problems from Actuarial Exam P • New appendix that gives a basic introduction to Mathematica • New examples, exercises, and data sets, particularly on the bivariate normal distribution • New visualization and animation features from Mathematica 7.0 • Updated Mathematica notebooks on the CD-ROM (Go to Downloads/Updates tab for link to CD files.) • After covering topics in discrete probability, the text presents a fairly standard treatment of common discrete distributions. It then transitions to continuous probability and continuous distributions, including normal, bivariate normal, gamma, and chi-square distributions. The author goes on to examine the history of probability, the laws of large numbers, and the central limit theorem. The final chapter explores stochastic processes and applications, ideal for students in operations research and finance.