Article

Generalized Linear Models

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The technique of iterative weighted linear regression can be used to obtain maximum likelihood estimates of the parameters with observations distributed according to some exponential family and systematic effects that can be made linear by a suitable transformation. A generalization of the analysis of variance is given for these models using log- likelihoods. These generalized linear models are illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables) and gamma (variance components). The implications of the approach in designing statistics courses are discussed.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Une autre loi de probabilité courante dans la modélisation de la mortalité est la loi binomiale (Feehan, 2018;Nelder et Wedderburn, 1972). ...
... pour la loi binomiale Il existe par ailleurs plusieurs manières de calculer la déviance d'un modèle. Si l'ajustement des modèles est fait par la méthode du maximum de vraisemblance, on peut calculer la déviance comme deux fois la différence entre la valeur de vraisemblance maximale que l'on peut atteindre et la valeur de vraisemblance obtenue par le modèle considéré (Nelder et Wedderburn, 1972). Pour un même jeu de données et pour une même loi de probabilité, la vraisemblance du modèle saturé est toujours une même constante, indépendamment du modèle candidat. ...
Thesis
La forme de la courbe de mortalité aux très grands âges reste incertaine. Le débat entre une trajectoire de décélération et une croissance exponentielle avec l’âge n’est pas tranché. Ce manque de consensus est essentiellement dû à la qualité inégale des données et à la variété des hypothèses servant à la modélisation. Cette thèse mobilise des données d’excellente qualité sur les décès survenus en France, en Belgique et au Québec pour identifier la trajectoire la plus plausible aux âges extrêmes par des modèles paramétriques. Nous étudions les différentes lois de probabilité applicables à nos données : loi de Poisson, loi binomiale négative et loi binomiale, en nous appuyant sur une palette d’outils d’évaluation de la performance des modèles (intervalles de confiance, résidus de déviance et critères d’information). L’hétérogénéité de la population, d’abord supposée inobservable, est prise en compte par les modèles de fragilité puis, supposée observable, elle est étudiée par les modèles de l’analyse de survie. Selon les données disponibles, la loi de Poisson reste appropriée pour la modélisation de la mortalité aux très grands âges. Une trajectoire de décélération de la mortalité apparaît comme la plus plausible dans la majorité des populations féminines mais une croissance exponentielle est plus convaincante pour les populations masculines. Une surmortalité masculine est présente dans toutes populations. Et il n’est pas possible d’identifier un plateau de mortalité. Ces résultats ne permettent pas de clore le débat. Pour trancher définitivement sur la forme de la trajectoire de mortalité, les efforts de collecte sur les décès aux très grands âges doivent être poursuivis.
... 18 Center-periphery analysis of Mardin has been the source of majority of works that appeared on the academic terrain during the late-1980s and thereby much of the literature on political Islam and Kurdish problem utilized or at least referred to this unique analysis. 19 In addition to its scientific value in offering a plausible perspective for understanding the operation of the state/society mechanism in Turkey, this theoretical attempt owed its popularity to the political and sociological developments within the country. As liberalism appeared to be the most favorable ideology of the late-1980s in Turkey, this specific development naturally found its reflections in the Turkish academia. ...
... In statistics, generalized linear model (GLM) introduced by Nelder and Wedderburn (1972) is a flexible generalization of traditional linear regression model that allows the response variable to have other distributions than the normal distribution (Olsson, 2002;Hilbe, 2011) Generalized additive models for location, scale and shape (GAMLSS) assuming that the response variable may come from a wide range of distributions in the family of GAMLSS was introduced by Hastie and Tibshirani (1987) 197 Academic Studies in Social, Human and Administrative Science -December 2022 as location, scale and shape, in the aspect of the explanatory variables makes GAMLSS family a more flexible tool than the GLM family (Rigby et al., 2019;Machsus et al., 2015). ...
... The most popular statistical models that are able to cope with such heterogeneous data are the generalized linear models (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. ...
... Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder-Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for t ≥ 0 until convergence ...
... Generalized linear models (GLMs) [17] are effective models for a variety of discrete and continuous label spaces, allowing the prediction of binary or count-valued labels (logistic, Poisson regression) as well as realvalued labels (gamma, least-squares regression). Inference in a GLM involves two steps: given a feature vector x ∈ R d and model parameters w * , a canonical parameter is generated as θ := w * , x then the label y is sampled from the exponential family distribution P [y | θ] = exp(y · θ − ψ(θ) − h(y)), ...
... where Z(η, β) = exp(β · (y · η − ψ(η) − h(y))) dy. This transformation is order and mode preserving since x β is an increasing function for any β > 0. This generalized likelihood distribution has variance [17] 1 β ∇ 2 ψ(η), which tends to 0 as β → ∞. Table 1 lists a few popular distributions, their variance altered versions, and asymptotic versions as β → ∞. ...
Preprint
Full-text available
This paper presents SVAM (Sequential Variance-Altered MLE), a unified framework for learning generalized linear models under adversarial label corruption in training data. SVAM extends to tasks such as least squares regression, logistic regression, and gamma regression, whereas many existing works on learning with label corruptions focus only on least squares regression. SVAM is based on a novel variance reduction technique that may be of independent interest and works by iteratively solving weighted MLEs over variance-altered versions of the GLM objective. SVAM offers provable model recovery guarantees superior to the state-of-the-art for robust regression even when a constant fraction of training labels are adversarially corrupted. SVAM also empirically outperforms several existing problem-specific techniques for robust regression and classification. Code for SVAM is available at https://github.com/purushottamkar/svam/
... The most popular statistical models that are able to cope with such heterogeneous data are the generalized linear models (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. ...
... Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder-Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for t ≥ 0 until convergence ...
Chapter
Full-text available
This chapter considers recurrent neural (RN) networks. These are special network architectures that are useful for time-series modeling, e.g., applied to time-series forecasting. We study the most popular RN networks which are the long short-term memory (LSTM) networks and the gated recurrent unit (GRU) networks. We apply these networks to mortality forecasting.
... The most popular statistical models that are able to cope with such heterogeneous data are the generalized linear models (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. ...
... Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder-Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for t ≥ 0 until convergence ...
Chapter
Full-text available
This chapter introduces and discusses the exponential family (EF) and the exponential dispersion family (EDF). The EF and the EDF are by far the most important classes of distribution functions for regression modeling. They include, among others, the Gaussian, the binomial, the Poisson, the gamma, the inverse Gaussian distributions, as well as Tweedie’s models. We introduce these families of distribution functions, discuss their properties and provide several examples. Moreover, we introduce the Kullback–Leibler (KL) divergence and the Bregman divergence, which are important tools in model evaluation and model selection.
... The most popular statistical models that are able to cope with such heterogeneous data are the generalized linear models (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. ...
... Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder-Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for t ≥ 0 until convergence ...
Chapter
Full-text available
This chapter considers convolutional neural (CN) networks. These are special network architectures that are useful for time-series and spatial data modeling, e.g., applied to image recognition problems. Time-series and images have a natural topology, and CN networks try to benefit from this additional structure (over tabular data). We introduce these network architectures and provide insurance-relevant examples related to telematics data and mortality forecasting.
... The most popular statistical models that are able to cope with such heterogeneous data are the generalized linear models (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. ...
... Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder-Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for t ≥ 0 until convergence ...
Chapter
Full-text available
This chapter is on classical statistical decision theory. It is an important chapter for historical reasons, but it also provides the right mathematical grounding and intuition for more modern statistical tools from data science and machine learning. In particular, we discuss maximum likelihood estimation (MLE), unbiasedness, consistency and asymptotic normality of MLEs in this chapter.
... The most popular statistical models that are able to cope with such heterogeneous data are the generalized linear models (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. ...
... Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder-Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for t ≥ 0 until convergence ...
Chapter
Full-text available
This chapter presents a selection of different topics. We discuss forecasting under model uncertainty, deep quantile regression, deep composite regression and the LocalGLMnet which is an interpretable FN network architecture. Moreover, we provide a bootstrap example to assess prediction uncertainty, we discuss mixture density networks, and we give an outlook to studying variational inference.
... The most popular statistical models that are able to cope with such heterogeneous data are the generalized linear models (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. ...
... Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder-Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for t ≥ 0 until convergence ...
Chapter
Full-text available
The core of this book are deep learning methods and neural networks. This chapter considers deep feed-forward neural (FN) networks. We introduce the generic architecture of deep FN networks, and we discuss universality theorems of FN networks. We present network fitting, back-propagation, embedding layers for categorical variables and insurance-specific issues such as the balance property in network fitting, as well as network ensembling to reduce model uncertainty. This chapter is complemented by many examples on non-life insurance pricing, but also on mortality modeling, as well as tools that help to explain deep FN network regression results.
... The most popular statistical models that are able to cope with such heterogeneous data are the generalized linear models (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. ...
... Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder-Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for t ≥ 0 until convergence ...
Chapter
Full-text available
This chapter discusses natural language processing (NLP) which deals with regression modeling of non-tabular or unstructured text data. We explain how words can be embedded into low-dimension spaces that serve as numerical word encodings. These can then be used for text recognition, either using RN networks or attention layers. We give an example where we aim at predicting claim perils from claim descriptions.
... The most popular statistical models that are able to cope with such heterogeneous data are the generalized linear models (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. ...
... Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder-Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for t ≥ 0 until convergence ...
Chapter
Full-text available
This chapter illustrates the data used in this book. These are a French motor third party liability (MTPL) claims data set, a Swedish motorcycle claims data set, a Wisconsin Local Government Property Insurance Fund data set, and a Swiss compulsory accident insurance data set.
... The most popular statistical models that are able to cope with such heterogeneous data are the generalized linear models (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. ...
... Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder-Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for t ≥ 0 until convergence ...
Chapter
Full-text available
This chapter summarizes some techniques that use Bayes’ theorem. These are classical Bayesian statistical models using, e.g., the Markov chain Monte Carlo (MCMC) method for model fitting. We discuss regularization of regression models such as ridge and LASSO regularization, which has a Bayesian interpretation, and we consider the Expectation-Maximization (EM) algorithm. The EM algorithm is a general purpose tool that can handle incomplete data settings. We illustrate this for different examples coming from mixture distributions, censored and truncated claims data.
... The most popular statistical models that are able to cope with such heterogeneous data are the generalized linear models (GLMs). The notion of GLMs has been introduced in the seminal work of Nelder-Wedderburn [283] in 1972. Their work has introduced a unified procedure for modeling and fitting distributions within the EDF to data having systematic differences (effects) that can be described by explanatory variables. ...
... Typically, one uses Fisher's scoring method or the iterative re-weighted least squares (IRLS) algorithm to solve this root search problem. This is a main result derived in the seminal work of Nelder-Wedderburn [283] and it explains the popularity of GLMs, namely, GLMs can be solved efficiently by this algorithm. Fisher's scoring method/IRLS algorithm explore the updates for t ≥ 0 until convergence ...
Chapter
Full-text available
This chapter discusses state-of-the-art statistical modeling in insurance and actuarial science, which is the generalized linear model (GLM). We discuss GLMs in the light of claim count and claim size modeling, we present feature engineering, model fitting, model selection, over-dispersion, zero-inflated claim counts problems, double GLMs, and insurance-specific issues such as the balance property for having unbiasedness.
... The extreme points of model space By Breiman's definition, data models tend to be simple, easy to interpret, and theoretically tractable, but have limited expressivity. This framework describes a wide set of standard statistical tools, including linear regression, logistic regression, autoregressive moving average (ARMA) models (Box et al., 2015), generalized linear models (Nelder and Wedderburn, 1972), and linear mixed models (McCulloch and Neuhaus, 2005). Due to the previously mentioned properties, data models are the go-to methods in many fields, such as econometrics, political science, and other social sciences. ...
... Achieving the interpretability from flexible statistical models as e.g. Generalized Linear Models (GLMs) (Nelder & Wedderburn, 1972) or Generalized Additive Models (GAMs) (Hastie, 2017), in deep neural networks, however, is inherently difficult. Recently, Agarwal et al. (2021) introduced Neural Additive Models (NAMs), a framework that models all features individually and thus creates visual interpretability of the single features. ...
Preprint
Deep neural networks (DNNs) have proven to be highly effective in a variety of tasks, making them the go-to method for problems requiring high-level predictive power. Despite this success, the inner workings of DNNs are often not transparent, making them difficult to interpret or understand. This lack of interpretability has led to increased research on inherently interpretable neural networks in recent years. Models such as Neural Additive Models (NAMs) achieve visual interpretability through the combination of classical statistical methods with DNNs. However, these approaches only concentrate on mean response predictions, leaving out other properties of the response distribution of the underlying data. We propose Neural Additive Models for Location Scale and Shape (NAMLSS), a modelling framework that combines the predictive power of classical deep learning models with the inherent advantages of distributional regression while maintaining the interpretability of additive models.
... Un modèle linéaire généralisé [14] est une extension du modèle linéaire général classique, de sorte que les modèles linéaires constituent un point de départ approprié pour l'introduction des modèles linéaires généralisés. Le modèle de régression linéaire est caractérisé par quatre éléments essentiels tels que, le vecteur colonne de dimension (n) des variables aléatoires dépendantes (Y), une composante systématique définie comme étant une matrice de taille (n× p) et de rang (p), appelée matrice du plan d'expérience (X) = {X1, X2, . . ...
Article
Full-text available
The success of organizations depends not only on customer loyalty but also on preventing customer attrition. However, little research has been done on the termination of the relationship between the firm and its clients. In other words, a better understanding of the nature, steps, and factors involved in the termination process will make it easier to prevent and prevent termination while trying to recover lost clients and attract new prospects. Using this study, we try to detail, through a literature review, the "determinants of churn" negatively impacting the relationship between companies and their customer portfolio, and complete the work with an empirical study in the automobile insurance sector describing the influence of these cessation factors on the insured’s act of commutation. The rate of termination, measured by the number of defecting customers relative to the total number of a company’s customers in a given period, appears to be a major concern for primarily service-oriented companies. The service activities of these organizations develop in close interdependence with the environment that imposes constraints on them. In order to cope with uncertainties, the internal structures of the latter adapt to the types and conditions of the environment, which is neither static nor homogeneous. According to R. De Bruecker (1995), "the business environment is defined about everything outside it: technology, the nature of the products, customers and competitors, other organizations, the political and economic climate, etc.". "In other words, it is subject to numerous constraints from its environment that it has no control over. J. R. Edighoffer (1998) notes that the objective of all companies "is to reduce this uncertainty; consequently, they must analyze and understand their environment ". A gradual deterioration in the services offered, the continuous evolution of the market, and constantly fluctuating customer attitudes explain why organizations choose to deal with customers who are increasingly difficult to apprehend and ready to change their suppliers. Our work will focus on analyzing the impact of the factors that stimulate the act of terminating automobile insurance contracts at the end of their term.
... At present, SP methods can be roughly divided into four categories: (1) deterministic prediction: inverse distance weighted(IDW) (Willmott et al., 1985) and generalized linear model (GLM) (Nelder and Wedderburn, 1972), (2) geostatistics method: kriging (Matheron, 1963), (3) combination method: regression kriging(RK) (Mohanasundaram et al., 2020), (4) machine learning (ML). With the complexity of practical problems, these basic methods cannot meet the requirements. ...
Article
Full-text available
Spatial prediction(SP) based on machine learning(ML) has been applied to soil water quality, air quality, marine environment, etc. However, there are still deficiencies in dealing with the problem of small samples. Normally, ML requires large amounts of training samples to prevent underfitting. And the data augmentation(DA) methods of mixup and synthetic minority over-sampling technique(SMOTE) ignore the similarity of geographic information. Therefore, this paper proposes a modified upsampling method and combines it with the random forest spatial interpolation(RFSI) to deal with the small sample problem in geographical space. The modified upsampling is mainly reflected in the following two aspects. Firstly, in the process of selecting the nearest points, it is to select points with similar geographic information in some aspects of the category after classification. Secondly, the selected difference is the difference of each category. In order to verify the effectiveness of the proposed method, we use daily precipitation data for January 2018 in Chongqing. The experimental results show that the combination of the modified upsampling method and RFSI effectively improves the accuracy of SP.
... Eddin et al. [78] investigate how aggregated transaction statistics and different graph features can be used to flag suspicious bank client behavior. To this end, the authors consider a random forest model, generalized linear model [79], and gradient boosted trees with LightGBM [80]. The authors utilize a large data set from a non-disclosed bank. ...
Article
Full-text available
Money laundering is a profound global problem. Nonetheless, there is little scientific literature on statistical and machine learning methods for anti-money laundering. In this paper, we focus on anti-money laundering in banks and provide an introduction and review of the literature. We propose a unifying terminology with two central elements: (i) client risk profiling and (ii) suspicious behavior flagging. We find that client risk profiling is characterized by diagnostics, i.e., efforts to find and explain risk factors. On the other hand, suspicious behavior flagging is characterized by non-disclosed features and hand-crafted risk indices. Finally, we discuss directions for future research. One major challenge is the need for more public data sets. This may potentially be addressed by synthetic data generation. Other possible research directions include semi-supervised and deep learning, interpretability, and fairness of the results.
... Diagnostic tests or proxies of alternative equilibria. We modelled the response of chlorophyll-a to TP and TN using generalised-linear models 58 with Gamma distribution and an identity link on untransformed data for single-year and multiple-year means up to 5-year means. We used the Gamma distribution, as chlorophyll fit this significantly better than a normal or log-normal distribution. ...
Article
Full-text available
Since its inception, the theory of alternative equilibria in shallow lakes has evolved and been applied to an ever wider range of ecological and socio-ecological systems. The theory posits the existence of two alternative stable states or equilibria, which in shallow lakes are characterised by either clear water with abundant plants or turbid water where phytoplankton dominate. Here, we used data simulations and real-world data sets from Denmark and northeastern USA (902 lakes in total) to examine the relationship between shallow lake phytoplankton biomass (chlorophyll-a) and nutrient concentrations across a range of timescales. The data simulations demonstrated that three diagnostic tests could reliably identify the presence or absence of alternative equilibria. The real-world data accorded with data simulations where alternative equilibria were absent. Crucially, it was only as the temporal scale of observation increased (>3 years) that a predictable linear relationship between nutrient concentration and chlorophyll-a was evident. Thus, when a longer term perspective is taken, the notion of alternative equilibria is not required to explain the response of chlorophyll-a to nutrient enrichment which questions the utility of the theory for explaining shallow lake response to, and recovery from, eutrophication.
... Four models involving linear models, linear mixed models, generalized linear models (glm) (Nelder and Wedderburn, 1972), and generalized linear mixed models (Hedeker, 2005) were tested for each variable. The model with the lowest Akaike Information Criterion (AIC) value (Chakrabarti and Ghosh, 2011) was the most preferred. ...
... where the p-value is obtained by using Generalised Linear Model [19] and the effectSize is calculated using Cohen's d [20]. While not directly part of the test function, they consider the power analysis to exclude mutations for which the statistical power of the test is too low (with the threshold β ≥ 0.8). ...
Preprint
Testing Deep Learning (DL) systems is a complex task as they do not behave like traditional systems would, notably because of their stochastic nature. Nonetheless, being able to adapt existing testing techniques such as Mutation Testing (MT) to DL settings would greatly improve their potential verifiability. While some efforts have been made to extend MT to the Supervised Learning paradigm, little work has gone into extending it to Reinforcement Learning (RL) which is also an important component of the DL ecosystem but behaves very differently from SL. This paper builds on the existing approach of MT in order to propose a framework, RLMutation, for MT applied to RL. Notably, we use existing taxonomies of faults to build a set of mutation operators relevant to RL and use a simple heuristic to generate test cases for RL. This allows us to compare different mutation killing definitions based on existing approaches, as well as to analyze the behavior of the obtained mutation operators and their potential combinations called Higher Order Mutation(s) (HOM). We show that the design choice of the mutation killing definition can affect whether or not a mutation is killed as well as the generated test cases. Moreover, we found that even with a relatively small number of test cases and operators we manage to generate HOM with interesting properties which can enhance testing capability in RL systems.
... In the second step, generalized linear models (GLM) are used as an alternative to the truncated Gutenberg-Richter model. These models were first developed by Nelder and Wedderburn (1972), and since then, various authors have given special attention to GLM, including McCullagh and Nelder (1989) and Dunteman and Ho (2006). Furthermore, GLM have been used in a variety of fields, including actuarial, insurance, and engineering (Jong & Heller, 2008;Myers et al., 2002). ...
Article
Full-text available
In this study, the earthquake frequency–magnitude relationship is modeled using two different designs: the truncated Gutenberg–Richter distribution and the generalized linear model. A goodness-of-fit statistical model is applied to the generalized linear model, and the generalized Poisson regression model appears to be superior to the generalized negative binomial regression model when considering the model selection criteria, namely the Akaike information criterion, Bayesian information criterion, likelihood ratio, and chi-square statistics. The primary goals of this study are to determine the annual rate above Mw 4.0 and the b-value of the truncated Gutenberg–Richter relationship, the probability of exceedance within a time period of 25, 50, and 100 years, and the return period of magnitude above Mw 4.0, and to compare these results to those obtained using the selected generalized Poisson regression model. According to the analyses, the generalized Poisson regression model can be effectively used to derive seismic hazard parameters instead of the Gutenberg-Richter model. Among the obtained results, the b-value at Algiers city is equal to 0.73±0.03 and the annual rate above Mw 4.0 is 4.48±0.19: the values of maximum possible magnitude obtained using the Kijko–Sellevoll and Tate–Pisarenko estimators are very close, 7.74±0.46 and 7.50±0.10, respectively, whereas they are equal to 7.40±0.15 and 7.43±0.16 for their Bayesian versions, respectively. The mean return periods derived using the truncated Gutenberg–Richter model with Kijko–Sellevoll estimator are similar to those derived using the generalized Poisson model for magnitudes less than Mw 5.0 and differ for magnitudes greater than Mw 5.0.
... All mod-324 elling was conducted with the h2o 3.36.0.3 R package (H2O.ai, 2021).325For the GLM, we included both first and second-order dependencies on the pre-326 dictors and assumed a normal distribution of the target variable with an identity link 327 function(Nelder & Wedderburn, 1972). In the GAM, we fitted smoothing terms for all 328 predictor variables using cubic regression splines, the most common smoothing algorithm and foraminifers, respectively, and the minimum number of rows at each final node (min rows ) 336 was set to three and two. ...
Preprint
Full-text available
Shelled pteropods and planktic foraminifers are calcifying zooplankton that contribute to the biological carbon pump, but their importance for regional and global plankton biomass and carbon fluxes is not well understood. Here, we modelled global annual patterns of pteropod and foraminifer total carbon (TC) biomass and total inorganic carbon (TIC) export fluxes over the top 200m using an ensemble of five species distribution models (SDMs). An exhaustive newly assembled dataset of zooplankton abundance observations was used to estimate the biomass of both plankton groups. With the SDM ensemble we modeled global TC biomass depending on multiple environmental parameters. We found hotspots of mean annual pteropod biomass in the high Northern latitudes and the global upwelling systems, and in the high latitudes of both hemispheres and the tropics for foraminifers. This largely agrees with previously observed distributions. For the biomass of both groups, surface temperature is the strongest environmental correlate, followed by chlorophyll-a. We found mean annual standing stocks of 52 (48-57) Tg TC and 0.9 (0.6-1.1) Tg TC for pteropods and foraminifers, respectively. This translates to mean annual TIC fluxes of 14 (9-22) Tg TIC yr-1 for pteropod shells and 11 (3-27) Tg TIC yr-1 for foraminifer tests. These results are similar to previous estimates for foraminifers standing stocks and fluxes but approximately a factor of ten lower for pteropods. The two zooplankton calcifiers contribute approximately 1.5% each to global surface carbonate fluxes, leaving 40%-60% of the global carbonate fluxes unaccounted for. We make suggestions how to close this gap.
... It is well known that the Poisson distribution is the most popular model to deal with count data under the framework of the generalized linear models (GLM) (Nelder and Wedderburn 1972). However, it is limited to equidispersed count data, i.e., when the mean of the response variable is equal to the variance. ...
Preprint
Full-text available
Univariate regression models have rich literature for counting data. However, this is not the case for multivariate count data. Therefore, we present the Multivariate Generalized Linear Mixed Models framework that deals with a multivariate set of responses, measuring the correlation between them through random effects that follows a multivariate normal distribution. This model is based on a GLMM with a random intercept and the estimation process remains the same as a standard GLMM with random effects integrated out via Laplace approximation. We efficiently implemented this model through the TMB package available in R. We used Poisson, negative binomial (NB), and COM-Poisson distributions. To assess the estimator properties, we conducted a simulation study considering four different sample sizes and three different correlation values for each distribution. We achieved unbiased and consistent estimators for Poisson and NB distributions; for COM-Poisson estimators were consistent, but biased, especially for dispersion, variance, and correlation parameter estimators. These models were applied to two datasets. The first concerns a sample from 30 different sites collected in Australia where the number of times each one of the 41 different ant species was registered; which results in an impressive 820 variance-covariance and 41 dispersion parameters estimated simultaneously, let alone the regression parameters. The second is from the Australia Health Survey with 5 response variables and 5190 respondents. These datasets can be considered overdispersed by the generalized dispersion index. The COM-Poisson model overcame the other two competitors considering three goodness-of-fit indexes. Therefore, the proposed model is capable of dealing with multivariate count data, and measuring any kind of correlation between them taking into account the effects of the covariates.
... Moreover, if there is no θ such that p θ a.s. = p θ0 , then θ 0 is the unique minimizer of L. We give in Tab. 1 a few examples from the class of generalized linear models (GLMs) proposed by Nelder and Wedderburn [30]. ...
Preprint
This paper revisits a fundamental problem in statistical inference from a non-asymptotic theoretical viewpoint $\unicode{x2013}$ the construction of confidence sets. We establish a finite-sample bound for the estimator, characterizing its asymptotic behavior in a non-asymptotic fashion. An important feature of our bound is that its dimension dependency is captured by the effective dimension $\unicode{x2013}$ the trace of the limiting sandwich covariance $\unicode{x2013}$ which can be much smaller than the parameter dimension in some regimes. We then illustrate how the bound can be used to obtain a confidence set whose shape is adapted to the optimization landscape induced by the loss function. Unlike previous works that rely heavily on the strong convexity of the loss function, we only assume the Hessian is lower bounded at optimum and allow it to gradually becomes degenerate. This property is formalized by the notion of generalized self-concordance which originated from convex optimization. Moreover, we demonstrate how the effective dimension can be estimated from data and characterize its estimation accuracy. We apply our results to maximum likelihood estimation with generalized linear models, score matching with exponential families, and hypothesis testing with Rao's score test.
... Another interesting alternative may be the distributional regression models, which, as stated by Heller et al. (2022), was first proposed by Rigby & Stasinopoulos (2005) as the generalised additive models for location, scale and shape (GAMLSS), a class of regression models that extends the wellknown generalised linear models (Nelder & Wedderburn, 1972) and generalised additive models (Hastie & Tibshirani, 1990). In fact, different works have already considered this framework to model censored data, such as Castro et al. ...
Article
Full-text available
The study of the expected time until an event of interest is a recurring topic in different fields, such as medical, economics and engineering. The Kaplan-Meier method and the Cox proportional hazards model are the most used methodologies to deal with such kind of data. Nevertheless, in recent years, the generalised additive models for location, scale and shape (GAMLSS) models – which can be seen as distributional regression and/or beyond the mean regression models – have been standing out as a result of its highly flexibility and ability to fit complex data. GAMLSS are a class of semi-parametric regression models, in the sense that they assume a distribution for the response variable, and any and all of its parameters can be modelled as linear and/or non-linear functions of a set of explanatory variables. In this paper, we present the Box-Cox family of distributions under the distributional regression framework as a solid alternative to model censored data.
... Poisson regression is included in the Generalized Linear Model (GLM), a nonlinear model to model the relationship between response and predictor variables (Nelder and Wedderburn, 1972). The response variable in the Poisson regression is in the form of count (Cameron and Trivedi, 2013). ...
Article
Full-text available
Tuberculosis is caused by Mycobacterium Tuberculosis (MT). MT usually attacks the lungs and causes pulmonary-tuberculosis. Tuberculosis cases in Indonesia keep increasing over the years. The presence of Multidrug-Resistant Tuberculosis (MDR-TB) has been one of the main obstacles in eradicating tuberculosis because it couldn’t be cured using standard drugs. In fact, the success rate of MDR-TB treatment in 2019 at the global level was only 57 percent. Research on MDR-TB can be related to the spatial aspect because this disease can be transmitted quickly. This study aims to obtain an overview and model the number of Indonesia’s pulmonary MDR-TB cases in 2019 using the Geographically Weighted Negative Binomial Regression (GWNBR) method. The independent variables used in the model are population density, percentage of poor population, health center ratio per 100 thousand population, the ratio of health workers per 10 thousand population, percentage of smokers, percentage of the region with PHBS policies, and percentage of BCG immunization coverage. The finding reveals that the model forms 12 regional groups based on significant variables where GWNBR gives better results compared to NBR. The significant spatial correlation implies that the collaboration among regional governments plays an important role in reducing the number of pulmonary MDR-TB.
... Only the marginal models produced negative ICCs. The GLMM-MLE is known to truncate the ICC to zero rather than produce negative ICCs, effectively fitting a generalized linear model (GLM) (36). Sampling error due to limited number of subjects to sample from (sample cluster size) compared to the population cluster size which is unlimited could be the cause for a negative ICC (35). ...
Preprint
Full-text available
Background: Using four case studies, we aim to provide practical guidance and recommendations for the analysis of cluster randomised controlled trials. Methods: Four modelling approaches (Generalized Linear Mixed Models with parameters/coefficients estimated by Maximum likelihood; Generalized Linear Models with parameters/coefficients estimated by Generalized Estimating Equations (1st order or second order) or Quadratic Inference Function) for the analysis of correlated individual participant level outcomes in cluster randomised controlled trials were identified after we reviewed the literature. These four methods are applied to four case studies of cluster randomised controlled trials with the number of clusters ranging from 10 to 100 and individual participants ranging from 748 to 9,207. Results are obtained for both continuous and binary outcomes using the statistical packages, R and SAS. Results: The intracluster correlation coefficient (ICC) for each of the case studies was small (<0.05) indicating little dependence of the outcomes related to cluster allocation. In most cases the four methods produced similar results. However, in a few analyses quadratic inference function produced different results compared to the other three methods. Conclusion: This paper demonstrates the analysis of cluster randomised controlled trials with four modelling approaches. The results obtained were similar in most cases, a plausible reason could be the negligible correlation (small ICCs) observed among responses in the four case studies. Due to the small ICC values obtained the generalisability of our results is limited. It is important to conduct simulation studies to comprehensively investigate the performance of the four modelling approaches.
... A key problem to consider for the SA stage is to achieve a balance between the accuracy and the economy of compute (Saltelli et al., 2008). People sometimes use surrogates (such as 45 the generalized linear model (GLM) (Nelder and Wedderburn, 1972)) instead of the actual model to further reduce the compute cost. ...
Preprint
Full-text available
The Single Column Atmospheric Model (SCAM) is an essential tool for analyzing and improving the physics schemes of CAM. Although it already largely reduces the compute cost from a complete CAM, the exponentially-growing parameter space makes a combined analysis or tuning of multiple parameters difficult. In this paper, we propose a hybrid framework that combines parallel execution and a learning-based surrogate model, to support large-scale sensitivity analysis (SA) and tuning of combinations of multiple parameters. We start with a workflow (with modifications to the original SCAM) to support the execution and assembly of a large number of sampling, sensitivity analysis, and tuning tasks. By reusing the 3,840 instances with the variation of 11 parameters, we train a neural network (NN) based surrogate model that achieves both accuracy and efficiency (with the computational cost reduced by several orders of magnitude). The improved balance between cost and accuracy enables us to integrate NN-based grid search into the traditional optimization methods to achieve better optimization results with fewer compute cycles. Using such a hybrid framework, we explore the joint sensitivity of multi-parameter combinations to multiple cases using a set of three parameters, identify the most sensitive three-parameter combination out of eleven, and perform a tuning process that reduces the error of precipitation by 5 % to 15 % in different cases.
... The generalized linear model (GzLM) was first presented by Nelder and Wedderbum (1972) and developed by McCullagh and Nelder (1989). It is a systematic extension of familiar regression models such as the linear models for a continuous response variable given continuous and/or categorical predictors. ...
Chapter
Floods mostly vary from one region to another, and their severity is determined by a variety of factors, including unpredictable weather patterns and heavy rainfall occurrences (Pham Van and Nguyen-Van, 2020; Soulard et al., 2020). Although floods are common in many places of India during monsoon seasons, the Ganga basin is particularly vulnerable (Bhatt et al., 2021; Meena et al., 2021). There are a lot of areas in the state of Bihar that get flooded due to the swelling of rivers in neighboring Nepal (Lal et al., 2020; Soulard et al., 2020; Wagle et al., 2020). This appealed to the attention of the present research. The Ganga basin spans China, Nepal, India, and Bangladesh (Agnihotri et al., 2019; Ahmad and Goparaju, 2020; Prakash et al., 2017; Sinha and Tandon, 2014). The global emergence of COVID-19 has stopped all the activities, and it debuted as the deadliest disease with the longest nationwide lockdown. These caused enormous disruption in all aspects of people’s livelihood. Besides, major obstacles got accumulated due to the effect of the flooding event during July 2020. It added misery to the people and livelihood of the people, who were trying to control the spread of COVID-19. These results in disaster-risk mitigation to other sectors. The only way to have an effective and prompt response is to have real-time information provided by space-based sensors. Using a cloud-based platform like Google earth engine (GEE), an automated technique is employed to analyze the flood inundation with Synthetic Aperture Radar (SAR) images. The study exhibits the potential of automated techniques along with algorithms applied to larger datasets on cloud-based platforms. The results present flood extent maps for the lower Ganga basin, comprising areas of the Indian subcontinent. Severe floods destroyed several parts of Bihar and West Bengal affecting a large population. This study offers a prompt and precise estimation of inundated areas to facilitate a quick response for risk assessment, particularly at times of the COVID-19. The three states (Bihar, Jharkhand, and West Bengal), collectively known as the Lower Ganga Basin, are home to more than 30% of the population (Prakash et al., 2017). Rapid population growth and settlements resulted in changes in land use, increased soil erosion, increased siltation, and other related variables that augmented flood severity (Li et al., 2020; Pham Van and Nguyen-Van, 2020). However, floods became the most frequent disaster in recent times, what compounded the problem was the COVID-19 pandemic (Kr€amer et al., 2021; Lal et al., 2020). As a result, new measures were needed to manage the spread of COVID-19 as well as flood mitigation (Wang et al., 2020; Zoabi et al., 2021). Although ground data and field measurements are considered to be more accurate, they are time and money consuming. Furthermore, field surveys were impossible to conduct during this period, since social distancing has become the norm, linked with significant health concerns and trip expenditures ( Jian et al., 2020; Lattari et al., 2019). Flood mitigation strategies that are ineffective may result in more human deaths, property damage, and more spread of COVID-19 (Cornara et al., 2019; Shen et al., 2019). It had disastrous impacts in 149 districts throughout Bihar, Assam, West Bengal. Since the movement was halted owing to a sudden shutdown, the only way out was to employ robust flood control techniques based on real-time information (Das et al., 2018; Dong et al., 2020; Tang et al., 2016). The dramatic increase in flood occurrence in these locations prompted specialists to implement more structured and effective flood management to address the issues, while also adhering to all COVID-19 norms and regulations (Min et al., 2020; Wang et al., 2019).
... Both can be usefully thought of as having affine structure in their own right giving two 'dual' affine geometries. The strongly related exponential dispersion families inherent this 'dual' parameter system and, of course, form the workhorse of Applied Statistic through generalised linear models, [28]. ...
Article
Full-text available
We take a very high level overview of the relationship between Geometry and Applied Statistics 50 years from the birth of Information Geometry. From that date we look both backwards and forwards. We show that Geometry has always been part of the statistician’s toolbox and how it played a vital role in the evolution of Statistics in the last 50 years.
... Count data has surpassed its appearence in electronic health record data, in many health information technology research contexts. This article links the framework of generalized linear models (GLMs) introduced by [1] to Bayesian uncertainty quantification and shrinkage priors with an aim towards modeling high dimensional count data in state-of-the-art applications. Most of such applications have the number of explanatory variables (p) greater than the number of observations (n) [2]. ...
Preprint
Full-text available
Background: Studies in the public health field often consist of outcome measures such as number of hospital visits or number of laboratory tests per person.They arise in genomics, electronic health records, epidemic modeling among many other areas. These measures are highly skewed distributions and requires count data models for inference. Count data modeling is of prime importance in these fields of public health and medical sciences. Also sparse outcomes, as in next-generation sequencing data, require further accounting for zero inflation. Methods: We present a unified Bayesian hierarchical framework that implements and compares shrinkage priors in negative-binomial and zero-inflated negative-binomial regression models. We first represent the likelihood by a Polya-Gamma data augmentation that makes it amenable to a hierarchical model employing a wide class of shrinkage priors. Shrinkage priors are especially relevant for high-dimensional regression. We specifically focus on the Horseshoe, Dirichlet Laplace, and Double Pareto priors. Extensive simulation studies address the model’s efficiency and mean square errors are reported. Further, the models are applied to data sets, namely covid-19 vaccine adverse events, no. of Ph.D. publications data, and the US National Medical Expenditure Survey, among other datasets. Results: The models consistently showed good performance in variable selection captured by model accuracies, sensitivities, and specificity and predictive performance by mean square errors. We even obtained mean square error rates as low as 0.003 in p > n cases in simulation studies. In real case studies, the variable selection results strongly confirmed current biological insights and opened the doors to potential new findings. For example, the number of days between the Covid-19 vaccination and onset of adverse events depended on age, sex, if there is life threat or not, if there was emergency room visit, no. of extended stay, other medications, laboratory data, disease during vaccination, prior vaccination status, allergy status among other factors. A remarkable reduction in MSE of the fitted values testified to the predictive performance of the model. Conclusions: Bayesian generalized linear models using shrinkage priors are robust enough to extract relevant predictors in high-dimensional regressions. They can be applied to a broad range of biometric and public health high dimensional problems. Also the R package ”ShrinkageBayesGlm” is available for hands-on experience at https://github.com/arinjita9/ShrinkageBayesGlm
... GPR is a non-parametric probabilistic regression approach which calculates the probability distribution of all possible functions that fits the data (Williams and Barber, 1998). The GLM was formulated by Nelder and Wedderburn (1972), which is generalization of ordinary linear regression and the variable or feature selection was performed using Akaike's Information Criterion. SpikeSlab is Bayesian method for simultaneously doing variable selection and regression, particularly useful in high dimensional settings (Mitchell and Beauchamp, 1988). ...
Article
Accurate estimation of disease severity in the field is a key to minimize the field losses in agriculture. Existing disease severity assessment methods have poor accuracy under field conditions. To overcome this limitation, this study used thermal and visible imaging with machine learning (ML) and model combination (MC) techniques to estimate plant disease severity under field conditions. Field experiments were conducted during 2017–18, 2018–19 and 2021–22 to obtain RGB and thermal images of chickpea cultivars with different levels of wilt resistance grown in wilt sick plots. ML models were constructed using four different datasets created using the wilt severity and image derived indices. ML models were also combined using MC techniques to assess the best predictor of the disease severity. Results indicated that the Cubist was the best ML algorithm, while the KNN model was the poorest predictor of chickpea wilt severity under field conditions. MC techniques improved the prediction accuracy of wilt severity over individual ML models. Combining ML models using the least absolute deviation technique gave the best predictions of wilt severity. The results obtained in the present study showed the MC techniques coupled with ML models improved the prediction accuracies of plant disease severity under field conditions.
... The multivariate linear model (MLM) [59] and the generalized linear model (GLM) [60] were used to investigate the key factors influencing the behavioral patterns of shrimps in the culture environment, accompanied by the comparison among them. According to the recorded data, the proportion of each behavior type was calculated by dividing the number of shrimps exhibiting each behavior pattern by the total number of shrimps in the pond. ...
Article
Full-text available
Recent years have witnessed a tremendous development in shrimp farming around the world, which, however, has raised a variety of issues, possibly due to a lack of knowledge of shrimp behavior in farms. This study focused on the relationship between shrimp behavior and the various factors of natural farming environment through situ surveys, as distinguished from the majority of laboratory studies on shrimp behavior. In the survey, the behaviors of kuruma prawn (Penaeus japonicus) were investigated in the groups of swimming in the water, crawling on the sand, resting on the sand, and hiding in the sand, followed by the quantification of the sex ratio, water quality, density, and light intensity. The results showed the average proportions of resting, hiding, crawling, and swimming activities of 69.87%, 20.85%, 8.24%, and 1.04%, respectively, of P. japonicus. The behavior of hiding, resting, and crawling is significantly affected by the sex ratio of the shrimp (p < 0.05). The proportions of hiding behavior exhibited a negative connection with density and a positive connection with light intensity, while the proportions of resting behavior showed the opposite according to both Pearson correlation analysis and multiple linear regression analysis. The light intensity was the only factor that significantly influenced the swimming behavior, in which the probability of the swimming behavior was reduced from 48% to 5% when light intensity varied from 0 to 10 lx, as determined by the generalized linear model. It could be speculated that P. japonicus prefers a tranquil environment. Female shrimp might exhibit less aggression and more adventure compared to male shrimp. The findings suggested light intensity, followed by density, as the most crucial element influencing the behavior of P. japonicus in the culture environment. These findings will contribute to the comprehension of the behavior of P. japonicus and provide a novel perspective for the formulation of its culture management strategy.
... 2.4.1 Multinomial logistic regression. The generalized linear models (GLMs) extend the linear models to include error distributions other than Gaussian and categorical response variables (Nelder and Wedderburn, 1972). The general form of a GLM is very close to the traditional linear model linking through a linear combination the explicative variables of the problem and the response. ...
Article
Full-text available
Purpose This study aims to perform a benefit segmentation and then a classification of visitors that travel to the Rocha Department in Uruguay from the capital city of Montevideo during the summer months. Design/methodology/approach A convenience sample was obtained with an online survey. A total of 290 cases were usable for subsequent data analysis. The following statistical techniques were used: hierarchical cluster analysis, K-means cluster analysis, machine learning, support vector machines, random forest and logistic regression. Findings Visitors that travel to the Rocha Department from Montevideo can be classified into four distinct clusters. Clusters are labelled as “entertainment seekers”, “Rocha followers”, “relax and activities seekers” and “active tourists”. The support vector machine model achieved the best classification results. Research limitations/implications Implications for destination marketers who cater to young visitors are discussed. Destination marketers should determine an optimal level of resource allocation and destination management activities that compare both present costs and discounted potential future income of the different target markets. Surveying non-residents was not possible. Future work should sample tourists from abroad. Originality/value The combination of market segmentation of Rocha Department’s visitors from the city of Montevideo and classification of sampled individuals training various machine learning classifiers would allow Rocha’s destination marketers determine the belonging of an unsampled individual into one of the already obtained four clusters, enhancing marketing promotion for targeted offers.
... Because environmental data were not normally distributed, I used Generalized Linear Modeling (GLM; family = Gamma [link = "log"]; Nelder and Wedderburn 1972) to evaluate the relationship between each environmental variable and the six categories of macrohabitat suitability for both partitions of the original HSM. I used a Poisson error-structure (family = Poisson [link = "log"]) with each GLM regression when comparing habitat affected by categories of both habitat suitability and SBS to graphically establish the relationship between the response variables (i.e., suitability and SBS) and smoothed functions of the predictor variable (i.e., ha of habitat). ...
Article
Full-text available
I evaluated the impact and extent of the Monument Fire on the geographic range and suitable macrohabitat of the Trinity bristle snail (Monadenia setosa), a California endemic with limited distribution in northern California. Total area burned by the fire was ~87,984 ha or 46.0% of the species range (n = 191,156 ha). Total area of suitable macrohabitat for the species is ~107,913 ha of which 44.5% (n = 47,962 ha) was encompassed by the fire. Results show that the total area of forest cover-type vegetation and individual forest stand attributes impacted by the fire was not significantly different from areas within the species range not burned by the fire. There was no significant proportional differences in the six sequential categories of suitable macrohabitat burned by the fire (i.e., Low, Low-moderate, Moderate, Moderate-high, High, Critical suitability). The percentage of Moderate and Moderate-high suitable macrohabitat burned was only somewhat greater than predicted by the pre-fire species habitat suitability model (HSM). Many individual watersheds were encompassed by the fire and the resulting mosaic of burned watersheds was highly variable. Application of the Soil Burn Severity (SBS) map identified 8,293 ha (17.3%) of Unburned or very low burned soil, 24,191 ha (50.5%) of Low burned soil, 13,998 ha (29.2.1%) of Moderately burned soil, and 1,460 ha (3.0%) of Highly burned soil within the boundaries of the Monument Fire. When applied to categories of suitable macrohabitat, I calculated that 31,096 ha (100%) of Low to Low-moderate and 13,998 ha (96.1%) of Moderate to Moderate-high suitable macrohabitat were burned. High and Critical areas of macrohabitat suitability were much less impacted by high SBS (n = 1,461 ha [58.0%] because these regions were small in size, highly fragmented, widely dispersed across the landscape, and separated by major topographic and riverine discontinuities.
... The Generalized Additive Model (GAM) proposed by Hastie and Tibshirani (1986) is an extension of the Generalized Linear Model (GLM) of Nelder and Wedderburn (1972). It is a non-parametric regression that relaxes the assumption of linearity between the response variable and the covariates, which allows the discovery of non-linear relationships between them. ...
Article
Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder with substantial clinical heterogeneity, especially in language and communication ability. There is a need for validated language outcome measures that show sensitivity to true change for this population. We used Natural Language Processing to analyze expressive language transcripts of 64 highly-verbal children and young adults (age: 6-23 years, mean 12.8 years; 78.1% male) with ASD to examine the validity across language sampling context and test-retest reliability of six previously validated Automated Language Measures (ALMs), including Mean Length of Utterance in Morphemes, Number of Distinct Word Roots, C-units per minute, unintelligible proportion, um rate, and repetition proportion. Three expressive language samples were collected at baseline and again 4 weeks later. These samples comprised interview tasks from the Autism Diagnostic Observation Schedule (ADOS-2) Modules 3 and 4, a conversation task, and a narration task. The influence of language sampling context on each ALM was estimated using either generalized linear mixed-effects models or generalized linear models, adjusted for age, sex, and IQ. The 4 weeks test-retest reliability was evaluated using Lin's Concordance Correlation Coefficient (CCC). The three different sampling contexts were associated with significantly (P < 0.001) different distributions for each ALM. With one exception (repetition proportion), ALMs also showed good test-retest reliability (median CCC: 0.73-0.88) when measured within the same context. Taken in conjunction with our previous work establishing their construct validity, this study demonstrates further critical psychometric properties of ALMs and their promising potential as language outcome measures for ASD research.
Preprint
Full-text available
The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method.
Article
Full-text available
Human activities, including urbanization, industrialization, agricultural pollution, and land use, have contributed to the increased fragmentation of natural habitats and decreased biodiversity in Zhejiang Province as a result of socioeconomic development. Numerous studies have demonstrated that the protection of ecologically significant species can play a crucial role in restoring biodiversity. Emeia pseudosauteri is regarded as an excellent environmental indicator, umbrella and flagship species because of its unique ecological attributes and strong public appeal. Assessing and predicting the potential suitable distribution area of this species in Zhejiang Province can help in the widespread conservation of biodiversity. We used the MaxEnt ecological niche model to evaluate the habitat suitability of E. pseudosauteri in Zhejiang Province to understand the potential distribution pattern and environmental characteristics of suitable habitats for this species, and used the AUC (area under the receiver operating characteristic curve) and TSS (true skill statistics) to evaluate the model performance. The results showed that the mean AUC value was 0.985, the standard deviation was 0.011, the TSS average value was 0.81, and the model prediction results were excellent. Among the 11 environmental variables used for modeling, temperature seasonality (Bio_4), altitude (Alt) and distance to rivers (Riv_dis) were the key variables affecting the distribution area of E. pseudosauteri, with contributions of 33.5%, 30% and 15.9%, respectively. Its main suitable distribution area is in southern Zhejiang Province and near rivers, at an altitude of 50–300 m, with a seasonal variation in temperature of 7.7–8 °C. Examples include the Ou River, Nanxi River, Wuxi River, and their tributary watersheds. This study can provide a theoretical basis for determining the scope of E. pseudosauteri habitat protection, population restoration, resource management and industrial development in local areas.
Article
Full-text available
This paper details the approach of the team Kohrrelation in the 2021 Extreme Value Analysis data challenge, dealing with the prediction of wildfire counts and sizes over the contiguous US. Our approach uses ideas from extreme-value theory in a machine learning context with theoretically justified loss functions for gradient boosting. We devise a spatial cross-validation scheme and show that in our setting it provides a better proxy for test set performance than naive cross-validation. The predictions are benchmarked against boosting approaches with different loss functions, and perform competitively in terms of the score criterion, finally placing second in the competition ranking.
Preprint
Full-text available
Machine learning and statistical modeling methods were used to analyze the impact of climate change on financial wellbeing of fruit farmers in Tunisia and Chile. The analysis was based on face to face interviews with 801 farmers. Three research questions were investigated. First, whether climate change impacts had an effect on how well the farm was doing financially. Second, if climate change was not influential, what factors were important for predicting financial wellbeing of the farm. And third, ascertain whether observed effects on the financial wellbeing of the farm were a result of interactions between predictor variables. This is the first report directly comparing climate change with other factors potentially impacting financial wellbeing of farms. Certain climate change factors, namely increases in temperature and reductions in precipitation, can regionally impact self-perceived financial wellbeing of fruit farmers. Specifically, increases in temperature and reduction in precipitation can have a measurable negative impact on the financial wellbeing of farms in Chile. This effect is less pronounced in Tunisia. Climate impact differences were observed within Chile but not in Tunisia. However, climate change is only of minor importance for predicting farm financial wellbeing, especially for farms already doing financially well. Factors that are more important, mainly in Tunisia, included trust in information sources and prior farm ownership. Other important factors include farm size, water management systems used and diversity of fruit crops grown. Moreover, some of the important factors identified differed between farms doing and not doing well financially. Interactions between factors may improve or worsen farm financial wellbeing.
Article
Full-text available
Background: The rapid prevalence of coronavirus disease 2019 (COVID-19) has caused a pandemic worldwide and affected the lives of millions. The potential fatality of the disease has led to global public health concerns. Apart from clinical practice, artificial intelligence (AI) has provided a new model for the early diagnosis and prediction of disease based on machine learning (ML) algorithms. In this study, we aimed to make a prediction model for the prognosis of COVID-19 patients using data mining techniques. Methods: In this study, a data set was obtained from the intelligent management system repository of 19 hospitals at Shahid Beheshti University of Medical Sciences in Iran. All patients admitted had shown positive polymerase chain reaction (PCR) test results. They were hospitalized between February 19 and May 12 in 2020, which were investigated in this study. The extracted data set has 8621 data instances. The data include demographic information and results of 16 laboratory tests. In the first stage, preprocessing was performed on the data. Then, among 15 laboratory tests, four of them were selected. The models were created based on seven data mining algorithms, and finally, the performances of the models were compared with each other. Results: Based on our results, the Random Forest (RF) and Gradient Boosted Trees models were known as the most efficient methods, with the highest accuracy percentage of 86.45% and 84.80%, respectively. In contrast, the Decision Tree exhibited the least accuracy (75.43%) among the seven models. Conclusion: Data mining methods have the potential to be used for predicting outcomes of COVID-19 patients with the use of lab tests and demographic features. After validating these methods, they could be implemented in clinical decision support systems for better management and providing care to severe COVID-19 patients.
Article
At the height of the COVID‐19 pandemic in the United Kingdom, the Governor of the Bank of England, while granting an interview, described the pandemic as an unprecedented economic emergency and said that the Bank could go as far as radical money‐printing operations. In reaction, the UK financial market, particularly the FTSE 100 and pound sterling, witnessed record‐breaking losses. Considering this evidence, we hypothesized that the emotions and moods of investors towards the financial market might have been impacted by the information they obtained from frequent government policy announcements. Furthermore, we proposed that the United Kingdom's final exit from the European Union (Brexit), which coincided with the pandemic, could have worsened the outlook of the UK financial market, as investors began to diversify their portfolios. Consequently, we examined the impact of government's policy announcements on investors’ reactions to the concurrence of the COVID‐19 pandemic and Brexit. Our findings reveal that the psychology of investors during the pandemic was significantly shaped by frequent policy announcements, which in turn affected overall market behaviour.
Preprint
Full-text available
Lockdowns were widely used to reduce transmission of COVID-19 and prevent health care services from being overwhelmed. While these mitigation measures helped to reduce loss of life, they also disrupted the everyday lives of billions of people. We use data from a survey of Singaporean citizens and permanent residents during the peak of the lockdown period between April and July 2020 to evaluate the social and economic impacts of Singapore’s COVID-19 mitigation measures. Over 60% of the population experienced negative impacts on their social lives and 40% on household economics. Regression models show the negative economic impacts were influenced by socio-economic and demographic factors that align with underlying societal vulnerabilities. When dealing with large-scale crisis’ like COVID-19, slow-onset disasters, and climate change, some of the burdens of mitigation measures can constitute a crisis in their own right – and this could be experienced unevenly by vulnerable segments of the population.
Conference Paper
Full-text available
Le monde de l’enseignement supérieur a beaucoup changé ces dernières décennies dû à l’omniprésence de l’informatique . Si l’on ajoute à cela les restrictions sanitaires et préventives suite à COVID’19. Les universités rendent les cours en mode distanciel, et dans l’enseignement technique les TP sont très importants et incontournable , et que les laboratoires à distance sont plus bénéfiques par rapport au laboratoire virtuels car ce dernier n’est qu’un modèle informatique d’approximative à la réalité via des modélisations afin de réaliser des simulations diverses. Dans ce contexte, l’EST d’Agadir a développé une plateforme à faible coût appelé LABERSIME installée en cloud (LMS et IDE) dotée d’un système embarqué à base de (ESP32-Micropython) pour piloter des équipements réels de laboratoire et faire des expériences qualitativement efficaces que celles en mode présentiel.
Article
The definition of second order interaction in a (2 × 2 × 2) table given by Bartlett is accepted, but it is shown by an example that the vanishing of this second order interaction does not necessarily justify the mechanical procedure of forming the three component 2 × 2 tables and testing each of these for significance by standard methods.*
Article
Interactions in three‐way and many‐way contingency tables are defined as certain linear combinations of the logarithms of the expected frequencies. Maximum‐likelihood estimation is discussed for many‐way tables and the solutions given for three‐way tables in the cases of greatest interest.
Article
If x1, x2,..., xk represent the levels of k experimental factors and y is the mean response, then the inverse polynomial response function is defined by x1x 2 ⋯ xk/y = Polynomial in (x1, x2 ⋯, xk). Arguments are given for preferring these surfaces to ordinary polynomials in the description of certain kinds of biological data. The fitting of inverse polynomials under certain assumptions is described, and shown to involve no more labour than that of fitting ordinary polynomials. Complications caused by the necessity of fitting unknown origins to the xi are described and the estimation process illustrated by an example. The goodness of fit of ordinary and inverse polynomials to four sets of data is compared and the inverse kind shown to have some advantages. The general question of the value of fitted surfaces to experimental data is discussed.
Article
Three methods of fitting log-linear models to multivariate contingency-table data with one dichotomous variable are discussed. Logit analysis is commonly used when a full contingency table of s dimensions is regarded as a table of rates of dimension s - 1. The split-table method treats the same data as two separate tables each of dimension s - 1. We show that the full contingency-table method can be regarded as a generalized approach: models which can be fitted by it include both the mutually exclusive subsets that can be fitted by the other two methods. Even when the logit method permits the model of choice to be fitted, the full contingency-table method of iterative proportional fitting to the set of sufficient configurations has the advantage of requiring neither matrix inversion nor substitution of an arbitrary value in empty elementary cells.
Article
Miscellaneous comments are made on regression analysis under four broad headings: regression of a dependent variable on a single regressor variable; regression on many regressor variables; analysis of bivariate and multivariate populations; models with components of variation.
Article
Following the comments by Moore and Zeigler on the analogy between the analysis of quantal responses and non-linear regression, the analogy between the former and linear weighted regression is developed when the Newton-Raphson method is used as the iterative process. A condition satisfied by the first derivative of the likelihood is shown to apply to a class of models where one transformation leads to a linear model and another to normal errors. The particular case of inverse polynomials is discussed, as part of the family of power transformations. The extra programming facilities required to incorporate the estimation of parameters for these iterative situations are described.
Article
Certain structural properties of the exponential‐type families are studied under three headings: (1) mean value function and the exponential‐type families, (2) a characterization of the Gamma family and (3) the quality of the first two Bhattacharya bounds and the exponential‐type families.
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
Interactions in three-way and many-way contingency tables arc defined as certain linear combinations of the logarithms of the expected frequencies. Maximum-likelihood estimation is discussed for many-way tables and the solutions given for three-way tables in the cases of greatest interest.
Article
A cross section of basic yet rapidly developing topics in multivariate data analysis is surveyed, emphasizing concepts required in facing problems of practical data analysis while de-emphasizing technical and mathematical detail. Aspects of data structure, logical structure, epistemic structure, and hypothesis structure are examined. Exponential families as models, problems of interpretation, parameters, causality, computation, and data cleaning and missing values are discussed.
Article
Incluye bibliografía e índice
Article
In its simplest formulation the problem considered is to estimate the cell probabilities pij of an r × c contingency table for which the marginal probabilities $p_{i\ldot}$ and $p_{\ldot j}$ are known and fixed, so as to minimize ΣΣp ijln (pij/πij), where πij are the corresponding entries in a given contingency table. An iterative procedure is given for determining the estimates and it is shown that the estimates are BAN, and that the iterative procedure is convergent. A summary of results for a four-way contingency table is given. An illustrative example is given.
The Advanced Theory of Statistics
  • M G Kendall
  • A Stuart
Analysis of log-likelihood ratios. “ANO∧”. (A contribution to the discussion of a paper on least squares by F
  • I J Good
  • IJ Good
On the analysis of multidimensional contingency tables
  • H H Ku
  • R N Varner
  • S Kullback
  • HH Ku