## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

To read the full-text of this research,

you can request a copy directly from the authors.

... Hence, the independent assumption between claims size and inter-claim arrival time in a classical risk model, as heavily covered in Delbaen and Haezendonck (1987), Waters (1983) and Yang and Zhang (2001) may no longer be appropriate for insurance risk portfolio modeling. Recent studies have employed the dependence assumption through the dependent frequencyseverity modeling under automobile insurance (Shi et al. 2015), while Kularatne et al. (2020) examined the suitable bivariate model through Archimedean copula to capture the dependence structure in general and life insurance modeling. ...

... Two general approaches proposed in modeling the dependency between claims severity and frequency are the conditional probability decomposition approach and the copula approach (Garrido et al. 2016). The first approach decomposes the joint probability distribution between the claims severity and frequency into a product of conditional probabilities and then predicts average claims severity using a regression model with the claims frequency as a covariate (Shi et al. 2015). On the other hand, the copula approach links the joint distribution of the claims severity and claims frequency through a copula. ...

... However, it is only adequate if the data satisfies the restrictive assumption of equidispersion; that is, the variance of the data is equal to its mean. Shi et al. (2015) illustrated that the Poisson process is insufficient to capture the overdispersion present in the automobile insurance data and proposed the use of Negative Binomial (NB) model. However, the NB model is unable to accommodate underdispersed data, and the closed-form of hazard function could not be obtained. ...

We model the recursive moments of aggregate discounted claims, assuming the inter-claim arrival time follows a Weibull distribution to accommodate overdispersed and underdispersed data set. We use a copula to represent the dependence structure between the inter-claim arrival time and its subsequent claim amount. We then use the Laplace inversion via the Gaver-Stehfest algorithm to solve numerically the first and second moments, which takes the form of a Volterra integral equation (VIE). We compute the average and variance of the aggregate discounted claims under the Farlie-Gumbel-Morgenstern (FGM) copula and conduct a sensitivity analysis under various Weibull inter-claim parameters and claim-size parameters. The comparison between the equidispersed, overdispersed and underdispersed counting processes shows that when claims arrive at times that vary more than is expected, insured lives can expect to pay higher premium, and vice versa for the case of claims arriving at times that vary less than expected. Upon comparing the Weibull risk process with an equivalent Poisson process, we also found that copulas with a wider range of dependency parameter such as the Frank and Heavy Right Tail (HRT), have a greater impact on the value of moments as opposed to modeling under FGM copula with weak dependence structure.

... Frees, Gao, and Rosenberg [16] also point out that the claim frequency has a significant effect on the claim severity for outpatient expenditures. Gschlößl and Czado [15], Frees, Gao, and Rosenberg [16], Erhardt and Czado [17], Shi, Feng, and Ivantsova [18] and Garrido, Genest, and Schulz [19] capture the dependence between the claim frequency and severity by treating the claim frequency as a predictor variable in the regression model for the average claim severity. Shi, Feng, and Ivantsova [18] show that the predictor method applied to the GLM frequency-severity model can only measure a linear relation between the claim frequency and severity. ...

... Gschlößl and Czado [15], Frees, Gao, and Rosenberg [16], Erhardt and Czado [17], Shi, Feng, and Ivantsova [18] and Garrido, Genest, and Schulz [19] capture the dependence between the claim frequency and severity by treating the claim frequency as a predictor variable in the regression model for the average claim severity. Shi, Feng, and Ivantsova [18] show that the predictor method applied to the GLM frequency-severity model can only measure a linear relation between the claim frequency and severity. Czado, Kastenmeier, Brechmann, and Min [20], Krämer, Brechmann, Silvestrini, and Czado [21] and Shi, Feng, and Ivantsova [18] model the joint distribution of the claim frequency and average claim severity through copulas. ...

... Shi, Feng, and Ivantsova [18] show that the predictor method applied to the GLM frequency-severity model can only measure a linear relation between the claim frequency and severity. Czado, Kastenmeier, Brechmann, and Min [20], Krämer, Brechmann, Silvestrini, and Czado [21] and Shi, Feng, and Ivantsova [18] model the joint distribution of the claim frequency and average claim severity through copulas. However, popular copulas, such as elliptical and Archimedean copulas, can only capture the symmetric or limited dependence structures. ...

The standard GLM and GAM frequency-severity models assume independence between the claim frequency and severity. To overcome restrictions of linear or additive forms and to relax the independence assumption, we develop a data-driven dependent frequency-severity model, where we combine a stochastic gradient boosting algorithm and a profile likelihood approach to estimate parameters for both of the claim frequency and average claim severity distributions, and where we introduce the dependence between the claim frequency and severity by treating the claim frequency as a predictor in the regression model for the average claim severity. The model can flexibly capture the nonlinear relation between the claim frequency (severity) and predictors and complex interactions among predictors and can fully capture the nonlinear dependence between the claim frequency and severity. A simulation study shows excellent prediction performance of our model. Then, we demonstrate the application of our model with a French auto insurance claim data. The results show that our model is superior to other state-of-the-art models.

... Thus, the dependence between claim frequency and severity need to be modeled. Erhardt and Czado (2012) ;Frees, Gao, and Rosenberg (2011); Gschlößl and Czado (2007) capture the dependence by treating claim frequency as a predictor variable in the regression model of average claim severity and Czado, Kastenmeier, Brechmann, and Min (2012) ;Krämer, Brechmann, Silvestrini, and Czado (2013); Shi, Feng, and Ivantsova (2015) employ the parametric copulas to model the joint distribution of claim frequency and average claim severity. Thus, we develop a stochastic gradient boosting frequency-severity model to overcome the restricted forms of the GLM and GAM models, where we treat claim frequency as a predictor variable in the gradient boosting regression model of the average claim severity to flexibly capture the nonlinear dependence between claim frequency and severity. ...

... Frees, Gao, and Rosenberg (2011) also point out that the claim frequency has a significant effect on the claim severity for outpatient expenditures. Gschlößl and Czado (2007), Frees, Gao, and Rosenberg (2011), Erhardt and Czado (2012, Shi, Feng, and Ivantsova (2015) and Garrido, Genest, and Schulz (2016) capture the dependence between the claim frequency and severity by treating the claim frequency as a predictor variable in the regression model for the average claim severity. Shi, Feng, and Ivantsova (2015) show that the predictor method applied to the GLM frequency-severity model can only measure a linear relation between the claim frequency and severity. ...

... Gschlößl and Czado (2007), Frees, Gao, and Rosenberg (2011), Erhardt and Czado (2012, Shi, Feng, and Ivantsova (2015) and Garrido, Genest, and Schulz (2016) capture the dependence between the claim frequency and severity by treating the claim frequency as a predictor variable in the regression model for the average claim severity. Shi, Feng, and Ivantsova (2015) show that the predictor method applied to the GLM frequency-severity model can only measure a linear relation between the claim frequency and severity. Czado, Kastenmeier, Brechmann, and Min (2012), Krämer, Brechmann, Silvestrini, and Czado (2013) and Shi, Feng, and Ivantsova (2015) model the joint distribution of the claim frequency and average claim severity through copulas. ...

This thesis makes use of some theoretical tools in finance, decision theory, machine learning, to improve the design, pricing and hedging of insurance contracts. Chapter 3 develops closed-form pricing formulas for participating life insurance contracts, based on matrix Wiener-Hopf factorization, where multiple risk sources, such as credit, market, and economic risks, are considered. The pricing method proves to be accurate and efficient. The dynamic and semi-static hedging strategies are introduced to assist insurance company to reduce risk exposure arising from the issue of participating contracts. Chapter 4 discusses the optimal contract design when the insured is third degree risk averse. The results showthat dual limited stop-loss, change-loss, dual change-loss, and stop-loss can be optimal contracts favord by both of risk averters and risk lovers in different settings. Chapter 5 develops a stochastic gradient boosting frequency-severity model, which improves the important and popular GLM and GAM frequency-severity models. This model fully inherits advantages ofgradient boosting algorithm, overcoming the restrictive linear or additive forms of the GLM and GAM frequency-severity models, through learning the model structure from data. Further, our model can also capture the flexible nonlinear dependence between claim frequency and severity

... While dependence between claim count and cost has been extensively studied [see e.g. Gschlößl and Czado (2007); Frees et al. (2011); Czado et al. (2012); Shi et al. (2015); Garrido et al. (2016)], most of these papers are focused on one-period models. Only a few works [see Frangos and Vrontos (2001); Gómez-Déniz (2016)] consider experience ratemaking using both counts and costs but their frameworks are limited to the static random effect. ...

... First, we make the simplifying assumption that both the claim frequency and cost per claim are governed by the same random effect, X t . While this assumption reconciles many existing claim frequency-cost models 14 , we note that a more general, non-degenerate joint distribution for multiple random 14 For instance, Shi et al. (2015) use a copula to characterize the random effects governing the claim frequency and severity, respectively. They have considered, among others, Archimedean copulas such as Gumbel, Clayton and Frank copulas, which have an interpretation of a (single) factor representation [see e.g. ...

... Oakes (1989)]. Therefore, the model discussed in thus Subsection can be viewed as a dynamic extension of Shi et al. (2015), in which the single factor is time-varying. Moreover, among models considering different types of claims, Shi and Valdez (2014) propose to model the dependence between different frequency variables through Archimedean copulas [see also Wen et al. (2009);Zhang et al. (2018) for an analysis of the credibility premium in such a model]. ...

We contribute to the non-life experience ratemaking literature by introducing a computationally efficient approximation algorithm for the Bayesian premium in models with dynamic random effects, where the risk of a policyholder is governed by an individual process of unobserved heterogeneity. Although intuitive and flexible, the biggest challenge of dynamic random effect models is that the resulting Bayesian premium typically lacks tractability. In this article, we propose to approximate the dynamics of the random effects process by a discrete (hidden) Markov chain and replace the intractable Bayesian premium of the original model by that of the approximate Markov chain model, for which concise, closed-form formula are derived. The methodology is general because it does not rely on any parametric distributional assumptions and, in particular, allows for the inclusion of both the cost and the frequency components in pricing. Numerical examples show that the proposed approximation method is highly accurate. Finally, a real data pricing example is used to illustrate the versatility of the approach.

... Roughly speaking, it means that the dependence between frequency and severity components should be either negative (η < 0), or not excessively positive. This assumption is typically satisfied in auto insurance, since empirical studies often find negative, or weakly positive frequency-severity dependence (Shi et al. 2015, Park et al. 2018, Lu 2019). ...

... More precisely, on the one hand, frequency and severity variables are both observed, have their own random effects and regression equations; on the other hand, when there is no claim, we allow for a potentially different updating rule of the severity component, by explicitly acknowledging that in this case, the claim amounts are not observed and hence its associated random effect needs not necessarily be updated. The terminology of three-part model are first introduced by Shi et al. (2015), but their model is designed for cross-sectional data only. Our three-part model, on the other hand, is the first dynamic one we are aware of. ...

The collective risk model (CRM) for frequency and severity is an important tool for retail insurance ratemaking, natural disaster forecasting, as well as operational risk in banking regulation. This model, initially designed for cross-sectional data, has recently been adapted to a longitudinal context for both a priori and a posteriori ratemaking, through random effects specifications. However, the random effects are usually assumed to be static due to computational concerns, leading to predictive premiums that omit the seniority of the claims. In this paper, we propose a new CRM model with bivariate dynamic random effects processes. The model is based on Bayesian state-space models. It is associated with a simple predictive mean and closed form expression for the likelihood function, while also allowing for the dependence between the frequency and severity components. A real data application for auto insurance is proposed to show the performance of our method.

... Shi and Zhao [13] extended the mixed copula approach to collective risk models and proposed a copula-linked compound distribution. Shi et al. [12] proposed a three-part model, which splits the frequency model into a binary classification model and a zero-truncated claim counts model. In the mixed copula approach, the major concern is that such a mixed copula is not unique according to Sklar's Theorem (Sklar, [14]). ...

... This particular model restricts the dependence as log-linearity. Shi et al.[12] split the frequency model into a logistic model and a zero-truncated negative binomial model. A weakness of this approach is that we do not have any analytical results. ...

Mixed copula approach has been used to jointly model frequency and severity of insurance claims. A major concern about this approach is the non-uniqueness of copulas, challenging both model inference and predictions. We propose to circumvent this issue by using the latent variable of waiting time for the second claim under the framework of copula. The copula links the latent variable of waiting time and the average claim size. The frequency-severity dependence can be derived using the relationship between the waiting time and the counts of a Poisson process. Assuming a Gaussian copula and a log-normal distributed average claim size, we can investigate the effect of claim counts on the conditional claim severity analytically, which would be difficult in the mixed copula approach. We propose a Monte Carlo algorithm to simulate from the predictive distribution of the aggregated claims amount. In an empirical example, we illustrate the proposed method and compare with other competing methods. It shows that our proposed method provides quite competitive results.

... Little research has been carried out in the area of the implementation of this de-tarrification in the motor insurance industry. There are several study has been carried out to identify new rating factor for their new set up premium [1][2][3][4][5][6][7][8][9][10][11][12], but little study on the impact of de-tarrification in the motor insurance company. The implementation of de-tariffication may be different from tariff (i.e., Malaysian Model), especially when facing Good Service Tax (GST). ...

... Multiple Linear Regression (MLR) is a statistical methods that quantify the relationship between a dependent variable and a set of independent variables [36][17] [10][11] [12]. This method takes into account the correlations between predictor variables and assesses the effect of each predictor variable when other variables are removed [37]. ...

: De-tariffication has become a hot topic for Malaysian motor insurers after effectively implemented on 1 July 2017. Generally, the insurance companies need to set a rating factor for their premium before calculating the price on selected premium. Typically, these rating factors are based on the risk profile of the policyholder. That means, the price of the premium determined by the risk factors from the profile of the policyholder. The aim of this research to investigate the impact after de-tariff implemented among the motor insurance industry. This research also investigates the effect of de-tariff on the Good Service Tax (GST) in the premium calculation. Multiple Linear Regression (MLR) was used to determine the most significant rating factor that influence the calculation of the premium received by the motor insurance industry. Once these k rating factors and parameters are identified, the value of premiums can be calculated by taking into account these rating factors and parameters in the de-tariff formula and comparing with the existing model. The result of the study indicates that de-tariff model has lower premium compared to Malaysia tariff model. Furthermore, GST is also found to have a significant impact on the motor insurance premium, where policyholders are required to pay higher premiums than non-GST premiums. The results will help the insurance companies to find new formulas in considering new rating factors and improve the accuracy of premium calculations.

... In other words, the dependence between frequency and severity components should either be negative (η < 0), or not excessively positive. This assumption is typically satisfied in auto insurance, since empirical studies often find negative, or weakly positive frequency-severity dependence (Shi et al., 2015;Park et al., 2018;Lu, 2019). ...

... More precisely, on the one hand, frequency and severity variables are both observed, have their own random effects and regression equations; on the other hand, when there is no claim, we allow for a potentially different updating rule of the severity component, by explicitly acknowledging that in this case, the claim amounts are not observed and hence its associated random effects needs not necessarily be updated. This terminology of three-part model is first introduced by Shi et al. (2015), but their model is designed for cross-sectional data only. Here, our three-part model is dynamic, and just as Model 3, it also allows for closed form predictive formulas. ...

The collective risk model (CRM) for frequency and severity is an important tool for retail insurance ratemaking, macro-level catastrophic risk forecasting, as well as operational risk in banking regulation. This model, which is initially designed for cross-sectional data, has recently been adapted to a longitudinal context to conduct both a priori and a posteriori ratemaking, through the introduction of random effects. However, so far, the random effect(s) is usually assumed static due to computational concerns, leading to predictive premium that omit the seniority of the claims. In this paper, we propose a new CRM model with bivariate dynamic random effect process. The model is based on Bayesian state-space models. It is associated with the simple predictive mean and closed form expression for the likelihood function, while also allowing for the dependence between the frequency and severity components. Real data application to auto insurance is proposed to show the performance of our method.

... On the other hand, the traditional GLM has been extended by combining marginal distributions for frequency and severity via a bivariate copula, for example, see Czado et al. [2012] and Kramer et al. [2013]. Two ideas are extensively reviewed and compared to traditional independent models in Shi et al. [2015]. ...

... Indeed, various papers have revealed that the average claim size is positively(or negatively) related to the number of claims in non-life insurances. For example, see Shi et al. [2015] and Garrido et al. [2016]. NeurFS incorporates the dependence structure by describing the average severity in terms of covariates as well as the frequency. ...

This paper proposes a flexible and analytically tractable class of frequency-severity models based on neural networks to parsimoniously capture important empirical observations. In the proposed two-part model, mean functions of frequency and severity distributions are characterized by neural networks to incorporate the non-linearity of input variables. Furthermore, it is assumed that the mean function of the severity distribution is an affine function of the frequency variable to account for a potential linkage between frequency and severity. We provide explicit closed-form formulas for the mean and variance of the aggregate loss within our modelling framework. Components of the proposed model including parameters of neural networks and distribution parameters can be estimated by minimizing the associated negative log-likelihood functionals with neural network architectures. Furthermore, we leverage the Shapely value and recent developments in machine learning to interpret the outputs of the model. Applications to a synthetic dataset and insurance claims data illustrate that our method outperforms the existing methods in terms of interpretability and predictive accuracy.

... This phenomenon invalidates the practice of using frequency-driven BMS and highlights the need to extend the classical collective risk model by allowing some dependence structure between frequency and severity. The existing studies on dependent frequency-severity models and the associated insurance premiums include copula-based models [4,8], two-step frequency-severity models [7,9,18,21], and bivariate random effect-based models [1,11,14,20]. The random effect model is especially popular in insurance ratemaking because of the mathematical tractability in its prediction. ...

... Though difficult to explain this in plain terms, we believe that the stationary distributions under the −1/+1/+2 system is more evenly spread than that under the −1/+2/+3 system, which, in turn, marginally stabilizes the BMS and produces smaller relativities. For the relationship between the stationary distribution and the optimal relativity, see Eqs. (19)- (21). • In terms of the HMSE, we see that the optimal threshold must be quite large under both the −1/+1/+2 and −1/+2/+3 systems in this particular dataset. ...

In the auto insurance industry, a Bonus-Malus System (BMS) is commonly used as a posteriori risk classification mechanism to set the premium for the next contract period based on a policyholder's claim history. Even though the recent literature reports evidence of a significant dependence between frequency and severity, the current BMS practice is to use a frequency-based transition rule while ignoring severity information. Although Oh et al. [(2020). Bonus-Malus premiums under the dependent frequency-severity modeling. Scandinavian Actuarial Journal 2020(3): 172–195] claimed that the frequency-driven BMS transition rule can accommodate the dependence between frequency and severity, their proposal is only a partial solution, as the transition rule still completely ignores the claim severity and is unable to penalize large claims. In this study, we propose to use the BMS with a transition rule based on both frequency and size of claim, based on the bivariate random effect model, which conveniently allows dependence between frequency and severity. We analytically derive the optimal relativities under the proposed BMS framework and show that the proposed BMS outperforms the existing frequency-driven BMS. Later, numerical experiments are also provided using both hypothetical and actual datasets in order to assess the effect of various dependencies on the BMS risk classification and confirm our theoretical findings.

... We assume that for a given portfolio of risks, the timeline simulation can be obtained by simulating from a frequency distribution and a portfolio specific severity distribution (analogous to actuarial applications (Klugman et al. 2008)). We assume that both frequency and severity are independent (commonly assumed in actuarial applications (Klugman et al. 2008), although recent studies explore the relaxation of this assumption (Peng et al. 2015)). In some instances, the frequency and severity distributions may have exact analytical forms, but this need not be the case. ...

... • Our framework has been presented for the case where we have assumed independence between frequency and severity. Recent work in an insurance context has explored the implications of this dependency (Peng et al. 2015). In a catastrophe modeling context, there are a number of perils where frequency and severity are dependent (e.g., flood). ...

The aim of this paper is to merge order statistics with natural catastrophe reinsurance pricing to develop new theoretical and practical insights relevant to market practice and model development. We present a novel framework to quantify the role that occurrence losses (order statistics) play in pricing of catastrophe excess of loss (catXL) contracts. Our framework enables one to analytically quantify the contribution of a given occurrence loss to the mean and covariance structure, before and after the application of a catXL contract. We demonstrate the utility of our framework with an application to idealized catastrophe models for a multi-peril and a hurricane-only case. For the multi-peril case, we show precisely how contributions to so-called lower layers are dominated by high frequency perils, whereas higher layers are dominated by low-frequency high severity perils. Our framework enables market practitioners and model developers to assess and understand the impact of altered model assumptions on the role of occurrence losses in catXL pricing.

... In particular, for the prediction of the fair premium, modeling the dependence structure in the CRM is important. There are several ways to do this, whereby in this paper we focus on two CRMs that were presented in the recent actuarial literature: two-part frequency-severity model (see, e.g., [8,9,29,32]) and copula-based CRM (see, e.g., [1,27]). These two methods were developed independently in different mathematical settings which makes the comparison of the two models difficult. ...

... We first review two CRMs from the insurance literature. The first model that we will revisit is the so called the two-part CRM where dependence between frequency and severity is induced by using the frequency as an explanatory variable of the severities; see, e.g., [8,9,29,32]. As a result, in this model the distribution of the aggregate severity can be easily determined. ...

Copulas allow a flexible and simultaneous modeling of complicated dependence structures together with various marginal distributions. Especially if the density function can be represented as the product of the marginal density functions and the copula density function, this leads to both an intuitive interpretation of the conditional distribution and convenient estimation procedures. However, this is no longer the case for copula models with mixed discrete and continuous marginal distributions, because the corresponding density function cannot be decomposed so nicely. In this paper, we introduce a copula transformation method that allows to represent the density function of a distribution with mixed discrete and continuous marginals as the product of the marginal probability mass/density functions and the copula density function. With the proposed method, conditional distributions can be described analytically and the computational complexity in the estimation procedure can be reduced depending on the type of copula used.

... Little research has been carried out in the area of the implementation of this de-tarrification in the motor insurance industry. There are several study has been carried out to identify new rating factor for their new set up premium [1][2][3][4][5][6][7][8][9][10][11][12], but little study on the impact of de-tarrification in the motor insurance company. The implementation of de-tariffication may be different from tariff (i.e., Malaysian Model), especially when facing Good Service Tax (GST). ...

... Multiple Linear Regression (MLR) is a statistical methods that quantify the relationship between a dependent variable and a set of independent variables [36][17] [10][11] [12]. This method takes into account the correlations between predictor variables and assesses the effect of each predictor variable when other variables are removed [37]. ...

: De-tariffication has become a hot topic for Malaysian motor insurers after effectively implemented on 1 July 2017. Generally, the insurance companies need to set a rating factor for their premium before calculating the price on selected premium. Typically, these rating factors are based on the risk profile of the policyholder. That means, the price of the premium determined by the risk factors from the profile of the policyholder. The aim of this research to investigate the impact after de-tariff implemented among the motor insurance industry. This research also investigates the effect of de-tariff on the Good Service Tax (GST) in the premium calculation. Multiple Linear Regression (MLR) was used to determine the most significant rating factor that influence the calculation of the premium received by the motor insurance industry. Once these k rating factors and parameters are identified, the value of premiums can be calculated by taking into account these rating factors and parameters in the de-tariff formula and comparing with the existing model. The result of the study indicates that de-tariff model has lower premium compared to Malaysia tariff model. Furthermore, GST is also found to have a significant impact on the motor insurance premium, where policyholders are required to pay higher premiums than non-GST premiums. The results will help the insurance companies to find new formulas in considering new rating factors and improve the accuracy of premium calculations.

... However, we may not preclude the possible dependence between the frequency and severity components. Inspired by this idea, there has been an increasing interest in developing models to capture the possible dependence as in Shi et al. (2015) and Lee et al. (2018). Secondly, it is also important to consider the repeated measurement of claims because it can provide further insights as to the unobserved heterogeneity of the policyholders. ...

... A similar approach has been used by Frees et al. (2011a) in the modeling and prediction of frequency and severity of health care expenditure. Shi et al. (2015) suggested a three-part framework in order to capture the association between frequency and severity components. When generalized linear models (GLMs) are used with the number of claims treated as a covariate in claims severity, Garrido et al. (2016) showed that the pure premium includes a correction term for inducing dependence. ...

In the ratemaking for general insurance, calculation of the pure premium has traditionally been based on modeling frequency and severity separately. It has also been a standard practice to assume, for simplicity, the independence of loss frequency and loss severity. However, in recent years, there is a sporadic interest in the actuarial literature and practice to explore models that depart from this independence assumption. Besides, because of the short-term nature of many lines of general insurance, the availability of data enables us to explore the benefits of using random effects for predicting insurance claims observed longitudinally, or over a period of time. This thesis advances work related to the modeling of compound risks via random effects. First, we examine procedures for testing random effects using Bayesian sensitivity analysis via Bregman divergence. It enables insurance companies to judge whether to use random effects for their ratemaking model or not based on observed data. Second, we extend previous work on the credibility premium of compound sum by incorporating possible dependence as a unified formula. In this work, an informative dependence measure between the frequency and severity components is introduced which can capture both the direction and strength of possible dependence. Third, credibility premium with GB2 copulas are explored so that one can have a succint closed form of the credibility premium with GB2 marginals and explicit approximation of credibility premium with non-GB2 marginals. Finally, we extend microlevel collective risk model into multi-year case using the shared random effect. Such framework includes many previous dependence models as special cases and a specific example is provided with elliptical copulas. We develop the theoretical framework associated with each work, calibrate each model with empirical data and evaluate model performance with out-of-sample validation measures and procedures.

... (2013) adopted parametric copulas to describe the correlation between the frequency and the severity of the claims and constructed marginal GLMs to fit them. Shi, Feng, and Ivantsova (2015) investigated and compared the performance of these two approaches. Garrido, Genest, and Schulz (2016) proceeded similarly as Frees, Gao, and Rosenberg (2011) and fitted GLMs to the marginal frequency and conditional severity, and then derived the formula for the pure premium. ...

... The precisely predicted distribution of claims may provide insurers with reliable and justifiable tools for pricing and risk management. The empirical studies in Shi, Feng, and Ivantsova (2015), Garrido, Genest, and Schulz (2016) and Park, Kim, and Ahn (2018) have shown the evidence of correlation between the number and the size of claims in the real-world vehicle insurance data. Additionally, intuitively speaking, the amounts of the claims that occur over a fixed time period for the same policyholder is dependent, but this is rarely considered in the regression analysis of claim amounts. ...

In this paper, we propose a new model to relax the impractical independence assumption between the counts and the amounts of insurance claims, which is commonly made in the existing literature for mathematical convenience. When considering the dependence between the claim counts and the claim amounts, we treat the number of claims as an explanatory variable in the model for claim sizes. Besides, generalized linear models (GLMs) are employed to fit the claim counts in a given time period. To describe the claim amounts which are repeatedly measured on a group of subjects over time, we adopt generalized linear mixed models (GLMMs) to incorporate the dependence among the related observations on the same subject. In addition, a Monte Carlo Expectation-Maximization (MCEM) algorithm is proposed by using a Metropolis-Hastings algorithm sampling scheme to obtain the maximum likelihood estimates of the parameters for the linear predictor and variance component. Finally, we conduct a simulation to illustrate the feasibility of our proposed model.

... In reality, these components are largely dependent, especially in motor insurance (Su & Bai, 2020). However, some procedures can eliminate the problem of correlation between the two components, and these were dealt with by Shi et al. (2015). The above facts motivated us to consider separate modeling, while the article only focuses on the claim severity in MTPL insurance. ...

Research background: Using the marginal means and contrast analysis of the target variable, e.g., claim severity (CS), the actuary can perform an in-depth analysis of the portfolio and fully use the general linear models potential. These analyses are mainly used in natural sciences, medicine, and psychology, but so far, it has not been given adequate attention in the actuarial field. Purpose of the article: The article's primary purpose is to point out the possibilities of contrast analysis for the segmentation of policyholders and estimation of CS in motor third-party liability insurance. The article focuses on using contrast analysis to redefine individual relevant factors to ensure the segmentation of policyholders in terms of actuarial fairness and statistical correctness. The aim of the article is also to reveal the possibilities of using contrast analysis for adequate segmentation in case of interaction of factors and the subsequent estimation of CS. Methods: The article uses the general linear model and associated least squares means. Contrast analysis is being implemented through testing and estimating linear combinations of model parameters. Equations of estimable functions reveal how to interpret the results correctly. Findings & value added: The article shows that contrast analysis is a valuable tool for segmenting policyholders in motor insurance. The segmentation's validity is statistically verifiable and is well applicable to the main effects. Suppose the significance of cross effects is proved during segmentation. In that case, the actuary must take into account the risk that even if the partial segmentation factors are set adequately, statistically proven, this may not apply to the interaction of these factors. The article also provides a procedure for segmentation in case of interaction of factors and the procedure for estimation of the segment's CS. Empirical research has shown that CS is significantly influenced by weight, engine power, age and brand of the car, policyholder's age, and district. The pattern of age's influence on CS differs in different categories of car brands. The significantly highest CS was revealed in the youngest age category and the category of luxury car brands.

... Czado et al. (2012) link marginal frequency to severity using copula. Shi et al. (2015) modeled regression of average severities by applying frequency claim as the covariate and make a comparison against mixed copula approached to construct the joint distribution of frequency and severity claims. ...

In most cases, loss in non-life insurance is calculated based on claim severity and frequency and an assumption of independence. However, in some cases, claim severity depends upon the claim frequency. This paper presents the derivation of aggregate loss calculation by modeling claim severity and frequency as the assumption of independence is eliminated. The authors modeled average claim severity using claim frequency as the covariate to induce the dependence among them. For that purpose, we use the generalized linear model. After doing parameters estimation, we will obtain the calculated loss.

... For more details on truncated negative binomial regression and likelihood functions, see e.g. Shi et al. (2015). In practical applications one usually tries to set the threshold as low as possible, subject to the GP distribution providing an acceptable fit. ...

A new class of copulas, termed the MGL copula class, is introduced. The new copula originates from extracting the dependence function of the multivariate generalized log-Moyal-gamma distribution whose marginals follow the univariate generalized log-Moyal-gamma (GLMGA) distribution as introduced in Li et al. (2021). The MGL copula can capture nonelliptical, exchangeable, and asymmetric dependencies among marginal coordinates and provides a simple formulation for regression applications. We discuss the probabilistic characteristics of MGL copula and obtain the corresponding extreme-value copula, named the MGL-EV copula. While the survival MGL copula can be also regarded as a special case of the MGB2 copula from Yang et al. (2011), we show that the proposed model is effective in regression modelling of dependence structures. Next to a simulation study, we propose two applications illustrating the usefulness of the proposed model. This method is also implemented in a user-friendly R package: rMGLReg.

... For all these reasons, recently, there is an increasing interest in exploring models that account the dependence between frequency and severity. In this sense, two different approaches can be distinguished: on one hand, a model is defined for the average claim size distribution using the number of claims as covariate (see Frees and Wang, 2006;Gschlößl and Czado, 2007;Frees et al., 2011;Garrido et al., 2016;Valdez et al., 2018); as a second approach, the frequency and severity (or average severity) components are related through a copula (see Erhardt and Czado, 2012;Czado et al., 2012;Krämer et al., 2013;Hua, 2015;Lee and Shi, 2019;Oh et al., 2020;Shi et al., 2015). Alternatively, in this paper, we propose the bivariate Sarmanov distribution to model the bivariate distribution relating the frequency and the average severity of claims; our main motivation is that, similarly to copulas, this distribution allows us to separate the dependent structure from the marginal distributions and, in the same way as the copula-based models, we can easily fit the joint behavior of different marginal distributions, continuous or discrete. ...

Real data studies emphasized situations where the classical independence assumption between the frequency and the severity of claims does not hold in the collective model. Therefore, there is an increasing interest in defining models that capture this dependence. In this paper, we introduce such a model based on Sarmanov's bivariate distribution, which has the ability of joining different types of marginals in flexible dependence structures. More precisely, we join the claims frequency and the average severity by means of this distribution. We also suggest a maximum likelihood estimation procedure to estimate the parameters and illustrate it both on simulated and real data.

... For more details on truncated negative binomial regression and likelihood functions, see e.g. Shi et al. (2015). In practical applications one usually tries to set the threshold as low as possible, subject to the GP distribution providing an acceptable fit. ...

A new class of copulas, termed the MGL copula class, is introduced. The new copula originates from extracting the dependence function of the multivariate generalized log-Moyal-gamma distribution whose marginals follow the univariate generalized log-Moyal-gamma (GLMGA) distribution as introduced in Li et al. (2021). The MGL copula can capture nonelliptical, exchangeable, and asymmetric dependencies among marginal coordinates and provides a simple formulation for regression applications. We discuss the probabilistic characteristics of MGL copula and obtain the corresponding extreme-value copula, named the MGL-EV copula. While the survival MGL copula can be also regarded as a special case of the MGB2 copula from Yang et al. (2011), we show that the proposed model is effective in regression modelling of dependence structures. Next to a simulation study, we propose two applications illustrating the usefulness of the proposed model. This method is also implemented in a user-friendly R package: rMGLReg.

... Penelitian mengenai frekuensi klaim dan besarnya klaim sebagai dasar untuk menentukan besarnya premi asuransi kendaraan bermotor serta penggunaan distribusi tweedie dengan generalized linear models telah dilakukan oleh beberapa peneliti, di antaranya, Shi [4] menytakan bahwa frekuensi klaim secara independen mengikuti sebaran tweedie dan memberikan hasil yang baik dengan menggunakan generalized linear models. Frees [5] menyatakan bahwa pada perusahaan asuransi skor klaim yang diperoleh dari frekuensi klaim dan besarnya klaim adalah alat yang berguna untuk menilai risiko kewajiban ketika terjadi kecelakaan pada asuransi kendaraan bermotor. ...

ABSTRAK
Seorang aktuaris mempunyai tugas penting dalam menentukan harga premi yang sesuai untuk setiap nasabah dengan risiko dan karakteristik yang berbeda. Banyak variabel yang dapat mempengaruhi harga premi. Oleh karena itu, aktuaris harus mengetahui variabel-variabel yang berpengaruh signifikan terhadap premi. Tujuan dari penelitian ini adalah untuk menentukan variabel yang dapat mempengaruhi besaran premi murni menggunakan distribusi campuran dalam menentukan besarnya premi melalui Generalized Linear Models (GLM) serta menentukan model harga premi yang sesuai berdasarkan variabel-variabel yang mempengaruhinya. Salah satu analisis statistik yang dapat digunakan untuk memodelkan premi asuransi adalah Generalized Linear Models. GLM merupakan perluasan dari model regresi klasik yang dapat mengakomodasi fleksibilitas untuk menggunakan beberapa distribusi data tetapi terbatas pada distribusi keluarga eksponensial. Dalam model GLM, premi diperoleh dengan mengalikan nilai ekspektasi bersyarat dari frekuensi klaim dan biaya klaim. Berdasarkan penelitian yang telah dilakukan diketahui bahwa frekuensi klaim dan besarnya klaim mengikuti distribusi Tweedie. Dari kedua model tersebut diketahui bahwa variabel yang mempengaruhi premi murni adalah jumlah anak, pendapatan per bulan, status pernikahan, pendidikan, pekerjaan, penggunaan kendaraan, besarnya bluebook yang dibayarkan, dan jenis kendaraan nasabah. Hal ini menunjukkan bahwa model GLM merupakan model yang representatif dan berguna bagi perusahaan asuransi.
ABSTRACT
It is an important task for an actuary in determining the appropriate premium price for each customer with different risks and characteristics. Many variables can affect the premium price. Therefore, actuaries must determine the variables that significantly affect the premium. The purpose of this study is to determine the variables that can affect the amount of pure premium using a mixed distribution in determining the amount of premium through Generalized Linear Models (GLM) and determine the appropriate premium price model based on the variables that influence it. One of the statistical analyzes that can be used to model insurance premiums is the Generalized Linear Models. GLM is an extension of the classic regression model that can accommodate the flexibility of its users to use multiple data distributions but is limited to the exponential family distribution. In the GLM model, the premium is obtained by multiplying the conditional expected value of the frequency of claims and the cost of claims. Based on the research that has been done, it is known that the frequency of claims and the size of claims follow the Tweedie distribution. From the two models, it is known that the variables affecting the pure premium are the number of children, monthly income, marital status, education, occupation, vehicle use, the number of bluebooks paid, and the type of vehicle from the customer. This shows that the GLM model is a representative and useful model for the insurance company business.

... See e.g. Czado et al. (2012), Frees et al. (2014aFrees et al. ( , 2016, Shi et al. (2015), and Garrido et al. (2016), where both positive and negative dependencies have been found depending on the data sets used. In particular, Lee and Shi (2019) analyzed longitudinal insurance claim data and proposed to utilize three different copulas to model the temporal dependency of claim frequencies, the temporal dependency of average claim severities (given claim frequencies), and the cross-sectional relation between frequency and average claim amount. ...

In this paper, as opposed to the usual assumption of independence, we propose a credibility model in which the (unobservable) risk profiles of the claim frequency and the claim severity are dependent. Given the risk profiles, the (conditional) marginal distributions of frequency and severity are assumed to belong to the exponential family. A bivariate conjugate prior is proposed for the risk profiles, where the dependency is incorporated via a factorization structure of the joint density. The bivariate posterior is derived, and in turn the Bayesian premium for the aggregate claim is given along with some results on the predictive joint and marginal distributions involving the claim number and the aggregate claim in the next period. To demonstrate the generality of our proposed model, we provide four different examples of bivariate conjugate priors in relation to mixed Erlang, gamma mixture, Farlie-Gumbel-Morgenstern (FGM) copula, and bivariate beta, where each choice has different merits. In these examples, more explicit results can be obtained, and in particular the predictive variance and Value-at-Risk (VaR) of the aggregate claim certainly provide more information on the inherent risk than the Bayesian premium which is merely the predictive mean. Finally, numerical examples will be given to illustrate the effect of dependence on the results, including the use of a real data set that further takes observable risk factors into consideration under a regression setting.

... The GB2 distribution includes the Burr, generalized Pareto and Pareto distributions as special cases and the Generalized Gamma (GG) distribution as a limiting case. The distributions which belong to the GB2 family have been widely used in the actuarial literature to model heavy-tailed insurance loss data for cases with and without covariate information; see, for instance, Frees et al. (2014Frees et al. ( , 2016; Frees and Valdez (2008); Hürlimann (2014); Jeong (2020); Jeong and Valdez (2020); Laudagé et al. (2019); Ramirez-Cobo et al. (2010); Shi et al. (2015); Wang et al. (2020); Yang et al. (2011), among many others. ...

This article presents the Exponential–Generalized Inverse Gaussian regression model with varying dispersion and shape. The EGIG is a general distribution family which, under the adopted modelling framework, can provide the appropriate level of flexibility to fit moderate costs with high frequencies and heavy-tailed claim sizes, as they both represent significant proportions of the total loss in non-life insurance. The model’s implementation is illustrated by a real data application which involves fitting claim size data from a European motor insurer. The maximum likelihood estimation of the model parameters is achieved through a novel Expectation Maximization (EM)-type algorithm that is computationally tractable and is demonstrated to perform satisfactorily.

... Two alternative approaches have been proposed-on the one hand, the bivariate model is deduced from the frequency distribution and from the severity distribution conditioned to frequency defined for the average claim size distribution using the number of claims as covariate [2,4]. On the other hand, Czado et al. [25] proposed bivariate models based on copulae (see also References [26] for a more general approach). Furthermore, the bivariate Sarmanov model has also been analyzed in the collective model framework [7]. ...

The aim of this paper is to introduce dependence between the claim frequency and the average severity of a policyholder or of an insurance portfolio using a bivariate Sarmanov distribution, that allows to join variables of different types and with different distributions, thus being a good candidate for modeling the dependence between the two previously mentioned random variables. To model the claim frequency, a generalized linear model based on a mixed Poisson distribution -like for example, the Negative Binomial (NB), usually works. However, finding a distribution for the claim severity is not that easy. In practice, the Lognormal distribution fits well in many cases. Since the natural logarithm of a Lognormal variable is Normal distributed, this relation is generalised using the Box-Cox transformation to model the average claim severity. Therefore, we propose a bivariate Sarmanov model having as marginals a Negative Binomial and a Normal Generalized Linear Models (GLMs), also depending on the parameters of the Box-Cox transformation. We apply this model to the analysis of the frequency-severity bivariate distribution associated to a pay-as-you-drive motor insurance portfolio with explanatory telematic variables.

... Generally, there are three specifications of the dependence structure in the CRM. The first is the so-called two-part approach, where the specification of the frequency and the severity part are separated and then their joint distribution is described via • the random effects (Hernández-Bastida et al., 2009;Baumgartner et al., 2015) • copula specifications (Czado et al., 2012;Cossette et al., 2019;Oh et al., 2020) • hierarchical structures (Frees et al., 2014;Shi et al., 2015;Garrido et al., 2016;Park et al., 2018;Valdez et al., 2020); ...

The typical risk classification procedure in the insurance field consists of a priori risk classification based on observable risk characteristics and a posteriori risk classification where premiums are adjusted to reflect claim histories. While using the full claim history data is optimal in a posteriori risk classification, some insurance sectors only use partial information to determine the appropriate premium to charge. Examples include auto insurance premiums being calculated based on past claim frequencies, and aggregate severities used to decide workers' compensation. The motivation is to have a simplified and efficient a posteriori risk classification procedure, customized to the context involved. This study compares the relative efficiency of the two simplified a posteriori risk classifications, that is, those based on frequency and severity. It provides a mathematical framework to assist practitioners in choosing the most appropriate practice.

... This database will be most valuable for researchers in actuarial science who do not have access to claims level data sets with which to test actuarial models. Many actuarial modelers that examine the dependency of frequency and severity of claims (see, e.g., Garrido, Genest, & Schulz, 2016;Lee & Shi, 2019;Shi, Feng, & Ivantsova, 2015) have access to propriety data sets either provided by private insurers or state regulators with which to apply their models. This data set can be used by researchers who do not have access to claims data sets. ...

This data insight highlights the Transportation Security Administration (TSA) claims data as an underused data set that would be particularly useful to researchers developing statistical models to analyze claim frequency and severity. Individuals who have been injured or had items damaged, lost or stolen may make a claim for losses to the TSA. The federal government reports information on every claim from 2002 to 2017 at https://www.dhs.gov/tsa-claims-data. Information collected includes claim date and type and site as well as closed claim amount and disposition (whether it was approved in full, denied, or settled. We provide summary statistics on the frequency and the severity of the data for the years 2003 to 2015. The data set has several unique features including severity is not truncated (there is no deductible), there are significant mass points in the severity data, and the frequency data shows a high degree of auto correlation if compiled on a weekly basis, and substantial frequency mass points at zero for daily data.

... One of the key assumptions frequently used in classical collective risk models is the independence of frequency and individual severities, the independence assumption among individual severities, and independence among individual severities. However, recent studies (Czado et al. 2012, Krämer et al. 2013, Frees et al. 2014, Baumgartner et al. 2015, Shi et al. 2015, Garrido et al. 2016, Lee et al. 2016, Park et al. 2018, Valdez et al. 2020 have reported evidence against the independence assumption. ...

Several collective risk models have recently been proposed by relaxing the widely used but controversial assumption of independence between claim frequency and severity. Approaches include the bivariate copula model, random effect model, and two-part frequency-severity model. This study focuses on the copula approach to develop collective risk models that allow a flexible dependence structure for frequency and severity. We first revisit the bivariate copula method for frequency and average severity. After examining the inherent difficulties of the bivariate copula model, we alternatively propose modeling the dependence of frequency and individual severities using multivariate Gaussian and t-copula functions. We also explain how to generalize those copulas in the format of a vine copula. The proposed copula models have computational advantages and provide intuitive interpretations for the dependence structure. Our analytical findings are illustrated by analyzing automobile insurance data.

... A similar approach has been used by Frees et al. (2011a) in the modeling and prediction of frequency and severity of health care expenditure. Shi et al. (2015) suggested a three-part framework in order to capture the association between frequency and severity components. When generalized linear models (GLMs) are used with the number of claims treated as a covariate in claims severity, Garrido et al. (2016) showed that the pure premium includes a correction term for inducing dependence. ...

For a typical insurance portfolio, the claims process for a short period, typically one year, is characterized by observing frequency of claims together with the associated claims severities. The collective risk model describes this portfolio as a random sum of the aggregation of the claim amounts. In the classical framework, for simplicity, the claim frequency and claim severities are assumed to be mutually independent. However, there is a growing interest in relaxing this independence assumption which is more realistic and useful for the practical insurance ratemaking. While the common thread has been capturing the dependence between frequency and aggregate severity within a single period, the work of Oh et al. (2020a) provides an interesting extension to the addition of capturing dependence among individual severities. In this paper, we extend these works within a framework where we have a portfolio of microlevel frequencies and severities for multiple years. This allows us to develop a factor copula model framework that captures various types of dependence between claim frequencies and claim severities over multiple years. It is therefore a clear extension of earlier works on one-year dependent frequency-severity models and on random effects model for capturing serial dependence of claims. We focus on the results using a family of elliptical copulas to model the dependence. The paper further describes how to calibrate the proposed model using illustrative claims data arising from a Singapore insurance company. The estimated results provide strong evidence of all forms of dependencies captured by our model.

... processes has been extensively studied for insurance data (Frees and Valdez 2008;Shi et al. 2015), it remains overlooked in the operational risk literature. This is surprising, as this feature is particularly important to establish the total operational risk capital with models based on compound processes (Brechmann et al. 2014), but also to provide risk managers with reliable indicators of future losses. ...

... Mixture models in a regression setting have been previously exploited in a number of actuarial applications. Several examples include the papers published by Brown and Buckley (2015), Shi et al. (2015), Garrido et al. (2016), Miljkovic and Fernández (2018), and Miljkovic and SenGupta (2018). For general concepts related to the mixture modeling, the reader is referred to the popular books by McLachlan and Basford (1988) and McLachlan and Peel (1994). ...

This paper explores the role of gender gap in the actuarial research community with advanced data science tools. The web scraping tools were employed to create a database of publications that encompasses six major actuarial journals. This database includes the article names, authors’ names, publication year, volume, and the number of citations for the time period 2005–2018. The advanced tools built as part of the R software were used to perform gender classification based on the author’s name. Further, we developed a social network analysis by gender in order to analyze the collaborative structure and other forms of interaction within the actuarial research community. A Poisson mixture model was used to identify major clusters with respect to the frequency of citations by gender across the six journals. The analysis showed that women’s publishing and citation networks are more isolated and have fewer ties than male networks. The paper contributes to the broader literature on the “Matthew effect” in academia. We hope that our study will improve understanding of the gender gap within the actuarial research community and initiate a discussion that will lead to developing strategies for a more diverse, inclusive, and equitable community.

... This phenomenon invalidates the practice of using frequency-driven BMS and highlights the need to extend the classical collective risk model by allowing some dependence structure between frequency and severity. Existing studies on dependent frequency-severity models and the associated insurance premiums include copula-based models (Czado et al., 2012;Frees et al., 2016), two-step frequency-severity models (Frees et al., 2014;Shi et al., 2015;Garrido et al., 2016;Park et al., 2018), and bivariate random effect-based models (Pinquet, 1997;Boudreault et al., 2006;Hernández-Bastida et al., 2009b;Baumgartner et al., 2015;Lu, 2016;Cheung et al., 2019). The random effect model is especially popular in insurance ratemaking because of the mathematical tractability in its prediction. ...

In auto insurance, a Bonus-Malus System (BMS) is commonly used as a posteriori risk classification mechanism to set the premium for the next contract period based on a policyholder's claim history. Even though recent literature reports evidence of a significant dependence between frequency and severity, the current BMS practice is to use a frequency-based transition rule while ignoring severity information. Although Oh et al. (2019) claim that the frequency-driven BMS transition rule can accommodate the dependence between frequency and severity, their proposal is only a partial solution, as the transition rule still completely ignores the claim severity and is unable to penalize large claims. In this study, we propose to use the BMS with a transition rule based on both frequency and size of claim, based on the bivariate random effect model, which conveniently allows dependence between frequency and severity. We analytically derive the optimal relativities under the proposed BMS framework and show that the proposed BMS outperforms the existing frequency-driven BMS. Later numerical experiments are also provided using both hypothetical and actual datasets in order to assess the effect of various dependencies on the BMS risk classification and confirm our theoretical findings.

... Generally, there are three specifications of the dependence structure in the CRM. The first is the so-called two-part approach, where the specification of the frequency and the severity part are separated and then their joint distribution is described via • the random effects (Hernández-Bastida et al., 2009;Baumgartner et al., 2015) • copula specifications (Czado et al., 2012;Cossette et al., 2019;Oh et al., 2020) • hierarchical structures (Frees et al., 2014;Shi et al., 2015;Garrido et al., 2016;Park et al., 2018;Valdez et al., 2020); ...

Typical risk classification procedure in insurance is consists of a priori risk classification determined by observable risk characteristics, and a posteriori risk classification where the premium is adjusted to reflect the policyholder's claim history. While using the full claim history data is optimal in a posteriori risk classification procedure, i.e. giving premium estimators with the minimal variances, some insurance sectors, however, only use partial information of the claim history for determining the appropriate premium to charge. Classical examples include that auto insurances premium are determined by the claim frequency data and workers' compensation insurances are based on the aggregate severity. The motivation for such practice is to have a simplified and efficient posteriori risk classification procedure which is customized to the involved insurance policy. This paper compares the relative efficiency of the two simplified posteriori risk classifications, i.e. based on frequency versus severity, and provides the mathematical framework to assist practitioners in choosing the most appropriate practice.

... Methods that are appropriate in the case of correlation between frequency and severity components are dealt with by, e.g. Shi et al. (2015). The above facts motivated us to consider a separate modelling, so the paper focuses only on the claim severity in motor third party liability (MTPL) insurance. ...

The paper focuses on the analysis of claim severity in motor third party liability insurance under the general linear model. The general linear model combines the analyses of variance and regression and makes it possible to measure the influence of categorical factors as well as the numerical explanatory variables on the target variable. In the paper, simple, main and interaction effects of relevant factors have been quantified using estimated regression coefficients and least squares means. Statistical inferences about least-squares means are essential in creating tariff classes and uncovering the impact of categorical factors, so the authors used the LSMEANS, CONTRAST and ESTIMATE statements in the GLM procedure of the Statistical Analysis Software (SAS). The study was based on a set of anonymised data of an insurance company operating in Slovakia; however, because each insurance company has its own portfolio subject to changes over time, the results of this research will not apply to all insurance companies. In this context, the authors feel that what is most valuable in their work, is the demonstration of practical applications that could be used by actuaries to estimate both the claim severity and the claim frequency, and, consequently, to determine net premiums for motor insurance (regardless of whether for motor third party liability insurance or casco insurance

... Inspired by this idea, there has been an increasing interest in developing models to capture the possible dependence as in Shi et al. (2015) and Lee et al. (2019). Secondly, it is also important to consider the repeated measurement of claims because it can provide further insights as to the unobserved heterogeneity of the policyholders. ...

The two-part regression model, which subdivides aggregate claims into frequency and severity components, is a widespread tool in actuarial practice for predicting pure premium. The assumption of independence between frequency and severity is conventional but there is increased interest in advancing models to capture the possible dependence as is done in Garrido et al. (2016). This paper extends our work in Jeong et al. (2018a) and explores the benefits of using random effects for predicting insurance claims observed longitudinally, or over a period of time, within a two-part framework relaxing the assumption of independence. More specifically, we introduce a generalized formula for credibility premium of compound sum with dependence, which extends and integrates previous work in both credibility premium of compound sums and dependent two-part compound risk models. In this generalized formula of credibility premium of compound sum, we are able to derive a dependence function, D N (γ), that offers an informative measure of the strength and direction of the association between frequency and severity. This function is easy to interpret and allows for practical implementation useful for actuarial ratemaking. Our model calibration, based on longitudinal claims from a Singapore automobile insurance company, shows that there is a strong negative dependence between frequency and severity. Keywords: Dependent frequency and severity, generalized linear model (GLM), hierarchical model, generalized Pareto (GP), generalized beta of the second kind (GB2), credibility premium of compound sum with dependence

Bonus-Malus Systems traditionally consider a customer's number of claims irrespective of their sizes, even though these components are dependent in practice. We propose a novel joint experience rating approach based on latent Markovian risk profiles to allow for a positive or negative individual frequency-severity dependence. The latent profiles evolve over time in a Hidden Markov Model to capture updates in a customer's claims experience, making claim counts and sizes conditionally independent. We show that the resulting risk premia lead to a dynamic, claims experience-weighted mixture of standard credibility premia. The proposed approach is applied to a Dutch automobile insurance portfolio and identifies customer risk profiles with distinctive claiming behavior. These profiles, in turn, enable us to better distinguish between customer risks.

This paper is to explore data-driven models based on a newly developed natural language processing (NLP) tool called Bidirectional Encoder Representations from Transformer (BERT) to incorporate textural data information for group classification and loss amount prediction on truck's basic warranty claims. In group classification modeling, multiple-class logistic regression is compared with BERT-based back-propagation neural networks (NN). In group loss severity modeling, direct NN regression is compared with BERT-based NN regression prediction. Furthermore, based on the results from a so-called optimal bin-width algorithm, the severity distribution is fitted in Gamma and its parameters are then estimated using maximum likelihood estimation (MLE). The data experiments show that the BERT framework for NLP improves the classification of the warranty claims as well as the loss severity prediction both in accuracy and stability.

To predict insurance reserves at the micro-level without data aggregation, a two-stage machine learning model based on enhanced LightGBM decision trees is proposed. The first stage is a classification task: whether there are claims that have arisen but have not been submitted under the contract (IBNR). In the second stage, the insurance reserve is predicted using a regression model in the case of IBNR, or it is assumed to be equal to the sum insured otherwise. It is shown that the proposed method is more effective than traditional methods of insurance reserves forecasting.KeywordsInsuranceReserve predictionIndividual claimsLightGBM

The concordance probability, also called the C-index, is a popular measure to capture the discriminatory ability of a predictive model. In this article, the definition of this measure is adapted to the specific needs of the frequency and severity model, typically used during the technical pricing of a non-life insurance product. For the frequency model, the need of two different groups is tackled by defining three new types of the concordance probability. Secondly, these adapted definitions deal with the concept of exposure, which is the duration of a policy or insurance contract. Frequency data typically have a large sample size and therefore we present two fast and accurate estimation procedures for big data. Their good performance is illustrated on two real-life datasets. Upon these examples, we also estimate the concordance probability developed for severity models.

Hail risk is among the most challenging perils to insure and property damage due to hailstones has been on the top of the list of annual claims for most non‐life insurers. In this article, we present a simple yet flexible statistical model for insurers to assess and manage hail risks from two aspects: analysing the insurance claims arrival pattern upon occurrence of a hailstorm and quantifying the subsequent financial impact of the hailstorm. We formulate the problem using a marked point process where the reporting of insurance claims due to a hailstorm is treated as recurrent events and the claim amounts are viewed as associated marks. Three complications are addressed in model building: the unobserved heterogeneity in claim arrival, the dependence between the event time and the mark and the complex distribution in claim amount. Using a unique data that combine the exposure data from a major US insurer and the radar data from a third‐party vendor, we show the proposed method help improve predictive analytics for post‐hailstorm claims volume, arrival rate and severity, and thus claim management decisions for the insurer.

Bonus-Malus Systems traditionally consider a customer's number of claims irrespective of their sizes, even though these components are dependent in practice. We propose a novel joint experience rating approach based on latent Markovian risk profiles to allow for a positive or negative individual frequency-severity dependence. The latent profiles evolve over time in a Hidden Markov Model to capture updates in a customer's claims experience, making claim counts and sizes conditionally independent. We show that the resulting risk premia lead to a dynamic, claims experience-weighted mixture of standard credibility premia. The proposed approach is applied to a Dutch automobile insurance portfolio and identifies customer risk profiles with distinctive claiming behavior. These profiles, in turn, enable us to better distinguish between customer risks.

We introduce a smooth-transition generalized Pareto (GP) regression model to study the time-varying dependence structure between extreme losses and a set of economic factors. In this model, the distribution of the loss size is approximated by a GP distribution, and its parameters are related to explanatory variables through regression functions, which themselves depend on a time-varying predictor of structural changes. We use this approach to study the dynamics in the monthly severity distribution of operational losses at a major European bank. Using the VIX as a transition variable, our analysis reveals that when the uncertainty is high, a high number of losses in a recent past are indicative of less extreme losses in the future, consistent with a self-inhibition hypothesis. On the contrary, in times of low uncertainty, only the growth rate of the economy seems to be a relevant predictor of the likelihood of extreme losses.

Approximate Bayesian Computation (abc) is a statistical learning technique to calibrate and select models by comparing observed data to simulated data. This technique bypasses the use of the likelihood and requires only the ability to generate synthetic data from the models of interest. We apply abc to fit and compare insurance loss models using aggregated data. A state-of-the-art abc implementation in Python is proposed. It uses sequential Monte Carlo to sample from the posterior distribution and the Wasserstein distance to compare the observed and synthetic data.

For a typical insurance portfolio, the claims process for a short period, typically one year, is characterized by observing frequency of claims together with the associated claims severities. The collective risk model describes this portfolio as a random sum of the aggregation of the claim amounts. In the classical framework, for simplicity, the claim frequency and claim severities are assumed to be mutually independent. However, there is a growing interest in relaxing this independence assumption which is more realistic and useful for the practical insurance ratemaking. While the common thread has been capturing the dependence between frequency and aggregate severity within a single period, the work of Oh et al. (2021) provides an interesting extension to the addition of capturing dependence among individual severities. In this paper, we extend these works within a framework where we have a portfolio of microlevel frequencies and severities for multiple years. This allows us to develop a factor copula model framework that captures various types of dependence between claim frequencies and claim severities over multiple years. It is therefore a clear extension of earlier works on one-year dependent frequency-severity models and on random effects model for capturing serial dependence of claims. We focus on the results using a family of elliptical copulas to model the dependence. The paper further describes how to calibrate the proposed model using illustrative claims data arising from a Singapore insurance company. The estimated results provide strong evidences of all forms of dependencies captured by our model.

In this paper we propose a statistical modeling framework that contributes to advancing methods for modeling insurance policy premium in the actuarial literature. Specification of separate frequency and severity models, accounting for territorial risk and performing accurate inference, are some of the challenges an actuary faces while modeling policy premium. Policy premiums are characterized to follow a semi-continuous probability distribution, featuring a non-zero probability at zero along with a positive continuous support. Interpretability is a concern when quantifying unobserved risks premiums face from spatial variation. Commonly used strategies in the literature are known to successfully quantify this risk, but do not necessarily produce interpretable estimates. Resorting to frequency-severity models leaves the actuary indecisive about the specification of covariates and spatial effects. The novelty of our proposed approach lies in the development of a parsimonious and interpretable zero-adjusted modeling framework that allows for joint estimation of the effect of policy and individual characteristics on the mean premium and dispersion, while quantifying spatial variability in the mean model. The developed methods are applied to a database featuring premiums arising from the collision coverage in insurance policies for motor vehicles within the state of Connecticut, USA, for the year 2008.

El objetivo del presente trabajo se centra en analizar la problemática asociada con el estudio de las dos variables que influyen en la determinación de la pérdida técnica para el negocio asegurador. En este sentido, son puestas a prueba dos metodologías para el cálculo de los siniestros totales esperados en una cartera de seguros de autos en Brasil. En primera instancia, se consideró un modelo estadístico univariado denominado tradicional, del cual se encontró que la distribución Log-Normal fue la de mejor ajuste. Como segunda metodología, fue calibrado un modelo de cópula que permite incorporar distintos tipos y grados de asociación estocástica para las variables frecuencia y severidad. Los resultados mostraron que estas presentan comportamientos extremos, con una correlación de Kendall negativa y baja (-0.24), pero que rechazan la hipótesis de independencia al 5 % de confianza. Con este marco, la cópula de Clayton rotada 270 grados, con marginales Exp-Poisson y Log-Normal para la frecuencia y severidad respectivamente, presentó las mejores estimaciones de pérdida llegando a una diferencia de 12 % con la pérdida promedio empírica de la base. Para finalizar, se detectó que asumir independencia entre severidad y frecuencia para este caso de estudio, llevaría a sobrestimaciones significativas de la pérdida esperada.

The two-part regression model, which subdivides aggregate claims into frequency and severity components, is a widespread tool in actuarial practice for predicting pure premium. The assumption of independence between frequency and severity is conventional but there is increased interest in advancing models to capture the possible dependence as is done in Garrido et al. (2016). This paper extends the work of Garrido et al. (2016) and explores the benefits of using random effects for predicting insurance claims observed longitudinally, or over a period of time, within a two-part framework relaxing the assumption of independence. More specifically, we introduce a generalized formula for credibility premium of compound sum with dependence, which extends and integrates previous work in both credibility premium of compound sums and dependent two-part compound risk models. In this generalized formula of credibility premium of compound sum, we are able to derive a dependence function, DN(γ), that offers an informative measure of the strength and direction of the association between frequency and severity. This function is easy to interpret and allows for practical implementation useful for actuarial ratemaking. Our model calibration, based on longitudinal claims from a Singapore automobile insurance company, shows that there is a strong negative dependence between frequency and severity.

Malicious hacking data breaches cause millions of dollars in financial losses each year, and more companies are seeking cyber insurance coverage. The lack of suitable statistical approaches for scoring breach risks is an obstacle in the insurance industry. We propose a novel frequency–severity model to analyze hacking breach risks at the individual company level, which would be valuable for underwriting purposes. We find that breach frequency can be modeled by a hurdle Poisson model, which is different from the negative binomial model used in the literature. The breach severity shows a heavy tail that can be captured by a nonparametric- generalized Pareto distribution model. We further discover a positive nonlinear dependence between frequency and severity, which our model also accommodates. Both the in-sample and out-of-sample studies show that the proposed frequency–severity model that accommodates nonlinear dependence has satisfactory performance and is superior to the other models, including the independence frequency–severity and Tweedie models.

In this paper, we explore the use of an extensive list of Archimedean copulas in general and life insurance modelling. We consider not only the usual choices like the Clayton, Gumbel–Hougaard, and Frank copulas but also several others which have not drawn much attention in previous applications. First, we apply different copula functions to two general insurance data sets, co-modelling losses and allocated loss adjustment expenses, and also losses to building and contents. Second, we adopt these copulas for modelling the mortality trends of two neighbouring countries and calculate the market price of a mortality bond. Our results clearly show that the diversity of Archimedean copula structures gives much flexibility for modelling different kinds of data sets and that the copula and tail dependence assumption can have a significant impact on pricing and valuation. Moreover, we conduct a large simulation exercise to investigate further the caveats in copula selection. Finally, we examine a number of other estimation methods which have not been tested in previous insurance applications.

We consider aggregate discounted claims of a risk portfolio with Weibull counting process and compute its recursive moments numerically via the Laplace transform. We define the dependence structure between the inter-claim arrival time and its subsequent claims size using the Farlie-Gumbel-Morgenstern (FGM) copula. In our numerical examples, we compare the moments and conduct sensitivity analysis assuming an exponential and a Pareto claims size distribution. We found that despite having similar marginal variances, the risk portfolio with Pareto claims size produces larger moments compared to the corresponding exponential claims size.

Longitudinal data (or panel data) consist of repeated ob- servations of individual units that are observed over time. Each individual insured is assumed to be independent but correlation between contracts of the same individual is permitted. This paper presents an exhaustive overview of models for panel data that consist of generalizations of count distributions where the dependence between con- tracts of the same insureds can be modeled with Bayesian and frequentist models, based on generalization of Poisson and negative binomial distributions. This paper introduces some of those models to actuarial science and compares the fitting with specification tests for nested and non-nested models. It also shows why some intuitive models (past ex- perience as regressors, multivariate distributions, or cop- ula models) involving time dependence cannot be used to model the number of reported claims. We conclude that the random effects models have a better fit than the other models examined here because the fitting is improved and it allows for more flexibility in computing the next year's premium.

This paper presents and compares different risk classification models for the annual number of claims reported to the insurer. Generalized heterogeneous, zero-inflated, hurdle, and compound frequency models are applied to a sample of an automobile portfolio of a major company operating in Spain. A statistical comparison between models is performed with the help of various specification tests (Score and Hausman tests for nested models, Vuong test or information criteria for nonnested ones). Interesting results about claiming behavior are obtained.

Using the Kullback-Leibler information criterion to measure the closeness of a model to the truth, the author proposes new likelihood-ratio-based statistics for testing the null hypothesis that the competing models are as close to the true data generating process against the alternative hypothesis that one model is closer. The tests are directional and are derived for the cases where the competing models are non-nested, overlapping, or nested and whether both, one, or neither is misspecified. As a prerequisite, the author fully characterizes the asymptotic distribution of the likelihood ratio statistic under the most general conditions. Copyright 1989 by The Econometric Society.

Chapter Preview. Many insurance datasets feature information about frequency, how often claims arise, in addition to severity, the claim size. This chapter introduces tools for handling the joint distribution of frequency and severity. Frequency-severity modeling is important in insurance applications because of features of contracts, policyholder behavior, databases that insurers maintain, and regulatory requirements. Model selection depends on the data form. For some data, we observe the claim amount and think about a zero claim as meaning no claim during that period. For other data, we observe individual claim amounts. Model selection also depends on the purpose of the inference; this chapter highlights the Tweedie generalized linear model as a desirable option. To emphasize practical applications, this chapter features a case study of Massachusetts automobile claims, using out-of-sample validation for model comparisons. How Frequency Augments Severity Information: At a fundamental level, insurance companies accept premiums in exchange for promises to indemnify a policyholder on the uncertain occurrence of an insured event. This indemnification is known as a claim. A positive amount, also known as the severity, of the claim, is a key financial expenditure for an insurer. One can also think about a zero claim as equivalent to the insured event not occurring. So, knowing only the claim amount summarizes the reimbursement to the policyholder. Ignoring expenses, an insurer that examines only amounts paid would be indifferent to two claims of 100 when compared to one claim of 200, even though the number of claims differs.

Chapter Preview. In the actuarial context, fat-tailed phenomena are often observed where the probability of extreme events is higher than that implied by the normal distribution. The traditional regression, emphasizing the center of the distribution, might not be appropriate when dealing with data with fat-tailed properties. Overlooking the extreme values in the tail could lead to biased inference for rate-making and valuation. In response, this chapter discusses four fat-tailed regression techniques that fully use the information from the entire distribution: transformation, models based on the exponential family, models based on generalized distributions, and median regression. Introduction: Insurance ratemaking is a classic actuarial problem in property-casualty insurance where actuaries determine the rates or premiums for insurance products. The primary goal in the ratemaking process is to precisely predict the expected claims cost which serves as the basis for pure premiums calculation. Regression techniques are useful in this process because future events are usually forecasted from past occurrence based on the statistical relationships between outcomes and explanatory variables. This is particularly true for personal lines of business where insurers usually possess large amount of information on policyholders that could be valuable predictors in the determination of mean cost. The traditional mean regression, though focusing on the center of the distribution, relies on the normality of the response variable.

We study whether individuals' private information on health risk affects their medical care utilization. The presence of such information asymmetry is critical to the optimal payment design in healthcare systems. To do so, we examine the relationship between self-perceived health status and healthcare expenditures. Because of simultaneity, we employ a copula regression to model jointly the mixed outcomes, with the association parameter capturing the residual dependence conditional on covariates. The semicontinuous nature of healthcare expenditures leads to a two-part interpretation of private health information: the hurdle component assesses its effect on the likelihood of using medical care services, and the conditional component quantifies its effect on the expenditures given consumption of care. The methodology proposed is applied to a survey data set of a sample of the US civilian non-institutionalized population to test and quantify the effects of private health information. We find evidence of adverse selection in the utilization of various types of medical care services.

Insurers, investors and regulators are interested in understanding the behavior of insurance company expenses, due to the high operating cost of the industry. Expense models can be used for prediction, to identify unusual behavior, and to measure firm efficiency. Current literature focuses on the study of total expenses that consist of three components: underwriting, investment and loss adjustment. A joint study of expenses by type is to deliver more information and is critical in understanding their relationship.This paper introduces a copula regression model to examine the three types of expenses in a longitudinal context. In our method, elliptical copulas are employed to accommodate the between-subject contemporaneous and lag dependencies, as well as the within-subject serial correlations of the three types. Flexible distributions are allowed for the marginals of each type with covariates incorporated in distribution parameters. A model validation procedure based on a t-plot method is proposed for in-sample and out-of-sample validation purposes. The multivariate longitudinal model effectively addresses the typical features of expenses data: the heavy tails, the strong individual effects and the lack of balance.The analysis is performed using property–casualty insurance company expenses data from the National Association of Insurance Commissioners of years 2001–2006. A unique set of covariates is determined for each type of expenses. We found that underwriting expenses and loss adjustment expenses are complements rather than substitutes. The model is shown to be successful in efficiency classification. Also, a multivariate predictive density is derived to quantify the future values of an insurer’s expenses.

We discuss the estimation and inference problems for the Tweedie compound Poisson process and its application to tarification. For data in the form of the total claim and number of claims for a given time interval and a given exposure, the Tweedie process corresponds to a Poisson process of claims and gamma distributed claim sizes. The model has three parameters, namely the mean claim rate, a dispersion parameter and a shape parameter, and the exposure enters as a weight via the dispersion parameter. The Tweedie process is an exponential dispersion model for fixed value of the shape parameter, and hence regression models for the claim rate may be fitted as in generalized linear models. The shape parameter is estimated by maximum likelihood, and inference is based on the likelihood ratio test, rather than the usual analysis of deviance. A GLIM 4 program for estimation in the model is presented.

A crucial assumption of the classical compound Poisson model of Lundberg for assessing the total loss incurred in an insurance portfolio is the independence between the occurrence of a claim and its claims size. In this paper we present a mixed copula approach suggested by Song et al. to allow for dependency between the number of claims and its corresponding average claim size using a Gaussian copula. Marginally we permit for regression effects both on the number of incurred claims as well as its average claim size using generalized linear models. Parameters are estimated using adaptive versions of maximization by parts (MBP). The performance of the estimation procedure is validated in an extensive simulation study. Finally the method is applied to a portfolio of car insurance policies, indicating its superiority over the classical compound Poisson model.

We present a joint copula-based model for insurance claims and sizes. It uses
bivariate copulae to accommodate for the dependence between these quantities.
We derive the general distribution of the policy loss without the restrictive
assumption of independence. We illustrate that this distribution tends to be
skewed and multi-modal, and that an independence assumption can lead to
substantial bias in the estimation of the policy loss. Further, we extend our
framework to regression models by combining marginal generalized linear models
with a copula. We show that this approach leads to a flexible class of models,
and that the parameters can be estimated efficiently using maximum-likelihood.
We propose a test procedure for the selection of the optimal copula family. The
usefulness of our approach is illustrated in a simulation study and in an
analysis of car insurance policies.

In this paper models for claim frequency and average claim size in non-life insurance are considered. Both covariates and spatial random effects are included allowing the modelling of a spatial dependency pattern. We assume a Poisson model for the number of claims, while claim size is modelled using a Gamma distribution. However, in contrast to the usual compound Poisson model, we allow for dependencies between claim size and claim frequency. A fully Bayesian approach is followed, parameters are estimated using Markov Chain Monte Carlo (MCMC). The issue of model comparison is thoroughly addressed. Besides the deviance information criterion and the predictive model choice criterion, we suggest the use of proper scoring rules based on the posterior predictive distribution for comparing models. We give an application to a comprehensive data set from a German car insurance company. The inclusion of spatial effects significantly improves the models for both claim frequency and claim size and also leads to more accurate predictions of the total claim sizes. Further we detect significant dependencies between the number of claims and claim size. Both spatial and number of claims effects are interpreted and quantified from an actuarial point of view.

Objective: To extend the standard two-part model for predicting health care expenditures where multiple encounters may occur within a one-year period. Data Sources: Data for this study were from the Medical Expenditure Panel Survey (MEPS). Study Design: The first part of the extended model represented the frequency of the num-ber of inpatient stays or outpatient visits. The second part modeled expenditure per stay or visit. Both component models used independent variables that consisted of demographic and access characteristics, socioeconomic status, health status, health insurance coverage, employment status and industry classification. The second part also included a variable representing the number of events to predict the expenditure per event. Data Collection Extraction Methods: MEPS panels 7 and 8 from 2003 were used for estimation, panel 9 from 2004 was used for prediction. Principal Findings: This aggregate expenditures model provided a better fit to the data than standard two-part models. The count variable was significant in predicting outpatient expenditures. This model was a useful for predicting total health care expenditures for ran-domly selected individuals and groups. Conclusions: The aggregate expenditures model provided a distribution of predictive val-ues for use in programs to manage expenditures or in studies to predict expenditures.

In insurance applications yearly claim totals of different coverage fields are of-ten dependent. In many cases there are numerous claim totals which are zero. A marginal claim distribution will have an additional point mass at zero, hence this probability function will not be continuous at zero and the cdfs will not be uniform. Therefore using a copula approach to model dependency is not straightforward. We will illustrate how to express the joint probability function by copulas with discrete and continuous margins. A pair copula construction will be used for the fit of the continuous copula allowing to choose appropriate copulas for each pair of margins.

This article examines adverse selection in insurance markets within a two-dimensional information framework, where policyholders differ in both their riskiness and degree of risk aversion. Using this setup, we first build a theoretical model to make equilibrium predictions on competitive insurance screening. We study several variations on the pattern of information asymmetry. The outcomes range from full risk separation, to partial separation, to complete pooling of different risk types. Next, we examine results of this construction with an empirical investigation using a cross sectional observation from a major automobile insurer in Singapore. To test for evidence of adverse selection, we propose a copula regression model to jointly examine the relationship between policyholders' coverage choice and accident occurrence. The association parameter in copula provides evidence of asymmetric information. Furthermore, we invoke the theory to identify subgroups of policyholders for whom one may expect the risk-coverage correlation and adverse selection to arise. The empirical findings are largely consistent with theoretical predictions.

I examine the effects of insurance status and managed care on hospitalization spells, and develop a new approach for sample selection problems in parametric duration models. MLE of the Flexible Parametric Selection (FPS) model does not require numerical integration or simulation techniques. I discuss application to the exponential, Weibull, log-logistic and gamma duration models. Applying the model to the hospitalization data indicates that the FPS model may be preferred even in cases in which other parametric approaches are available. Copyright © 2002 John Wiley & Sons, Ltd.

This is the only book actuaries need to understand generalized linear models (GLMs) for insurance applications. GLMs are used in the insurance industry to support critical decisions. Until now, no text has introduced GLMs in this context or addressed the problems specific to insurance data. Using insurance data sets, this practical, rigorous book treats GLMs, covers all standard exponential family distributions, extends the methodology to correlated data structures, and discusses recent developments which go beyond the GLM. The issues in the book are specific to insurance data, such as model selection in the presence of large data sets and the handling of varying exposure times. Exercises and data-based practicals help readers to consolidate their skills, with solutions and data sets given on the companion website. Although the book is package-independent, SAS code and output examples feature in an appendix and on the website. In addition, R code and output for all the examples are provided on the website.

Individuals, corporations and government entities regularly exchange financial risks y at prices Π. Comparing distributions of risks and prices can be difficult, particularly when the financial risk distribution is complex. For example, with insurance, it is not uncommon for a risk distribution to be a mixture of 0’s (corresponding to no claims) and a right-skewed distribution with thick tails (the claims distribution). However, analysts do not work in a vacuum, and in the case of insurance they use insurance scores relative to prices, called “relativities,” that point to areas of potential discrepancies between risk and price distributions. Ordering both risks and prices based on relativities, in this article we introduce what we call an “ordered” Lorenz curve for comparing distributions. This curve extends the classical Lorenz curve in two ways, through the ordering of risks and prices and by allowing prices to vary by observation. We summarize the ordered Lorenz curve in the same way as the classic Lorenz curve using a Gini index, defined as twice the area between the curve and the 45-degree line. For a given ordering, a large Gini index signals a large difference between price and risk distributions. We show that the ordered Lorenz curve has desirable properties. It can be expressed in terms of weighted distributions functions. In special cases, curves can be ranked through a partial ordering. We show how to estimate the Gini index and give pointwise consistency and asymptotic normality results. A simulation study and an example using homeowners insurance underscore the potential applications of these methods.

In this paper, we propose a flexible “two-part” random effects model (
[35] and [40]) for correlated medical cost data. Typically, medical cost data are right-skewed, involve a substantial proportion of zero values, and may exhibit heteroscedasticity. In many cases, such data are also obtained in hierarchical form, e.g., on patients served by the same physician. The proposed model specification therefore consists of two generalized linear mixed models (GLMM), linked together by correlated random effects. Respectively, and conditionally on the random effects and covariates, we model the odds of cost being positive (Part I) using a GLMM with a logistic link and the mean cost (Part II) given that costs were actually incurred using a generalized gamma regression model with random effects and a scale parameter that is allowed to depend on covariates (cf., Manning et al., 2005). The class of generalized gamma distributions is very flexible and includes the lognormal, gamma, inverse gamma and Weibull distributions as special cases. We demonstrate how to carry out estimation using the Gaussian quadrature techniques conveniently implemented in SAS Proc NLMIXED. The proposed model is used to analyze pharmacy cost data on 56,245 adult patients clustered within 239 physicians in a mid-western U.S. managed care organization.

This paper explores the specification and testing of some modified count data models. These alternatives permit more flexible specification of the data-generating process (dgp) than do familiar count data models (e.g., the Poisson), and provide a natural means for modeling data that are over- or underdispersed by the standards of the basic models. In the cases considered, the familiar forms of the distributions result as parameter-restricted versions of the proposed modified distributions. Accordingly, score tests of the restrictions that use only the easily-computed ML estimates of the standard models are proposed. The tests proposed by Hausman (1978) and White (1982) are also considered. The tests are then applied to count data models estimated using survey microdata on beverage consumption.

Building on the work of Bedford, Cooke and Joe, we show how multivariate data, which exhibit complex patterns of dependence in the tails, can be modelled using a cascade of pair-copulae, acting on two variables at a time. We use the pair-copula decomposition of a general multivariate distribution and propose a method for performing inference. The model construction is hierarchical in nature, the various levels corresponding to the incorporation of more variables in the conditioning sets, using pair-copulae as simple building blocks. Pair-copula decomposed models also represent a very flexible way to construct higher-dimensional copulae. We apply the methodology to a financial data set. Our approach represents the first step towards the development of an unsupervised algorithm that explores the space of possible pair-copula models, that also can be applied to huge data sets automatically.

We reconsider the problem of producing fair and accurate taris based on aggregated insurance data giving numbers of claims and total costs for the claims. Jrgensen and de Souza (Scand. Actuarial J., 1994) assumed Poisson arrival of claims and gamma distributed costs for individual claims. Jrgensen and de Souza (1994) directly modelled the risk or expected cost of claims per insured unit, say. They observed that the dependence of the likelihood function on is as for a linear exponential family, so that modelling similar to that of generalized linear models is possible. In this paper we observe that, when modelling the cost of insurance claims, it is generally necessary to model the dispersion of the costs as well as their mean. In order to model the dispersion we use the framework of double generalized linear models. Modelling the dispersion increases the precision of the estimated taris. The use of double generalized linear models also allows us to handle the case where only the total cost of claims and not the number of claims has been recorded. Keywords: Car insurance, claims data, compound Poisson model, exposure, generalized linear models, dispersion modelling, double generalized linear models, REML, risk theory, tarication. Address for correspondence: Dr G. K. Smyth, Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research, Post Oce, Royal Melbourne Hospital, Parkville, VIC 3050, Australia 1 1

In many biomedical applications, researchers encounter semicontinuous data where data are either continuous or zero. When the data are collected over time the observations may be correlated. Analysis of this kind of longitudinal semicontinuous data is challenging due to the presence of strong skewness in the data. A flexible class of zero-inflated models in a longitudinal setting is developed. A Bayesian approach is used to analyze longitudinal data from an acupuncture clinical trial, in which the effects of active acupuncture, sham acupuncture and standard medical care is compared on chemotherapy-induced nausea in patients who were treated for advanced breast cancer. A spline model is introduced into the linear predictor of the model to explore the possibility of a nonlinear treatment effect. Possible serial correlation between successive observations is also accounted using the Brownian motion. Thus, the approach taken in this paper provides for a more flexible modeling framework and, with the use of WinBUGS, provides for a computationally simpler approach than direct maximum-likelihood. The Bayesian methodology is illustrated with the acupuncture clinical trial data.

We discuss tools for the evaluation of probabilistic forecasts and the critique of statistical models for count data.
Our proposals include a nonrandomized version of the probability integral transform, marginal calibration diagrams, and
proper scoring rules, such as the predictive deviance. In case studies, we critique count regression models for patent data, and
assess the predictive performance of Bayesian age-period-cohort models for larynx cancer counts in Germany. The toolbox
applies in Bayesian or classical and parametric or nonparametric settings and to any type of ordered discrete outcomes.

There are two broad classes of models used to address the econometric problems caused by skewness in data commonly encountered in health care applications: (1) transformation to deal with skewness (e.g., ordinary least square (OLS) on ln(y)); and (2) alternative weighting approaches based on exponential conditional models (ECM) and generalized linear model (GLM) approaches. In this paper, we encompass these two classes of models using the three parameter generalized Gamma (GGM) distribution, which includes several of the standard alternatives as special cases-OLS with a normal error, OLS for the log-normal, the standard Gamma and exponential with a log link, and the Weibull. Using simulation methods, we find the tests of identifying distributions to be robust. The GGM also provides a potentially more robust alternative estimator to the standard alternatives. An example using inpatient expenditures is also analyzed.

This article concerns a new joint modeling approach for correlated data analysis. Utilizing Gaussian copulas, we present a unified and flexible machinery to integrate separate one-dimensional generalized linear models (GLMs) into a joint regression analysis of continuous, discrete, and mixed correlated outcomes. This essentially leads to a multivariate analogue of the univariate GLM theory and hence an efficiency gain in the estimation of regression coefficients. The availability of joint probability models enables us to develop a full maximum likelihood inference. Numerical illustrations are focused on regression models for discrete correlated data, including multidimensional logistic regression models and a joint model for mixed normal and binary outcomes. In the simulation studies, the proposed copula-based joint model is compared to the popular generalized estimating equations, which is a moment-based estimating equation method to join univariate GLMs. Two real-world data examples are used in the illustration.

By a theorem due to Sklar, a multivariate distribution can be represented in terms of its underlying margins by binding them together using a copula function. By exploiting this representation, the "copula approach" to modelling proceeds by specifying distributions for each margin and a copula function. In this paper, a number of families of copula functions are given, with attention focusing on those that fall within the Archimedean class. Members of this class of copulas are shown to be rich in various distributional attributes that are desired when modelling. The paper then proceeds by applying the copula approach to construct models for data that may suffer from selectivity bias. The models examined are the self-selection model, the switching regime model and the double-selection model. It is shown that when models are constructed using copulas from the Archimedean class, the resulting expressions for the log-likelihood and score facilitate maximum likelihood estimation. The literature on selectivity modelling is almost exclusively based on multivariate normal specifications. The copula approach permits selection modelling based on multivariate non-normality. Examples of self-selection models for labour supply and for duration of hospitalization illustrate the application of the copula approach to modelling. Copyright Royal Economic Society, 2003

In this paper, we consider "heavy-tailed" data, that is, data where extreme values are likely to occur. Heavy-tailed data have been analyzed using flexible distributions such as the generalized beta of the second kind, the generalized gamma and the Burr. These distributions allow us to handle data with either positive or negative skewness, as well as heavy tails. Moreover, it has been shown that they can also accommodate cross-sectional regression models by allowing functions of explanatory variables to serve as distribution parameters. The objective of this paper is to extend this literature to accommodate longitudinal data, where one observes repeated observations of cross-sectional data. Specifically, we use copulas to model the dependencies over time, and heavy-tailed regression models to represent the marginal distributions. We also introduce model exploration techniques to help us with the initial choice of the copula and a goodness-of-fit test of elliptical copulas for model validation. In a longitudinal data context, we argue that elliptical copulas will be typically preferred to the Archimedean copulas. To illustrate our methods, Wisconsin nursing homes utilization data from 1995 to 2001 are analyzed. These data exhibit long tails and negative skewness and so help us to motivate the need for our new techniques. We find that time and the nursing home facility size as measured through the number of beds and square footage are important predictors of future utilization. Moreover, using our parametric model, we provide not only point predictions but also an entire predictive distribution.

In this paper a lifestyle perspective is taken to study the various influences on four health related behaviours, i.e. cigarette smoking, diet behaviour, alcohol use and exercise. Of interest is how these behaviours are distributed over four socio-demographic indicators, i.e. the respondents gender, educational level, employment status and age. As a third factor the respondent's city of residence, Varna in Bulgaria and Glasgow and Edinburgh in Scotland, is taken into consideration. Data collected by telephone from 268 respondents from Varna, 827 respondents from Glasgow and 275 respondents from Edinburgh are considered. Large differences in the prevalence of health behaviours are found, with respondents in Varna behaving least healthily and respondents in Edinburgh behaving most healthily, and this is also true at sub-group level. Alcohol use is the exception, and here the opposite relationship between health behaviour and city of residence is found. Females generally behave more healthily than males, however, this pattern is not consistent for all health behaviours. Better educated and employed respondents behave in a more healthy way compared with less well educated and unemployed respondents, and this is true in all three cities, with the difference being particularly large in Scotland. An 'economic' and a 'self-care' explanation are put forward to explain the patterns observed but both explanations are found wanting. It is proposed that integrating various theoretical models is necessary to further develop our understanding of health lifestyle behaviour.

Joint regression analysis of correlated data using gaussian copulas

- P X Song
- .-K Li
- M Yuan

Song, P.X.-K., Li, M., Yuan, Y., 2009. Joint regression analysis of correlated data using
gaussian copulas. Biometrics 65 (1), 60-68.

Testing adverse selection with two-dimensional information: evidence from the singapore auto insurance market

- Shi