ArticlePDF Available

The Use of Annual Mileage as a Rating Variable


Abstract and Figures

Auto insurance companies must adapt to ever-evolving regulations and technological progress. Several variables commonly used to predict accidents rates, such as gender and territory, are being questioned by regulators. Insurers are pressured to find new variables that predict accidents more accurately and are socially acceptable. Annual mileage seems an ideal candidate. The recent development in new technologies should induce insurance carriers to explore ways to introduce mileage-based insurance premiums. We use the unique database of a major insurer in Taiwan to investigate whether annual mileage should be introduced as a rating variable in auto third-party liability insurance. We find that annual mileage is an extremely powerful predictor of the number of claims at-fault. The inclusion of mileage as a new variable should, however, not take place at the expense of bonus-malus systems; rather, the information contained in the bonus-malus premium level complements the value of annual mileage. An accurate rating system should therefore include annual mileage and bonus-malus as the two main building blocks, possibly supplemented by the use of other variables like age, territory and engine cubic capacity. While Taiwan has specific characteristics (high traffic density, a mild bonus-malus system and limited compulsory auto coverage), our results are so strong that we can confidently conjecture that they extend to all developed nations.
Content may be subject to copyright.
Auto insurance companies must adapt to ever-evolving regulations and tech-5
nological progress. Several variables commonly used to predict accidents rates,6
such as gender and territory, are being questioned by regulators. Insurers are7
pressured to nd new variables that predict accidents more accurately and are8
socially acceptable. Annual mileage seems an ideal candidate. The recent devel-9
opment in new technologies should induce insurance carriers to explore ways10
to introduce mileage-based insurance premiums. We use the unique database11
of a major insurer in Taiwan to investigate whether annual mileage should be12
introduced as a rating variable in auto third-party liability insurance. We nd13
that annual mileage is an extremely powerful predictor of the number of claims14
at-fault. The inclusion of mileage as a new variable should, however, not take15
place at the expense of bonus-malus systems; rather, the information contained16
in the bonus-malus premium level complements the value of annual mileage.17
An accurate rating system should therefore include annual mileage and bonus-18
malus as the two main building blocks, possibly supplemented by the use of19
other variables like age, territory and engine cubic capacity. While Taiwan has20
specic characteristics (high trafc density, a mild bonus-malus system and lim-21
ited compulsory auto coverage), our results are so strong that we can condently22
conjecture that they extend to all developed nations.23
Auto liability insurance, Rating variables, Annual mileage25
Auto insurers, in order to remain competitive in risk selection and pricing, are27
constantly seeking better ways to measure risk. To this end, they adopt numer-28
ous rating variables — and, when unavailable, proxy variables — to better gauge29
how risky each particular customer is.30
Astin Bulletin, page 1 of 31. doi: 10.1017/asb.2015.25 C
2015 by Astin Bulletin. All rights reserved.
Auto insurers typically use a large number of variables in their ratings, in-31
cluding age, sex, marital status of principal driver, make, model, use of car,32
territory, moving violations, etc. Other factors that may improve risk classi-33
cation are not used due to regulatory restrictions or practical reasons; a factor34
may be too costly to credibly observe or socially unacceptable. Consequently,35
in most developed countries, insurers have implemented bonus-malus systems36
(BMS), which modify the premium according to past claims history. One of37
the main goals of BMS is to reduce adverse selection by including indirectly38
information that could not be taken into account explicitly such as respect of39
the driving code, alcohol use, mileage driven, etc.40
One of the potential classication variables that has not been widely used41
so far is annual mileage. It is intuitively clear that those who drive more will42
have more auto accidents, that each extra mile spent on the road creates a small43
additional chance of an accident. However, insurers have been reluctant to use44
annual mileage due to their inability to verify policyholders’ statements and the45
relative easiness to tamper with odometers. This had led them to use proxy vari-46
ables like the use of the car (e.g. personal, commuting or business) or the distance47
between home and work. Butler (2006) argues that no less than 12 widely used48
rating variables can be considered as proxies for odometer miles: gender, car age,49
previous accidents at-fault and not-at-fault, credit score, postal code, income,50
military rank, existence of a prior insurer, premium payment by installments,51
years with same employer, collision deductible and tort rights.52
This reliance on proxy variables may change with the development of new53
technologies like telematics, on-board computers, sophisticated GPS transmit-54
ters, tampering-resistant odometers and their fast decrease in cost. Thanks to55
these advances, many auto insurers throughout the world have started to adopt56
annual mileage among their rating variables. As data recorded from GPS be-57
come available to actuarial researchers, opportunities to study previously un-58
available variables will arise. Pioneering research using new variables include59
Ayu s o et al. (2014) and Paefgen et al. (2013, 2014). Ayuso et al. (2014) ana-60
lyze the driving patterns of 15,940 Spanish drivers under the age of 30 years;61
besides the daily distance travelled, they were able to record the percentage of62
total kilometers driven in urban areas, at night, or exceeding speed limits. They63
showed that the time until rst crash is reduced by night driving, by speeding,64
and for inexperienced drivers, among other results. For 1,567 vehicles, Paefgen65
et al. (2013, 2014) studied the risk of an accident as a function of new variables66
like the time of day, the day of the week, and speed intervals, and discovered a67
non-linear relationship between annual mileage and claim frequencies.68
While there is ample evidence that annual mileage positively correlates with69
claim rates (Ferreira and Minikel (2010), Jovanis and Chang (1986), Lemaire70
(1985), Litman (2011), Lourens et al. (1999), Progressive Insurance (2005),71
among others), there is a dearth of research in the actuarial literature that com-72
pares the accuracy of mileage as a rating variable with traditional pricing factors.73
A notable exception seems to be Ferreira and Minikel (2013), who study over74
three million individual car-years observed in 2006 in Massachusetts. Poisson75
and linear regression models are used to explain the pure premium as a function
of annual mileage and two traditional rating factors: territory (six zones) and77
class (adults, senior citizens, business use, years of driving experience). The main78
conclusions are that, while mileage is a signicant predictor of accident risk, it79
is inferior to the other rating factors if used alone; mileage can substantially80
improve rating accuracy if used in conjunction with other variables.81
In this research, we investigate whether annual mileage is a potential rat-82
ing variable using a unique database originating from Taiwan. We were able to83
merge the annual mileage recorded during routine maintenance and oil changes84
in a large network of specialized shops with auto insurance related data collected85
from the largest insurer operating in Taiwan. Our research extends the existing86
literature in several signicant ways: (a) We use a large database, comprising87
over a quarter million policy-years; (b) We study claim severity in addition to88
claim frequency; (c) We include a large set of traditional classication variables89
as controls: gender, age, marital status of policyholder, vehicle age, type, use,90
engine cubic capacity, territory, urban/rural driving. We also use the BMS level91
of each policyholder, a variable that several studies (Lemaire (1985), among oth-92
ers) consider to be the best predictor of future accidents. One important ratio-93
nale for the good accident predictability of BMS levels has been mileage: indeed94
BMS coefcients may partially reect unobserved mileage driven. In addition,95
we use negative binomial regression models to evaluate the relationship between96
claim frequency and mileage, and linear regression to examine the relationship97
between claim severity and mileage.98
By providing empirical evidence of a strong relationship between annual99
mileage and claim counts and a positive relationship between mileage and claim100
severity, this paper provides ample justication for the use of mileage as a rat-101
ing variable. The remainder of the article proceeds as follows. Section 2 dis-102
cusses criteria for auto insurance rating variables and evaluates annual mileage103
in light of these requirements. The data are presented in section 3. Section 4104
presents the main results of regressions performed on the claim count and sever-105
ity distributions. The robustness of results is discussed in section 5. Section 6106
2.1. Rating criteria for “fair” discrimination109
Auto insurers openly practice discrimination in underwriting and pricing. Com-110
petition among insurers and adverse selection among policyholders trigger111
“shing for good risks”, the use of a large number of classication variables112
shown to affect claim frequency and severity. As long as regulators allow them,113
insurers are using variables like age, gender, marital status, territory (postal114
code), years licensed, credit score and occupation of the main driver; good-115
student discounts; driver training; participation in a trafc safety program; re-116
stricted usage; type, model, engine cubic capacity, horse power, age, use of the117
car; annual mileage; garage ownership; premium payment frequency; as well118
as past claims and moving violations. This process of segmentation, the sub-119
division of a portfolio of drivers into a large number of homogeneous rating120
cells, only ends when the cost of including more risk factors exceeds the prot121
that the additional classication would create, or when regulators rule out new122
Insurers have a preference for total freedom in selecting risk factors, so124
that they can charge appropriate premiums to all groups based on risk dif-125
ferentials. They claim that risk classication creates incentives for insureds to126
minimize risks. Accurate risk classication and incentives for risk reduction127
provide the main reasons why society lets insurers discriminate. Indeed, re-128
search consistently suggests that restrictions on risk classication result in cross-129
subsidizations: low-risk individuals choosing to reduce their coverage and more130
high-risk drivers on the road. As price subsidies weaken the link between risk131
and premiums, consumers’ incentives for loss prevention are diminished. Insur-132
ance companies lose incentives to control costs and tend to send more applicants133
to assigned risk pools. As a result, whenever regulation prohibits or reduces the134
role of a rating variable, the resulting marginal premium decrease for high-risk135
drivers does not compensate the increase for low risks, and overall premiums136
tend to increase (Blackmon and Zeckhauser, 1991; Schwarze and Wein, 2005;137
Brown et al., 2007; Regan et al., 2008; Weiss et al., 2010; Derrig and Tennyson,138
2011; Sass and Siegfried, 2012).139
Despite this evidence, society, represented by legislators and insurance reg-140
ulators, has limited the types of discrimination insurers are allowed to practice.141
Indeed, in recent years, certain classiers, including race, gender, age and terri-142
tory, have been severely restricted or outright prohibited. For example, despite143
massive and undisputed proof that females cause fewer accidents, the Court of144
Justice of the European Union ruled that all insurance contracts entered on or145
after December 21, 2012, cannot price males and females differently. The use of146
gender is also prohibited in ten U.S. states, and limited in 22 others (Avraham147
et al., 2013). Age is not used in six Canadian provinces and nine U.S. states, with148
strict restrictions in eleven other states. Two U.S. states ban the use of postal149
code in all property/casualty contracts. Other states limit the number of territo-150
rial rating cells that can be used, restrict premium ratios across two contiguous151
territories or between the highest-rated and lowest-rated districts, or force terri-152
tory to be a secondary rating factor (Avraham et al. (2013), Brown et al. (2007),153
Derrig and Tennyson (2011), Harrington (1991), Jaffe and Russell (2001)).154
“Discrimination” can be viewed in a positive or negative way. It may mean155
nothing more than recognizing a difference between groups, the cornerstone156
of insurance pricing, or it can construed as a prejudice, asserting that certain157
groups are morally inferior and undeserving of equal treatment. Everyone will158
agree that insurers should be permitted to deny coverage or charge a higher159
premium to drivers who have been convicted for drunk driving. Few would dis-160
agree that the use of race in rating should be disallowed, and even viewed as161
repugnant, despite signicant differences in accidents costs, as race is a non-
causal factor, not under the control of the insured, and historically linked to163
unfair treatment. What distinguishes “fair” discrimination from “unfair”?164
When should discrimination be deemed illegal? Which tests should be used to165
determine if a rating variable is socially acceptable?166
The American Academy of Actuaries (1980), Avraham et al. (2013), Gauld-167
ing (1995), and mostly Kelly and Nielson (2006), have presented a variety of168
tests that ideal risk predictors should pass. Requirements can be subdivided into169
actuarial, operational, social and legal criteria.170
2.1.1. Actuarial criteria. A classication variable is considered to be actuari-171
ally fair if it is accurate (the most important criterion, requiring a strong rela-172
tionship between variable and claims), credible (sufcient data exist for all rating173
cells), reliable over time and shows homogeneity within cells.174
The variables that have been questioned (race, age, gender, territory) easily175
pass the accuracy and reliability tests. There are some credibility issues for age176
and territory, as few data are available for very old drivers and some territories177
are sparsely populated. Age is subject to much criticism on the homogeneity178
issue: young and elderly drivers show greater heterogeneity of skills, driving179
abilities and accident rates.180
2.1.2. Operational criteria. For each insured, the value of the variable must181
be objective (different underwriters will always classify in the same way), as-182
sessed at little cost, and not easy to manipulate. There must be an intu-183
itive relationship with claim rates. Discontinuities between groups should be184
Race, age and gender are objective, easily measured at no cost, and cannot186
be manipulated. The relationship with claim rates is not easily demonstrated:187
it is not evident that driving ability is a clear-cut function of age and gender.188
Age fails the continuity test, as there is often a big drop in premiums at the189
age of 25 years for females, and 30 years for males. Manipulation of territory190
constitutes one of the main causes of premium fraud, as it is not uncommon for191
car owners to register their car in a rural area when they actually live in a city;192
such deception is costly to detect, requiring insurers to patrol downtown areas193
during consecutive nights to identify out-of-town registrations.194
As an example of a variable failing the cost criterion, an in-depth psycholog-195
ical prole could reliably predict accident risk, but the underwriting cost would196
be prohibitive.197
2.1.3. Social criteria. Social acceptability is an important test to implement198
a rating variable, with the main requirements being privacy, controllability, af-199
fordability/availability and causality. Risk classication is easier to accept by the200
public if there is an intuitive and demonstrable cause-and-effect of the variable201
on claim rates, and if individuals are encouraged to take action to reduce their202
For age, gender and territory, privacy is generally not an issue, as individ-204
uals rarely mind revealing their age, sex or where they live. The other three205
requirements are probably at the origin of these variables’ exclusion in many206
jurisdictions. Age and gender are obviously not under the control of policy-207
holders. Contrary to variables like miles driven, model of car or trafc vi-208
olations, drivers have no possible action to reduce their premium, thus no209
incentive for safer driving. Affordability is an issue, as the drivers getting pe-210
nalized, the young and the elderly, are just those who can generally ill-afford211
to pay high premiums, and who, more often than others, have difculties ac-212
quiring insurance. Causality is a major issue. The link between age and claims213
is indirect. Causality requires much more than correlation between the vari-214
able and claim rates. Younger drivers have high accidents rates; this is however215
not due to their age per se, but rather to risk-taking behavior, such as driv-216
ing at night, under the inuence of alcohol or drugs, or at excessive speeds,217
often without seatbelts buckled. Claim frequencies increase for the elderly, as218
some older drivers begin to lose their sensory skills (vision, hearing), their cog-219
nitive skills (memory, mental agility, processing of sensory information) and220
motor functions (muscle strength, exibility, endurance); moreover, some med-221
ications impair driving ability. However, the cause-and-effect relationship is222
A variable commonly used in some European countries that clearly fails the224
causality test is garage ownership. While owners of a private garage are safer225
drivers, there is no clear explanation why this should be the case, except through226
correlation with a third variable, possibly income or a caring attitude towards227
the car. Similarly, the “good-student discount” used by many U.S. insurers is228
contentious due to the lack of causality.229
Variables like Internet browsing, purchasing patterns, genetic information230
or sexual orientation would clearly violate the privacy requirement.231
2.1.4. Legal criteria. Insurers should be prohibited from classications that232
are socially suspect. According to the U.S. Supreme Court, suspect classica-233
tions have four factors in common: history of discrimination against the group;234
the characteristics that distinguish the group have no relationship to its ability to235
contribute to society; the characteristics are immutable; the subject class lacks236
political power. Any classication variable that perpetuates or reinforces social237
inequalities can be considered as suspect, as well as any characteristic associated238
with historical discrimination (Gaulding, 1995). The Supreme Court specically239
characterized race, religion and national origin as denitely suspect factors, and240
gender and illegitimacy of birth as quasi-suspect (Avraham et al., 2013).241
While not going as far as prohibiting the use of age, gender or marital status,242
the Canadian Supreme Court has requested insurers to at least explore whether243
better, non-discriminating, variables exist (Kelly and Nielson, 2006).244
Other variables currently used by insurers could be questioned given these
legal criteria. Territory, credit scores and premium payment frequency, for246
instance, can be challenged as proxies for the more objectionable classiers of247
income and race.248
2.2. The evaluation of mileage as a rating variable249
With the use of age, gender and territory prohibited or severely curtailed, and250
possibly other variables such as credit score and premium payment frequency251
next in line, insurers need to nd new variables to maintain accuracy and pos-252
sibly increase drivers’ incentives to reduce risk. Annual mileage is an obvious253
candidate that has been suggested in numerous papers, dating as far back as254
Bailey and Simon’s seminal paper (1960).255
Annual mileage easily passes many of the criteria developed in section 2.1.256
It passes all actuarial tests. Many papers have established a strong relationship257
between annual mileage and claim frequencies, a relationship that has remained258
stable over time. Indeed, more time spent on the road translates into more trafc259
incidents and situations leading to claims. The relationship is, however, less than260
proportional: doubling annual mileage increases the claim frequency, but does261
not double it, possibly because high-mileage users are more experienced or drive262
more on low-risk highways rather than high-risk urban areas. Despite this, the263
variable is highly accurate as a predictor of claims, as it depends on individuals’264
own behavior and is directly based on exposure to risk, and not on the behavior265
of groups of people such as single males or inhabitants of a given township. In266
addition, within-cell heterogeneity is acceptable.267
Mileage also passes several operational tests. It is a numerical, hence objec-268
tive, variable. Rating discontinuities can be minimized as insurers are free to269
subdivide their portfolios into many mileage rating classes. Mostly, there is an270
obvious, intuitive relationship between mileage and claims, since each mile a car271
travels creates a small chance of an accident.272
Mileage is a socially acceptable variable, mostly because of controllability:273
drivers have a strong incentive to affect their accident rate by reducing their driv-274
ing. It improves fairness by shifting weight in pricing towards an individually275
controllable factor rather than based on involuntary membership in a group.276
Causality is obvious: most policyholders should accept the idea that in-277
creased driving raises the chances of an accident. There should be few, if any,278
legal challenges to annual mileage, as this variable is not socially suspect in279
any way. High road users do not constitute a group that had to face historical280
Yet, in practice, mileage is a variable that is hardly used. Some insurers use282
one or two cut-off mileage points, with small surcharges and discounts. The283
main reason for its infrequent use is that the variable, while passing a majority of284
criteria, badly fails other requirements. Until the advent of GPS and on-board285
computers, signicant moral hazard was present, as drivers had a strong and286
obvious incentive to under-report mileage. Incorrect mileage has been reported287
in numerous papers, especially when there is a nancial incentive to under-report288
(Janke, 1991; Langford et al., 2008; Staplin et al., 2008). Odometers were easily289
tampered with, and the cost to control this manipulation was prohibitive, re-290
quiring for instance inspector visits to policyholders’ domicile, agreements with291
repair shops to report odometer readings and having policyholders forward a292
picture of their odometer — all leading to complaints about privacy issues. As293
a result, understatement of annual mileage is one of the major sources of auto294
insurance fraud.295
This situation is rapidly changing, due to fast pace of introduction of telem-296
atics, on-board computers, and GPS transmitters, and the decreasing price of297
these new technologies. For instance, in May 2012, a large company introduced298
a voluntary program in Pennsylvania to monitor mileage using a telematics de-299
vice. Using the catchy slogan “Just have your car send us your driving habits”,300
the rating plan involves the use of a transmitter that comes factory-installed in301
all new vehicles sold by the largest U.S. car manufacturer, or can be profession-302
ally installed on existing cars at a cost of $ 100. A required subscription costing303
$ 200 per year provides automatic crash response, emergency services, road-304
side and stolen vehicle assistance and diagnostic and maintenance information.305
Odometer readings are recorded and e-mailed monthly to the subscriber and306
the insurer. Premium discounts are offered at each renewal, for instance 32%307
for 3,500 annual miles, 13% for 11,000 miles, 5% for 15,000 miles.308
Other companies use telematics to monitor additional driving habits, such309
as the use of the car between midnight and 4 a.m., speeds over 80 miles per310
hour, acceleration and breaking behavior and the type of roads travelled (urban,311
country, motorway).312
While customer tracking can be perceived as an invasion of privacy (policy-313
holders may be leery of allowing their insurance company to track their location314
and driving hours), and affordability may remain an issue for some categories of315
drivers, the two main features of telematics — their inability to be manipulated316
by drivers and their generally low cost — lay to rest the main criticisms against317
using mileage in rating.318
3. THE DATA319
3.1. Background320
Taiwan has a land area of 32,260 sq. km, the size of Belgium, and a popula-321
tion of 23,360,000 in July 2014 (CIA, 2015). Two thirds of the country consist322
mostly of rugged mountains, leading to a very high population concentration323
in the plains. Due to nightmarish driving conditions (high population density,324
6.5 million motorcycles sharing the road with cars, unavailable parking in cities),325
only 4,826,000 non-commercial sedans were registered in 2012, a very low num-326
ber for an afuent country, with a GPD per capita (corrected for purchasing327
power) of $ 43,600, higher than France (Taiwan Insurance Institute, 2014). It is
rare for young individuals to own a car. Very few couples own two cars.329
Automobile insurance is organized in a somewhat different way than in most330
western countries. Compulsory liability only covers bodily injury losses up to a331
limit, currently NT$ 2,200,000 per person (1 NT$ =US$ 31.27 as of April,332
2015). The small increase of the limit during our observation period, from NT$333
1,600,000 to NT$ 1,700,000, is not expected to impact our study, as the vast334
majority of policyholders purchase coverage above the limit. Voluntary policies335
provide additional third-party bodily injury and property damage coverage. Our336
data pool all of these policies, which are subject to the same rating variables and337
BMS. First-party collision coverage is also available, but not considered in this338
study, as another BMS is used.339
Only three a priori classication variables are used by insurers for rat-340
ing purposes: use of car (personal/business), gender and driver age (<20,341
20–25, 25–30, 30–60, >60 years). As females receive a discount, a fact well342
known to Taiwanese households, it is a common practice for couples to reg-343
ister their car to the female driver. As a result, while the vast majority of344
drivers on the road are males, insurers report 70% of female drivers in their345
The BMS has no upper limit in the malus zone. However, no single driver347
in our sample pays more than a 60% surcharge. Therefore, we can model the348
Taiwanese BMS as a 10-class Markov Chain, with premiums levels 70, 74, 82,349
100, 110, 120, 130, 140, 150 and 160. New drivers start in class 4, at level 100.350
Claim-free years are rewarded by a one-class discount. Each claim is penalized351
by three classes (Taiwan Insurance Institute, 2015).352
3.2. Data353
Our large database (over a quarter million policy-years) was produced by pool-354
ing claim and policy information from the largest auto insurer operating in355
Taiwan (market share: 20%) with maintenance records from a chain of repair356
shops operated by the largest car manufacturer (market share: 38%). Repair357
records resulting from an accident were excluded, to avoid introducing a bias358
in the database. Besides the number and severity of claims, insurance variables359
include gender, age and marital status of the main driver, territory, use of car,360
BMS class, engine cubic capacity and date of rst registration of the car. As361
odometer readings are systematically collected by repair shops during each visit,362
interpolation or extrapolation of odometer values between visits allows us to es-363
timate annual mileage. Data are available for seven policy years, 2001 to 2007.364
All policyholders purchased the compulsory policy; 88.82% bought additional365
voluntary insurance.366
All claims, whether reported under the compulsory contract or one of the367
voluntary policies, are recorded. A claim may trigger a payment under a com-368
pulsory and/or a voluntary policy. To avoid double counting, claims reported369
on the same date under two or three policies are counted as a single claim.370
% Voluntary 87.86 87.94 88.16 89.08 89.38 88.93 89.02 89.14 89.62 89.12
Some claims may be missed, for instance, a property damage only claim, if371
the driver did not purchase the corresponding voluntary coverage — not a372
likely occurrence since nearly 89% of drivers in our sample bought it. This373
may raise a problem if high-mileage users are more prone to purchase ad-374
ditional insurance. If this is the case, more claims will be missed among the375
low-mileage drivers, and the impact of mileage on claim frequencies may be376
somewhat overstated. Such a behavior is well-known in collision coverage, but377
fortunately does not take place in our third-party sample, as shown in Table 1.378
(Policies are ranked by increasing mileage, and subdivided into ten equal-sized379
For all policyholders in our sample, the values of the following variables are381
Gender is a classication variable used in rating. Only 29.49% of drivers384
are registered as males, a clear indication that policyholders take advan-385
tage of their knowledge of differential rates to get a premium discount.386
So it is all but certain that policies registered in the “female driver” cat-387
egory include a large number of cars owned by couples, often driven by388
Age is also used in rating. While for rating purposes, the company uses390
ve age categories (<20, 20–25, 25–30, 30–60, >60 years), less than391
1% of drivers are between ages 20 and 25 years, and only a handful are392
between 18 (the minimum driving age) and 20 years. Consequently, we393
combine the rst three age groups and end up with three classes: under394
30 (7.38% of drivers), 30–60 (88.76%), over 60 (3.86%) years.395
Bonus-malus premium level, from 70 to 160, as described in section 3.1.396
Vehicle type and use. Since 97.9% of cars are registered as non-397
commercial sedans, we discard the remaining categories (business use,398
trucks, passenger coaches, taxis).399
Mileage is expressed in kilometers driven per day. Repair shop techni-400
cians know the date the car was put in service, which allows for a rst401
estimate of mileage upon the rst oil change. The date and odometer402
reading are recorded on each visit to the shop. Extrapolation or inter-403
polation then yields an estimate of annual mileage. For instance, assume404
a driver has three visits to the repair shop. His odometer readings are405
13,200 on October 1, 2001, 24,400 on April 1, 2002 (182 days later, 91406
days into 2002, 274 days before January 1, 2003), and 37,400 on January407
0 .005 .01 .015 .02
0 50 100
FIGURE 1: Distribution of daily kilometers driven. (Color online)
15, 2003 (289 days later, 15 days into 2003). The estimate of the number408
of kilometers driven in 2002 is409
A visual inspection of the data shows numerous instances of obvious record-
ing mistakes, with mileages like 44,581 km or +24,833 km. Truncating the411
upper and lower 1% of the data seems a conservative approach, eliminating all412
unrealistic gures. The truncation daily mileage varies across policy years, aver-413
aging 7.43 km and 133.37 km. After eliminating business users, trucks, and the414
unreasonable mileage gures, the total sample size is 259,029. Figure 1 shows415
the distribution of daily kilometers driven of our sample. The average annual416
number of kilometers driven per car is 16,167.417
Several other variables are recorded for classication purposes.418
Marital status. 92.03% of policy owners are married.420
Car age. 26.45% of cars in the sample are under one year of age; 26.19%421
are between ages 1 and 2 years; 18.4% between ages 2 and 3 years;422
12.38% between ages 3 and 4 years; 8.05% between ages 4 and 5 years;423
and 8.53% are older.424
City. 49.99% of our sample drivers live in an urban area.425
Number of Claims Number of Policies Percentage
0 247,955 95.72%
1 8,222 3.17%
2 2,689 1.04%
3 136 0.05%
4 25 0.01%
5 2 0.00%
Total 259,029 100%
Territory. 47.45% of cars are registered in the north of Taiwan; 30.16%426
in the south; 17.31% in central Taiwan and 5.08% in the eastern part of427
the island.428
Engine cubic capacity. The engine capacity is under 1,800 cc for 65.80%429
of cars; between 1,800 cc and 2,000 cc for 28.92% of cars and above430
2,000 cc for the remaining 5.28%.431
3.3. Summary statistics432
Table 2 provides the distribution of the number of claims for the entire sample433
of 259,029 policies. As is common in observed claim count distributions in auto434
insurance, the sample variance (0.0768) is substantially larger than the sample435
mean (0.0545). This will require the use of negative binomial regression, rather436
than Poisson regression, in the statistical analysis. Figure 2 graphs the distribu-437
tion of the natural logarithm of claim amounts.438
The 259,029 policies are subdivided in ten deciles. Table 3 presents the439
mileage limits and mean number of kilometers driven in each mileage class (av-440
eraged across years), as well as the means and variances of the ten claim count441
distributions, and the means and variances of the logarithm of claim severi-442
ties, expressed in U.S. dollars. Figure 3a plots the claim frequencies for the ten443
mileage classes and the 95% condence intervals of the point estimates of the444
average claim frequency. Figure 3b graphs log (claim severity).445
As expected, claim frequencies increase with mileage, but in a less-than-446
proportional way. Drivers in the top mileage decile have about three times as447
many accidents as those in the bottom decile. The variance of the claim number448
increases with mileage. There is no overlap between the condence intervals for449
the upper and lower deciles, conrming a strong, signicant, positive relation-450
ship between annual mileage and accidents. The fact that claim frequencies in451
deciles 1 and 2 are nearly identical provides support for the “low mileage bias”452
(Langford et al., 2008; Staplin et al., 2008), the observation that infrequent users453
of their cars, mostly elderly motorists, have a higher per-mile accident rate, as454
they mostly drive in congested urban areas.455
Mileage Average Class Mean Daily Claim Variance of Log (Claim Variance of
Decile Limit km Driven Frequency Claim Number Severity) (USD) Log Severity
1 6.57–18.80 14.45 0.0349 0.0499 6.2139 1.3037
2 18.80–24.90 21.95 0.0354 0.0496 6.1718 1.3354
3 24.90–30.13 27.59 0.0416 0.0575 6.2300 1.3502
4 30.13–34.91 32.52 0.0491 0.0677 6.1985 1.2318
5 34.91–39.90 37.37 0.0501 0.0706 6.2443 1.3645
6 39.90–45.68 42.74 0.0562 0.0789 6.2975 1.3037
7 45.68–52.63 49.02 0.0578 0.0807 6.3057 1.4142
8 52.63–61.62 56.92 0.0631 0.0873 6.2637 1.3373
9 61.62–75.73 67.96 0.0726 0.1013 6.3577 1.3722
10 75.73+92.41 0.0843 0.1219 6.3794 1.3076
0 .1 .2 .3 .4
2 4 6 8 10 12
log(claim severity)
FIGURE 2: Distribution of claim severity (natural log of claim amount). (Color online)
For all available categorical variables, Table 4 provides the percentage of pol-456
icyholders for each variable category, as well as claim frequencies and mean457
severities. Claim frequency differences across categories are smaller than dif-458
ferences found across mileage classes in Table 3. Means range from 0.0424 to459
0.0742 in Table 4, whereas the mileage means range from 0.0349 to 0.0843, a460
larger variation. Females are at-fault in more accidents (female claim frequency:461
.02 .04 .06 .08 .1
Claim frequency
20 40 60 80 100
Kilometers driven per day
Claim frequency with 95% confidence interval
6.1 6.2 6.3 6.4 6.5
Log (claim severity)
20 40 60 80 100
Kilometers driven per day
Log(claim severity) with 95% confidence interval
FIGURE 3: Claim frequencies (3a) and severities (3b) as a function of daily kilometers, with 95% condence
intervals of point estimates of mileage decile average claim frequency and severity. (Color online)
Category Variable Percentage (%) Claim Frequency (%) Claim Severity (USD)
Age Age<30 7.38% 6.57% 1,644.49
Age30–60 88.79% 5.37% 1,364.26
Age60+3.83% 5.02% 1,384.58
Gender Female 70.51% 5.67% 1,331.71
Male 29.49% 4.93% 1,547.11
Married Married 92.05% 5.41% 1,388.32
Not Married 7.95% 5.91% 1,402.5
Car Age Car age 0 26.45% 7.42% 1,569.13
Car age 1 26.19% 5.24% 1,331.2
Car age 2 18.40% 4.58% 1,310.77
Car age 3 12.38% 4.60% 1,242.28
Car age 4 8.05% 4.24% 1,179.25
Car age 5 8.53% 4.24% 1,283.57
Capacity Capacity 1 65.80% 5.76% 1,359.22
Capacity 2 28.91% 4.93% 1,461.41
Capacity 3 5.28% 4.52% 1,432.32
Region City 5.08% 6.20% 1,837.63
North 47.45% 4.89% 1,247.59
South 30.19% 5.83% 1,449.32
Middle 17.28% 6.00% 1,479.09
0.0567; male: 0.0493), but these accidents are on average less costly, which may462
justify the discount awarded to females.463
Table 5 presents Pearson correlation coefcients for all continuous variables,464
using the logarithm of claim severity due to the high skewness of this variable.465
Spearman and Kendall correlations are very similar. Due to the large sample466
size, correlation coefcients between mileage and all other variables are statisti-467
cally signicant at the 1% level. Correlation coefcients show that young drivers468
tend to drive more, older drivers less, on average. As expected, urbanites drive469
less than rural policyholders. Mileage is positively related to the BMS coef-470
cient, suggesting that, if mileage is not used as a rating variable, the information471
it contains is partially reected through the BMS coefcient. The positive rela-472
tionship between female and married further conrms our conjecture that mar-473
ried couples report the female as the main driver to get the insurance discount.474
4.1. Claim frequency: Negative binomial regression results476
Poisson or negative binomial regressions are typically used for count depen-477
dent variables. Negative binomial regression ts the data better when modeling478
Mileage Claim Freq Log (sev) Driver age Car age Capacity BMS
Mileage 1 0.05512 0.04567 0.0735 0.05292 0.06728 0.05305
(<0.0001) (<0.0001) (<0.0001) (<0.0001) (<0.0001) (<0.0001)
Claim freq 1 0.25112 0.00501 0.03445 0.01283 0.04160
(<0.0001) (0.0108) (<0.0001) (<0.0001) (<0.0001)
Log (sev) 1 0.03529 0.02046 0.02058 0.02125
(0.0002) (0.0313) (0.0303) (0.0254)
Driver age 1 0.0046 0.05913 0.11049
(0.0313) (<0.0001) (<0.0001)
Car age 1 0.03206 0.61857
(<0.0001) (<0.0001)
Capacity 1 0.01205
Note: Numbers in parentheses show p-values.
over-dispersed count outcome variables, as is the case with our claim number479
distribution. As claim frequency shows marked over-dispersion, we use negative480
binomial regression with log link function. In order to allow for an individual481
specic dispersion parameter, we run a random effect negative binomial regres-482
sion model, as suggested by Hausman et al. (1984) and Boucher and Guillen483
(2009). A more extensive discussion of model selection is found in section 5.484
Table 6 reports the negative binomial regression results. Model (1) regresses485
only the mileage variable, which turns out to be highly signicant. We add a486
mileage square term in Model (2) in order to test for a possible non-linear rela-487
tionship between claim frequency and mileage, which is observed in Figure 3a.488
The signicantly positive mileage term and the signicantly negative square489
mileage term together indicate that claims increase with mileage less than pro-490
portionally; the curve plotting claim frequency as a function of mileage is in-491
creasing and concave. Model (3) regresses all current rating variables. While all492
variables are highly signicant, the overall Chi-square is higher and the log like-493
lihood is lower than in model (1), suggesting that mileage alone explains claim494
rates better than all current rating variables combined. Compared to drivers495
under the age of 30 years, discounts given to older drivers, particularly those496
between 30 and 60 years, are entirely justied.497
In model (4), mileage is added to the current variables, and found, as ex-498
pected, to have a hugely signicant positive effect: it has the largest z-score (27.2)499
of all variables, followed by BMS (17.60). If only one variable is to be used, it500
should be mileage, consistent with the results in models (1)–(2) and the summary501
statistics section. The use of mileage in rating eliminates the need for discounts502
for older drivers; presumably, policyholders over the age of 60 years, many of503
(1) (3) (4) (5) (6) (7)
Mileage (2) Current Current Rating All All Observable Signicant
Variables Only Mileage2 Rating with Mileage Observable with Mileage Variables Only
Mileage 0.0107∗∗∗ 0.0137∗∗∗ 0.0136∗∗∗ 0.0140∗∗∗ 0.0139∗∗∗
[0.0004] [0.0005] [0.0005] [0.0006] [0.0005]
Mileage20.0001∗∗∗ 0.0001∗∗∗ 0.0001∗∗∗ 0.0001∗∗∗
[0.0000] [0.0000] [0.0000] [0.0000]
Age 30–60 0.1552∗∗∗ 0.1113∗∗∗ 0.1241∗∗∗ 0.0738∗∗ 0.0676∗∗
[0.0336] [0.0336] [0.0351] [0.0352] [0.0299]
Age 60+ 0.1797∗∗∗ 0.0626 0.1399∗∗ 0.0141
[0.0607] [0.0608] [0.0619] [0.0621]
Female 0.1242∗∗∗ 0.1607∗∗∗ 0.1013∗∗∗ 0.1387∗∗∗ 0.1377∗∗∗
[0.0216] [0.0216] [0.0218] [0.0219] [0.0218]
Bonus-Malus 1.3768∗∗∗ 1.3030∗∗∗ 0.5529∗∗∗ 0.4939∗∗∗ 0.4897∗∗∗
[0.0740] [0.0740] [0.1182] [0.1174] [0.1169]
Married 0.06790.0719∗∗ 0.0763∗∗
[0.0354] [0.0354] [0.0348]
Car Age 0–1 0.3527∗∗∗ 0.3500∗∗∗ 0.3280∗∗∗
[0.0504] [0.0503] [0.0352]
Car Age 1–2 0.1523∗∗∗ 0.1366∗∗∗ 0.1162∗∗∗
[0.0439] [0.0439] [0.0260]
Car Age 2–3 0.0298 0.0215
[0.0453] [0.0453]
Car Age 3–4 0.0581 0.0483
[0.0479] [0.0480]
Car Age 4+ 0.0079 0.0108
(1) (3) (4) (5) (6) (7)
Mileage (2) Current Current Rating All All Observable Signicant
Variables Only Mileage2 Rating with Mileage Observable with Mileage Variables Only
[0.0533] [0.0533]
Eng. Capacity 2 0.1222∗∗∗ 0.1647∗∗∗ 0.1608∗∗∗
[0.0222] [0.0222] [0.0221]
Eng. Capacity 3 0.1260∗∗∗ 0.1559∗∗∗ 0.1539∗∗∗
[0.0458] [0.0458] [0.0457]
City 0.0095 0.0536∗∗∗
[0.0200] [0.0201]
North 0.1517∗∗∗ 0.1463∗∗∗ 0.1168∗∗∗
[0.0436] [0.0436] [0.0241]
South 0.07810.07940.0597∗∗
[0.0437] [0.0437] [0.0256]
Middle 0.0444 0.0321
[0.0468] [0.0468]
Wald Chi2 930.6 924.5 533.1 1,322 690.8 1,510 1,501
Log Likelihood 52,813 52,782 52,987 52,590 52,908 52,496 52,501
Logarithmic Score 0.2114 0.2113 0.2121 0.2107 0.2118 0.2105 0.2,105
10-Fold Cross Validation
Note: ,∗∗ and ∗∗∗ indicate signicance at the 10%, 5% and 1% levels, respectively. Numbers in parentheses provide standard errors. Unbalanced panel negative binomial
regression is used. Chi2, log likelihood and logarithmic score 10-fold cross validation are presented as goodness of t measures. Larger Chi2 and log likelihoods and
smaller logarithmic scores indicate a better t. Year dummy variables are included but the coefcients are not reported here.
them retired, spend less time on the road. So mileage reects a large part of
the age effect on claims, a useful outcome since age is one of the variables that505
regulators criticize. Although “Age 30–60 years” remains signicant after the506
inclusion of mileage, its magnitude decreases when controlling for mileage.507
Even after the inclusion of mileage, BMS remains a signicant predictor.508
One of the main reasons for BMS is classication; BMS may pick up infor-509
mation not revealed to insurers or not used in rating. Therefore, there was a510
distinct possibility that the introduction of a powerful classication variable511
such as mileage would have lessened the need for the use of BMS in rating,512
maybe up to the point of making BMS insignicant. This did not occur: BMS513
remains a dominant variable, with its coefcient hardly decreased. Therefore,514
BMS contains important information not reected by mileage, and should re-515
main an important component in pricing.516
BMS and mileage include important, but different, information about the517
risk a policyholder constitutes. Mileage carries present, or at least very recent,518
information: the latest knowledge about the amount of driving of the insured.519
BMS summarizes past information, as the current BMS level results from claim520
history since the inception of the policy. So BMS captures material from the521
past, including previous mileage, but also overall respect of the laws and the driv-522
ing code, alcohol consumption, road rage behavior, ability to react to crisis sit-523
uations, processing of dangerous circumstances, etc. Replacing current mileage524
by lagged mileage in our models did not affect results in any way: mileage and525
BMS remain highly signicant. Therefore, the information contained in BMS526
reects much more than past mileage.527
Model (5) includes all available variables but mileage. Despite the large num-528
ber of observations, few variables turn out to be highly signicant. The claim529
frequency is a decreasing function of car age, with newer cars involved in more530
accidents. This could possibly be a spurious relationship, due to an omitted531
variable: our data do not include driving experience, a rating variable used in532
several countries. New drivers must start their driving career in the initial class533
of the BMS, at a premium level of 100. Given the lenient rules of the Taiwanese534
BMS, maluses never compensate bonuses, and the average driver has a premium535
level of 81. So it may be that that “new car” is somewhat synonymous with “re-536
cently licensed driver”, a conjecture supported by the large positive correlation537
between BMS and Car Age 0–1 years.538
Compared to smaller cars, autos with a large engine have a reduced claim539
frequency. This point needs cautious interpretation because our sample consists540
of just one brand which does not build cars with super-sized engines. Among541
the geographical variables, only drivers from the northern part of the country542
could claim a discount, probably because of the better roads in this part of the543
state and the separate lanes for scooters. BMS remains signicant at the 1%544
level, but its importance in model (5) is much decreased compared to model (3),545
as measured by the large reduction of its size of coefcient. The use of car age,546
engine cubic capacity and territory lessens the need for a sophisticated BMS.547
The Taiwanese BMS is fairly mild: penalties are not severe when compared to548
Variables Coefcients z-score Standard Errors IRR
Mileage 0.0140∗∗∗ 25.40 0.0006 1.0141
Mileage20.0001∗∗∗ 8.27 0.0001 0.9999
Age 30–60 0.0738∗∗ 2.10 0.0352 0.9289
Age 60+−0.0141 0.23 0.0621 0.9860
Female 0.1387∗∗∗ 6.34 0.0219 1.1488
Bonus-Malus 0.4939∗∗∗ 4.20 0.1174 1.6387
Married 0.0719∗∗ 2.03 0.0354 0.9306
Car Age 0–1 0.3500∗∗∗ 6.96 0.0503 1.4191
Car Age 1–2 0.1366∗∗∗ 3.11 0.0439 1.1464
Car Age 2–3 0.0215 0.47 0.0453 1.0217
Car Age 3–4 0.0483 1.01 0.0480 1.0495
Car Age 4+−0.0108 0.20 0.0533 0.9893
Engine Capacity 2 0.1647∗∗∗ 7.41 0.0222 0.8481
Engine Capacity 3 0.1559∗∗∗ 3.40 0.0458 0.8556
City 0.0536∗∗∗ 2.67 0.0201 1.0551
North 0.1463∗∗∗ 3.36 0.0436 0.8639
South 0.07941.82 0.0438 0.9237
Middle 0.0321 0.69 0.0468 0.9684
Wald Chi2 1,510
Log Likelihood 5,2496
BMS in force in most other countries (Lemaire and Zi, 1994). Should Taiwanese549
companies decide to make transition rules and premium levels differentials more550
severe, the signicance of BMS would certainly increase. Introducing new vari-551
ables such as car age, territory and cubic capacity instead of a more severe BMS,552
while actuarially justied, would result in a complicated rating system with a553
large number of variables, which would be more difcult to understand by bro-554
kers and consumers. So BMS should remain an important component of auto555
insurance rating.556
Adding mileage to all variables (Model 6), or regressing only signicant vari-557
ables (Model 7) hardly modies the strong conclusions of this analysis.558
Table 7 shows the Incidence rate ratios (IRR) and z-scores of Model (6), the559
full regression model from Table 6. The z-scores of mileage and squared mileage560
are the largest among all variables, indicating that, by far, mileage is the most561
accurate variable that insurers could introduce. The impact of mileage on claim562
frequencies surpasses the inuence of all other variables, including BMS, by a563
wide margin. Mileage IRR show that driving an additional kilometer increases564
the chance of accident by 1.41%., everything else being equal. Note that in all565
models, we add a year xed effect in order to control for year-specic events566
such as weather and road condition changes, which may affect everyone that567
Variables Coefcients z-score Standard Errors
Mileage 0.0019∗∗∗ 4.18 0.0005
Female 0.0552∗∗ 2.20 0.0251
Engine Capacity 2 0.0505∗∗ 1.98 0.0255
City 0.0795∗∗∗ 3.56 0.0223
Wald Chi2 60.49
R2 (overall) 0.0056
Rho 0.2000
Breusch and Pagan LM test Chi2 7.22
RMSE 10-fold Cross Validation 1.153
Note: ,∗∗ and ∗∗∗ indicate signicance at the 10%, 5% and 1% levels, respectively. Unbalanced
panel random effect linear regression is used. Chi2, R2 and RMSE (Root mean square error)
are presented as goodness of t measures. Larger Chi2 and R2 and smaller RMSE indicate a
better t. Year dummy variables are included but the coefcients are not reported here.
4.2. Claim severity: Linear regression model results569
Mileage clearly impacts claim frequency, but does it inuence claim severity? Is570
the cost of an accident mostly random, or are high road users involved in more571
severe crashes, maybe because they drive more on freeways, and thus faster? We572
run the same set of random effect linear regressions as in Table 6, with log claim573
severity as a dependent variable. As most of the variables prove to be statistically574
insignicant (providing some support that the cost of an accident is for a large575
part random), only results with signicant variables are presented in Table 8.576
The Breusch and Pagan (1979) LM test shows that the random effect model ts577
the data better than OLS.578
Mileage turns out to be the most signicant variable, in a set of only four.579
The claim severity of a driver in the top mileage decile driving 92 km a day is580
about 15% higher or about U.S. $ 200 more than a driver in the bottom mileage581
decile driving 14 km a day, everything else being equal. The effect of mileage on582
severity is positive, but much smaller than the effect on frequency. This further583
justies the use of mileage as a rating variable. The squared mileage term is584
insignicant in the severity regression.585
As intuitively expected, female drivers have more accidents on average but586
their severity is lower, by about 5%. In cities, where trafc density is higher, more587
accidents take place, but less severe accidents.588
5.1. Alternate model: Regressions with dummy mileage variables590
The use of mileage as a continuous variable implies a linear dependence. Non-591
linear or non-monotonic relationships are certainly possible, at least in certain592
mileage ranges. The positive association found in the previous section could be593
driven mostly by certain mileage levels. To rule out such a possibility, we run an594
alternative model using mileage decile dummies instead of a continuous mileage595
variable. Table 9 reports negative binomial regression results with all available596
variables, but with the continuous mileage variable replaced by nine dummy597
variables characterizing the ten mileage deciles. Regression coefcients for all598
other variables are barely affected. The Chi-square and log likelihood of the599
model with dummy variables are quite similar to the results in Model (6) of Ta-600
ble 6, indicating that the use of mileage deciles provides as much information as601
the continuous variable. In addition, categorical dummy variables reveal a slight602
non-linear relationship for low-mileage users, as shown in Figure 4. Controlling603
for all other factors, mileage exhibits a monotonically increasing relationship,604
both for frequency and severity. Therefore, every single mileage decile carries605
signicant information, and the practice of some insurers to introduce in rating606
just one mileage cut-off point (“the low-mileage discount”) is inefcient from an607
actuarial perspective, as valuable information is lost. Comparing IRR in Table 9,608
drivers in the top mileage decile have about 2.43 times more accidents per year609
than policyholders in the lowest mileage decile. None of the other categorical610
variables shows such a strong effect on claim frequency.611
Our ndings are comparable with the results in Paefgen et al. (2014), who612
analyzed a sample of 27,600 vehicle months over two years. Detailed In-Vehicle613
Recorders Data from a major European Pay-As-You-Drive insurance company614
enabled them to use variables such as time of day, day of week, velocity. There-615
fore, our study cannot be expected to provide the same results, due to omitted616
variables issues and major sample differences: we control for potential rating617
variables, Paefgen et al. (2014) control for the driving situation. However, the618
two studies overall provide very similar results, mostly a strong positive relation-619
ship between claim frequency and mileage. Paefgen et al. (2014) nd a stronger620
non-linear relationship, with lower accident rates in the low-mileage area and621
a less-than-proportional increase for high mileage. Our results are similar for622
low mileage, but differ in the high-mileage zone. This seems to be largely due to623
sample differences. Our data is from Taiwan, a relatively small country with high624
trafc density. The average daily mileage of 92 km in the top decile corresponds625
to the eighth decile of Paefgen et al. (2014)’s sample. Truncating Paefgen et al.626
(2014)’s results at their eighth decile leads to very similar results.627
5.2. Cross-validation628
The purpose of our research is to evaluate mileage as a potential rating vari-629
able by comparing its predictive power to other classication variables. Cross-630
validation is an important component of predictive modeling, as for instance631
adding more variables reduces the training error but may result in sample over-632
tting, hence larger predicting errors (Geisser, 1993).633
Among the various methods available, we run the widely-used 10-fold cross-634
validation as it is known to work well in model selection (Kohavi, 1995). We635
0 .2 .4 .6 .8 1
coefficient estimates in frequency regression
20 40 60 80 100
mileage decile (%)
−.2 −.1 0 .1 .2 .3
coefficient estimate in severity regression
20 40 60 80 100
mileage decile(%)
FIGURE 4: Mileage dummy coefcients of frequency (4a) and severity (4b) regressions. (Color online)
Claim Frequency Claim Severity
Variables Coefcient Standard Error IRR Coefcient Standard Error
Mileage 1 0.0483 [0.0528] 1.0495 0.032 [0.0611]
Mileage 2 0.2081∗∗∗ [0.0509] 1.2313 0.0221 [0.0588]
Mileage 3 0.3703∗∗∗ [0.0491] 1.4482 0.0152 [0.0567]
Mileage 4 0.3829∗∗∗ [0.0490] 1.4665 0.0354 [0.0567]
Mileage 5 0.5022∗∗∗ [0.0480] 1.6524 0.0840 [0.0555]
Mileage 6 0.5487∗∗∗ [0.0477] 1.7311 0.0924[0.0552]
Mileage 7 0.6306∗∗∗ [0.0470] 1.8788 0.0482 [0.0544]
Mileage 8 0.7600∗∗∗ [0.0460] 2.1382 0.1367∗∗ [0.0532]
Mileage 9 0.8875∗∗∗ [0.0452] 2.4292 0.1176∗∗ [0.0522]
Age 30–60 0.0745∗∗ [0.0352] 0.9282 0.0333 [0.0412]
Age 60+−0.0166 [0.0621] 0.9835 0.1024 [0.0723]
Female 0.1369∗∗∗ [0.0219] 1.1467 0.0553∗∗ [0.0256]
Married 0.0709∗∗ [0.0354] 0.9316 0.0150 [0.0415]
Bonus-Malus 0.4943∗∗∗ [0.1175] 1.6394 0.0569 [0.1301]
Car age 0–1 0.3501∗∗∗ [0.0503] 1.4193 0.0882 [0.0575]
Car age 1–2 0.1370∗∗∗ [0.0439] 1.1469 0.0180 [0.0507]
Car age 2–3 0.0222 [0.0453] 1.0225 0.0264 [0.0523]
Car age 3–4 0.0484 [0.0480] 1.0496 0.0245 [0.0552]
Car age 4+−0.0107 [0.0533] 0.9894 0.0219 [0.0612]
Engine capacity 2 0.1632∗∗∗ [0.0222] 0.8494 0.0536∗∗ [0.0258]
Engine capacity 3 0.1541∗∗∗ [0.0458] 0.8572 0.0329 [0.0537]
City 0.0520∗∗∗ [0.0201] 1.0534 0.0704∗∗∗ [0.0234]
North 0.1470∗∗∗ [0.0436] 0.8633 0.0673 [0.0508]
South 0.0801[0.0437] 0.9230 0.0072 [0.0512]
Middle 0.0322 [0.0468] 0.9683 0.0157 [0.0548]
Wald Chi2 1,475 93.83
Likelihood Ratio 52,508
R2 (overall) 0.0086
Note: ,∗∗ and ∗∗∗ indicate signicance at the 10%, 5% and 1% levels, respectively. Numbers in parentheses
provide standard errors. Unbalanced panel negative binomial regression is used for claim regression and
random effect linear regression model is used for severity regression. Larger Chi2 and R2 indicate a better
t. Year dummy variables are included but the coefcients are not reported here.
randomly subdivide the data into a training sample and a testing (or hold-out)636
sample. 10% of the dataset becomes the hold-out sample. We t the model using637
the training sample and evaluate accuracy using the estimated coefcients from638
that sample. We repeat this 10 times. To measure accuracy, we calculate the Root639
Mean Square Error (RMSE) in the severity regression and scoring rule in the640
negative binomial regression. We calculate logarithmic scores, as Bickel (2007)641
has shown that, overall, they outperform quadratic or spherical scoring. The642
Score(y,P)=−log f(y),
where f(y) is the predictive probability with mass function Pr(Y=y). We com-
pute the average of this score over all ten testing samples. A forecast that is closer645
to the true probability receives a lower penalty. Therefore, the lower the score,646
the better the model. This metric is reported in Table 6. The logarithmic score647
is the lowest in model (6) where all variables are used as explanatory variables,648
and highest in model (3), implying that the model that includes the mileage vari-649
able is not over-tted and improves predictability. Comparing models (2) and650
(3), mileage alone outperforms all current rating variables combined in terms651
of predictive power. The logarithmic score also indicates that using all other652
observable variables [model (5)] increases predictability but still underperforms653
mileage [models (1) and (2)].654
5.3. Robustness of results and model selection655
The Poisson model is widely used to model claim counts, but it fails to adjust for656
overdispersion. Overdispersion is taken into account through negative binomial657
regression. Both regression techniques do not factor in longitudinal data sets.658
Our sample has a somewhat limited number of years: the maximum duration of659
a policy is seven years, but the average is close to two. Therefore, the effectiveness660
of models utilizing the longitudinal feature of our data is somewhat doubtful661
(Gourieroux and Jasiak, 2004). Still, it is worthwhile to check the robustness662
of the model selection, as the relatively short observation period may possibly663
bias our results. For example, if high or low mileage drivers systemically move664
out, an attrition problem may result. To address this concern, we calculate the665
average mileage of the drivers who stay with the company, and of those who666
move out, and nd that the difference is ignorable. Policyholders staying in the667
sample drive about 0.2 km more per day.668
Boucher and Inoussa (2014) describe three types of models for longitudi-669
nal data. As Boucher et al. (2008) suggest that the conditional model performs670
poorly in tting, we run two alternatives, the random effect model and the671
marginal model. First, we run a random effect Poisson regression where the un-672
observed heterogeneity among individuals is controlled. Second, we run a ran-673
dom effect negative binomial regression where the unobserved heterogeneity in674
dispersion is allowed. Third, we run a negative binomial regression controlling675
for the unobserved heterogeneity among individuals (Allison, 2005). Last, we676
run a population average model (marginal model), GEE (Generalized Estimat-677
ing Equation) with negative binomial distribution and log link function, where678
error clusters within individuals are allowed. All regression results are provided679
in Table 10. Parameter estimates, especially concerning the mileage variable, are680
almost identical in all models. Test statistics show that the negative binomial681
model is superior to the Poisson model, and that the random effect Poisson682
model ts the data better than Poisson regression. However, all of these results683
Poisson Negative NB Random GEE NB Random
Variables Poisson Random Effect Binomial Effect (Log Link, NB) Effect NLMIXED
Mileage 0.0140∗∗∗ 0.0143∗∗∗ 0.0141∗∗∗ 0.0140∗∗∗ 0.0140∗∗∗ 0.01437∗∗∗
[0.0005] [0.0006] [0.0006] [0.0006] [0.0005] [0.0006]
Mileage20.0001∗∗∗ 0.0001∗∗∗ 0.0001∗∗∗ 0.0001∗∗∗ 0.0001∗∗∗ 0.0001∗∗∗
[0.0001] [0.0001] [0.0001] [0.0000] [0.0001] [0.0001]
Age 30–60 0.0713∗∗ 0.0985∗∗ 0.06780.0738∗∗ 0.0754∗∗ 0.1634∗∗∗
[0.0312] [0.0378] [0.0395] [0.0352] [0.0326] [0.0399]
Age 60+ 0.0247 0.0042 0.0296 0.0141 0.0217 0.1930∗∗
[0.0543] [0.0653] [0.0668] [0.0621] [0.0565] [0.0681]
Female 0.1360∗∗∗ 0.1422∗∗∗ 0.1354∗∗∗ 0.1387∗∗∗ 0.1367∗∗∗ 0.04105
[0.0194] [0.0243] [0.0236] [0.0219] [0.0203] [0.0237]
Bonus-Malus 0.5339∗∗∗ 0.9246∗∗∗ 0.5095∗∗∗ 0.4939∗∗∗ 0.3179∗∗ 0.07308
[0.1031] [0.1036] [0.1258] [0.1174] [0.1082] [0.0402]
Married 0.05950.06440.0591 0.0719∗∗ 0.05971.1768∗∗∗
[0.0315] [0.0384] [0.0392] [0.0354] [0.0328] [0.1437]
Car Age 0–1 0.3820∗∗∗ 0.6806∗∗∗ 0.3869∗∗∗ 0.3500∗∗∗ 0.4302∗∗∗ 0.4544∗∗∗
[0.0443] [0.0484] [0.0533] [0.0503] [0.0459] [0.0548]
Car Age 1–2 0.1383∗∗∗ 0.2061∗∗∗ 0.1371∗∗∗ 0.1366∗∗∗ 0.1505∗∗∗ 0.1296∗∗∗
[0.0389] [0.0436] [0.0463] [0.0439] [0.0401] [0.04518]
Car Age 2–3 0.0432∗∗∗ 0.0540∗∗∗ 0.0421 0.0215 0.0459 0.2301∗∗∗
[0.0401] [0.0442] [0.0475] [0.0453] [0.0412] [0.0462]
Car Age 3–4 0.0617 0.0305 0.0644 0.0483 0.0584 0.2771∗∗∗
[0.0425] [0.0459] [0.0503] [0.0480] [0.0435] [0.0494]
Car Age 4+ 0.0057∗∗∗ 0.0309 0.0063 0.0108 0.009 0.4187∗∗∗
[0.0473] [0.0497] [0.0556] [0.0533] [0.0482] [0.0556]
Eng. Capacity 2 0.1818∗∗∗ 0.1957∗∗∗ 0.1774∗∗∗ 0.1647∗∗∗ 0.1833∗∗∗ 0.1932∗∗∗
Poisson Negative NB Random GEE NB Random
Variables Poisson Random Effect Binomial Effect (Log Link, NB) Effect NLMIXED
[0.0197] [0.0244] [0.0240] [0.0222] [0.0206] [0.0243]
Eng. Capacity 3 0.2025∗∗∗ 0.2306∗∗∗ 0.1968∗∗∗ 0.1559∗∗∗ 0.2060∗∗∗ 0.2406∗∗∗
[0.0417] [0.0517] [0.0500] [0.0458] [0.0436] [0.0502]
City 0.0359∗∗ 0.0295 0.0353 0.0536∗∗∗ 0.03530.04859∗∗
[0.0178] [0.0221] [0.0218] [0.0201] [0.0186] [0.0221]
North 0.1589∗∗∗ 0.1476∗∗∗ 0.1713∗∗∗ 0.1463∗∗∗ 0.1586∗∗∗ 0.4326∗∗∗
[0.0386] [0.0483] [0.0481] [0.0436] [0.0404] [0.0475]
South 0.0547 0.0368 0.0609 0.07940.0534 0.3941∗∗∗
[0.0385] [0.0485] [0.0483] [0.0437] [0.0404] [0.0479]
Middle 0.0123 0.0002 0.0298 0.0321 0.0117 0.3013∗∗∗
[0.0413] [0.0520] [0.0519] [0.0468] [0.0433] [0.0513]
Wald Chi2 2,002 1,482 1,320 1,510 1,841
Log Likelihood 56,389 53,617 52,587 52,496 52,776
Alpha 4.44
LR Test of Alpha, Chi2 5,543.11
Alpha 8.49
LR Test of Alpha, Chi2 7,604.82
Likelihood-ratio 0.00
Test vs. Pooled, Chi
Note: ,∗∗ and ∗∗∗ indicate signicance at the 10%, 5% and 1% levels, respectively. Numbers in parentheses provide standard errors. Unbalanced panel random effect
negative binomial regression is used. Year dummy variables are included but the coefcients are not reported here.
show that our mileage result is robust and unlikely to be biased or exaggerated684
by the error structure.685
In this research, we have used the unique database of a major insurance car-687
rier in Taiwan to investigate whether annual mileage should be introduced as a688
rating variable in auto third-party liability insurance. Admittedly, several char-689
acteristics of Taiwan and its insurance market are quite different from other690
countries: the extreme trafc density, the low number of cars given the high691
average wealth level and compulsory insurance that only requires bodily injury692
coverage with fairly low policy limits. However, our results are so strong that693
we can condently extend them to all developed countries. Annual mileage is694
an extremely powerful predictor of the number of claims at-fault. Its signi-695
cance, as measured by z-score and its associated p-value, by far exceeds that of696
all other variables, including BMS. This conclusion applies independently of all697
other variables possibly included in rating. Cross-validation results show that a698
prediction model with the mileage variable alone performs better than models699
with all current rating variables and all other observable variables.700
Insurance companies are facing difcult pricing decisions, as several vari-701
ables commonly used are challenged by regulators. The EU now forbids the use702
of gender rating. Territory is being challenged in the a substitute for race.703
Insurers are being pressured to nd new variables that predict accidents more704
accurately and are socially acceptable. Annual mileage seems an ideal candidate,705
to be introduced whenever feasible. The recent development of telematics de-706
vices and their rapid decrease in price should induce carriers to explore ways707
to minimize the practical problems associated with mileage-based insurance708
The inclusion of annual mileage as a new rating variable should, however, not710
take place at the expense of BMS. BMS are not a substitute for annual mileage;711
on the contrary, the information contained in the BMS premium level comple-712
ments the value of annual mileage. An accurate rating system should therefore713
include annual mileage and BMS as the two main building blocks, possibly sup-714
plemented by the use of other variables like age and territory, where allowed.715
ALLISON, P.D. (2005) Fixed Effects Regression Methods for Longitudinal Data Using SAS.Cary,717
NC: SAS Institute.718
AMERICAN ACADEMY OF ACTUARIES. (1980) Risk Classication: Statement of Principles.Wash-719
ington DC.720
AVRAHAM, R., LOGUE,K.andSCHWARCZ, D. (2013) Understanding Insurance Anti-721
Discrimination Laws. University of Michigan Law School Scholarship Repository, Ann Arbor,722
MI, Art 52.723
AYUSO, M., GUILLEN,M.andPEREZ-MARIN, M. (2014) Time and distance to rst accident and724
driving patterns of young drivers with pay-as-you-drive insurance. Accident Analysis and Pre-725
vention,73, 125–131.726
BAILEY,R.andSIMON, L. (1960) Two studies in automobile insurance ratemaking. ASTIN Bul-727
letin,1(4), 192–217.728
BICKEL, J.E. (2007) Some comparisons among quadratic, spherical, and logarithmic scoring rules.729
Decision Analysis,4(2), 49–65.730
BLACKMON,G.andZECKHAUSER, R. (1991) Mispriced equity: Regulated rates for auto insurance731
in Massachusetts. American Economic Review,81(2), 65–69.732
BOUCHER,J.P.andGUILLEN, M. (2009) A survey on models for panel count data with applications733
to insurance, RACSAM-Revista de la Real Academic de Ciencas Exactas, Fisicos y Naturales.734
Serie A. Matematicas,103(2), 277–294.735
BOUCHER,J.P.andINOUSSA, R. (2014) A posteriori ratemaking with panel data. ASTIN Bulletin,736
44(3), 587–612.737
BOUCHER, J.P., DENUIT,M.andGUILLEN, M. (2008) Models of insurance claim counts with time738
dependence based on generalization of Poisson and negative binomial distribution. Variance,739
2(1), 135–162.740
BREUSCH, T.S. and PAGAN, A.R. (1979) A simple test for heteroscedasticity and random coefcient741
variation. Econometrica,47(5), 1287–1294.742
BROWN, R., CHARTERS, D., GUNZ,S.andHADDOW, N. (2007) Colliding interest – age as an743
automobile insurance rating variable: Equitable rate-making or unfair discrimination? Journal744
of Business Ethics,72(2), 103–114.745
BUTLER, P. (2006) Driver negligence vs. Odometer miles: Rival theories to explain 12 predictors of746
auto insurance claims, American Risk & Insurance Association Annual Meeting, Washington,747
CIA. (2015) The World Factbook.
DERRIG,R.andTENNYSON, S. (2011) The impact of rate regulation on claims: Evidence751
from Massachusetts automobile insurance. Risk Management and Insurance Review,14(2),752
FERREIRA,J.andMINIKEL, E. (2010) Pay-As-You-Drive auto insurance in Massachusetts.754
Conservation Law Foundation and Environmental Insurance Agency.
content/uploads/2010/12/CLF-PAYD-Study November-2010.pdf.756
FERREIRA,J.andMINIKEL, E. (2013) Measuring per mile risk for Pay-As-You-Drive auto insur-757
ance. Transportation Research Record: Journal of the Transportation Research Board,2297(10),758
GAULDING, J. (1995) Race, sex, and genetic discrimination in insurance: What’s Fair? Cornell Law760
Review,80, 1646–1694.761
GEISSER, S. (1993) Predictive Inference. New York, NY: Chapman and Hall.762
GOURIEROUX,CandJASIAK J. (2004) Heterogeneous INAR (1) model with application to car763
insurance. Insurance: Mathematics and Economics, 34(2), 177–192.764
GREEN W.H. (1994) Accounting for excess zeroes and sample selection in Poisson and negative765
binomial regression models. Department of Economics, Stern school of Business, New York766
HARRINGTON, S. (1991) Auto insurance in Michigan: Regulation, no-fault, and affordability. Jour-768
nal of Insurance Regulation,91(10), 144–183.769
HAUSMAN, J., HALL,B.andGRILICHES, Z. (1984) Econometric models for count data with ap-770
plication to the patents-R and D relationship. Econometrica,52(4), 909–938.771
JAFFE,D.andRUSSELL, T. (2001) The regulation of automobile insurance in California. In Dereg-772
ulating Property-Liability Insurance: Restoring Competition and Increasing Market Efciency773
(ed. D. Cummins), Washington, DC: American Enterprise Institute – Brookings Institution774
Joint Center for Regulatory Studies, 195–236.775
JANKE, M. (1991) Accidents, mileage, and the exaggeration of risk. Accident Analysis and Preven-776
tion,23(2-3), 183–188.777
JOVANIS,P.andCHANG, H. (1986) Modeling the relationship of accidents to miles traveled. Trans-778
portation Research Record,1068, 42–51.779
KELLY,M.andNIELSON, N. (2006) Age as a variable in insurance pricing and risk classication.780
The Geneva Papers: Issues and Practice,31(2), 212–232.781
KOHAVI, R. (1995) A study of cross-validation and bootstrap for accuracy estimation782
and model selection. The International Joint Conference on Articial Intelligence,14(2),783
LANGFORD, J., KOPPEL,S.MCCARTHY,D.andSRINIVASAN, S. (2008) In defence of the “low-785
mileage bias”. Accident Analysis and Prevention,40(6), 1996–1999.786
LEMAIRE, J. (1985) Automobile Insurance: Actuarial Models. Boston: Kluwer Nijhoff Publishing.787
LEMAIRE,J.andZI, H. (1994) A comparative analysis of 30 bonus-malus systems. ASTIN Bulletin,788
24(2), 287–309.789
LITMAN, T. (2011) Distance-based vehicle insurance feasibility. Costs and Benets. Victoria Trans-790
port Policy Institute. com.pdf.791
LOURENS, P., VISSERS,J.andJESSERUN, M. (1999) Annual mileage, driving violations, and acci-792
dent involvement in relation to drivers’ sex, age, and level of education. Accident Analysis and793
Prevention,31(5), 593–597.794
PAEFGEN, J., STAAKE,T.andTHIESSE, F. (2013) Evaluation and aggregation of pay-as-you-drive795
insurance rate factors: A classication analysis approach. Decision Support Systems,56, 192–796
PAEFGEN, J., STAAKE,T.andTHIESSE, F. (2014) Multivariate exposure modeling of accident risk:798
Insights from Pay-as-you-drive insurance data. Transportation Research, Part A: Policy and799
Practice,61, 27–40.800
PROGRESSIVE INSURANCE. (2005) Texas Mileage Study: Relationship between Annual Mileage and801
Insurance Losses.
REGAN, L., WEISS,M.andTENNYSON, S. (2008) The relationship between auto insurance rate803
regulation and insured loss costs: An empirical analysis. Journal of Insurance Regulation,27(2),804
SASS,J.andSIEFRIED, F. (2012) Insurance markets and unisex tariffs: Is the European806
court of justice improving or destroying welfare? Scandinavian Actuarial Journal, online,807
SCHWARZE,R.andWEIN, T. (2005) Is the market classication of risk always efcient? Evidence809
from German third party motor insurance. ESRC Centre for Analysis of Risk and Regulation,810
London School of Economics and Political Science, Discussion Paper 32.811
STAPLIN, L., GISH,K.andJOYC E, J. (2008) “Low mileage bias” and related policy implications –812
A cautionary note, Accident Analysis and Prevention,40(3), 1249–1252.813
TAIWAN INSURANCE INSTITUTE. (2014) Insurance statistics.
TAIWAN INSURANCE INSTITUTE. (2015) Insurance laws and regulations database.816
WEISS, M., TENNYSON,S.andREGAN, L. (2010) The effects of automobile insurance rate regula-818
tion on loss costs and claim frequency: An empirical analysis. Journal of Risk and Insurance,819
77(3), 597–624.820
Wharton School, University of Pennsylvania,822
459 JMHH, 3730 Walnut Street, Philadelphia,823
PA 19104-6302, USA824
E-Mail: lemaire@wharton.upenn.edu825
Phone: +1-215-898-7765. Fax: +1-215-898-1280826
SOJUNG CAROL PARK (Corresponding author)827
College of Business Administration,828
Seoul National University, Republic of Korea829
E-Mail: park.sojung@gmail.com830
Research Fellow, Risk and Insurance ResearchCenter,832
College of Commerce, National Chengchi University,833
Tamkang University, Taiwan834
... While more and more telematics and autonomous driving data are available for rating systems, claim history is still considered to be the best predictor of future accidents in the state-of-the-art actuarial literatures. Lemaire et al. (2016) argued that annual mileage and claim history (BMS level) are two main powerful rating variables of an accurate rating system, but annual mileage should not take the place of claim history [9]. to incorporate GPS trajectories and past accident data to quantify the relationship between driving habits and accidents, their findings can facilitate the development of UBI programs by extending existing experience rating systems [11] [1]. Denuit et al. (2019) proposed an new multivariate credibility model that incorporates telematics data jointly with claim history to upgrade experience rating [5]. ...
... While more and more telematics and autonomous driving data are available for rating systems, claim history is still considered to be the best predictor of future accidents in the state-of-the-art actuarial literatures. Lemaire et al. (2016) argued that annual mileage and claim history (BMS level) are two main powerful rating variables of an accurate rating system, but annual mileage should not take the place of claim history [9]. to incorporate GPS trajectories and past accident data to quantify the relationship between driving habits and accidents, their findings can facilitate the development of UBI programs by extending existing experience rating systems [11] [1]. ...
... 2.1 Actuarial criteria of variable and model selection Lemaire et al. (2016) has pointed out that a strong relationship between rating variable and claims is the most important actuarial fairness criterion (accuracy). Further, we argue that cause-effect relationship is a stronger relationship than correlational relationship, which is the zeroth law of our study. ...
Full-text available
With the popularity of Telematics and Self-driving, more and more rating factors, such as mileage, route, driving behavior, etc., are introduced into actuarial models. There are quite a few doubts and disputes on the rationality and accuracy of the selection of rating variables, but it does not involve the widely accepted historical claim records. Recently, Tesla Insurance released a new generation of Safety Score-based insurance, irrespective of accident history. Forward-looking experts and scholars began to discuss whether claim history will disappear in the future auto insurance rate-making system. Therefore, this paper proposes a new risk variable elimination method as well as a real-time road risk model design framework and concludes that claim history will be regarded as a "noise" factor and deprecated in the Pay-How-You-Drive model.
... The car insurance industry has been trying to classify drivers into different risk levels for years. Traditionally, car insurance companies only consider demographic factors such as gender, age, or vehicle model for rating risk levels (Lemaire, Park, & Wang, 2015). In recent years, with the prevalence of GPS tracking devices, insurers have started to adopt usage-based insurance (UBI) policies, including Pay-As-You-Drive (PAYD) and Pay-How-You-Drive (PHYD). ...
... The premium is expected to be higher for drivers who travel longer distances (Denuit, Marchal, Pitrebois, & Walhin, 2007) because long-distance trips significantly increase the risk of accidents (J.-P. Boucher, Peârez-Marõân, & Santolino, 2013;Lemaire et al., 2015;Litman, 2005). Early ideas of implementing mileage into insurance pricing include pay-at-the-pump (Sugarman, 1994), where a surcharge is applied for each liter of petrol, and self-reported mileage estimates with occasional verification by the insurance company (Litman, 2011). ...
Full-text available
With the prevalence of GPS tracking technologies, car insurance companies have started to adopt usage-based insurance policies, which adapt insurance premiums according to the customers’ driving behavior. Although many risk models for assessing an individual driver’s accident risk based on the history of driving trajectories, driving events, and exposure records exist, these models do not take the geographical context of the driven trajectories and driving events into account. This study explores the influence of enriching the existing purely driving-behavior-based feature set by multiple geographical context features for the task of differentiating between accident and accident-free drivers. Prediction performances of five machine learning classifiers—logistic regression, random forest, XGBoost, feed-forward neural networks (FFNN), and long short-term memory (LSTM) networks—were evaluated on the usage records of over 8,000 vehicles in one year from Italy. The results show that the inclusion of geographical information such as weather, points of interest (POIs), and land use can increase the relative predictive performance in terms of AUC by up to 8%, among which land use is the most informative. For the data of this study, XGBoost generally yielded the best performance and made most use out of the geographical information, while logistic regression is only slightly outperformed by more complex models if the proposed geographical information is not available. LSTM did not outperform the other methods, possibly due to the small volume of training data available. The results outline the potential of including the geographical context in usage-based car insurance risk modeling to improve the accuracy, leading to fairer usage-based insurance policies.
... A survey 1 conducted by Willis Towers Watson on 1,005 insurance consumers in the United States reports that 4 out of 5 drivers are in favour of sharing their recent driving information in exchange for a personalized insurance product. Among the benefits, it seems clear nowadays that the addition of telematics information into the insurance pricing models improves the precision of the pure premium (see for instance [Ayuso et al., 2019], [Pérez-Marín and Guillen, 2019], [Verbelen et al., 2017] and [Lemaire et al., 2015]). UBI also has many positive impacts on society (see for instance [Greenberg, 2009] and [Bordoff and Noel, 2008]). ...
Usage-based insurance (UBI) is now a sought-after auto insurance product in the market. By using a wide range of telematics data, insurance companies can better understand the insured's driving behavior and capture the relationship between insurance loss and the relevant risk factors. This study examines the frequency of UBI claims and combines machine learning algorithms with classic actuarial distributions to establish the predictive model. More specifically, considering the large number of driving behavior features and their complex interactions, we replace generalized linear models with boosted trees, and synchronously update the estimation results of the zero-inflation probability and mean parameter under a zero-inflated Poisson or zero-inflated negative binomial assumption. We further discuss the role of regularization terms and “dropout” in dual-parameter boosted trees, and propose a general framework for insurance claim frequency modeling, which shows high prediction accuracy on both UBI and French motor third-party liability datasets, as well as the interpretability. The potential of extensive driving behavior features has been further verified on a Chinese insurance dataset, and the factors that have a significant impact on vehicle risk are identified and quantified on this basis. In addition, we discuss in detail the key points of applying boosted trees in actuarial science, which also promotes predictive insurance analytics.
We analyze a novel dataset collecting the driving behavior of young policyholders in a motor third party liability (MTPL) portfolio, followed over a period of three years. Driving habits are measured by the total mileage and the distance driven on different road types and during distinct time slots. Driving style is characterized by the number of harsh acceleration, braking, cornering and lateral movement events. First, we develop a baseline pricing model for the complete portfolio with claim history and self-reported risk characteristics of approximately 400,000 policyholders each year. Next, we propose a methodology to update the baseline price via the telematics information of young drivers. Our approach results in a truly usage-based insurance (UBI) product, making the premium dependent on a policyholder's driving habits and style. We highlight the added value of telematics via improvements in risk classification and we put focus on managerial insights by analyzing expected profits and retention rates under our new UBI pricing structure.
We give a survey on the field of telematics car driving data research in actuarial science. We describe and discuss telematics car driving data, we illustrate the difficulties of telematics data cleaning, and we highlight the transparency issue of telematics car driving data resulting in associated privacy concerns. Transparency of telematics data is demonstrated by aiming at correctly allocating different car driving trips to the right drivers. This is achieved rather successfully by a convolutional neural network that manages to discriminate different car drivers by their driving styles. In a last step, we describe two approaches of using telematics data for improving claims frequency prediction, one is based on telematics heatmaps and the other one on time series of individual trips, respectively.
Novel navigation applications provide a driving behavior score for each finished trip to promote safe driving, which is mainly based on experts’ domain knowledge. In this paper, with automobile insurance claims data and associated telematics car driving data, we propose a supervised driving risk scoring neural network model. This one-dimensional convolutional neural network takes time series of individual car driving trips as input and returns a risk score in the unit range of (0,1). By incorporating credibility average risk score of each driver, the classical Poisson generalized linear model for automobile insurance claims frequency prediction can be improved significantly. Hence, compared with non-telematics-based insurers, telematics-based insurers can discover more heterogeneity in their portfolio and attract safer drivers with premiums discounts.
This study aims to demonstrate the effect of the cost of telematics and loss ratio improvement on the coverage demand for health promotion medical insurance. Real-time monitoring via telematics is expected to alleviate moral hazard of insured persons through the “analogical experience rating system,” resulting in a decreased claim cost and an improved loss ratio. In reality, however, the real-time monitoring on a continuous basis imposes a cost burden on insurers and thus, certain expense loadings apply to insurance premiums. The analysis based on the modified separating market equilibrium model reveals that people tend to opt for partial insurance coverage, and high-risk individuals even stay uninsured unless the expense loadings are not excessive. This result implies that the demand for health promotion medical insurance can decrease, and may conduce a market shrinkage, unless the improvement of loss ratio sufficiently surpasses the cost of telematics utilization.
Road accidents and vehicular emissions are two significant issues related to road transportation, affecting both human life and the environment. Prior research suggests that driver behavior is a crucial factor in the majority of road crashes and is a significant factor influencing fuel consumption and vehicle emission. Significant improvement in driving behavior can be achieved by providing feedback to drivers about their driving behavior. An increasing interest among researchers to identify risky and non-economical driving maneuvers has led to the development of driver behavior profiling, i.e., rating/categorizing drivers into different categories based on how they drive. To get an insight into different parameters and methodology adopted by researchers for categorizing drivers into different categories, this paper presents a systematic review of studies on driver behavior profiling. In the present paper, PRISMA approach was adopted to shortlist the most relevant studies for systematic review out of 1231 initial studies, which were extracted using the relevant keywords. The findings from our study suggest that the selection of parameters for profiling the driver will depend on the application of the profiling scheme, type of device used for extracting data, and importance of parameter in rating criteria. Further, the findings suggest that significant improvement in driving behavior can be achieved by providing feedback to the drivers about their driving behavior and by implementing usage-based insurance schemes. It is also suggested that future studies shall focus on using smartphone devices for the collection of driver data as smartphones are nowadays easily accessible to everyone.
Conference Paper
Full-text available
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on arti cial data and theoretical results in restricted settings have shown that for selecting a good classi er from a set of classiers (model selection), ten-fold cross-validation may be better than the more expensive leaveone-out cross-validation. We report on a largescale experiment|over half a million runs of C4.5 and a Naive-Bayes algorithm|to estimate the e ects of di erent parameters on these algorithms on real-world datasets. For crossvalidation, we vary the number of folds and whether the folds are strati ed or not � for bootstrap, we vary the number of bootstrap samples. Our results indicate that for real-word datasets similar to ours, the best method to use for model selection is ten-fold strati ed cross validation, even if computation power allows using more folds. 1
Full-text available
State regulation of rates is often viewed as a means to make automobile insurance more affordable to consumers by controlling insurer profits and pricing practices. Contrary to this view, economic theory predicts that distorting insurance prices through regulation will distort consumers' decisions about driving, insurance purchase and safety incentives, leading to higher automobile accident loss costs and, ultimately, to higher premiums. The paper uses a panel of annual state-level data over the time period 1980-1998 to investigate whether the incentive distortions from rate regulation have a significant impact on automobile insurance loss costs and claim frequency. Using fixed effects estimation techniques, coupled with methods to control for the possible endogenous determination of state regulatory regimes, we examine whether loss costs and claim frequency are systematically higher in the presence of rate regulation. Our empirical results show that the existence of rate regulation, and increased regulatory stringency, are associated with significantly higher average loss costs and insurance claim frequency. Thus, rate regulation appears to distort driver incentives as predicted by theory.
The mathematical theory of non-life insurance developed much later than the theory of life insurance. The problems that occur in the former field are far more intricate for several reasons: 1. In the field oflife insurance, the company usually has to pay a claim on the policy only once: the insured dies or the policy matures only once. It is with only a few particular types of policy (for instance, sickness insurance, when the insured starts working again after a period of sickness) that a valid claim can be made on a number of different occasions. On the other hand, the general rule in non-life insurance is that the policyholder is liable to be the victim of several losses (in automobile insurance, of course, but also in burglary and fire insurance, householders' comprehensive insurance, and so on). 2. In the field of life insurance, the amount to be paid by the company­ excluding any bonuses-is determined at the inception of the policy. For the various types of life insurance contracts, the sum payable on death or at maturity of the policy is known in advance. In the field of non-life insurance, the amount of a loss is a random variable: the cost of an automobile crash, the partial or totalloss of a building as a result of fire, the number and nature of injuries, and so forth.
This study examines the relationship between accident costs and annual miles driven with mileage and claims data representing approximately 3 million individual car years of insurance exposure for private passenger automobiles in Massachusetts in the 2006 policy year. Poisson and linear models relating pure premium to annual mileage estimates demonstrate that mileage is a significant predictor of insurance risk, that mileage alone cannot replace traditional rating factors such as class and territory, and that mileage gains in explanatory power when used in conjunction with those traditional rating factors. These findings provide a strong actuarial basis for pay-as-you-drive insurance, in which drivers are charged rates per mile that differ depending on the driver's class and territory. A model of consumer response to pay-as-you-drive insurance based on studies of miles elasticity to gasoline prices suggests that if all drivers in Massachusetts switched to per mile insurance policies, aggregate vehicle miles traveled in the state would drop by 5.0% to 9.5%. Greenhouse gas emissions from private passenger automobiles would be reduced by a similar amount, and the social equity implications of pay-as-you-drive insurance would be positive. On the basis of sound actuarial justification and positive social benefits, this study finds a strong argument in favor of the regulatory approval of pay-as-you-drive insurance.
Ratemaking is one of the most important tasks of non-life actuaries. Usually, the ratemaking process is done in two steps. In the first step, a priori ratemaking, an a priori premium is computed based on the characteristics of the insureds. In the second step, called the a posteriori ratemaking, the past claims experience of each insured is considered to the a priori premium and set the final net premium. In practice, for automobile insurance, this correction is usually done with bonus-malus systems, or variations on them, which offer many advantages. In recent years, insurers have accumulated longitudinal information on their policyholders, and actuaries can now use many years of informations for a single insured. For this kind of data, called panel or longitudinal data, we propose an alternative to the two-step ratemaking approach and argue this old approach should no longer be used. As opposed to a posteriori models of cross-section data, the models proposed in this paper generate premiums based on empirical results rather than inductive probability. We propose a new way to deal with bonus-malus systems when panel data are available. Using car insurance data, a numerical illustration using at-fault and non-at-fault claims of a Canadian insurance company is included to support this discussion. Even if we apply the model for car insurance, as long as another line of business uses past claim experience to set the premiums, we maintain that a similar approach to the model proposed should be used.
We conducted a study of approximately 16,000 drivers under the age of 30 that had purchased a pay-as-you-drive insurance policy, where their risk of being involved in a crash was analyzed from vehicle tracking data using a global positioning system. The comparison of novice vs. experienced young drivers shows that vehicle usage differs significantly between these groups and that the time to the first crash is shorter for those drivers with less experience. Driving at night and a higher proportion of speed limit violations reduces the time to the first crash for both novice and experienced young drivers, while urban driving reduces the distance traveled to the first crash for both groups. Gender differences are also observed in relation to the influence of driving patterns on the risk of accident. Nighttime driving reduces the time to the first accident in the case of women, but not for men. The risk of an accident increases with excessive speed, but the effect of speed is significantly higher for men than it is for women among the more experienced drivers.
IN SOME APPLICATIONS of the general linear model, the usual assumptions of homoscedastic disturbances and fixed coefficients may be questioned. When these requirements are not met, the loss in efficiency in using ordinary least squares (OLS) may be substantial and, more importantly, the biases in estimated standard errors may lead to invalid inferences. This has caused a number of writers to propose models which relax these conditions and to devise estimators for their more general specifications, e.g., Goldfeld and Quandt (8) for heteroscedasticity and Hildreth and Houck (11) for random coefficients. However, because the effect of introducing random coefficient variation is to give the dependent variable a different variance at each observation, models with this feature can be considered as particular heteroscedastic formulations for the purpose of detecting departure from the standard linear model. A test for heteroscedasticity with the same asymptotic properties as the likelihood ratio test in standard situations, but which can be computed by two least squares regressions, thereby avoiding the iterative calculations necessary to obtain maximum likelihood estimates of the parameters in the full model, is considered in this paper. The approach is based on the Lagrangian multiplier
The increasing adoption of in-vehicle data recorders (IVDR) for commercial purposes such as Pay-as-you-drive (PAYD) insurance is generating new opportunities for transportation researchers. An important yet currently underrepresented theme of IVDR-based studies is the relationship between the risk of accident involvement and exposure variables that differentiate various driving conditions. Using an extensive commercial data set, we develop a methodology for the extraction of exposure metrics from location trajectories and estimate a range of multivariate logistic regression models in a case-control study design. We achieve high model fit (Nagelkerke’s R2 0.646, Hosmer–Lemeshow significance 0.848) and gain insights into the non-linear relationship between mileage and accident risk. We validate our results with official accident statistics and outline further research opportunities. We hope this work provides a blueprint supporting a standardized conceptualization of exposure to accident risk in the transportation research community that improves the comparability of future studies on the subject.