ArticlePDF Available

Research on the Features of Car Insurance Data Based on Machine Learning

Authors:

Abstract and Figures

With the continuous development of machine learning, enterprises using machine learning methods to mine potential data information has become a hot topic in the research of major insurance companies. In this paper, the features of auto insurance data are analyzed, and the most important features affecting auto renewal are mined. The random forest (RF), gradient lifting tree (GBDT) and lifting machine algorithm (LightGBM) are compared. The test results show that: LightGBM model with the best superiority and robustness. Features of car insurance business channel, NCD, car age and new car purchase price have a greater impact on whether to renew insurance or not.
Content may be subject to copyright.
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 166 (2020) 582–587
1877-0509 © 2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 3rd International Conference on Mechatronics and Intelligent Robotics,
ICMIR-2019.
10.1016/j.procs.2020.02.016
10.1016/j.procs.2020.02.016 1877-0509
© 2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientic committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics, ICMIR-2019.
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science00 (2019) 000000
www.elsevier.com/locate/proce
dia
2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection and peer-review under responsibility of the scientific committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
3rd International Conference on Mechatronics and Intelligent Robotics (ICMIR-2019)
Research on the Features of Car Insurance Data Based on Machine
Learning
Hui Dong Wang*1
School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Abstract.
With the continuous development of machine learning, enterprises using machine learning methods to mine potential data
information has become a hot topic in the research of major insurance companies. In this paper, the features of auto insurance
data are analyzed, and the most important features affecting auto renewal are mined. The random forest (RF), gradient lifting tree
(GBDT) and lifting machine algorithm (LightGBM) are compared. The test results show that: LightGBM model with the best
superiority and robustness. Features of car insurance business channel, NCD, car age and new car purchase price have a greater
impact on whether to renew insurance or not.
© 2019 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of organizing committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
Keywords: Car insurance, Feature engineering, LightGBM, Data analysis
1. Introduction
With the gradual increase in the number of cars, companies will pay more and more attention to precision
marketing. Excavating useful knowledge and information hidden in users, products and services in massive customer
data, and acquiring more customer resources has become the focus of competition among major insurance
companies. Improving products and services through machine learning and data mining is the way to gain new
competitive advantage [1].
Feature selection is one of the commonly used techniques in data preprocessing. As a dimension reduction
method, it focuses on deleting irrelevant or redundant features and selecting a small number of important features
1 Corresponding Author. Tel.+(86) 18669078086
*E-mail: 826727335@qq.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science00 (2019) 000000
www.elsevier.com/locate/proce
dia
2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection and peer-review under responsibility of the scientific committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
3rd International Conference on Mechatronics and Intelligent Robotics (ICMIR-2019)
Research on the Features of Car Insurance Data Based on Machine
Learning
Hui Dong Wang*1
School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Abstract.
With the continuous development of machine learning, enterprises using machine learning methods to mine potential data
information has become a hot topic in the research of major insurance companies. In this paper, the features of auto insurance
data are analyzed, and the most important features affecting auto renewal are mined. The random forest (RF), gradient lifting tree
(GBDT) and lifting machine algorithm (LightGBM) are compared. The test results show that: LightGBM model with the best
superiority and robustness. Features of car insurance business channel, NCD, car age and new car purchase price have a greater
impact on whether to renew insurance or not.
© 2019 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of organizing committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
Keywords: Car insurance, Feature engineering, LightGBM, Data analysis
1. Introduction
With the gradual increase in the number of cars, companies will pay more and more attention to precision
marketing. Excavating useful knowledge and information hidden in users, products and services in massive customer
data, and acquiring more customer resources has become the focus of competition among major insurance
companies. Improving products and services through machine learning and data mining is the way to gain new
competitive advantage [1].
Feature selection is one of the commonly used techniques in data preprocessing. As a dimension reduction
method, it focuses on deleting irrelevant or redundant features and selecting a small number of important features
1 Corresponding Author. Tel.+(86) 18669078086
*E-mail: 826727335@qq.com
Hui Dong Wang / Procedia Computer Science 166 (2020) 582–587 583
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science00 (2019) 000000
www.elsevier.com/locate/proce
dia
2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection and peer-review under responsibility of the scientific committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
3rd International Conference on Mechatronics and Intelligent Robotics (ICMIR-2019)
Research on the Features of Car Insurance Data Based on Machine
Learning
Hui Dong Wang*1
School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Abstract.
With the continuous development of machine learning, enterprises using machine learning methods to mine potential data
information has become a hot topic in the research of major insurance companies. In this paper, the features of auto insurance
data are analyzed, and the most important features affecting auto renewal are mined. The random forest (RF), gradient lifting tree
(GBDT) and lifting machine algorithm (LightGBM) are compared. The test results show that: LightGBM model with the best
superiority and robustness. Features of car insurance business channel, NCD, car age and new car purchase price have a greater
impact on whether to renew insurance or not.
© 2019 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of organizing committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
Keywords: Car insurance, Feature engineering, LightGBM, Data analysis
1. Introduction
With the gradual increase in the number of cars, companies will pay more and more attention to precision
marketing. Excavating useful knowledge and information hidden in users, products and services in massive customer
data, and acquiring more customer resources has become the focus of competition among major insurance
companies. Improving products and services through machine learning and data mining is the way to gain new
competitive advantage [1].
Feature selection is one of the commonly used techniques in data preprocessing. As a dimension reduction
method, it focuses on deleting irrelevant or redundant features and selecting a small number of important features
1 Corresponding Author. Tel.+(86) 18669078086
*E-mail: 826727335@qq.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science00 (2019) 000000
www.elsevier.com/locate/proce
dia
2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection and peer-review under responsibility of the scientific committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
3rd International Conference on Mechatronics and Intelligent Robotics (ICMIR-2019)
Research on the Features of Car Insurance Data Based on Machine
Learning
Hui Dong Wang*1
School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Abstract.
With the continuous development of machine learning, enterprises using machine learning methods to mine potential data
information has become a hot topic in the research of major insurance companies. In this paper, the features of auto insurance
data are analyzed, and the most important features affecting auto renewal are mined. The random forest (RF), gradient lifting tree
(GBDT) and lifting machine algorithm (LightGBM) are compared. The test results show that: LightGBM model with the best
superiority and robustness. Features of car insurance business channel, NCD, car age and new car purchase price have a greater
impact on whether to renew insurance or not.
© 2019 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of organizing committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
Keywords: Car insurance, Feature engineering, LightGBM, Data analysis
1. Introduction
With the gradual increase in the number of cars, companies will pay more and more attention to precision
marketing. Excavating useful knowledge and information hidden in users, products and services in massive customer
data, and acquiring more customer resources has become the focus of competition among major insurance
companies. Improving products and services through machine learning and data mining is the way to gain new
competitive advantage [1].
Feature selection is one of the commonly used techniques in data preprocessing. As a dimension reduction
method, it focuses on deleting irrelevant or redundant features and selecting a small number of important features
1 Corresponding Author. Tel.+(86) 18669078086
*E-mail: 826727335@qq.com
from the original data [2]. Suyeon Kang et al. [3] proposed a new feature selection algorithm for aggregated data
analysis, which has great flexibility when applied to real auto insurance data, and solves the standardization problem
of complex data set modeling. Alshamsi et al. [4] use random forest algorithms to help insurers predict customer
choices to provide more competitive services. The LightGBM algorithm has obvious advantages in comparing data
processing with the gradient lifting tree algorithm [5]. Yanmei Jiang et al. [6] compared various algorithms in
commodity prediction to prove that the LightGBM model has the best performance. This paper analyzes more than
60,000 auto insurance data, and uses LightGBM algorithm model to find out the most important characteristics that
affect customer renewed insurance, so that enterprises can more effectively develop marketing strategies.
2. Data Interpretation and Feature Engineering
2.1 Data Cleaning and Feature Renaming
Based on business knowledge and existing data, we interpret the data to understand the meanings expressed by
each feature. Then we preprocess the data. The main operations are: invalid data deletion, missing value filling,
feature dimension reduction, and so on. The amount of raw data used in this article is 65,535, with a total of 28
feature variables. The attribute variables of the customer auto insurance data are as follows: policy number; start date;
end date; car insurance business channel; car brand; car series; insurance property; renewal year; insurance category;
Whether the province license plate; use property; car type; car purpose; new car purchase price; car age; insurance
type; NCD; risk category (A minimum, E highest); customer category; the insured person's gender; the insured
person's age; whether insurance the car damage; whether insurance theft; whether insured persons in the car;
insurance amount; signing premium; cases number; Settled compensation amount.
After analyzing the data, it can be known that the features of the policy number, start date, end date, car brand, and
car series have little influence on whether or not to renew the insurance. These redundant features are removed
directly. Insurance property and renewal year are too relevant to renewal or not, so we also remove these two features.
In order to facilitate the work later, we rename the features as shown in Table 1.
Table 1. Features and field name
Field name
Feature
Field name
channel
whether insured persons in
the car
persons
insurance category insurance
category
the insured person's gender insured gender
province license
local car whether insurance the car
damage car damage
use property
the insured person's age
insured age
car type
whether insurance theft
car theft
car purpose
customer category
customer category
car price
settled compensation
amount
compensation
car age
signing premium
premium
insurance type
cases number
cases number
NCD
insurance amount
insurance amount
risk category risk category
whether to renew the
insurance
label
584 Hui Dong Wang / Procedia Computer Science 166 (2020) 582–587
2.2 Missing Value Filling and Eigenvalue Processing
Some features in the data have missing values, which are handled as follows:There is one missing value for
each car type and car propose, so we can delete these two data directly. NCD has 11 missing values, and we also
directly delete the sample data of the missing values. The risk category has nearly 50,000 missing values. This
feature may have a deeper impact on the model results, so we use 0 to fill in the missing values. The insured
person's gender value is probably more than 5,000 missing. Fill male or female with 50% probability of gender.
The value of many features is text and can’t be entered into the model. Some eigenvalues need to be quantified,
and some need to be divided into several intervals. Specific operations are shown in Table 2.
Table 2. Eigenvalue quantification and segmentation
Field name
Process description
channel
8 different values, corresponding to the numbers 0 to 7
insurance category
3 different values, corresponding to the numbers 0 to 2
local car
3 different values, corresponding to the numbers 0 to 2
use property
9 different values, corresponding to the numbers 0 to 8
car type
17 different values, corresponding to the numbers 0 to 16
car purpose
16 different values, corresponding to the numbers 0 to 15
car price
less than 100000, 100000~150000, 150,000~20000, 200000~3000
00,
300000~500000, 500000~100000 0, greater than 1000000, 7 segme
nts
car age
0~1, 2~5, 6~10, 11~20, greater than 20, 5 segments
Insurance type
2 different values, corresponding to the numbers 0 to 1
NCD
16 different values, corresponding to the numbers 0 to 15
risk category
6 different values, corresponding to the numbers 0 to 5
customer category
2 different values, corresponding to the numbers 0 to 1
insured gender
2 different values, corresponding to the numbers 0 to 1
insured age less than 20, 21~30, 31~40, 41~50, 51~60, greater than 60,
6 segments
car damage
2 different values, corresponding to the numbers 0 to 1
car theft
2 different values, corresponding to the numbers 0 to 1
persons
2 different values, corresponding to the numbers 0 to 1
insurance amount 0, 50000, 100000, 150000, 200000, 300000, 500000, 1000000,
others, 9 classes
premium less than 100, 100~500, 500~1000, 1000~2000, 2000~5000,
5000~10000, 100000~20000, others, 8 segments
cases number
undetermined
compensation less than 1000, 1000~3000, 3000~8000, 8000~20000,
20000~100000, greater than 100000, 6 segments
label
2 different values, corresponding to the numbers 0 to 1
3. Model Performance Evaluation Index
The classification accuracy rate is used as a model classification performance evaluation index, and the
contribution of each category to the accuracy rate is required to be similar. In this paper, the ratio of class 0 (no) to
class 1 (yes) is 5:1, which has a certain degree of imbalance. Therefore, the positive class recall rate, F1 value and
Hui Dong Wang / Procedia Computer Science 166 (2020) 582–587 585
2.2 Missing Value Filling and Eigenvalue Processing
Some features in the data have missing values, which are handled as follows:There is one missing value for
each car type and car propose, so we can delete these two data directly. NCD has 11 missing values, and we also
directly delete the sample data of the missing values. The risk category has nearly 50,000 missing values. This
feature may have a deeper impact on the model results, so we use 0 to fill in the missing values. The insured
person's gender value is probably more than 5,000 missing. Fill male or female with 50% probability of gender.
The value of many features is text and can’t be entered into the model. Some eigenvalues need to be quantified,
and some need to be divided into several intervals. Specific operations are shown in Table 2.
Table 2. Eigenvalue quantification and segmentation
Field name
Process description
channel
8 different values, corresponding to the numbers 0 to 7
insurance category
3 different values, corresponding to the numbers 0 to 2
local car
3 different values, corresponding to the numbers 0 to 2
use property
9 different values, corresponding to the numbers 0 to 8
car type
17 different values, corresponding to the numbers 0 to 16
car purpose
16 different values, corresponding to the numbers 0 to 15
car price
less than 100000, 100000~150000, 150,000~20000, 200000~3000
00,
300000~500000, 500000~100000 0, greater than 1000000, 7 segme
nts
car age
0~1, 2~5, 6~10, 11~20, greater than 20, 5 segments
Insurance type
2 different values, corresponding to the numbers 0 to 1
NCD
16 different values, corresponding to the numbers 0 to 15
risk category
6 different values, corresponding to the numbers 0 to 5
customer category
2 different values, corresponding to the numbers 0 to 1
insured gender
2 different values, corresponding to the numbers 0 to 1
insured age
less than 20, 21~30, 31~40, 41~50, 51~60, greater than 60,
6 segments
car damage
2 different values, corresponding to the numbers 0 to 1
car theft
2 different values, corresponding to the numbers 0 to 1
persons
2 different values, corresponding to the numbers 0 to 1
insurance amount
0, 50000, 100000, 150000, 200000, 300000, 500000, 1000000,
others, 9 classes
premium
less than 100, 100~500, 500~1000, 1000~2000, 2000~5000,
5000~10000, 100000~20000, others, 8 segments
cases number
undetermined
compensation
less than 1000, 1000~3000, 3000~8000, 8000~20000,
20000~100000, greater than 100000, 6 segments
label
2 different values, corresponding to the numbers 0 to 1
3. Model Performance Evaluation Index
The classification accuracy rate is used as a model classification performance evaluation index, and the
contribution of each category to the accuracy rate is required to be similar. In this paper, the ratio of class 0 (no) to
class 1 (yes) is 5:1, which has a certain degree of imbalance. Therefore, the positive class recall rate, F1 value and
AUC value are used as evaluation indicators for evaluating the classification performance of the model. The
confusion matrix in the binary classification problem is shown in Table 3.
Table 3. Confusion matrix
classification prediction positive prediction negative
condition positive TP FN
condition negative FP TN
The positive recall(equation 1) represents the proportion of the real positive cases that are judged to be positive.
_
TP
RC p TP FN
(1)
F-score is a comprehensive evaluation index based on recall and precision(equation 2). In (equation 3), β
represents the relative importance of precision and recall. β takes 1 to get F1-score (equation 4). The larger the F1
value, the better the classification performance.
_
TP
PC p TP FP
(2)
2
2
1 __
score __


RC p PC p
FPC p RC p
(3)
2_ _
1__


RC p PC p
F score PC p RC p
(4)
The AUC value is the area composed of the ROC (receiver operating characteristic curve) curve and the FPR axis.
The X-axis of the ROC curve represents FPR (equation 5), and the Y-axis represents TPR (equation 6). The ROC
curve deviates from the 45° diagonal as far as possible. The AUC value is a quantitative representation of the ROC
curve, and the larger the value, the better.
FP
FPR FP TN
(5)
TP
TPR TP FN
(6)
4. Model Building
Machine learning algorithms are usually trained in a mini-batch manner, and the size of the training data is not
limited by memory. The GBDT algorithm needs to traverse the entire training data multiple times at each iteration.
The main reason that LightGBM puts forward is to solve the problems that GBDT encounters when dealing with
massive data. LightGBM is a gradient boosting framework that uses a decision tree based on learning algorithms to
support efficient parallel training with faster training speed, lower memory consumption, higher accuracy, and faster
processing of massive data.
The data has been processed into training samples that the model can recognize, and a training set is constructed
based on the understanding of the business. The overall process of the model is shown in Fig.1.
586 Hui Dong Wang / Procedia Computer Science 166 (2020) 582–587
Raw data
· data cleaning
· feature renaming
· missing value filling
· eigenvalue processing
LighjtGBM
Model Result
Figure.1. The overall process of the model
5. Result and Summary
Classify the processed data sets and compare them with RF and GBDT algorithm models, as shown in Table 4.
Table 4. Performance under different models
Data set model RC_p F1-value AUC
Feature engineering
RF 0.8016 0.3467 0.7357
GBDT 0.8229 0.2266 0.7888
LightGBM 0.8282 0.3123 0.8045
From the data in the Table 4, it can be seen that in the comparison of these three evaluation indicators, the
LightGBM algorithm model has a certain degree of improvement except for the F1 value which is slightly lower
than the RF algorithm. The ROC curve is shown in Fig.2. Overall, the LightGBM algorithm has a better
classification effect. The LightGBM algorithm experiments show that the features affecting car insurance renewal
are sorted by importance as shown in Fig.3. It can be seen from the figure that the features that affect the renewal of
cars are mainly car insurance business channel, NCD, new car purchase price and age. Based on this result,
insurance companies can make more targeted marketing methods and get more profits.
Figure..2. ROC curves for different models
Figure..3. Feature importance order affecting renewal
Hui Dong Wang / Procedia Computer Science 166 (2020) 582–587 587
Raw data
· data cleaning
· feature renaming
· missing value filling
· eigenvalue processing
LighjtGBM
Model Result
Figure.1. The overall process of the model
5. Result and Summary
Classify the processed data sets and compare them with RF and GBDT algorithm models, as shown in Table 4.
Table 4. Performance under different models
Data set
model
RC_p
F1-value
AUC
Feature engineering
RF
0.8016
0.3467
0.7357
GBDT
0.8229
0.2266
0.7888
LightGBM
0.8282
0.3123
0.8045
From the data in the Table 4, it can be seen that in the comparison of these three evaluation indicators, the
LightGBM algorithm model has a certain degree of improvement except for the F1 value which is slightly lower
than the RF algorithm. The ROC curve is shown in Fig.2. Overall, the LightGBM algorithm has a better
classification effect. The LightGBM algorithm experiments show that the features affecting car insurance renewal
are sorted by importance as shown in Fig.3. It can be seen from the figure that the features that affect the renewal of
cars are mainly car insurance business channel, NCD, new car purchase price and age. Based on this result,
insurance companies can make more targeted marketing methods and get more profits.
Figure..2. ROC curves for different models
Figure..3. Feature importance order affecting renewal
Reference
1. Vladimir Kašćelan,Ljiljana Kašćelan,Milijana Novović Burić. A nonparametric data mining approach for risk prediction in car insurance: a
case study from the Montenegrin market[J]. Economic Research-Ekonomska Istraživanja,2016,29(1) :545-558.
2. Chen M S , Hwang C P , Ho T Y , et al. Driving behaviors analysis based on feature selection and statistical approach: a preliminary
study[J]. The Journal of Supercomputing, 2018.
3. Suyeon Kang,Jongwoo Song. Feature selection for continuous aggregate response and its application to auto insurance data[J]. Expert
Systems With Applications,2018(93):104-117.
4. Alshamsi,Asma S. Predicting car insurance policies using random forest[C] 2014 10th International Conference on Innovations in
Information Technology (INNOVATIONS). IEEE, 2014.
5. Xiaojun Ma,Jinglan Sha,Dehua Wang,Yuanbo Yu,Qian Yang,Xueqi Niu. Study on A Prediction of P2P Network Loan Default Based on
the Machine Learning LightGBM and XGboost Algorithms according to Different High Dimensional Data Cleaning[J]. Electronic
Commerce Research and Applications,2018(31):24-39.
6. Yanmei Jiang, Qingkai Bu. Supermarket Commodity Sales Forecast Based on Data Mining [J]. Hans Journal of Data
Mining,2018,08(02):74-78.
... The purpose of this study was to review and survey the state-of-art existing cloud-and blockchain-based insurance systems. Several researchers and practitioners have published state-of-art analyses related to the different factors that have a direct impact on insurance stakeholders (policyholders and service providers), such as those found in [12][13][14][15], and blockchain-and cloud-based insurance frameworks and techniques have been discussed in [16][17][18][19][20][21][22][23][24][25][26][27][28]. ...
... These stakeholders' behavior and customer knowledge help the customers to make decisions regarding the continuity or discontinuity of the insurance policy. Wang, in [14], examined 60,000 vehicle insurance policies using machine learning techniques and determined the factors that had a direct impact on the clients' decision-making processes related to the continuity of the insurance. Regardless of insurance literacy, Arumugam and Bhargavi, in [15], drew attention to the driver's behavior. ...
... The surveys presented in [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28] highlighted various possible research directions that could help to enhance the insurance sector. ...
Article
Full-text available
Despite the rapid expansion in the insurance industry, many issues remain unresolved and may require immediate action. As the insurance sector continues to evolve with the development of new technologies, it faces more challenges, especially related to data security and fraud. The fraud-prevention data and tactics presently used by insurance firms are outdated and ineffective. Additionally, insurance firms have traditionally handled the settlement of all consumer claims through lengthy manual processes. These manual processes need to be changed to provide opportunities for insurance businesses to grow. In the case of vehicles, the information obtained from an automobile data recorder can be used as evidence. Data from automated vehicles are critical because they can help the police, law enforcement agencies, and insurance companies to reconstruct the events leading up to a collision. Insurance companies require the forensic analysis of accident videos, which is a time-consuming process and involves a large amount of storage. Due to hardware limitations and associated costs, the current standalone (and often dedicated) computing infrastructures used for this purpose are quite limited. Previous research focused on simple video analysis tasks within cloud computing and blockchain technology. The requirements for a large-scale auto-insurance system are quite high and need more thorough investigation. In this paper, a review of the contribution of recent approaches to storing accidental data in cloud computing using blockchain is provided. We focused on the latest cloud and blockchain studies related to auto-insurance along with the related issues and challenges. Some useful solutions and recommendations are provided to address the identified issues and challenges in the cloud-based and blockchain-based auto-insurance sector.
... vehicle evaluation is essential for designers and manufacturers to enhance the appeal of their new models. Extant research has shown promise in exploiting machine learning (ML) and artificial intelligence for vehicle price prediction [2][3][4], vehicle sales prediction [5], vehicle purchase criteria [6], vehicle evaluation [7], and insurance services [8]. When evaluating a vehicle, consumers typically analyze multiple data types, such as images, 3D models, parametric specifications, and text reviews. ...
... We add a dropout layer after the first hidden layer with dropout rates ranging from 0.25 or 0.3 for predicting different rating scores. 8 https://cars.usnews.com/cars-trucks/acura/mdx/2007/photos-exterior 9 https://cars.usnews.com/cars-trucks/acura/mdx/2007/photos-interior (2)Text model: Secondly, the text model adopts a pre-trained transformer-based BERT [30] text embedding module. ...
Preprint
Full-text available
Accurate vehicle rating prediction can facilitate designing and configuring good vehicles. This prediction allows vehicle designers and manufacturers to optimize and improve their designs in a timely manner, enhance their product performance, and effectively attract consumers. However, most of the existing data-driven methods rely on data from a single mode, e.g., text, image, or parametric data, which results in a limited and incomplete exploration of the available information. These methods lack comprehensive analyses and exploration of data from multiple modes, which probably leads to inaccurate conclusions and hinders progress in this field. To overcome this limitation, we propose a multi-modal learning model for more comprehensive and accurate vehicle rating predictions. Specifically, the model simultaneously learns features from the parametric specifications, text descriptions, and images of vehicles to predict five vehicle rating scores, including the total score, critics score, performance score, safety score, and interior score. We compare the multi-modal learning model to the corresponding unimodal models and find that the multi-modal model's explanatory power is 4% - 12% higher than that of the unimodal models. On this basis, we conduct sensitivity analyses using SHAP to interpret our model and provide design and optimization directions to designers and manufacturers. Our study underscores the importance of the data-driven multi-modal learning approach for vehicle design, evaluation, and optimization. We have made the code publicly available at http://decode.mit.edu/projects/vehicleratings/.
... [2] menggunakan metode AHP (Analytic Hierarchy Process) untuk menganalisis proses pengelolaan risiko pada asuransi kendaraan bermotor. [3] menggunakan metode machine learning untuk menganalisis faktor yang mempengaruhi status perpanjangan polis asuransi kendaraan bermotor. [3] membandingkan metode random forest, gradient lifting tree, dan lifting machine algorithm. ...
... [3] menggunakan metode machine learning untuk menganalisis faktor yang mempengaruhi status perpanjangan polis asuransi kendaraan bermotor. [3] membandingkan metode random forest, gradient lifting tree, dan lifting machine algorithm. [4] menganalisis hubungan antara status asuransi kendaraan bermotor dengan kecelakaan kendaraan bermotor menggunakan analisis regresi logistik. ...
Article
One type of general insurance is motor vehicle insurance. Premium pricing of general insurance can be calculated by some methods. In this study, Bayes method will be used. The distribution of claim frequency is Poisson distribution and the distribution of claim severity is Exponential distribution. The premium is calculated by multiplying the expectation of claim frequency and the expectation of claim severity. Based on the historical data analysis using the Bayes method, the highest pure premium of motor vehicle insurance in Indonesia is Hino brand and the lowest pure premium is Honda brand. The result of this premium pricing can be used as a reference for the insurance companies to manage their motor vehicle insurance reserves.
... For comparison, the results for the efficiency of our algorithms can be seen in Table 1. Effects of important features for insurance auto-renewal with classification ML algorithms are represented in [7]. In this research, the most successful models are random forest, gradient-lifting tree (GBDT) and lifting machine algorithm (LightGBM), with LightGBM producing the best result at 0.8045 AUC. ...
Chapter
Full-text available
Car accidents and the possible resulting loss of assets or life are issues for every car owner that must contend with some point in their driving life. Driving is an inherently dangerous act, even if it does not seem so at first, resulting in greater than 33,000 fatal vehi le crashes in USA in 2019 alone. However, the loss of life and possible damages can be reduced with the help of insurances. Insurance is an arrangement under which a person or agency receives financial security or reimbursement from an insurance provider in the form of a policy. Insurances help limit the losses of the customers when an undesirable event occurs, such as a car crash or a heart attack. Vehicle insurance provides customers monetary compensation after unfortunate accidents, provided they annually pay premium fees to the companies first. Our goal is to develop a machine learning algorithm that predicts customers who are interested in getting or renewing their vehicle insurance with the help of personal, vehicle, contact, and previous insurance data. The insurance sales forecast is helpful to companies, since they can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue, while also being beneficial to customers, who can go through the process and the aftermath of car accidents easier thanks to their monetary compensation. In this paper, the Health Insurance Cross-Sell Prediction dataset is used. The proposed model tries getting the value by training itself on a train and test dataset and will result in a categorical response feature based on the aforementioned data with the aid of well-known machine learning algorithms: k-nearest neighbors, random forest, support vector machines, Naive Bayes, and logistic regression.
... Alshamsi (2014) [54] applied random forest algorithms that assist insurers in anticipating client decisions to achieve more enticing insurance packages. Wang (2020) [55] examined the data from over 60,000 auto insurance packages, employing a Light-GBM algorithm to identify the vital features that impact upon decisions to remain with a particular insurer. Doing so allows businesses to produce more robust advertising tactics. ...
Article
Full-text available
This article investigates the impact of big data on the actuarial sector. The growing fields of applications of data analytics and data mining raise the ability for insurance companies to conduct more accurate policy pricing by incorporating a broader variety of data due to increased data availability. The analyzed areas of this paper span from automobile insurance policy pricing, mortality and healthcare modeling to estimation of harvest-, climate- and cyber risk as well as assessment of catastrophe risk such as storms, hurricanes, tornadoes, geomagnetic events, earthquakes, floods, and fires. We evaluate the current use of big data in these contexts and how the utilization of data analytics and data mining contribute to the prediction capabilities and accuracy of policy premium pricing of insurance companies. We find a high penetration of insurance policy pricing in almost all actuarial fields except in the modeling and pricing of cyber security risk due to lack of data in this area and prevailing data asymmetries, for which we identify the application of artificial intelligence, in particular machine learning techniques, as a possible solution to improve policy pricing accuracy and results.
Article
Full-text available
Explainable Artificial Intelligence (XAI) models allow for a more transparent and understandable relationship between humans and machines. The insurance industry represents a fundamental opportunity to demonstrate the potential of XAI, with the industry’s vast stores of sensitive data on policyholders and centrality in societal progress and innovation. This paper analyses current Artificial Intelligence (AI) applications in insurance industry practices and insurance research to assess their degree of explainability. Using search terms representative of (X)AI applications in insurance, 419 original research articles were screened from IEEE Xplore, ACM Digital Library, Scopus, Web of Science and Business Source Complete and EconLit. The resulting 103 articles (between the years 2000–2021) representing the current state-of-the-art of XAI in insurance literature are analysed and classified, highlighting the prevalence of XAI methods at the various stages of the insurance value chain. The study finds that XAI methods are particularly prevalent in claims management, underwriting and actuarial pricing practices. Simplification methods, called knowledge distillation and rule extraction, are identified as the primary XAI technique used within the insurance value chain. This is important as the combination of large models to create a smaller, more manageable model with distinct association rules aids in building XAI models which are regularly understandable. XAI is an important evolution of AI to ensure trust, transparency and moral values are embedded within the system’s ecosystem. The assessment of these XAI foci in the context of the insurance industry proves a worthwhile exploration into the unique advantages of XAI, highlighting to industry professionals, regulators and XAI developers where particular focus should be directed in the further development of XAI. This is the first study to analyse XAI’s current applications within the insurance industry, while simultaneously contributing to the interdisciplinary understanding of applied XAI. Advancing the literature on adequate XAI definitions, the authors propose an adapted definition of XAI informed by the systematic review of XAI literature in insurance.
Article
Technological innovations affect many sectors of the economy, including the insurance business. Among these innovations, IoT-based (Internet of Things) solutions can be highlighted, the main feature of which is that real-time and continuous data collection is performed using the Internet, thus optimizing the risk management of the insurer. Given that a significant part of the data thus collected constitutes personal data, the rules of the General Data Protection Regulation (GDPR) should apply. The data protection examination of the technologies affecting the insurance institution raises several issues, which, in my view, significantly impede the application of these technological achievements. The study aims to explore these problems and attempts to make proposals to solve them.
Article
Full-text available
Like the other financial markets, insurance markets need to increase their profit margins if they want to continue their activities. In this article, we examine the factors that can affect the costs of insurance companies and examine the relationships between these factors. We use two approaches to determine the effect of some variables on the charge of insurance companies. Our first approach is to use data analysis. Then we analyzed the relationships between these variables using diagrams. We used statistical graphs to examine the effect of each variable on the cost of insurance companies. Then we determine the relationships between these variables and the trade of these variables on each other. Another approach we use is the Best–Worst method that is one of multi-criteria decision-making techniques. Our expert finds each variable's weight on the charge of insurance companies by using the Best–Worst method. We implemented these two methods on an insurance company's data, and we showed which variable can have the most impact on costs. These results can help insurance companies to determine macro-fiscal policies and pricing. Using these results, insurance companies can divide their customers into several sections and offer the price of their services to each insured separately.
Article
Full-text available
Due to the prevalence of IoV technology, big data has increasingly been promoted as a revolutionary development in a variety of applications. Indeed, the received big data from IoV is valuable particularly for those involved in analyzing driver’s behaviors. For instance, in the fleet management domain, fleet administrators are interested in fine-grained information about fleet usage, which is influenced by different driver usage patterns. In the vehicle insurance market, usage-based insurance or pay-as-you-drive schemes aim to adapt the insurance premium to individual driver behavior or even to provide various value-added services to policy holders. These applications can be expected to improve and to make safer the driving style of various individuals. Nowadays, big data analysis is becoming indispensable for automatic discovering of intelligence that is involved in the frequently occurring patterns and hidden rules. It is essential and necessary to study how to utilize these large-scale data. Regarding driving behaviors analysis, this paper presents a preliminary study based on feature selection and statistical approach. Feature selection is one of the important and frequently used techniques in data preprocessing for big data mining. Feature selection, as a dimensionality reduction technique, focuses on choosing a small subset of the significant features from the original data by removing irrelevant or redundant features. According to selection process, the most significant feature is vehicle speed for the collected vehicular data. Afterward, the statistical approach calculates skewness and dispersion in speed distribution as the statistical features for driving behaviors analysis. Finally, the established classification rules not only provide data-driven services and big data analytics but also offer training data samples for supervised machine learning algorithms. To validate the feasibility of the proposed method, over 150 drivers and more than 200,000 trips are verified in the simulation. As expected, experimental results are well matched with our observations.
Article
Full-text available
For prediction of risk in car insurance we used the nonparametric data mining techniques such as clustering, support vector regression (SVR) and kernel logistic regression (KLR). The goal of these techniques is to classify risk and predict claim size based on data, thus helping the insurer to assess the risk and calculate actual premiums. We proved that used data mining techniques can predict claim sizes and their occurrence, based on the case study data, with better accuracy than the standard methods. This represents the basis for calculation of net risk premium. Also, the article discusses advantages of data mining methods compared to standard methods for risk assessment in car insurance, as well as the specificities of the obtained results due to small insurance market, such as Montenegrin.
Article
Big data and the Internet financial sector tremendously developed in the 21st century. The national emphasis on this field has also gradually improved. Peer-to-peer (P2P) is an innovative mode of borrowing that is a powerful complement to the traditional financial industry. The projected default rate on credit is an absolute prerequisite for guaranteeing the proper operation of related financial projects or platforms. In this paper, we use ‘multi-observation’ and ‘multi-dimensional’ data cleaning method and apply the modern machine learning algorithms LightGBM in Asia at the end of 2016 and XGboost, which are based on real P2P transaction data from Lending club. The default risk of loans in the platform is strongly and innovatively predicted. And the results of different methods are compared. Furthermore, we observe that the LightGBM algorithm based on multiple observational data set classification prediction results is the best. The average performance rate of the historical transaction data of the Lending Club platform rose by 1.28 percentage points, which reduced loan defaults by approximately $117 million. Finally, with respect to the influencing factors of the default rate, suggested developments for the Lending club and other P2P platforms are provided as is the suggested direction of other countries’ development in this field.
Article
This paper presents new feature selection algorithms for aggregate data analysis. Data aggregation is commonly used when it is not appropriate to model the relationship between a response and explanatory variables at an individual-level. We investigate substantial challenges in analysis for aggregate data. Then, we propose a groupwise feature selection method that addresses (i) the change in dataset depending on the selection of predictor variables, (ii) the presence of potential missing responses, and (iii) the suitability of model selection criteria when comparing models using different datasets. In application to real auto insurance data, we find a set of important predictors to classify the policyholders into some homogeneous risk groups. Our results clearly demonstrate the potential of the proposed feature selection method for aggregate data analysis in terms of flexibility and computational complexity. We expect that the proposed algorithms would be further applied into a wide range of decision-making tasks using aggregate data as they are applicable to any type of data.
Supermarket Commodity Sales Forecast Based on Data Mining
  • Yanmei Jiang
  • Qingkai Bu
Yanmei Jiang, Qingkai Bu. Supermarket Commodity Sales Forecast Based on Data Mining [J].