Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 166 (2020) 582–587
1877-0509 © 2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 3rd International Conference on Mechatronics and Intelligent Robotics,
ICMIR-2019.
10.1016/j.procs.2020.02.016
10.1016/j.procs.2020.02.016 1877-0509
© 2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientic committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics, ICMIR-2019.
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science00 (2019) 000–000
www.elsevier.com/locate/proce
dia
2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection and peer-review under responsibility of the scientific committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
3rd International Conference on Mechatronics and Intelligent Robotics (ICMIR-2019)
Research on the Features of Car Insurance Data Based on Machine
Learning
Hui Dong Wang*1
School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Abstract.
With the continuous development of machine learning, enterprises using machine learning methods to mine potential data
information has become a hot topic in the research of major insurance companies. In this paper, the features of auto insurance
data are analyzed, and the most important features affecting auto renewal are mined. The random forest (RF), gradient lifting tree
(GBDT) and lifting machine algorithm (LightGBM) are compared. The test results show that: LightGBM model with the best
superiority and robustness. Features of car insurance business channel, NCD, car age and new car purchase price have a greater
impact on whether to renew insurance or not.
© 2019 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of organizing committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
Keywords: Car insurance, Feature engineering, LightGBM, Data analysis
1. Introduction
With the gradual increase in the number of cars, companies will pay more and more attention to precision
marketing. Excavating useful knowledge and information hidden in users, products and services in massive customer
data, and acquiring more customer resources has become the focus of competition among major insurance
companies. Improving products and services through machine learning and data mining is the way to gain new
competitive advantage [1].
Feature selection is one of the commonly used techniques in data preprocessing. As a dimension reduction
method, it focuses on deleting irrelevant or redundant features and selecting a small number of important features
1 Corresponding Author. Tel.+(86) 18669078086
*E-mail: 826727335@qq.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science00 (2019) 000–000
www.elsevier.com/locate/proce
dia
2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection and peer-review under responsibility of the scientific committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
3rd International Conference on Mechatronics and Intelligent Robotics (ICMIR-2019)
Research on the Features of Car Insurance Data Based on Machine
Learning
Hui Dong Wang*1
School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Abstract.
With the continuous development of machine learning, enterprises using machine learning methods to mine potential data
information has become a hot topic in the research of major insurance companies. In this paper, the features of auto insurance
data are analyzed, and the most important features affecting auto renewal are mined. The random forest (RF), gradient lifting tree
(GBDT) and lifting machine algorithm (LightGBM) are compared. The test results show that: LightGBM model with the best
superiority and robustness. Features of car insurance business channel, NCD, car age and new car purchase price have a greater
impact on whether to renew insurance or not.
© 2019 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of organizing committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
Keywords: Car insurance, Feature engineering, LightGBM, Data analysis
1. Introduction
With the gradual increase in the number of cars, companies will pay more and more attention to precision
marketing. Excavating useful knowledge and information hidden in users, products and services in massive customer
data, and acquiring more customer resources has become the focus of competition among major insurance
companies. Improving products and services through machine learning and data mining is the way to gain new
competitive advantage [1].
Feature selection is one of the commonly used techniques in data preprocessing. As a dimension reduction
method, it focuses on deleting irrelevant or redundant features and selecting a small number of important features
1 Corresponding Author. Tel.+(86) 18669078086
*E-mail: 826727335@qq.com
Hui Dong Wang / Procedia Computer Science 166 (2020) 582–587 583
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science00 (2019) 000–000
www.elsevier.com/locate/proce
dia
2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection and peer-review under responsibility of the scientific committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
3rd International Conference on Mechatronics and Intelligent Robotics (ICMIR-2019)
Research on the Features of Car Insurance Data Based on Machine
Learning
Hui Dong Wang*1
School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Abstract.
With the continuous development of machine learning, enterprises using machine learning methods to mine potential data
information has become a hot topic in the research of major insurance companies. In this paper, the features of auto insurance
data are analyzed, and the most important features affecting auto renewal are mined. The random forest (RF), gradient lifting tree
(GBDT) and lifting machine algorithm (LightGBM) are compared. The test results show that: LightGBM model with the best
superiority and robustness. Features of car insurance business channel, NCD, car age and new car purchase price have a greater
impact on whether to renew insurance or not.
© 2019 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of organizing committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
Keywords: Car insurance, Feature engineering, LightGBM, Data analysis
1. Introduction
With the gradual increase in the number of cars, companies will pay more and more attention to precision
marketing. Excavating useful knowledge and information hidden in users, products and services in massive customer
data, and acquiring more customer resources has become the focus of competition among major insurance
companies. Improving products and services through machine learning and data mining is the way to gain new
competitive advantage [1].
Feature selection is one of the commonly used techniques in data preprocessing. As a dimension reduction
method, it focuses on deleting irrelevant or redundant features and selecting a small number of important features
1 Corresponding Author. Tel.+(86) 18669078086
*E-mail: 826727335@qq.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science00 (2019) 000–000
www.elsevier.com/locate/proce
dia
2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection and peer-review under responsibility of the scientific committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
3rd International Conference on Mechatronics and Intelligent Robotics (ICMIR-2019)
Research on the Features of Car Insurance Data Based on Machine
Learning
Hui Dong Wang*1
School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Abstract.
With the continuous development of machine learning, enterprises using machine learning methods to mine potential data
information has become a hot topic in the research of major insurance companies. In this paper, the features of auto insurance
data are analyzed, and the most important features affecting auto renewal are mined. The random forest (RF), gradient lifting tree
(GBDT) and lifting machine algorithm (LightGBM) are compared. The test results show that: LightGBM model with the best
superiority and robustness. Features of car insurance business channel, NCD, car age and new car purchase price have a greater
impact on whether to renew insurance or not.
© 2019 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of organizing committee of the 3rd International Conference on Mechatronics and Intelligent
Robotics (ICMIR-2019)
Keywords: Car insurance, Feature engineering, LightGBM, Data analysis
1. Introduction
With the gradual increase in the number of cars, companies will pay more and more attention to precision
marketing. Excavating useful knowledge and information hidden in users, products and services in massive customer
data, and acquiring more customer resources has become the focus of competition among major insurance
companies. Improving products and services through machine learning and data mining is the way to gain new
competitive advantage [1].
Feature selection is one of the commonly used techniques in data preprocessing. As a dimension reduction
method, it focuses on deleting irrelevant or redundant features and selecting a small number of important features
1 Corresponding Author. Tel.+(86) 18669078086
*E-mail: 826727335@qq.com
from the original data [2]. Suyeon Kang et al. [3] proposed a new feature selection algorithm for aggregated data
analysis, which has great flexibility when applied to real auto insurance data, and solves the standardization problem
of complex data set modeling. Alshamsi et al. [4] use random forest algorithms to help insurers predict customer
choices to provide more competitive services. The LightGBM algorithm has obvious advantages in comparing data
processing with the gradient lifting tree algorithm [5]. Yanmei Jiang et al. [6] compared various algorithms in
commodity prediction to prove that the LightGBM model has the best performance. This paper analyzes more than
60,000 auto insurance data, and uses LightGBM algorithm model to find out the most important characteristics that
affect customer renewed insurance, so that enterprises can more effectively develop marketing strategies.
2. Data Interpretation and Feature Engineering
2.1 Data Cleaning and Feature Renaming
Based on business knowledge and existing data, we interpret the data to understand the meanings expressed by
each feature. Then we preprocess the data. The main operations are: invalid data deletion, missing value filling,
feature dimension reduction, and so on. The amount of raw data used in this article is 65,535, with a total of 28
feature variables. The attribute variables of the customer auto insurance data are as follows: policy number; start date;
end date; car insurance business channel; car brand; car series; insurance property; renewal year; insurance category;
Whether the province license plate; use property; car type; car purpose; new car purchase price; car age; insurance
type; NCD; risk category (A minimum, E highest); customer category; the insured person's gender; the insured
person's age; whether insurance the car damage; whether insurance theft; whether insured persons in the car;
insurance amount; signing premium; cases number; Settled compensation amount.
After analyzing the data, it can be known that the features of the policy number, start date, end date, car brand, and
car series have little influence on whether or not to renew the insurance. These redundant features are removed
directly. Insurance property and renewal year are too relevant to renewal or not, so we also remove these two features.
In order to facilitate the work later, we rename the features as shown in Table 1.
Table 1. Features and field name
Feature
Field name
Feature
Field name
car insurance
business channel
channel
whether insured persons in
the car
persons
insurance category insurance
category
the insured person's gender insured gender
whether the
province license
plate
local car whether insurance the car
damage car damage
use property
use property
the insured person's age
insured age
car type
car type
whether insurance theft
car theft
car purpose
car purpose
customer category
customer category
new car purchase
price
car price
settled compensation
amount
compensation
car age
car age
signing premium
premium
insurance type
insurance type
cases number
cases number
NCD
NCD
insurance amount
insurance amount
risk category risk category
whether to renew the
insurance
label
584 Hui Dong Wang / Procedia Computer Science 166 (2020) 582–587
2.2 Missing Value Filling and Eigenvalue Processing
Some features in the data have missing values, which are handled as follows:①There is one missing value for
each car type and car propose, so we can delete these two data directly.② NCD has 11 missing values, and we also
directly delete the sample data of the missing values.③ The risk category has nearly 50,000 missing values. This
feature may have a deeper impact on the model results, so we use 0 to fill in the missing values.④ The insured
person's gender value is probably more than 5,000 missing. Fill male or female with 50% probability of gender.
The value of many features is text and can’t be entered into the model. Some eigenvalues need to be quantified,
and some need to be divided into several intervals. Specific operations are shown in Table 2.
Table 2. Eigenvalue quantification and segmentation
Field name
Process description
channel
8 different values, corresponding to the numbers 0 to 7
insurance category
3 different values, corresponding to the numbers 0 to 2
local car
3 different values, corresponding to the numbers 0 to 2
use property
9 different values, corresponding to the numbers 0 to 8
car type
17 different values, corresponding to the numbers 0 to 16
car purpose
16 different values, corresponding to the numbers 0 to 15
car price
less than 100000, 100000~150000, 150,000~20000, 200000~3000
00,
300000~500000, 500000~100000 0, greater than 1000000, 7 segme
nts
car age
0~1, 2~5, 6~10, 11~20, greater than 20, 5 segments
Insurance type
2 different values, corresponding to the numbers 0 to 1
NCD
16 different values, corresponding to the numbers 0 to 15
risk category
6 different values, corresponding to the numbers 0 to 5
customer category
2 different values, corresponding to the numbers 0 to 1
insured gender
2 different values, corresponding to the numbers 0 to 1
insured age less than 20, 21~30, 31~40, 41~50, 51~60, greater than 60,
6 segments
car damage
2 different values, corresponding to the numbers 0 to 1
car theft
2 different values, corresponding to the numbers 0 to 1
persons
2 different values, corresponding to the numbers 0 to 1
insurance amount 0, 50000, 100000, 150000, 200000, 300000, 500000, 1000000,
others, 9 classes
premium less than 100, 100~500, 500~1000, 1000~2000, 2000~5000,
5000~10000, 100000~20000, others, 8 segments
cases number
undetermined
compensation less than 1000, 1000~3000, 3000~8000, 8000~20000,
20000~100000, greater than 100000, 6 segments
label
2 different values, corresponding to the numbers 0 to 1
3. Model Performance Evaluation Index
The classification accuracy rate is used as a model classification performance evaluation index, and the
contribution of each category to the accuracy rate is required to be similar. In this paper, the ratio of class 0 (no) to
class 1 (yes) is 5:1, which has a certain degree of imbalance. Therefore, the positive class recall rate, F1 value and
Hui Dong Wang / Procedia Computer Science 166 (2020) 582–587 585
2.2 Missing Value Filling and Eigenvalue Processing
Some features in the data have missing values, which are handled as follows:①There is one missing value for
each car type and car propose, so we can delete these two data directly.② NCD has 11 missing values, and we also
directly delete the sample data of the missing values.③ The risk category has nearly 50,000 missing values. This
feature may have a deeper impact on the model results, so we use 0 to fill in the missing values.④ The insured
person's gender value is probably more than 5,000 missing. Fill male or female with 50% probability of gender.
The value of many features is text and can’t be entered into the model. Some eigenvalues need to be quantified,
and some need to be divided into several intervals. Specific operations are shown in Table 2.
Table 2. Eigenvalue quantification and segmentation
Field name
Process description
channel
8 different values, corresponding to the numbers 0 to 7
insurance category
3 different values, corresponding to the numbers 0 to 2
local car
3 different values, corresponding to the numbers 0 to 2
use property
9 different values, corresponding to the numbers 0 to 8
car type
17 different values, corresponding to the numbers 0 to 16
car purpose
16 different values, corresponding to the numbers 0 to 15
car price
less than 100000, 100000~150000, 150,000~20000, 200000~3000
00,
300000~500000, 500000~100000 0, greater than 1000000, 7 segme
nts
car age
0~1, 2~5, 6~10, 11~20, greater than 20, 5 segments
Insurance type
2 different values, corresponding to the numbers 0 to 1
NCD
16 different values, corresponding to the numbers 0 to 15
risk category
6 different values, corresponding to the numbers 0 to 5
customer category
2 different values, corresponding to the numbers 0 to 1
insured gender
2 different values, corresponding to the numbers 0 to 1
insured age
less than 20, 21~30, 31~40, 41~50, 51~60, greater than 60,
6 segments
car damage
2 different values, corresponding to the numbers 0 to 1
car theft
2 different values, corresponding to the numbers 0 to 1
persons
2 different values, corresponding to the numbers 0 to 1
insurance amount
0, 50000, 100000, 150000, 200000, 300000, 500000, 1000000,
others, 9 classes
premium
less than 100, 100~500, 500~1000, 1000~2000, 2000~5000,
5000~10000, 100000~20000, others, 8 segments
cases number
undetermined
compensation
less than 1000, 1000~3000, 3000~8000, 8000~20000,
20000~100000, greater than 100000, 6 segments
label
2 different values, corresponding to the numbers 0 to 1
3. Model Performance Evaluation Index
The classification accuracy rate is used as a model classification performance evaluation index, and the
contribution of each category to the accuracy rate is required to be similar. In this paper, the ratio of class 0 (no) to
class 1 (yes) is 5:1, which has a certain degree of imbalance. Therefore, the positive class recall rate, F1 value and
AUC value are used as evaluation indicators for evaluating the classification performance of the model. The
confusion matrix in the binary classification problem is shown in Table 3.
Table 3. Confusion matrix
classification prediction positive prediction negative
condition positive TP FN
condition negative FP TN
The positive recall(equation 1) represents the proportion of the real positive cases that are judged to be positive.
_
TP
RC p TP FN
(1)
F-score is a comprehensive evaluation index based on recall and precision(equation 2). In (equation 3), β
represents the relative importance of precision and recall. β takes 1 to get F1-score (equation 4). The larger the F1
value, the better the classification performance.
_
TP
PC p TP FP
(2)
2
2
1 __
score __
( )
RC p PC p
FPC p RC p
(3)
2_ _
1__
RC p PC p
F score PC p RC p
(4)
The AUC value is the area composed of the ROC (receiver operating characteristic curve) curve and the FPR axis.
The X-axis of the ROC curve represents FPR (equation 5), and the Y-axis represents TPR (equation 6). The ROC
curve deviates from the 45° diagonal as far as possible. The AUC value is a quantitative representation of the ROC
curve, and the larger the value, the better.
FP
FPR FP TN
(5)
TP
TPR TP FN
(6)
4. Model Building
Machine learning algorithms are usually trained in a mini-batch manner, and the size of the training data is not
limited by memory. The GBDT algorithm needs to traverse the entire training data multiple times at each iteration.
The main reason that LightGBM puts forward is to solve the problems that GBDT encounters when dealing with
massive data. LightGBM is a gradient boosting framework that uses a decision tree based on learning algorithms to
support efficient parallel training with faster training speed, lower memory consumption, higher accuracy, and faster
processing of massive data.
The data has been processed into training samples that the model can recognize, and a training set is constructed
based on the understanding of the business. The overall process of the model is shown in Fig.1.
586 Hui Dong Wang / Procedia Computer Science 166 (2020) 582–587
Raw data
· data cleaning
· feature renaming
· missing value filling
· eigenvalue processing
LighjtGBM
Model Result
Figure.1. The overall process of the model
5. Result and Summary
Classify the processed data sets and compare them with RF and GBDT algorithm models, as shown in Table 4.
Table 4. Performance under different models
Data set model RC_p F1-value AUC
Feature engineering
RF 0.8016 0.3467 0.7357
GBDT 0.8229 0.2266 0.7888
LightGBM 0.8282 0.3123 0.8045
From the data in the Table 4, it can be seen that in the comparison of these three evaluation indicators, the
LightGBM algorithm model has a certain degree of improvement except for the F1 value which is slightly lower
than the RF algorithm. The ROC curve is shown in Fig.2. Overall, the LightGBM algorithm has a better
classification effect. The LightGBM algorithm experiments show that the features affecting car insurance renewal
are sorted by importance as shown in Fig.3. It can be seen from the figure that the features that affect the renewal of
cars are mainly car insurance business channel, NCD, new car purchase price and age. Based on this result,
insurance companies can make more targeted marketing methods and get more profits.
Figure..2. ROC curves for different models
Figure..3. Feature importance order affecting renewal
Hui Dong Wang / Procedia Computer Science 166 (2020) 582–587 587
Raw data
· data cleaning
· feature renaming
· missing value filling
· eigenvalue processing
LighjtGBM
Model Result
Figure.1. The overall process of the model
5. Result and Summary
Classify the processed data sets and compare them with RF and GBDT algorithm models, as shown in Table 4.
Table 4. Performance under different models
Data set
model
RC_p
F1-value
AUC
Feature engineering
RF
0.8016
0.3467
0.7357
GBDT
0.8229
0.2266
0.7888
LightGBM
0.8282
0.3123
0.8045
From the data in the Table 4, it can be seen that in the comparison of these three evaluation indicators, the
LightGBM algorithm model has a certain degree of improvement except for the F1 value which is slightly lower
than the RF algorithm. The ROC curve is shown in Fig.2. Overall, the LightGBM algorithm has a better
classification effect. The LightGBM algorithm experiments show that the features affecting car insurance renewal
are sorted by importance as shown in Fig.3. It can be seen from the figure that the features that affect the renewal of
cars are mainly car insurance business channel, NCD, new car purchase price and age. Based on this result,
insurance companies can make more targeted marketing methods and get more profits.
Figure..2. ROC curves for different models
Figure..3. Feature importance order affecting renewal
Reference
1. Vladimir Kašćelan,Ljiljana Kašćelan,Milijana Novović Burić. A nonparametric data mining approach for risk prediction in car insurance: a
case study from the Montenegrin market[J]. Economic Research-Ekonomska Istraživanja,2016,29(1) :545-558.
2. Chen M S , Hwang C P , Ho T Y , et al. Driving behaviors analysis based on feature selection and statistical approach: a preliminary
study[J]. The Journal of Supercomputing, 2018.
3. Suyeon Kang,Jongwoo Song. Feature selection for continuous aggregate response and its application to auto insurance data[J]. Expert
Systems With Applications,2018(93):104-117.
4. Alshamsi,Asma S. Predicting car insurance policies using random forest[C] 2014 10th International Conference on Innovations in
Information Technology (INNOVATIONS). IEEE, 2014.
5. Xiaojun Ma,Jinglan Sha,Dehua Wang,Yuanbo Yu,Qian Yang,Xueqi Niu. Study on A Prediction of P2P Network Loan Default Based on
the Machine Learning LightGBM and XGboost Algorithms according to Different High Dimensional Data Cleaning[J]. Electronic
Commerce Research and Applications,2018(31):24-39.
6. Yanmei Jiang, Qingkai Bu. Supermarket Commodity Sales Forecast Based on Data Mining [J]. Hans Journal of Data
Mining,2018,08(02):74-78.