Question
Asked 31st Jan, 2017
  • St.Francis Institute of Technology

How to compute impurity using Gini Index?

For decision trees, we can either compute the information gain and entropy or gini index in deciding the correct attribute which can be the splitting attribute. Can anyone send an worked out example of Gini index

Most recent answer

Afra Nuwasiima
University of Nairobi
I have just learnt that gini is the default. Though R also supports the information gain.

Popular answers (1)

Robert Lösch
Technische Universität Bergakademie Freiberg
Lets assume we have 3 classes and 80 objects. 19 objects are in class 1, 21 objects in class 2, and 40 objects in class 3 (denoted as (19,21,40) ).
The Gini index would be: 1- [ (19/80)^2 + (21/80)^2 + (40/80)^2] = 0.6247 i.e. costbefore = Gini(19,21,40) = 0.6247
In order to decide where to split, we test all possible splits. For example splitting at 2.0623, which results in a split (16,9,0) and (3,12,40):
After testing x1 < 2.0623:
costL =Gini(16,9,0) = 0.4608
costR =Gini(3,12,40) = 0.4205
Then we weight branch impurity by empirical branch probabilities:
costx1<2.0623 = 25/80 costL + 55/80 costR = 0.4331
We do that for every possible split, for example x1 < 1:
costx1<1 = FractionL Gini(8,4,0) + FractionR Gini(11,17,40) = 12/80 * 0.4444 + 68/80 * 0.5653 = 0.5417
After that, we chose the split with the lowest cost. This is the split x1 < 2.0623 with a cost of 0.4331.
13 Recommendations

All Answers (6)

Robert Lösch
Technische Universität Bergakademie Freiberg
Lets assume we have 3 classes and 80 objects. 19 objects are in class 1, 21 objects in class 2, and 40 objects in class 3 (denoted as (19,21,40) ).
The Gini index would be: 1- [ (19/80)^2 + (21/80)^2 + (40/80)^2] = 0.6247 i.e. costbefore = Gini(19,21,40) = 0.6247
In order to decide where to split, we test all possible splits. For example splitting at 2.0623, which results in a split (16,9,0) and (3,12,40):
After testing x1 < 2.0623:
costL =Gini(16,9,0) = 0.4608
costR =Gini(3,12,40) = 0.4205
Then we weight branch impurity by empirical branch probabilities:
costx1<2.0623 = 25/80 costL + 55/80 costR = 0.4331
We do that for every possible split, for example x1 < 1:
costx1<1 = FractionL Gini(8,4,0) + FractionR Gini(11,17,40) = 12/80 * 0.4444 + 68/80 * 0.5653 = 0.5417
After that, we chose the split with the lowest cost. This is the split x1 < 2.0623 with a cost of 0.4331.
13 Recommendations
Vishnu Priya Madabhushi
VNR Vignana Jyothi Institute of Engineering & Technology
I understood that in case of gini index,we try to minimize and take that value for splitting and in ginigain ,the one that gives maximum value for splitting,am I right?But when do we use gini index and ginigain?
Afra Nuwasiima
University of Nairobi
I have just learnt that gini is the default. Though R also supports the information gain.

Similar questions and discussions

How to identify journal quartiles (Q1, Q2, Q3, Q4) for the journals indexed in the ISI/SSCI list?
Question
24 answers
  • Cong Minh HuynhCong Minh Huynh
Dear researchers,
However, I dont know how to identify journal quartile (Q1, Q2, Q3, Q4) for the journals that are indexed in ISI/SSCI list? Would you please help me to do that?
Thank you so much!
Stay healthy and best regards,

Related Publications

Article
In this paper, for the refinement of the database in data mining, by synthetically analyzing the characteristics of the current attribute reduction methods and decision tree algorithm, we put forward formalized description model of rule knowledge, and establish a kind of attribute reduction method (BD-RED) of decision tree by using similarity betwe...
Article
Full-text available
At present, computer data mining techniques have been widely used in the field of E-business in various trade and capital analysis process. Its use has also made a good market effect for many enterprises. E-business is the main modern commercial trade exchange mode [1] . With the development of E-business, the technology of data mining for business...
Chapter
The guidance work of graduates’ employment plays a very important role in university, college students, and the society. The current employment guidance mode is formed during the distribution system of the past graduates reformation. It has some limitations. Through the association rules application of the computer data mining technology, this pape...
Got a technical question?
Get high-quality answers from experts.