Question
• St.Francis Institute of Technology

# How to compute impurity using Gini Index?

For decision trees, we can either compute the information gain and entropy or gini index in deciding the correct attribute which can be the splitting attribute. Can anyone send an worked out example of Gini index

Afra Nuwasiima
University of Nairobi
I have just learnt that gini is the default. Though R also supports the information gain.

Robert Lösch
Lets assume we have 3 classes and 80 objects. 19 objects are in class 1, 21 objects in class 2, and 40 objects in class 3 (denoted as (19,21,40) ).
The Gini index would be: 1- [ (19/80)^2 + (21/80)^2 + (40/80)^2] = 0.6247 i.e. costbefore = Gini(19,21,40) = 0.6247
In order to decide where to split, we test all possible splits. For example splitting at 2.0623, which results in a split (16,9,0) and (3,12,40):
After testing x1 < 2.0623:
costL =Gini(16,9,0) = 0.4608
costR =Gini(3,12,40) = 0.4205
Then we weight branch impurity by empirical branch probabilities:
costx1<2.0623 = 25/80 costL + 55/80 costR = 0.4331
We do that for every possible split, for example x1 < 1:
costx1<1 = FractionL Gini(8,4,0) + FractionR Gini(11,17,40) = 12/80 * 0.4444 + 68/80 * 0.5653 = 0.5417
After that, we chose the split with the lowest cost. This is the split x1 < 2.0623 with a cost of 0.4331.
13 Recommendations

Robert Lösch
Lets assume we have 3 classes and 80 objects. 19 objects are in class 1, 21 objects in class 2, and 40 objects in class 3 (denoted as (19,21,40) ).
The Gini index would be: 1- [ (19/80)^2 + (21/80)^2 + (40/80)^2] = 0.6247 i.e. costbefore = Gini(19,21,40) = 0.6247
In order to decide where to split, we test all possible splits. For example splitting at 2.0623, which results in a split (16,9,0) and (3,12,40):
After testing x1 < 2.0623:
costL =Gini(16,9,0) = 0.4608
costR =Gini(3,12,40) = 0.4205
Then we weight branch impurity by empirical branch probabilities:
costx1<2.0623 = 25/80 costL + 55/80 costR = 0.4331
We do that for every possible split, for example x1 < 1:
costx1<1 = FractionL Gini(8,4,0) + FractionR Gini(11,17,40) = 12/80 * 0.4444 + 68/80 * 0.5653 = 0.5417
After that, we chose the split with the lowest cost. This is the split x1 < 2.0623 with a cost of 0.4331.
13 Recommendations
Sanjay Chakraborty
Techno International New Town
2 Recommendations
VNR Vignana Jyothi Institute of Engineering & Technology
I understood that in case of gini index,we try to minimize and take that value for splitting and in ginigain ,the one that gives maximum value for splitting,am I right?But when do we use gini index and ginigain?
Afra Nuwasiima
University of Nairobi
I have just learnt that gini is the default. Though R also supports the information gain.

## Related Publications

Article
In this paper, for the refinement of the database in data mining, by synthetically analyzing the characteristics of the current attribute reduction methods and decision tree algorithm, we put forward formalized description model of rule knowledge, and establish a kind of attribute reduction method (BD-RED) of decision tree by using similarity betwe...
Article
Full-text available
At present, computer data mining techniques have been widely used in the field of E-business in various trade and capital analysis process. Its use has also made a good market effect for many enterprises. E-business is the main modern commercial trade exchange mode [1] . With the development of E-business, the technology of data mining for business...
Chapter
The guidance work of graduates’ employment plays a very important role in university, college students, and the society. The current employment guidance mode is formed during the distribution system of the past graduates reformation. It has some limitations. Through the association rules application of the computer data mining technology, this pape...
Got a technical question?