University of Nairobi
Question
Asked 31st Jan, 2017
How to compute impurity using Gini Index?
For decision trees, we can either compute the information gain and entropy or gini index in deciding the correct attribute which can be the splitting attribute. Can anyone send an worked out example of Gini index
Most recent answer
I have just learnt that gini is the default. Though R also supports the information gain.
Popular answers (1)
Technische Universität Bergakademie Freiberg
Lets assume we have 3 classes and 80 objects. 19 objects are in class 1, 21 objects in class 2, and 40 objects in class 3 (denoted as (19,21,40) ).
The Gini index would be: 1- [ (19/80)^2 + (21/80)^2 + (40/80)^2] = 0.6247 i.e. costbefore = Gini(19,21,40) = 0.6247
In order to decide where to split, we test all possible splits. For example splitting at 2.0623, which results in a split (16,9,0) and (3,12,40):
After testing x1 < 2.0623:
costL =Gini(16,9,0) = 0.4608
costR =Gini(3,12,40) = 0.4205
Then we weight branch impurity by empirical branch probabilities:
costx1<2.0623 = 25/80 costL + 55/80 costR = 0.4331
We do that for every possible split, for example x1 < 1:
costx1<1 = FractionL Gini(8,4,0) + FractionR Gini(11,17,40) = 12/80 * 0.4444 + 68/80 * 0.5653 = 0.5417
After that, we chose the split with the lowest cost. This is the split x1 < 2.0623 with a cost of 0.4331.
13 Recommendations
All Answers (6)
Technische Universität Bergakademie Freiberg
Lets assume we have 3 classes and 80 objects. 19 objects are in class 1, 21 objects in class 2, and 40 objects in class 3 (denoted as (19,21,40) ).
The Gini index would be: 1- [ (19/80)^2 + (21/80)^2 + (40/80)^2] = 0.6247 i.e. costbefore = Gini(19,21,40) = 0.6247
In order to decide where to split, we test all possible splits. For example splitting at 2.0623, which results in a split (16,9,0) and (3,12,40):
After testing x1 < 2.0623:
costL =Gini(16,9,0) = 0.4608
costR =Gini(3,12,40) = 0.4205
Then we weight branch impurity by empirical branch probabilities:
costx1<2.0623 = 25/80 costL + 55/80 costR = 0.4331
We do that for every possible split, for example x1 < 1:
costx1<1 = FractionL Gini(8,4,0) + FractionR Gini(11,17,40) = 12/80 * 0.4444 + 68/80 * 0.5653 = 0.5417
After that, we chose the split with the lowest cost. This is the split x1 < 2.0623 with a cost of 0.4331.
13 Recommendations
VNR Vignana Jyothi Institute of Engineering & Technology
I understood that in case of gini index,we try to minimize and take that value for splitting and in ginigain ,the one that gives maximum value for splitting,am I right?But when do we use gini index and ginigain?
Similar questions and discussions
How to Insert Google Scholar icon and Researchgate icon by using Latex (Overleaf) ?
Sachin Chandrasekara
Using Latex, I want to insert the Google Scholar icon and Researchgate in my CV. I used the following command, but it doesn't appear researchgate icon. Link works.
\newcommand{\researchgatesocialsymbol}{\faResearchgate}
\newcommand*{\Researchgate}[1]{\sociallink{\researchgatesocialsymbol}{http://www.#1}{#1}}
Compiled this command line in the main .tex file.
\Researchgate{https://www.researchgate.net/profile/PROFILE_NAME}.
Your assistance would be appreciated.
How to identify journal quartiles (Q1, Q2, Q3, Q4) for the journals indexed in the ISI/SSCI list?
Cong Minh Huynh
Dear researchers,
Currently, I just know checking journal quartile (Q1-Q4) for Scopus journals and SCImago journals at https://www.scimagojr.com/
However, I dont know how to identify journal quartile (Q1, Q2, Q3, Q4) for the journals that are indexed in ISI/SSCI list? Would you please help me to do that?
Thank you so much!
Stay healthy and best regards,
Related Publications
In this paper, for the refinement of the database in data mining, by synthetically analyzing the characteristics of the current attribute reduction methods and decision tree algorithm, we put forward formalized description model of rule knowledge, and establish a kind of attribute reduction method (BD-RED) of decision tree by using similarity betwe...
At present, computer data mining techniques have been widely used in the field of E-business in various trade and capital analysis process. Its use has also made a good market effect for many enterprises. E-business is the main modern commercial trade exchange mode [1] . With the development of E-business, the technology of data mining for business...
The guidance work of graduates’ employment plays a very important role in university, college students, and the society. The current employment guidance mode is formed during the distribution system of the past graduates reformation. It has some limitations. Through the association rules application of the computer data mining technology, this pape...