ArticlePDF Available

A clustering algorithm application in Parkinson disease based on k-means method

Authors:

Figures

Content may be subject to copyright.
International Journal of Mathematics and
Computer Science, 15(2020), no. 4, 1005–1014
b b
M
CS
A Clustering Algorithm Application in
Parkinson Disease based on kmeans Method
Israa Ali Alshabeeb, Nidaa Ghalib Ali,
Saba Abdulameer Naser, Wafaa M. R. Shakir
Technical Computer System Department
Babylon Technical Institute
Al-Furat Al-Awsat Technical University
Babil, Iraq
Inb.esr@atu.edu.iq, Inb.nedaa10@atu.edu.iq,
sabaabdulameernaser@gmail.com, inb.wfa@atu.edu.iq
(Received July 21, 2020, Accepted August 17, 2020)
Abstract
Data mining methods are used to predict and compare ages of peo-
ple with Parkinson’s disease and so are considered a critical part in
the medical community. The amount of data stored in the medical
section database is increasing rapidly. There is a system used to ar-
range the results of ages of Parkinson disease patients by years. In this
paper, a data mining technique called the kmeans clustering algo-
rithm for analyzing data is implemented. This technique is applied on
a hospital database and analyzes the performance successfully and so
is useful in giving accurate results and making an effective decision by
the Ministry of Health in Iraq and related parties to find appropriate
solutions.
1 Introduction
The ability to monitor the performance of a disease in a huge number of pa-
tients is becoming of increasing importance. It is crucial to know how many
Key words and phrases: Data Mining, Clustering, Parkinson disease,
kmeans clustering.
AMS (MOS) Subject Classifications: 68Q25, 68W40, 62H30
ISSN 1814-0432, 2020, http://ijmcs.future-in-tech.net
1006 I. A. Alshabeeb, N. G. Ali,S. A. Naser, W. M. Shakir
people are affected by diseases in relevance to different ages. A second popu-
lar degenerative disorder disease after Alzheimer’s is Parkinson disease (PD).
The most common symptoms of PD are shaking and slowness of movement.
The main cause of this disease is unknown [1] [2]. In data mining, many
algorithms have been used with diseases. Decision tree, factor analysis, neu-
ral net and logistic regression were implemented to examine the biomedical
voice measurements to find out which measurement is more suitable to figure
out the early symptoms of PD and follow early treatment [3]. Using data
mining techniques such as Decision Tree Algorithm, Naive Bayes and Neural
Network doctors can detect the risk rate of a heart disease [4]. Deepika and
Kalaiselvi [5] gave a review of using data mining methods and comparing the
results for diagnosis and prognosis of breast cancer disease, heart disease and
thyroid. Using technical regression, the decision tree models and technical
regression to predict diabetes using specific risk factors were discussed in [6].
While Bayesian analysis was used to estimate how PD depends on hereditary
as a risk factor key and how it affected PD patients [7]. In this work, a new
approach is used to study yearly comparisons of PD patients to see if the
disease is growing or not. The algorithm that is proposed in this paper gives
good accuracy.
Physicians and The Ministry of Health in Iraq try to know how a disease has
been progressing to find solutions to treat people at an early stage. There are
many clustering methods in data mining that are available to predict how
many patients are affected every year. The kmeans clustering method is
used in this paper to know how PD affects people.
2 Methodology
2.1 Clustering
Clustering is a data mining technique for gathering similar objects in fea-
tures or properties in one group which is called a cluster. When the distance
between two objects is less than any other distance of other objects will be
in the same subset and this subset should contain at least one object [8].
This technique simply puts similar data into a groups and dissimilar ob-
jects into a separate group [9]. Clustering is widely used in many different
fields and applications such as pattern recognition, image processing, secu-
rity, machine-learning situations, data analysis, business, web search and
biology [10]. Clustering methods can be classified in many types. Those
are represented by partitioning methods, density-based methods, hierarchi-
A Clustering Algorithm Application in Parkinson ... 1007
Figure 1: Partitioning clustering
cal methods, grid-based methods and model-based methods [11]. The type
of data and the purpose of application are the main points to choose the best
clustering method. Many researches have used clustering algorithms, like Hi-
erarchical clustering [12], kmeans clustering [13] [14] in applications. Par-
titioning methods work based on iterative reallocating data objects among
subsets and determine the initial number of subsets. Partitioning algorithms
are represented by the kmedoids and kmeans. kmeans algorithm is
proposed in this paper. Partitioning clustering is shown in figure 1
2.2 Data mining process
In this work, data was gathered from Neuroscience hospital in Baghdad, Iraq
for four years (2016-2019) and a classification method to analyze them.
2.2.1 kmeans Method
kmeans is one of the clustering algorithms. It is used to cluster the data
into groups based on a centroid point. This algorithm is used widely in
pattern recognition applications [15]. It is useful with both a big or small
dataset and gives a good result. kmeans clustering is considered as an
1008 I. A. Alshabeeb, N. G. Ali,S. A. Naser, W. M. Shakir
unsupervised linear method [13]. It works in many steps. The first step is to
mention the number of kclustering and the centroid point. Then we put the
data that are similar to each other or have the lower distance in one group.
Finally, we repeat the steps above to set all data in groups and no object is
moved from cluster to another. The kmeans flow chart is shown in figure
2 and the process of the algorithm is demonstrated in figures 3, 4 and 5.
The objective function for kmean method could be written as:
f=
k
X
i=1
n
X
j=1
(pj yi)2,(2.1)
where kis the number of clusters, nis the number of cases, pj represents the
point or the object in the cluster and yi is the centroid point for cluster j.
The centroid point represents the mean value for each cluster.
In figure 3, the number of clusters was selected and the centroid point for
each group was found.
In figure 4, the distance between centroid points and the object was com-
puted and assigned the object to the nearest cluster depending on the dis-
tance value. In figure 5, we recalculated the centroid for each cluster and
reassigned objects to groups until there was no change any longer.
A Clustering Algorithm Application in Parkinson ... 1009
Figure 2: Flowchart of kmeans algorithm
1010 I. A. Alshabeeb, N. G. Ali,S. A. Naser, W. M. Shakir
Figure 3: kmeans at initialization
Figure 4: Centroid and grouping
A Clustering Algorithm Application in Parkinson ... 1011
Figure 5: Recomputed centroid and regrouping
3 Result and Discussion
In this model, the data set was applied on 35 patients at the hospital each
year for 4 successive years. The Xl-miner tool was used to analyze the dataset
and implement the method. The generated results are shown in tables 1, 2,
3 and 4, respectively. In table 1, for the year 2016, the overall age for cluster
size 14 is 75.3 while overall age of cluster size 13 is 63. That means 27 out of
35 of patients are older than 62. The rest of patients are from 44 to 50 years
old. In table 2, for the year 2017, the overall age for cluster size 10 is 62.7
while overall age of cluster size 15 is 52.9. That means 25 out of 35 patients
are older than 50 years and the rest are from 14 to 41 years old. For table 3,
for the year 2018, the overall age for cluster size 11 and size 10 are between
47.7 and 39 years old. The rest of patients are from 15 to 28 years old. In
2019, 24 patients out of 35 between 32 and 48 are suffering from the disease
and 11 patients are from 8 to 23 years old. That means PD seems to affect
younger people compared to 2016 and 2017.
1012 I. A. Alshabeeb, N. G. Ali,S. A. Naser, W. M. Shakir
Table 1: Data summary for 2016
cluster Size age
cluster1 14 75.3
cluster2 13 63
cluster3 3 44
cluster4 5 50.6
Total 35
Table 2: Data summary for 2017
cluster Size age
cluster1 10 62.7
cluster2 15 52.9
cluster3 2 14
cluster4 8 41.75
Total 35
Table 3: Data summary for 2018
cluster Size age
cluster1 11 47.7
cluster2 10 39.6
cluster3 7 15.2
cluster4 7 28.2
Total 35
Table 4: Data summary for 2019
cluster Size age
cluster1 13 48.3
cluster2 11 32.6
cluster3 3 8.3
cluster4 8 23.2
Total 35
A Clustering Algorithm Application in Parkinson ... 1013
References
[1] Fan, Kuan, Pengzhi Hu, Chengyuan Song, Xiong Deng, Jie Wen, Yiming
Liu, Hao Deng. ”Novel Compound Heterozygous PRKN Variants in a
Han-Chinese Family with Early-Onset Parkinsons Disease.” Parkinsons
Disease, (2019).
[2] Priyansha Raj Sinha, Amit Alexander Charan, ”Parkinsons disease: A
review article.” The Pharma Innovation, 6, no. 9, Part H, (2017), 511.
[3] Shianghau Wu, Jiannjong Guo, ”A Data Mining Analysis of the Parkin-
sons Disease,” iBusiness, 3, no. 1, (2011), 71–75.
[4] J. Thomas, R. Theresa Princy, ”Human heart disease prediction sys-
tem using data mining techniques,” In 2016 International Conference
on Circuit, Power and Computing Technologies (ICCPCT), 1-5. IEEE,
(2016).
[5] M. Deepika, K. Kalaiselvi, A Empirical study on Disease Diagnosis
using Data Mining Techniques,” In 2018 Second International Con-
ference on Inventive Communication and Computational Technologies
(ICICCT), IEEE, (2018), 615–620.
[6] Xue-Hui Meng, Yi-Xiang Huang, Dong-Ping Rao, Qiu Zhang, Qing Liu,
”Comparison of three data mining models for predicting diabetes or
prediabetes by risk factors,” The Kaohsiung journal of medical sciences,
29, no. 2, (2013), 93–99.
[7] Abolfazl Saghafi, Chris P. Tsokos, Rebecca D. Wooten, ”On Heredity
Factors of Parkinsons Disease: A Parametric and Bayesian Analysis,”
Advances in Parkinson’s Disease, 7, no. 3, (2018), 31–42.
[8] Jiawei Han, Micheline Kamber, Jian Pei, Data mining concepts and
techniques, 3rd edition, Morgan Kaufmann, 2011.
[9] Hina Gulati, P. K. Singh, ”Clustering techniques in data mining: A
comparison,” In 2015 2nd international conference on computing for
sustainable global development (INDIACom), IEEE, (2015), 410–415.
[10] J. Han, M. Kamber. Data Mining: Concepts and Techniques, 2nd Edi-
tion, Elsevier, 2006.
1014 I. A. Alshabeeb, N. G. Ali,S. A. Naser, W. M. Shakir
[11] T. Madhulatha, T. Soni, ”An overview on clustering methods,” arXiv
preprint arXiv:1205.1117, (2012).
[12] Lin Liao, Zhen Jia, Yang Deng, ”Coarse-Graining Method Based on
Hierarchical Clustering on Complex Networks,” Communications and
Network, 11, no. 1, (2019), 21–34.
[13] Ling-Li Jiang, Yu-Xiang Cao, Hua-Kui Yin, Kong-Shu Deng, ”An im-
proved kernel kmeans cluster method and its application in fault di-
agnosis of roller bearing,” (2013).
[14] Manyun Lin, Xiangang Zhao, Cunqun Fan, Lizi Xie, Lan Wei, Peng
Guo, ”Polarimetric Meteorological Satellite Data Processing Software
Classification Based on Principal Component Analysis and Improved
kmeans Algorithm,” Journal of Geoscience and Environment Protec-
tion, 5, no. 7, (2017), 39.
[15] Siwei Wang, Miaomiao Li, Ning Hu, En Zhu, Jingtao Hu, Xinwang
Liu, Jianping Yin, kmeans clustering with incomplete data,” IEEE
Access, 7, (2019), 69162-69171.
... Among the A.I. techniques aimed at obtaining relevant information about these unstructured data are supervised techniques such as classification [19], or descriptive techniques such as association rules [16,17] or clustering. Clustering is one of the most widespread A.I. techniques, with great results in various fields of application such as energy [38], health [6], or economics [23]. Due to its potential to obtain hidden groups in data without prior labelling, clustering is also very relevant in social media analysis problems. ...
Chapter
Today’s information society has led to the emergence of a large number of applications that generate and consume digital data. Many of these applications are based on social networks, and therefore their information often comes in the form of unstructured text. This text from social media also tends to contain a high level of noise and untrustworthy content. Therefore, having systems capable of dealing with it efficiently is a very relevant issue. In order to verify the trustworthiness of the social media content, it is necessary to analyse and explore social media data by using text mining techniques. One of the most widespread techniques in the field of text mining is text clustering, that allows us to automatically group similar documents into categories. Text clustering is very sensitive to the presence of noise and so in this paper we propose a pre-processing pipeline based on word embedding that allows selecting trustworthy content and discarding noise in a way that improves clustering results. To validate the proposed pipeline, a real use case is provided on a Twitter dataset related to COVID-19.
... Currently, new methods and tools have been introduced by researchers and scholars to develop intelligent systems for detecting emotions in their early stages [14] as well as in the field of healthcare systems [15]. Building an accurate and reliable model has become important in detecting and recognizing human emotions via psychophysiological data [16]. ...
Article
Full-text available
Emotion detection from an ECG signal allows the direct assessment of the inner state of a human. Because ECG signals contain nerve endings from the autonomic nervous system that controls the behavior of each emotion. Besides, emotion detection plays a vital role in the daily activities of human life, where we lately witnessed the outbreak of the (COVID-19) pandemic that has a bad influence on the affective states of humans. Therefore, it has become indispensable to build an intelligent system capable of predicting and classifying emotions in their early stages. Accordingly, in this study, the Parallel-Extraction of Temporal and Spatial Features using Convolutional Neural Network (PETSFCNN) is established. So, in-depth features of the ECG signals are extracted and captured from the suggested parallel 2-channel structure of 1-dimensional CNN network and 2-dimensional CNN network and then combined by feature fusion technique for more dependable classification results. Besides, Grid Search Optimized-Deep Neural Network (GSO-DNN) is adopted for higher classification accuracy. To verify the performance of the proposed method, our experiment was implemented on two different datasets. The maximum classification accuracy of 97.56% and 96.34% on both valence and arousal were gained, respectively using the internationally approved DREAMER dataset. While the same model on the private dataset achieved 76.19% for valence and 80.95% for arousal respectively. The classification results of the PETSFCNN-GSO-DNN model are compared with state-of-the-art methods. The empirical findings reveal that the proposed method can detect emotions from ECG signals more accurately and better than state-of-the-art methods and has the potential to be implemented as an intelligent system for affect detection.
Article
Full-text available
Genetic factors are thought to play an important role in the pathogenesis of Parkinson’s disease (PD), particularly early-onset PD. The PRKN gene is the primary disease-causing gene for early-onset PD. The details of its functions remain unclear. This study identified novel compound heterozygous variants (p.T240K and p.L272R) of the PRKN gene in a Han-Chinese family with early-onset PD. This finding is helpful in the genetic diagnosis of PD and also the functional research of the PRKN gene.
Article
Full-text available
Clustering has been intensively studied in machine learning and data mining communities. Although demonstrating promising performance in various applications, most of the existing clustering algorithms cannot efficiently handle clustering tasks with incomplete features which is common in practical applications. To address this issue, we propose a novel K-means based clustering algorithm which unifies the clustering and imputation into one single objective function. It makes these two processes be negotiable with each other to achieve optimality. Further, we design an alternate optimization algorithm to solve the resultant optimization problem and theoretically prove its convergence. Comprehensive experimental study has been conducted on nine UCI benchmark datasets and real-world applications to evaluate the performance of the proposed algorithm, and the experimental results have clearly demonstrated the effectiveness of our algorithm which outperforms several commonly-used methods for incomplete data clustering.
Article
Full-text available
Hereditary is one of the key risk factors of the Parkinson's disease (PD) and children of individuals with the Parkinson's carry a two-fold risk for the disease. In this article, chance of developing the Parkinson's disease is estimated for an individual in five types of families. That is, families with negative history of the PD (I), families with positive history where neither one of the parents (II), one of the parents (III-IV), or both parents (V) are diagnosed with the disease. After a sophisticated modeling, Maximum Likelihood and Bayesian Approach are used to estimate the chance of developing the Parkinson's in the five mentioned family types. It is extremely important knowing such probabilities as the individual can take precautionary measures to defy the odds. While many physicians have provided medical opinions on chance of developing the PD, our study is one of the first to provide statistical modeling and analysis with real data to support the conclusions.
Article
Full-text available
For the kernel K-mean cluster method is run in an implicit feature space, the initial and iterative cluster centers cannot be defined explicitly. Against the deficiency of the initial cluster centers selected in the original space discretionarily in the existing methods, this paper proposes a new method for ensuring the clustering center that virtual clustering centers are defined in the feature space by the original classification as the initial cluster centers and the iteration clustering centers are ensured by the further virtual classification. The improved method is used for fault diagnosis of roller bearing that achieves a good cluster and diagnosis result, which demonstrates the effectiveness of the proposed method.
Article
Full-text available
Clinical decision-making needs available information to be the guidance for physicians. Nowadays, data mining method is applied in medical research in order to analyze large volume of medical data. This study attempts to use data mining method to analyze the databank of Parkinson's disease and explore whether the voice measurement variables can be the diagnostic tool for the Parkinson's disease.
Article
Full-text available
The purpose of this study was to compare the performance of logistic regression, artificial neural networks (ANNs) and decision tree models for predicting diabetes or prediabetes using common risk factors. Participants came from two communities in Guangzhou, China; 735 patients confirmed to have diabetes or prediabetes and 752 normal controls were recruited. A standard questionnaire was administered to obtain information on demographic characteristics, family diabetes history, anthropometric measurements and lifestyle risk factors. Then we developed three predictive models using 12 input variables and one output variable from the questionnaire information; we evaluated the three models in terms of their accuracy, sensitivity and specificity. The logistic regression model achieved a classification accuracy of 76.13% with a sensitivity of 79.59% and a specificity of 72.74%. The ANN model reached a classification accuracy of 73.23% with a sensitivity of 82.18% and a specificity of 64.49%; and the decision tree (C5.0) achieved a classification accuracy of 77.87% with a sensitivity of 80.68% and specificity of 75.13%. The decision tree model (C5.0) had the best classification accuracy, followed by the logistic regression model, and the ANN gave the lowest accuracy.
Conference Paper
Nowadays, health disease are increasing day by day due to life style, hereditary. Especially, heart disease has become more common these days, i.e. life of people is at risk. Each individual has different values for Blood pressure, cholesterol and pulse rate. But according to medically proven results the normal values of Blood pressure is 120/90, cholesterol is and pulse rate is 72. This paper gives the survey about different classification techniques used for predicting the risk level of each person based on age, gender, Blood pressure, cholesterol, pulse rate. The patient risk level is classified using datamining classification techniques such as Naive Bayes, KNN, Decision Tree Algorithm, Neural Network. etc., Accuracy of the risk level is high when using more number of attributes.
Article
Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the process of grouping similar objects into different groups, or more precisely, the partitioning of a data set into subsets, so that the data in each subset according to some defined distance measure. This paper covers about clustering algorithms, benefits and its applications. Paper concludes by discussing some limitations.