ArticlePDF Available

Abstract and Figures

A problem of soil clustering based on the chemical characteristics of soil, and proper visual representation of the obtained results, is analysed in the paper. To that aim, K-means and fuzzy K-means algorithms are adapted for soil data clustering. A database of soil characteristics sampled in Montenegro is used for a comparative analysis of implemented algorithms. The procedure of setting proper values for control parameters of fuzzy K-means is illustrated on the used database. In addition, validation of clustering is made through visualisation. Classified soil data are presented on the static Google map and dynamic Open Street Map.
Content may be subject to copyright.
56 Telfor Journal, Vol. 8, No. 1, 2016.
Abstract — A problem of soil clustering based on the
chemical characteristics of soil, and proper visual
representation of the obtained results, is analysed in the paper.
To that aim, K-means and fuzzy K-means algorithms are
adapted for soil data clustering. A database of soil
characteristics sampled in Montenegro is used for a
comparative analysis of implemented algorithms. The
procedure of setting proper values for control parameters of
fuzzy K-means is illustrated on the used database. In addition,
validation of clustering is made through visualisation.
Classified soil data are presented on the static Google map and
dynamic Open Street Map.
Keywordsclustering, data mining, K-means, fuzzy K-
means, pedologic map.
I. INTRODUCTION
N Montenegro, in the period 1958-1988, a detailed soil
map with scale of 1:50 000 was made. Unfortunately, as
in other Former Republics of Yugoslavia, the enormous
effort and the work was not properly presented to the wide
professional community and land users, since the data and
data map were available only in a hard copy version. The
goal of this paper is to investigate the possibility of using
data mining tools for soil classification based on the
available data and to illustrate adequate visualisation, in
order to make this data understandable to wider society.
Data mining (DM) is a set of techniques that aims to
discover implicit useful information from big data.
Information discovery is usually performed by identifying
patterns and establishing relationships. Data Mining allows
focusing on the most important information in the data.
DM includes: clustering, anomaly detection, association
rule learning, classification, regression and summarization
and sequence or path analysis and forecasting. DM is the
computer-assisted process of "digging" through enormous
databases in order to analyse and extract the meaning of the
data.
In this paper focus is on clustering. Clustering has found
applications in many research areas such as mathematics,
engineering, economics, marketing, machine learning,
pattern recognition, genetics, bioinformatics, psychology,
biology, data compression and information retrieval.
Clustering is a process of grouping similar sets of data. This
grouping is unsupervised; it is done without using known
structures in the data. Clustering aims to make clusters with
data samples which are more similar to each other than to
data samples that belong to the other clusters. Each cluster
is defined by a central point, a centroid. Similarity of data
in one cluster is measured by using different criteria. Thus,
there are lots of different methods which can solve the
general task of clustering [1].
Two types of K-means algorithm are analysed in this
paper and the obtained results are discussed. The standard
K-means algorithm divides a data sample into exclusive
clusters. The initial values of clusters’ centroids are
randomly selected from the available data. Updating
centroids and clustering of data is then repeated until
convergence is reached or for a defined number of
iterations. A new centroid for a cluster is calculated based
on each data sample that belongs to that cluster. The first
issue of application of K-means-type algorithms is that the
number of clusters should be known in advance. Thus,
before discovering knowledge from big data, we have to
know how many cluster we expect in a database. The
second issue is that this kind of algorithms is very sensitive
to the initial clusters’ centroids. Usually, initial centroids
are chosen randomly.
In fuzzy K-means clustering data samples belong to
every estimated cluster, with a certain belonging degree.
Hence, a result of this algorithm are not exclusive clusters,
but clusters with fuzzy borders. A fuzzifier is a parameter
which defines fuzziness of fuzzy K-means clusters. This
method is used for clustering data in cases where clusters
are not clearly defined and one cannot estimate clear
borders among data samples.
In this paper, standard and fuzzy K-means clustering of
soil data is implemented for a database and results are
graphically presented. The results of fuzzy K-means
clustering for different values of fuzzifier are tested, and a
procedure for the selection of fuzzifier for a database is
proposed. In addition to simple graphical representation,
maps are a proper way of presenting the results of clustering
of soil. Thus, the results of K-means clustering of soil are
present on the static Google map and dynamic Open Street
Map.
Soil data clustering by using K-means and fuzzy
K-means algorithm
Elma Hot and Vesna Popović-Bugarin, member, IEEE
I
Paper received May 24, 2016; accepted June 12, 2016. Date o
f
publication July 20, 2016. The associate editor coordinating the review o
f
this manuscript and approving it for publication was Prof. Miroslav
Lutovac.
This paper is a revised and expanded version of the paper presented a
t
the 23rd Telecommunications Forum TELFOR 2015 [21].
This research is supported in part by FP7 Project Fore Mont and Pro ject
for establishment of pilot Montenegrin Centre of Excellence in Bio-
informatics - BIO-ICT (Contract No. 01-1001).
Elma Hot, Faculty of Electrical Engineering Podgorica, University o
f
Montenegro, Džordža Vašingtona bb, 81000 Podgorica (Phone: +382
20 245 839, Fax: +382 20 245 873, e-mail: elma_hot@live.com).
Vesna Popović-Bugarin, Faculty of Electrical Engineering Podgorica,
University of Montenegro, Džordža Vašingtona bb, 81000 Podgoric
a
(Phone:+382 20 245 839, Fax: +382 20 245 873, e-mail: pvesna@ac.me).
Hot and Popović-Bugarin: Soil data clustering by using K-means and fuzzy K-means algorithm 57
The paper is organized as follows. In Section II K-means
clustering is presented; in Section III fuzzy K-means
algorithm is reviewed. The performances of analysed
algorithms in the case of clustering data samples with
chemical characteristic of soil are analysed in Section IV
through simulation results. A conclusion is drawn in
Section V.
II. K-MEANS CLUSTERING
K-means (KM) clustering is a widely used partitioning
method. This method aims to make K mutually exclusive
clusters of n data samples characterised by d parameters.
Each cluster K is defined with one central point (centroid)
determined by a certain combination of parameters
contained in each data sample.
KM is known as a method of vector quantization, since it is
based on the location of points and their mutual distances
[2]. Namely, data samples described by d parameters can be
presented as points in a d-dimensional space, where their
coordinates are determined by the values of d parameters.
A data sample belongs to a cluster defined with a centroid
which is the closest one to the considered sample (point).
The closest centroid is chosen after calculating the distance
of each data from each centroid. Each data sample can
belong to exactly one cluster. Hence, KM clustering is also
called hard clustering.
A. K-means clustering algorithm
The KM algorithm aims to distribute a set X of n data
samples into K clusters. Each data sample is defined by d
parameters. We consider data samples as points in a d-
dimensional space for better visualization. Input to the
algorithm is the number of clusters K. The initial values of
centroids 11 1
12
, ,...,
K
cc c, d
i
cR, are chosen randomly from
the available data samples. After the calculation of the
distance of each data sample from set X to each clusters’
centroid, each data sample is declared to be a member of its
closest cluster. A set of data samples that belong to a cluster
defined by centroid i
c is denoted as i
c, 1iK . In each
iteration, a centroid is estimated as a mean value of d
corresponding parameters of all data samples which are a
member of a corresponding cluster. Calculating K new
centroids in each iteration is equivalent to changing a
clusters' position in a d-dimensional space, till optimal
clusters positions are reached. The processes of clustering
and updating centroids are repeated until convergence has
been reached or for a specified number of iterations.
One drawback of KM clustering appears when a point is
equally close to more than one centroid. In this case, the
algorithm will not converge, since this point belonging will
oscillate among a few different clusters, resulting in
different clusterings. However, this rarely happens in
practice [3], [4].
III. FUZZY K-MEANS CLUSTERING
Fuzzy K-Means (FKM) clustering method is a modification
of the standard KM clustering. As in KM clustering, initial
centroids are chosen randomly. Each iteration in FKM
clustering also starts with calculating the distance of each
data sample to each centroid. However, in FKM there is a
belonging degree, which is inversely proportional to that
distance. A data sample belongs to every cluster with a
certain degree [5]. Hence, borders among clusters are fuzzy.
FKM clustering is also referred to as soft clustering.
In FKM clustering, all data samples affect the calculation
of new centroids. The impact of a data sample on the
calculation of clusters’ centroids is proportional to the
degree of its belonging to that cluster. The other part of the
FKM algorithm is the same as in the KM algorithm.
Fig. 1. K-means clustering algorithm.
A. Fuzzy K-means clustering algorithm
Input to the FKM algorithm is the number of clusters K.
For n data samples the algorithm gives as a result an n
K
matrix W, with elements:
2
1
1
1
(, ) .
k
m
m
n
ii
kic
wij
xc
x




(1)
,ij
wis the degree to which element i
x
belongs to cluster
j
c,,
01
ij
w
. mR
is the fuzzifier, it defines the level
of cluster fuzziness, and 1m. In the absence of a priori
knowledge of the data fuzziness, it is recommended to set
the fuzzifier according to the database and expected results
of clustering. By adjusting the fuzzifier m, border between
clusters can be more fuzzy or more clear. The procedure
for its selection is illustrated in the next section through the
simulation results of FKM algorithm for different values of
m.
New centroids for K cluster are calculated on the basis of
all the data samples:
()
()
m
k
x
km
k
x
wx x
cwx
. (2)
The procedure repeats until reaching convergence or for a
specified maximum number of iterations. Both algorithms
58 Telfor Journal, Vol. 8, No. 1, 2016.
converge when
1tt t
kk k
cc c

, where t is the number of
iteration and
is a sensitivity threshold. In all our
experiments, this threshold is 0.01.
FKM clustering is computationally more complex than
KM clustering. KM calculates a distance and chooses the
smallest o ne in order to find to which cluster a data belongs,
while FKM performs additional Kxd multiplications for
each data sample (1), (2). However, the FKM algorithm has
a better result for a data set with overlapped clusters than
the KM algorithm. Moreover, in a case where data samples
are equally close to more than one centroid, the FKM
algorithm will not oscillate, unlike KM algorithm. FKM
algorithm will give an equal belonging degree of these data
samples to more than one cluster.
IV. S
IMULATION
R
ESULTS
Algorithms for KM and FKM are implemented in Java
and adapted for a soil database. A database of soil samples
sampled in Montenegro is used for clustering; 2526 data
samples are used. The goal is to estimate to which soil type
a sample belongs, using KM and FKM methods of
clustering. Based on the knowledge of the types of soil in
Montenegro, the number of clusters is chosen. The number
of clusters is essential to proper clustering. Thus, it is
important to have a correct value of this parameter. Since
this parameter is not always known in advance, the cases of
clustering with a smaller or higher number of clusters is also
investigated. The conclusion is that, the final centroids are
quite similar to initial centroids in case of using a number
of clusters smaller than a correct one. On the other hand, in
the case of using a larger number of clusters than a correct
one, clusters with a small number, zero or nearly zero, of
elements appears. This was an additional confirmation of
defined number of soil types in Montenegro. Moreover,
these conclusions can be used as a guide for determination
of a correct number of clusters in the case when this
information is not known in advance.
In Montenegro there are four to six types of soil. In our
simulations, data samples with only six parameters are used
for clustering. The used parameters represent the numerical
values of chemical characteristics of soil samples.
Consequently, this clustering of soil can be considered as
basic one. Hence, the optimal number of clusters for this
basic clustering is four.
Fig. 2. Visualisation of soil database, each point is a data
sample with three parameters.
For the verification and visualization of the performance
of the algorithm, clustering of soil samples based on two or
three parameters in three or four clusters was done and is
presented in this section. MATLAB is used for a graphical
representation of results.
Data samples with three parameters from soil database
are presented in a 3-D space before clustering (Fig. 2.).
Clustering soil samples is done by KM and FKM
algorithms. The maximum number of iterations is 100 in
both algorithms.
The mean value of the number of iterations needed for
achieving the convergence of KM algorithm in the case of
clustering in four clusters based on six parameters is 7 in
500 runs. Fig. 3 shows how the parameters of centroid
converge, in the case of clustering in four clusters based on
six parameters. Convergence is reached after 8 iterations.
Fig. 3. Convergence of centroids of K-means clusters in
the case of clustering in four clusters based on six
parameters.
The mean value of the number of iterations for the FKM
algorithm is 21 in 500 runs for the case of clustering in four
clusters based on six parameters. In the case of clustering in
four clusters based on six parameters, presented in Fig. 4,
the convergence is reached after 14 iterations.
Fig. 4. Convergence of centroids of fuzzy K-means
clusters in the case of clustering in four clusters based on
six parameters.
Hot and Popović-Bugarin: Soil data clustering by using K-means and fuzzy K-means algorithm 59
A. Results of K-means Clustering
Results of K-means clustering based on two parameters
for three and four clusters are presented graphically in Fig.
5.a) and b), respectively. Each data sample belongs to only
one cluster. Hence, each colour of data sample presents one
cluster.
Clustering based on three parameters in four clusters of
the same data is presented in Fig. 6.
Fig. 5. K-means clustering based on two parameters. Each
colour presents one cluster a) K=3; b) K=4.
Fig. 6. Clustering in four clusters (types) based on three
parameters. Each colour presents one cluster. a), b), c)
show different projections in three-dimensional space;
B. Results of Fuzzy K-means Clustering
Validation of FKM clustering results is also done by its
visualisation. In FKM clustering each data sample belongs
to each cluster, with a different belonging degree. Hence,
each FKM clust er was drawn on a different graph, while the
colours of samples depend on the belonging degree of each
sample to that cluster. This way of visualisation allows a
proper way of presenting the propagation of clusters. Dark
red presents a belonging degree near to 100%. Yellow and
green present data samples witch partly belong to a plotted
cluster. Light blue to dark blue colours present belonging
degrees whose values are less than 30%.
Parameter m is the fuzzifier and it defines the level of
cluster fuzziness [6]. FKM clustering of soil database was
done with different values of m. Results of FKM clustering
with different fuzzifiers are presented in Fig. 8-9.
Fig. 8.a) presents four FKM clusters of soil database,
where fuzzifier m is 1.1. Dark red is a dominant colour
around centroids and the borders between clusters are clear.
Hence, fuzziness of clusters is missing, so results are similar
to KM clusters.
The second example is FKM clustering where fuzzifier is
2 (Fig. 8.b). The appearance of all colours, from dark red to
dark blue, means that all belonging degrees from 0 to 100%
appear. Different clusters are visible but borders are still
fuzzy.
Fig. 7. Fuzzy K-means clustering based on three
parameters, each graph presents one cluster, colour
depends on belonging degree, K=4 a) m=1.1; b) m=2.
FKM clusters with a fuzzifier 3 are shown in Fig. 9.a)
Clusters are visible, but dark red colours are missing. This
means that belonging degrees near to 100% do not exist. All
belonging degrees in this case are smaller, while a blue
colour is dominant. Hence, fuzziness of clusters is
significant. A border between clusters is less visible than in
the case where the value of m is smaller.
The results of FKM clustering of the same database with
a fuzzifier 5 are shown in Fig. 9.b). Clusters are not visible;
all belonging degrees are equal. Hence, this clustering is
unsuccessful and pointless. The belonging degree of each
sample tends to be equal for every cluster when a fuzzifier
takes values higher than 3.
A conclusion from these examples is that a fuzzifier
value 2 is optimal for this database. Clusters are visible and
borders of clusters are fuzzy enough. These examples show
the choosing procedure of fuzzifier for a specific database.
Having in mind the previous conclusion, FKM clustering
of soil database with a fuzzifier 2 is performed.
60 Telfor Journal, Vol. 8, No. 1, 2016.
Fig. 8. Fuzzy K-means clustering based on three
parameters; each graph presents one cluster; colour
depends on belonging degree, K=4 a) m=3; b) m=5.
The result of the FKM clustering of soil database based
on two parameters and three clusters is presented in Fig. 9.
Clustering is done based on two parameters, so coordinates
present the position of soil samples in 2D space. Belonging
degrees are inversely proportional to the distance between
samples and centroids in 2D space. Increasing the distance
between a sample and centroid reduces the belonging
degree and colours change proportionally from red to blue.
Based on these graphs, a conclusion is made that the FKM
clustering algorithm made a successful clusterization of soil
data.
Fig. 9. Fuzzy K-means clustering in three clusters, based
on two parameters. Each graph presents one cluster, while
colour depends on belonging degree.
Fig. 10 presents the result of FKM clustering in four
clusters based on two parameters of the same database. A
conclusion that clusterization of soil data is successful can
be made, as in the previous case.
Fig. 10. Fuzzy K-means clustering in four clusters, based
on two parameters. Each graph presents one cluster, while
colour depends on belonging degree, K=4.
Soil samples in used database besides physical and
chemical characteristics, have the coordinates of location.
The coordinates of soil samples from database are given in
meters in the coordinate system MGI 1901 / Balkans zone
6. First, coordinates are converted to longitude and latitude.
Graphical environment for presenting the results of
clustering on the static Google map of Montenegro is
implemented in Java (Fig. 11.). Soil samples are labelled
with markers and the colour of marker depends on a marked
sample's soil type. It allows searching the map by
municipalities and soil types, and adjusting the zoom.
Different types of maps are available: roadmap, satellite,
hybrid and terrain.
Fig. 11. K-means clustering of soil samples presented on
Google static map.
Hot and Popović-Bugarin: Soil data clustering by using K-means and fuzzy K-means algorithm 61
The second method of presenting the results of soil
clustering is on dynamic maps. Dynamic maps are
implemented using R programing language. In addition, a
Leaflet is used. It is a set of open-source JavaScript libraries
for interactive maps. An Open Street dynamic map is shown
in Fig. 12. Markers present all soil samples from the
database with coordinates. The colour of marker presents
the results of KM clustering. Each colour is related to a
different cluster. Four clusters are depicted.
Fig. 12. Dynamic Open Street map with results of KM
clustering, markers of different colours presents different
KM clusters.
This map can be considered as a basic pedologic map,
because clustering is made based on only six chemical
parameters and only four clusters were made. Mapping of
KM results more precisely is possible by using more soil
parameters and clustering in more clusters. On this map in
Fig. 12. it is visible that two types of soil are dominant, blue
and green colours. Comparing this to pedologic maps of soil
of Montenegro made by experts, it can be seen that two
dominant soil types are the same (at the same parts of
Montenegro) on both maps.
Using maps for presenting soil data and clustering results
allows marking data with markers, polygons, raster images,
images of soil profiles etc.
V. C
ONCLUSION
The problem of soil data clustering and visualization is
analysed in the paper. Data mining techniques, KM and
FKM, are adapted for this purpose. The visualisation of KM
and FKM results is used for the validation of results.
Results obtained by using KM are presented on the Static
Google map and dynamic Open Street Map of Montenegro.
Presented soil data and data mining result on maps are a
proper way of presenting data to scientists, land users and
people who want to get information about soil in
Montenegro. Our future work will be dedicated to
improving data mining techniques and publishing all results
through a WEB application.
R
EFERENCES
[1] J. C. Bezdek, R. Ehrilich, W Fill, “FCM: The Fuzzy C-means
Clustering Algorithm”, Computers & Geosciences, vol. 10, no. 2-3,
pp. 191-203, 1984.
[2] “Vector Quantization and Clustering”, Courses
of Electrical
Engineering and Computer Science, Massachusetts Institute of
Technology
.
[3] J. Macqueen, “Some Methods for Classification and Analysis of
Multivariate Observations” Proc. of the fifth Berkeley Symposium on
Mathematical Statistics and Probability, 1, page 281-297. University
of California Press, (1967)
[4] Andrew Ng,” CS229 Lecture notes”, Machine Learning Course
Materials
[5] S. Ghosh, S. K. Dubey, “Comparative Analysis of K-Means and
Fuzzy C-Means Algorithms”, ((IJACSA) International Journal of
Advanced Computer Science and Applications, vol. 4, no.4, 2013.
[6] E. Hot, V. Popović-Bugarin, Ana Topalović, Mirko Knežević,
“Generating thematic pedologic maps by using data mining and
interpolations,” submitted for 3nd International Conference on
Electrical, Electronic and Computing Engineering IcETRAN 2016,
Zlatibor, Serbia, June 2016
[7] Md. K. I. Rahmani, N. Pal, K. Arora, “Clustering of Image Data
Using K-Means and Fuzzy K-Means”, (IJACSA) International
Journal of Advanced Computer Science and Applications, Vol. 5, No.
7, 2014
[8] R. Suganya, R. Shanthi, “Fuzzy C- Means Algorithm- A Review”,
International Journal of Scientific and Research Publications, vol. 2,
Issue 11, November 2012 1 ISSN 2250-3153
[9] E. Hot, V. Popović-Bugarin, “Analysis Of Fuzzy K-Means
Clustering Method Using Database Of Soil Samples Sampled In
Montenegro,” Information Technology IT 2016, Žabljak,
Montenegro, February 2016
[10] J. Balkovič, Z. Rampasekova, V. Hutar, J. Sobocka and R. Skalsky,
“Digital Soil Mapping from Conventional Field Soil Observations,”
Soil & Water Res., 8, 2013 (1): 13–25
[11] S. Har-Peled, B. Sadri,How Fast is the k-means Method?*,
January 2, 2010
[12] Singaravelu.S, A.Sherin and S.Savitha “Agglomerative Fuzzy K-
Means Clustering Algorithm”, Nehru E-Journal A Journal of Nehru
Arts and Science College (NASC) Research Article
[13] S. Chattopadhyay, D. Kumar Pratihar, S. C. De Sarkar, “A
Comparative Study of Fuzzy C-Means Algorithm and Entropy-
Based Fuzzy Clustering Algorithms”, Computing and Informatics,
Vol. 30, 2011, 701–720
[14] L. G. Vendrusculo, A. L. Kaleita, “Terrain Analysis and Data Mining
Techniques Applied to Location of Classic Gully in a Watershed”,
2013 ASABE Annual International Meeting
[15] L. Rokach, O. Maimon, “Clustering Methods”, In The Data Mining
and Knowledge Discovery Handbook, pages 321–352. 2005.
[16] J. C. Dunn, "A Fuzzy Relative of the ISODATA Process and Its Use
in Detecting Compact Well-Separated Clusters", Journal of
Cybernetics 3: pp. 32-57, 1973
[17] D. Rajesh, “Application of Spatial Data Mining for Agriculture,
International Journal of Computer Applications, (0975 – 8887)
Volume 15– No.2, February 2011
[18] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, “From Data Mining to
Knowledge Discovery in Databases,” AI Magazine, Vol 17, No 3,
1996
[19] B. Fustic, G. Djuretic, “The Soils of Montenegro,” University of
Montenegro and Biotechnical Institute, Podgorica, Montenegro,
2000.
[20] S. Ghosh, S. K. Dubey, “Comparative Analysis of K-Means and
Fuzzy C-Means Algorithms,” (IJACSA) International Journal of
Advanced Computer Science and Applications, vol. 4, no.4, 2013.
[21] E. Hot and V. Popović-Bugarin, "Soil data clustering by using K-
means and fuzzy K-means algorithm," Telecommunications Forum
Telfor (TELFOR), 2015 23rd, Belgrade, 2015, pp. 890-893.
... Data Mining or extracting knowledge from repositories is a conventionally new trend this not only helps to predict but it also discovers hidden patterns of data. Agricultural data mining is a technology-driven solution that can bring knowledge and information to agricultural development [12]. The various techniques of data mining are available according to CRISP standards. ...
Article
Full-text available
This paper explores the crucial relationship between soil quality and agricultural productivity, emphasizing the pivotal role of soil in sustaining human life. Focusing on the agricultural heritage of Kolar District, Karnataka, India, and its contemporary challenges such as soil degradation, the study employs the clustering technique of data mining to analyze physicochemical soil properties. By applying K-means clustering, the research identifies distinct soil characteristics in various regions of Kolar, providing valuable insights for farmers to optimize agricultural practices. Through statistical analysis and evaluation using the Rattle tool of R-Analytics, the study elucidates practical strategies for improving soil fertility and enhancing crop yields. This work serves as a foundation for informed decision-making in agriculture, addressing the urgent need to understand and mitigate soil quality issues to ensure sustainable food production and livelihoods.
... Based on the historical data [15] used linear regression algorithm to predict the temperature and rainfall of a particular region in three different seasons, i.e., monsoon, winter and summer. Reference [16] used K-means clustering model to classify the soils based on the chemical properties of the soil. Senors were used to capture and store the data of the soil. ...
Article
Full-text available
The agriculture sector is a vital part of India’s economy. About 54.6% of the workers are employed in agricultural and allied activities, and 18.8% of the India’s gross value added (GVA) is generated by these activities. One of the common problems faced by Indian young farmers is choosing the right crop according to the soil conditions. This has led to a significant setback in productivity in agriculture. This study will help the farmers to determine which crop will be suitable to grow on their soil; thus, the prime motive of this study is to create economic welfare to farmers. The dataset used in this study was from different Indian government website and is publicly available. Based on seven different attributes, i.e., nitrogen, phosphorous, potassium, temperature, relative humidity, pH value, and rainfall, a crop is recommended to grow. Four different machine learning algorithms, i.e., Naive Bayes, decision tree, logistic regression, and random forest were used on the data. Random forest’s testing accuracy, i.e., R2R^2 was about 99%, and hence, it was used to develop and deploy a cloud-based app which recommends a particular crop to be grown in a particular soil.
... Consequently, the function of computers in consumer grouping becomes crucial. Using unsupervised learning, computers group data based on a variety of variables [15], [16]. Unsupervised learning aids in modeling consumer behavior and personalities that may have been previously overlooked [17]. ...
Article
Full-text available
Buyers are the most crucial entities for companies selling products, including PT Paragon Technology and Innovation. PT Paragon Technology and Innovation is a cosmetics company that oversees well-known brands such as Wardah, Emina, MakeOver, and Kahf. It is essential for this company to understand the characteristics of its buyers who purchase their products, and one way to achieve this is by conducting consumer segmentation. This consumer segmentation is carried out on customers who have purchased Emina products from March 2021 to March 2023, using three types of RFM analysis approaches: vanilla RFM analysis, RFM+Lifetime, and RFM/Lifetime, which are then grouped using the K-Means algorithm. Through the implementation of this consumer segmentation, the company can gain a deeper understanding of its buyers' behavior towards the products they offer, thereby enhancing business processes and marketing efforts. The consumer segmentation has been completed with the finding that out of the three types of RFM analysis approaches employed for consumer segmentation, the RFM+Lifetime approach is the most effective and relevant one, resulting in four categories: Make Up, Face Care, Others, and General. The Make Up category further consists of five segments, while each of the other categories contains four segments.
... However, we retained a climate variables (cessation date) due to its importance in determining the timing and availability of green water resource in the rain-fed-based agricultural system. The final data processing step was clustering and in this study the K-means spatial clustering [38][39][40][41] was employed to find homogenous climatic regions of the study area. Kmeans is generally simple to implement and it can be used with large datasets having medium to coarse spatial resolution. ...
Preprint
Full-text available
Spatiotemporal climate variability is a leading environmental constraint to the rain-fed agricultural productivity and food security of communities in the Abbay basin and elsewhere in Ethiopia. The previous one-size-fits-all approach to soil and water management technology targeting did not effectively address climate-induced risks to rain-fed agriculture. This study, therefore, delineates homogenous climatic regions and identifies climate-induced risks to rain-fed agriculture that are important to guide decisions and selection of site-specific technologies for green water management in the Abbay basin. The k-means spatial clustering method was employed to identify homogenous climatic regions in the study area, while the Elbow method was used to determine an optimum number of the climate clusters. The k-mean clustering used the Enhancing National Climate Services (ENACTS) daily rainfall, minimum and maximum temperatures, and other derived climate variables that include daily rainfall amount, length of growing period (LGP), rainfall onset and cessation dates, and rainfall intensity, temperature, potential evapotranspiration (PET), soil moisture and AsterDEM to define climate regions. Accordingly, 12 climate clusters or regions were identified and mapped for the basin. Clustering a given geographic region into homogenous climate classes is useful to accurately identify and target locally relevant green water management technologies to effectively address local-scale climate-induced risks. The study has also provides a methodological framework that can be used in the other river basins of Ethiopia and indeed elsewhere.
... To group existing soil samples based on their physicochemical properties and to have a suitable visual representation of the achieved results, k-means algorithm was adapted for soil data clustering. K-means algorithm form clusters with similar data samples based on the square Euclidean distance method (Hartigan & Wong, 1979;Hot & Popović-Bugarin, 2016). These analyses were performed using R 3.2.5 (R Development Core Team, 2015) and RStudio (version 0.99.903) ...
Article
Full-text available
Mining activities accumulate large quantities of waste in tailing ponds, which results in several environmental impacts. In Cartagena–La Unión mining district (SE Spain), a field experiment was carried out in a tailing pond to evaluate the effect of aided phytostabilization on reducing the bioavailability of zinc (Zn), lead (Pb), copper (Cu) and cadmium (Cd) and enhancing soil quality. Nine native plant species were planted, and pig manure and slurry along with marble waste were used as amendments. After 3 years, the vegetation developed heterogeneously on the pond surface. In order to evaluate the factors affecting this inequality, four areas with different VC and an area without treatment (control area) were sampled. Soil physicochemical properties, total, bioavailable and soluble metals, and metal sequential extraction were determined. Results revealed that pH, organic carbon, calcium carbonate equivalent and total nitrogen increased after the aided phytostabilization, while electrical conductivity, total sulfur and bioavailable metals significantly decreased. In addition, results indicated that differences in VC among sampled areas were mainly owing to differences in pH, EC and concentration of soluble metals, which in turn were modified by the effect of non-restored areas on close restored areas after heavy rains due to a lower elevation of the restored areas compared to the unrestored ones. Therefore, to achieve the most favorable and sustainable long-term results of aided phytostabilization, along with plant species and amendments, micro-topography should be also taken into consideration, which causes different soil characteristics and thus different plant growth and survival.
Article
Full-text available
Climate change and environmental degradation pose a significant threat to the global community. Soil management is one of the critical factors for achieving climate neutrality, as plants and soils together currently absorb approximately 30% of the CO2 emitted by human activities each year. This study focused on delineating soil management zones in olive groves to maintain soil health in complex environmental conditions and minimize adverse effects on the biological systems supported. The results of this study are crucial because they showed the potential of unsupervised machine learning techniques in this setting and important soil characteristics for defining management zones. They might significantly affect applying precision farming techniques and methods in olive groves, providing a possible remedy for the problems caused by climate change. A total of 222 soil samples at a depth of 0–30 cm were collected from three areas in the Region of Western Greece, at a density of 21 × 21 m in each area, and analyzed for physicochemical properties. Principal Component Analysis (PCA) was utilized to identify the critical soil properties for delineating the management zones. The soil samples were clustered using unsupervised machine learning methods, K-means, Hierarchical clustering, and DBSCAN. PCA is a method that can help in the selection of critical parameters for the delineation of management zones. Sand (S), Clay (C), Cation Exchange Capacity (CEC), Potassium (K), Calcium (Ca), Soil Organic Carbon (SOC), and total carbonates (CaCO3) can delineate management zones in olive cultivation. The management zones and how each field is separated vary depending on the clustering method.
Chapter
The degree of expansivity of clay depends on the mineral(s) present; therefore, identifying mineral(s) in clay is essential to assess its swelling and shrinkage characteristics. Fine-grained soils could be classified as Kaolinitic or Montmorillonitic where Montmorillonitic soils are relatively more expansive in nature. The expansive soils are very problematic as they affect the stability of structures found on them. X-ray diffraction (XRD), differential thermal analysis scanning electron microscopy (SEM) etc. techniques could be able to predict the mineral(s) in clay with high accuracy; however, employing such techniques in soil investigation is not possible due to their sophistication and handling of bulk heterogeneous soil mass. Many researchers and codes suggested the expansivity of soils based on index properties such as liquid limit, plastic limit, shrinkage limit, etc. This study aims to identify the Kaolinitic, Montmorillonitic and Mixture of both soils by applying an unsupervised learning clustering technique namely K-Mean clustering.
Chapter
Based on the production practice of low production pumping wells in the X block, the energy consumption analysis method based on cluster analysis is carried out by using the big data mining analysis method, which is aiming at the swabbing parameters, production and energy consumption data of pumping wells. The K-means clustering algorithm is used to classify low production wells, and the energy consumption characteristics are analyzed in combination with the clustering results, so as to provide direction guidance for the treatment and adjustment of low production and high consumption wells. After field application verification, 1167 pumping wells with liquid production less than 20 m3 are clustered into four categories, and reasonable optimization measures are formulated according to the energy consumption characteristics of various wells. The energy consumption characteristics of low production wells are obtained, and the distribution law of oil well energy consumption can be understood quickly by the established, which also provide a new idea for energy consumption analysis of pumping wells, and has important guiding significance for energy saving and tapping the potential of low production wells.KeywordsClusteringBig dataEnergy consumptionLow productionPumping unit
Article
Full-text available
FYizzy clustering is useful to mine complex and multi-dimensional data sets, where the members have partial or fuzzy relations. Among the various developed techniques, fuzzy-C-means (FCM) algorithm is the most popular one, where a piece of data has partial membership with each of the pre-defined cluster centers. Moreover, in FCM, the cluster centers are virtual, that is, they are chosen at random and thus might be out of the data set. The cluster centers and membership values of the data points with them are updated through some iterations. On the other hand, entropy-based fuzzy clustering (EFC) algorithm works based on a similarity-threshold value. Contrary to FCM, in EFC, the cluster centers are real, that is, they are chosen from the data points. In the present paper, the performances of these algorithms have been compared on four data sets, such as IRIS, WINES, OLITOS and psychosis (collected with the help of forty doctors), in terms of the quality of the clusters (that is, discrepancy factor, compactness, distinctness) obtained and their computational time. Moreover, the best set of clusters has been mapped into 2-D for visualization using a self-organizing map (SOM).
Article
Full-text available
Clustering is a major technique used for grouping of numerical and image data in data mining and image processing applications. Clustering makes the job of image retrieval easy by finding the images as similar as given in the query image. The images are grouped together in some given number of clusters. Image data are grouped on the basis of some features such as color, texture, shape etc. contained in the images in the form of pixels. For the purpose of efficiency and better results image data are segmented before applying clustering. The technique used here is K-Means and Fuzzy K-Means which are very time saving and efficient.
Article
Full-text available
In the arena of software, data mining technology has been considered as useful means for identifying patterns and trends of large volume of data. This approach is basically used to extract the unknown pattern from the large set of data for business as well as real time applications. It is a computational intelligence discipline which has emerged as a valuable tool for data analysis, new knowledge discovery and autonomous decision making. The raw, unlabeled data from the large volume of dataset can be classified initially in an unsupervised fashion by using cluster analysis i.e. clustering the assignment of a set of observations into clusters so that observations in the same cluster may be in some sense be treated as similar. The outcome of the clustering process and efficiency of its domain application are generally determined through algorithms. There are various algorithms which are used to solve this problem. In this research work two important clustering algorithms namely centroid based K-Means and representative object based FCM (Fuzzy C-Means) clustering algorithms are compared. These algorithms are applied and performance is evaluated on the basis of the efficiency of clustering output. The numbers of data points as well as the number of clusters are the factors upon which the behaviour patterns of both the algorithms are analyzed. FCM produces close results to K-Means clustering but it still requires more computation time than K-Means clustering.
Conference Paper
A problem of soil clustering and spatial representation of the obtained results, based on in-situ measurements of physical and chemical characteristics of soil, is analysed in the paper. K-means and fuzzy K-means algorithms are adapted for the soil data clastering. Database of soil samples sampled in Montenegro is used for comparative analysis of the used algorithm. Classified soil data are presented on static Google map.
Conference Paper
Gullies are an extreme form of soil erosion that degrade diverse environments trough the siltation of streams and water bodies. Indirectly, gully erosion compromises crop productivity working as a link to watercourse allowing movement of detached topsoil particles from agricultural fields during heavy storm events. Furthermore, studies found reduction of the catchment area when active gullies are present. This complex process involves multiple factors and it demands to be studied consistently in order to locate the areas prone for gully erosion. The determination of gullies areas depends upon topographical, geological, and hydrological characteristics; however its location is mainly controlled by the high capacity of overland flow to cut the channel. We hypothesize that identification of gully in agricultural landscape can be performed from high-resolution elevation data products and unsupervised clustering approaches. In order to examine this hypothesis we have used variables resultant from of LiDAR-based terrain analysis as input of a three clustering techniques.    A k-means, fuzzy k-means, and CLARA clustering algorithms were used to carry out the cluster analysis. The results of the cluster analysis suggested that 8 classes were optimal for group areas in the watershed. Elevation data from one field-scale watershed near Treynor in Pottawattamie County, IA, was used to calibration purpose and terrain analysis using slope, flow accumulation, plan convexity, topographic wetness Index, and stream power index were calculated. The cluster analysis has shown highest concordance with percentage of corrected classified pixels that approach based in medoid (CLARA) has obtained the best agreement of points within gullied area (30.1%). The results of this research might speed up gullies field surveys and also can serve as input in conservation planning framework
Article
Two fuzzy versions of the k-means optimal, least squared error partitioning problem are formulated for finite subsets X of a general inner product space. In both cases, the extremizing solutions are shown to be fixed points of a certain operator T on the class of fuzzy, k-partitions of X, and simple iteration of T provides an algorithm which has the descent property relative to the least squared error criterion function. In the first case, the range of T consists largely of ordinary (i.e. non-fuzzy) partitions of X and the associated iteration scheme is essentially the well known ISODATA process of Ball and Hall. However, in the second case, the range of T consists mainly of fuzzy partitions and the associated algorithm is new; when X consists of k compact well separated (CWS) clusters, Xi, this algorithm generates a limiting partition with membership functions which closely approximate the characteristic functions of the clusters Xi. However, when X is not the union of k CWS clusters, the limiting partition is truly fuzzy in the sense that the values of its component membership functions differ substantially from 0 or 1 over certain regions of X. Thus, unlike ISODATA, the “fuzzy” algorithm signals the presence or absence of CWS clusters in X. Furthermore, the fuzzy algorithm seems significantly less prone to the “cluster-splitting” tendency of ISODATA and may also be less easily diverted to uninteresting locally optimal partitions. Finally, for data sets X consisting of dense CWS clusters embedded in a diffuse background of strays, the structure of X is accurately reflected in the limiting partition generated by the fuzzy algorithm. Mathematical arguments and numerical results are offered in support of the foregoing assertions.