Chapter

Discovering Knowledge in Data: An Introduction to Data Mining

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Chapter Five begins with a discussion of the differences between supervised and unsupervised methods. In unsupervised methods, no target variable is identified as such. Most data mining methods are supervised methods, however, meaning that (a) there is a particular pre-specified target variable, and (b) the algorithm is given many examples where the value of the target variable is provided, so that the algorithm may learn which values of the target variable are associated with which values of the predictor variables. A general methodology for supervised modeling is provided, for building and evaluating a data mining model. The training data set, test data set, and validation data sets are discussed. The tension between model overfitting and underfitting is illustrated graphically, as is the bias-variance tradeoff. High complexity models are associated with high accuracy and high variability. The mean-square error is introduced, as a combination of bias and variance. The general classification task is recapitulated. The k-nearest neighbor algorithm is introduced, in the context of a patient-drug classification problem. Voting for different values of k are shown to sometimes lead to different results. The distance function, or distance metric, is defined, with Euclidean distance being typically chosen for this algorithm. The combination function is defined, for both simple unweighted voting and weighted voting. Stretching the axes is shown as a method for quantifying the relevance of various attributes. Database considerations, such as balancing, are discussed. Finally, k-nearest neighbor methods for estimation and prediction are examined, along with methods for choosing the best value for k.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Data mining uses statistical techniques, mathematics, artificial intelligence, and machine learning to extract and identify useful information and related knowledge from various large databases (Mirza, 2018). One of the six tasks in data mining is estimation (Larose & Larose, 2014). Estimation in a broad sense deals with continuously valued outcomes in which some input data will become a value for some unknown continuous attributes (B & G, 2013). ...
... Missing values are a persistent issue that wreaks havoc on data analysis methods (Larose & Larose, 2014). Several methods can be used to deal with missing values. ...
... Normalization is the process of scaling the attribute values of the data so that they can fall within a certain range (Han et al., 2012). Min-Max methods were used in this study by performing a linear transformation of raw data shown in equation (1) (Larose & Larose, 2014). This normalization still maintains the relationship between the actual data values. ...
Article
Full-text available
Covid-19 has resulted in an increase in the people's need for vehicle ownership in order to avoid public transportation. People's purchasing power, on the other hand, has also weakened. Therefore, they prefer to purchase affordable cars, such as used cars. Moreover, the Luxury Goods Sales Tax (PPnBM) discounts were officially applied to the purchase of the new cars in March 2021. This study aims at estimating the price of used cars using several data mining algorithms, such as Random Forest, K-Nearest Neighbour (KNN), and Naïve Bayes. By employing the RapidMiner tool, this study was able to evaluate the attributes affecting car prices. From the experimental results, random forest producers have the highest accuracy of 95.46%. Then, this study figured out that brand, engine capacity, kilometres, colours, years, number of passengers, and transmissions are the most influential attributes to determine the estimation of the used car prices.
... It consists of a tree structure scheme (representing a set of rules) composed of a collection of decision nodes (input variables) connected by branches extending from root nodes to leaf nodes (decision outcomes or targets). Such a tree structure scheme express a general pattern of recursive partition/split, in which, starting at the root node, attributes are tested on decision nodes, resulting in multiple branches (which can be visualized as a set of IF-THEN statements) and similarly, each branch can lead to another decision node or a leaf node [26]. ...
... The hidden layer aims to link the input layer to the output layer, extracting useful features and subfeatures from the input patterns with respect to the prediction output. Thus, the number of hidden layers and neurons in each hidden layer are both user-defined considering the problem under consideration [9], [26], [28]. Moreover, the hidden neurons use the sigmoid transfer function: ...
... The predicted class labels can be used to compute the confusion matrix regarding a fixed D. Fig. 11 depicts an example of the confusion matrix, which matches predicted outcomes with the actual values. It includes four main statistics for the binary classification task [28]: Moreover, we also considered other classification metrics which are obtained from the confusion matrix, as follows described [21], [26]: 1) True positive rate (TPR), recall, hit rate or sensitivity: ...
Article
Full-text available
The emergence of the Industry 4.0 concept and the profound digital transformation of the industry plays a crucial role in improving organisations’ supply chain (SC) performance, consequently achieving a competitive advantage. The order fulfilment process (OFP) consists of one of the key business processes for the organization SC and represents a core process for the operational logistics flow. The dispatch workflow process consists of an integral part of the OFP and is also a crucial process in the SC of cement industry organizations. In this work, we focus on enhancing the order fulfilment process by improving the dispatch workflow process, specifically with respect to the cement loading process. Thus, we proposed a machine learning (ML) approach to predict weighing deviations in the cement loading process. We adopted a realistic and robust rolling window scheme to evaluate six classification models in a real-world case study, from which the random forest (RF) model provides the best predictive performance. We also extracted explainable knowledge from the RF classifier by using the Shapley additive explanations (SHAP) method, demonstrating the influence of each input data attribute used in the prediction process.
... Çünkü birçok işletme veri toplamak için ciddi maliyetlere katlanmakta ancak bu veri yığınından, değerli ve eyleme dönüştürülebilir bilgileri çekememektedir. Veri madenciliği uygulamaları bu sorunu ortadan kaldırmaktadır (Larose ve Larose, 2014). ...
... Because many businesses spend a lot of money collecting data but can't extract important and actionable information from it. This issue is eliminated by data mining applications (Larose and Larose, 2014). ...
Article
Full-text available
Geçmişi ve bugünü anlamanın, geleceğe daha net bakmamıza yardım ettiği söylenebilir. Özellikle bilgi çağında, dijitalleşmenin de katkısıyla oluşan devasa veriler bu anlamlandırmayı daha önemli kılmaktadır. Bunu başarabilmek için elimizdeki en etkili yöntemlerden biri ise veri madenciliğidir. Veri madenciliği söz konusu verilerin içerisinde anlamlı ilişkileri, kalıpları ve eğilimleri keşfetmeye dayalı üretkenliği arttırmaya yönelik bir araçtır. Sosyal bilimlerde ve pazarlama alanında sıklıkla kullanılan veri madenciliği, keşfettiği anlamlı kalıplar ve ilişkilerle, müşterilerin gelecekteki davranışlarını tahmin etmeye yönelik öngörü geliştirmekte; ürün tekliflerinin nasıl yapılandırılması gerektiği gibi satış ve hizmet fonksiyonlarını destekleyerek işletmeler için birçok avantaj yaratmaktadır. Bu bağlamda çalışmada, sosyal bilimlerde veri madenciliği ve uygulamalarına ilişkin genel bilgi verilmesi, ardından pazarlama alanında veri madenciliği kullanımının değerlendirilmesi amaçlanmıştır. Bu sayede veri madenciliği kavramının sosyal bilimciler açısından daha net anlaşılmasına ve benimsenmesine, pazarlama alanında veri madenciliği uygulamalarının artmasına, dolayısıyla teoriye ve sektöre sağlayacağı katkıyı arttırmasına destek olacağı düşünülmektedir. Anahtar Kelimeler: Veri Madenciliği, Sosyal Bilimlerde Veri Madenciliği, Pazarlamada Veri Madenciliği Abstract Understanding the past, present, and future helps us to see the end more clearly. Especially in the information age, the vast data created with the contribution of digitalization makes this interpretation more critical. One of the most effective ways we can achieve this is data mining. Data mining is a tool for increasing productivity based on discovering meaningful relationships, patterns, and trends within the data. Data mining, which is used in social sciences and marketing, creates many advantages that support sales and service functions, such as how product offerings should be structured, with meaningful patterns and relationships that it discovers and insights to determine future behavior patterns of customers. Also, data mining is used to minimize the losses that may occur due to fraud. In this context, this study aims to give general information about data mining and its applications in social sciences and then evaluate the use of data mining in marketing. In this way, it is thought that the concept of data mining will be understood and adopted more clearly by social scientists, and it will contribute to the increase of data mining applications in the field of marketing, theory, and the sector. Keywords: Data Mining, Data Mining in Social Sciences, Data mining in Marketing
... Çünkü birçok işletme veri toplamak için ciddi maliyetlere katlanmakta ancak bu veri yığınından, değerli ve eyleme dönüştürülebilir bilgileri çekememektedir. Veri madenciliği uygulamaları bu sorunu ortadan kaldırmaktadır (Larose ve Larose, 2014). ...
... Because many businesses spend a lot of money collecting data but can't extract important and actionable information from it. This issue is eliminated by data mining applications (Larose and Larose, 2014). ...
Article
Geçmişi ve bugünü anlamanın, geleceğe daha net bakmamıza yardım ettiği söylenebilir. Özellikle bilgi çağında, dijitalleşmenin de katkısıyla oluşan devasa veriler bu anlamlandırmayı daha önemli kılmaktadır. Bunu başarabilmek için elimizdeki en etkili yöntemlerden biri ise veri madenciliğidir. Veri madenciliği söz konusu verilerin içerisinde anlamlı ilişkileri, kalıpları ve eğilimleri keşfetmeye dayalı üretkenliği arttırmaya yönelik bir araçtır. Sosyal bilimlerde ve pazarlama alanında sıklıkla kullanılan veri madenciliği, keşfettiği anlamlı kalıplar ve ilişkilerle, müşterilerin gelecekteki davranışlarını tahmin etmeye yönelik öngörü geliştirmekte; ürün tekliflerinin nasıl yapılandırılması gerektiği gibi satış ve hizmet fonksiyonlarını destekleyerek işletmeler için birçok avantaj yaratmaktadır. Bu bağlamda çalışmada, sosyal bilimlerde veri madenciliği ve uygulamalarına ilişkin genel bilgi verilmesi, ardından pazarlama alanında veri madenciliği kullanımının değerlendirilmesi amaçlanmıştır. Bu sayede veri madenciliği kavramının sosyal bilimciler açısından daha net anlaşılmasına ve benimsenmesine, pazarlama alanında veri madenciliği uygulamalarının artmasına, dolayısıyla teoriye ve sektöre sağlayacağı katkıyı arttırmasına destek olacağı düşünülmektedir.
... Algoritma k-nearest neighbor (k-NN) merupakan algoritma yang bertujuan untuk mengklasifikasi objek baru berdasarkan atribut dan training samples (Larose, 2005) [14]. Dimana hasil dari sampel uji yang baru diklasifikasikan berdasarkan mayoritas dari kategori pada k-NN. ...
... Algoritma k-nearest neighbor (k-NN) merupakan algoritma yang bertujuan untuk mengklasifikasi objek baru berdasarkan atribut dan training samples (Larose, 2005) [14]. Dimana hasil dari sampel uji yang baru diklasifikasikan berdasarkan mayoritas dari kategori pada k-NN. ...
Article
Full-text available
Penalti adalah denda yang diberikan oleh operator telekomunikasi selaku pemberi pekerjaan kepada tower provider (perusahaan penyedia menara telekomunikasi). Penalti ini diberikan karena waktu penyelesaian pekerjaan melebihi batas waktu yang ditentukan. Untuk mengurangi tingkat kerugian perusahaan yang diakibatkan oleh penalti dari operator telekomunikasi, maka tower provider itu sendiri harus bisa mengambil langkah untuk mencegah atau menghindari penalti dari operator telekomunikasi dengan cara memprediksi penalti yang akan akan diterima oleh tower provider. Pada penelitian ini digunakan algoritma klasifikasi seperti decision tree (C4.5), naive bayes, k-nearest neighbor, logistic regression, dan neural network. Dengan kelima algoritma klasifikasi ini dilakukan perbandingan untuk mendapatkan tingkat akurasi dari masing-masing algoritma klasifikasi menggunakan metode validasi 10-fold cross validation dan metode perbandingan uji beda parametrik t-test. Dari uji parametrik t-test diperoleh hasil algoritma decision tree (C4.5) lebih dominan daripada algoritma yang lain, berikutnya bisa dikatakan algoritma naive bayes, logistic regression, dan neural network memiliki akurasi yang sama, namun demikian algoritma logistic regression dan neural network tidak lebih baik dari algoritma k-nearest neighbor.
... Two daily neural network models were developed. The first is a feed-forward model with back propagation (Larose 2014). The second is a recurrent network that is based on the approach presented in (Elman 1990). ...
... The second model is a feed-forward neural network with back propagation (dFFNN). As discussed in (Larose 2014), choosing a large number of hidden nodes would increase the complexity of the model whereas the network's ability to learn may be affected by a reduced number of hidden nodes. After examining different configurations, a model with 12 hidden nodes was selected. ...
Article
Due to the limited natural water resources and the increase in population, managing water consumption is becoming an increasingly important subject worldwide. In this paper, we present and compare different machine learning models that are able to predict water demand for Central Indiana. The models are developed for two different time scales: daily and monthly. The input features for the proposed model include weather conditions (temperature, rainfall, snow), social features (holiday, median income), date (day of the year, month), and operational features (number of customers, previous water demand levels). The importance of these input features as accurate predictors is investigated. The results show that daily and monthly models based on recurrent neural networks produced the best results with an average error in prediction of 1.69% and 2.29%, respectively for 2016. These models achieve a high accuracy with a limited set of input features.
... The process in data mining allows data to be extracted to become new knowledge or information that was not previously known [10]. By recognizing existing data patterns, data mining methods can analyze and find new patterns [11]. In the existing dataset it is possible to create new rules, patterns or models that are different from the previous database [12]. ...
Article
p class="AbstractL-MAG">One of the most feared infectious diseases today is COVID-19. The transmission of this disease is quite fast. Patients also sometimes do not have the same symptoms. Overcoming the spread of the pandemic has been widely carried out throughout the world. Apart from the medical method, there are also many other methods, including computerization. Data mining is a discipline that can project data into new knowledge. One of the main functions of data mining is classification. Decision tree is one of the best models to solve classification problems. The number of data attributes can affect the performance of an algorithm. This study uses information gain to select the attribute features of the Covid-19 surveillance dataset. This study proves that there is an increase in the accuracy of the decision tree algorithm by adding information gain feature selection. Previously, the decision tree only had an accuracy rate of 65% for the classification of the Covid-19 surveillance dataset. After pre-processing using information gain, the accuracy rate increased to 75%.</p
... Random Forest, a supervised machine learning technique in the R platform is employed to carry out aforesaid tasks. This is a type of comprehensive case study in the Human Resource area of specialization [8]. ...
Chapter
Attrition in human resources refers to the continuing loss of employee overtime. The employee reduction may be due to voluntary or involuntary reasons. This paper portrays the Random Forest model for variable importance measure and attrition prediction. Human resource (HR) analytics refers to applying analytic procedures to the human resource department of an organization in the hope of improving employee performance and thusly getting a better improvement rate of profitability. In this context, IBM has gathered information on employee satisfaction, income, seniority, and some demographics. It includes the data of 1470 employees. The present study used this dataset for variable importance measures and prediction of employee attrition. The Random Forest machine learning technique is applied to find the relative importance of variables. The result concluded that Over Time, Job Role, Monthly Income, Job Level, Total Working Years, and Age are the strongest predictors.
... Discovering knowledge from data requires the use of data mining methods and advanced software. Data mining is the analysis of an observed data set in order to find non-obvious relationships and interdependencies that exist in them (Foreman, 2014;Larose, 2005;Hand et al., 2001). ...
Article
Full-text available
Purpose: The aim of the article is to identify customers' purchasing behaviour profiles on the basis of characterizing the process of making a decision to purchase a product from food industry companies’ indicators (observable variables) in the context of corporate social responsibility (CSR). Design/methodology/approach: The data for the research were collected from a survey concerning a group of 801 customers from the Świętokrzyskie Voivodeship. The resources were pre-explored and pre-processed to enable further studies. In order to obtain customers profiles, the latent class analysis (LCA) method was used. It enables identification of homogeneous groups (latent classes) of customers based on selected indicators. Findings: The impact on customers’ purchasing behaviour of 15 CSR activities undertaken by enterprises from several different groups (in relation to: environment, society, employees, contractors, and customers) was examined. Six profiles of customer purchasing behaviour were identified. They were labelled and subjected to descriptive characteristics. Research limitations/implications: The results point out the need to continue the research based on a broader countrywide data set. Practical implications: The research findings can contribute to improving the effectiveness of food industry companies in the range of CSR activities. Due to this, these companies will be able to take more effective steps to retain existing customers and acquire new ones. Social implications: Taking corporate social responsibility actions contributes to solve social and environmental problems. It can also affect the quality of life in a society. Nowadays, it is an important and developmental research area. Originality/value: The conducted study showed that latent class analysis is proper tool for analysing the qualitative data obtained in the questionnaire surveys. The work provides a vital information on the impact of corporate social responsibility activities by food industry companies on customers' purchasing behaviour. Keywords: corporate social responsibility, customer profiles, purchasing behaviour, food industry, latent class analysis. Category of the paper: Research paper.
... Intisar et al. [9] used a slightly different approach. First, they applied topic modeling algorithms (LDA [18] and NMF [19]) to vectorize text and then used these vectors to train and evaluate several classification algorithms, such as kNN [20], Random Forest, Multinomial Naive Bayes (MNB) [21], and Multilayer Perceptron [22]. Even though using implicit features affected the performance of individual classification algorithms (positive in case of kNN and MNB, negative in case of RF), the final accuracy (the result of the best approach for each type of features) did not improve much compared to the TF-IDF [23] baseline (0.86 vs 0.88 accuracy). ...
Preprint
Full-text available
Competitive programming remains a very popular activity that combines both software engineering and education. In order to prepare and to practice, contestants use extensive archives of problems from past contents available on various competitive programming platforms. One way to make this process more effective is to provide an automatic tag system for the tasks. Prior works do that by either using the tasks' problem statements or the code of their solutions. In this study, we investigate which information source is more valuable for tag prediction. To answer that question, we compare existing approaches of both types on the same dataset and with the same set of tags. Then, we propose a novel approach, which is an ensemble of the Gated Graph Neural Network model for analyzing solutions and the Bidirectional Encoder Representations from Transformers model for processing statements. Our experiments show that our approach outperforms previously proposed models by 0.175 of the PR-AUC metric.
... Veri madencili i, de erlendirme için gizli kal plar ortaya ç karan daha büyük bir süreç olarak kabul edilen bilgi ke if sürecinde (process of knowledge discovery) bir ad m olarak görülmektedir. Bu ba lamda, endüstride, medyada ve ara t rma ortam nda veri madencili i terimi, örüntü tan ma teknolojilerinin yan s ra istatistiksel ve matematiksel teknikleri kullanarak, depolarda depolanan büyük miktarda veriyi eleyerek anlaml yeni korelasyonlar, örüntüler ve e ilimler ke fetme sürecidir [3,4,5,6]. ...
... The Nearest Neighbor algorithm, which is designed to find the nearest point of the observed object, introduced the KNN (K-Nearest Neighbor) algorithm. The KNN algorithm's main idea is to find the K-nearest points [17]. There are a lot of different improvements for the traditional KNN algorithm, such as the Wavelet Based K-Nearest Neighbor Partial Distance Search (WKPDS) algorithm, Equal-Average Nearest Neighbor Search (ENNS) algorithm, Equal-Average Equal-Norm Nearest Neighbors code word Search (EENNS) algorithm, the Equal-Average Equal-Variance Equal-Norm Nearest Neighbor Search (EEENNS) algorithm [37]. ...
... Metode pengembangan data mining yang digunakan untuk menganalisis data dalam penerapan data mining ini menggunakan proses tahapan knowledge discovery in databases (KDD) yang terdiri dari Data, Data Cleaning, Data transformation, Data mining, Pattern evolution, knowledge : [6]- [8] Gambar 1. Tahapan Penelitian Berikut merupan halhal yang perlu di lakukan dalam penelitian berdasarkan tahapan knowledge discovery in databases; ...
Article
Full-text available
The Indonesian government in obtaining Real Time data on MSMEs who are entitled to assistance, accuracy in distributing MSME assistance, and accelerating Indonesia's economic growth through MSMEs, especially the Cirebon Regency area. There are several ways so that cash transfer assistance for micro-scale SMEs from the government is right on target, in this study the authors will use data mining techniques with the k-nearest neighbors method in classifying receiving assistance from SMEs. The data used in this study uses secondary data with attributes of Regency, District, Business Name, Product Name, Business License, Assets and Turnover. The application of the KNN algorithm uses the retrieval operator, cross validation, and in developing the model using the KNN algorithm operator, apply model and performance. The results of the accuracy are 98.46 % with details, namely the Prediction Results are Eligible and it turns out to be true as many as 339 Data. The Prediction Result is Eligible and it turns out to be true Not Eligible as much as 2 Data. Prediction results are not eligible and it turns out to be true as much as 4 data. Prediction results are not eligible and it turns out to be true, 42 data are not eligible. Recommendations for the pattern of knowledge obtained using the K-NN algorithm. Researchers provide recommendations that are feasible to be given assistance for MSMEs as many as 339 MSME participant data spread across the Cirebon district and included in the affected category. Then there are several MSME participants who cannot receive MSME assistance according to the application of the KNN algorithm, which is 42 data, and there are 2 data from participants who are proposed to receive MSME assistance. The hope of the research for participants who receive assistance from the government can survive in conditions like this covid 19
... Meskipun SOM berbasis JST, analisis ini tidak menggunakan nilai kelas target dan tidak menetapkan kelas untuk setiap data, sehingga SOM dapat digunakan untuk tujuan pengelompokan. SOM menunjukkan tiga karakteristik: kompetisi (setiap vektor bobot bersaing satu sama lain untuk menjadi simpul pemenang), kerja sama (setiap simpul pemenang bekerja sama dengan lingkungannya), dan adaptasi (perubahan simpul pemenang dan lingkungannya) (Larose, 2004). Algoritma yang dijalankan dalam analisis SOM sebagai berikut 1. Pemberian nilai awal pada syaraf input: 1 , 2 , … , 2. Pemberian nilai awal pada syaraf output: sebanyak j x l : 11 , 12 , … , 3. Isikan bobot antara neuron input dan output dengan angka acak mulai dari 0 sampai 1 4. Ulangi sampai bobot tidak berubah atau telah mencapai jumlah maksimum iterasi. 5. Pilih satu dari vektor input yang ada. ...
... Moreover, it has to be noted that most working accidents in different fields are related to the use of work equipment and machinery (Fargnoli 2021;Fargnoli et al. 2018). According to (Larose 2005),"data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques". Thanks to data mining and the revision of accident classification systems, in many countries accident analysis has turned from single event analysis to the contemporary assessment of several data, in order to identify general risk profiles for each productive sector and suitable mitigation measures. ...
Conference Paper
Full-text available
Despite the international efforts to promote the circular economy and waste recycling, landfilling is still a common practice for waste disposal around the world, with considerable impacts on the environment and public health. However, while such a sector was addressed in many countries through a consistent environmental legislative framework, safety management for workers in landfills has not been sufficiently considered by policymakers. Moreover, in the scientific literature OHS issues related to landfill management were not deeply addressed through the elaboration of specific procedures. Hence, with the goal to evaluate the existence of specific risk profiles in such working contexts, the paper provides a contribution through a data-driven approach. In fact, the study contains a detailed investigation of the biggest Italian occupational accident database (powered by INAIL) with reference to the landfill management sector, according to a methodology specifically designed to employ the European Statistics on Accidents at Work (ESAW) codes as data filters. After selecting a sample of n.78 accidents, likely to have occurred in Italian landfills in the period 2008/2019, accidents' dynamics were assessed for each event with reference to the use of work equipment. The results achieved allowed us to bring to light the potential risks related to this determinant. Thus, potential risk management measures were defined in order to improve the safety management of work equipment in this specific context.
... Klaster adalah kumpulan dari data yang serupa dan dibedakan dengan data dalam klaster lainnya. Pada klastering algoritma klasterisasi mencari ke segmen data seluruh set menjadi bagian dari kelompok yang relatif homogen dengan kesamaan data dalam klaster dimaksimalkan untuk didapatkan, dan kesamaan data yang berbeda digabungkan dengan klaster lainnya [16]. ...
Article
Full-text available
Covid-19 merupakan penyakit pandemi yang disebabkan oleh virus corona 2 nCoV-2019 yangmenyerang system pernafasan manusia dan telah menyebar ke seluruh dunia. Padang sebagai Ibukota Provinsi di Sumatera Barat juga mendapat penularan dari virus ini sehingga per tanggal 18 Juli 2021 angka kasus Covid-19 telah mencapai 25830 kasus. Data Mining klasterisasi adalah suatu ilmu dalam membuat suatu pengelompokan kategori dari data tertentu yang diharapkan mampu mengkategorikan daerah kelurahan di Kota Padang menjadi beberapa klaster sebagai rekomendasi dalam penanganan penyebaran virus Covid-19 di Kota Padang. Algoritma K-Medoins adalah salah satu algoritma yang bagus dalam mengklasterisasi data tertentu dikarenakan menggunakan metode Partitional Clustering yang ampuh dan telah digunakan dalam penelitian pengklasteran sejenis. Penelitian menggunakan algoritma K-Medoids ini diterapkan menggunakan aplikasi Rapidminer Studio 9.9.002 dengan Sumber data didapatkan dari situs corona.padang.go.id dan menghasilkan 3 klaster dengan masing-masing klaster menghasilkan 51 item, 20 item dan 33 item dari 104 item kelurahan di Kota Padang.
... (1) Experiment Details. To examine the performance of the proposed MRwMR-BUR criterion, we conduct experiments using six public datasets [58], [61], [62], [63], [64] (see descriptions in Table 3) and compare the performance of MRwMR-BUR-KSG (estimate UR via the KSG estimator) to MRwMR via three popular classifiers: Support Vector Machine (SVM) [65], K-Nearest Neighbors (KNN) [66] and Random Forest (RF) [67]. Five representative MRwMR based algorithms: MIM [36], JMI [30], JMIM [37], mRMR [31] and GSA [28] are shortlisted for performance evaluation. ...
Preprint
Mutual Information (MI) based feature selection makes use of MI to evaluate each feature and eventually shortlists a relevant feature subset, in order to address issues associated with high-dimensional datasets. Despite the effectiveness of MI in feature selection, we notice that many state-of-the-art algorithms disregard the so-called unique relevance (UR) of features, and arrive at a suboptimal selected feature subset which contains a non-negligible number of redundant features. We point out that the heart of the problem is that all these MIBFS algorithms follow the criterion of Maximize Relevance with Minimum Redundancy (MRwMR), which does not explicitly target UR. This motivates us to augment the existing criterion with the objective of boosting unique relevance (BUR), leading to a new criterion called MRwMR-BUR. Depending on the task being addressed, MRwMR-BUR has two variants, termed MRwMR-BUR-KSG and MRwMR-BUR-CLF, which estimate UR differently. MRwMR-BUR-KSG estimates UR via a nearest-neighbor based approach called the KSG estimator and is designed for three major tasks: (i) Classification Performance. (ii) Feature Interpretability. (iii) Classifier Generalization. MRwMR-BUR-CLF estimates UR via a classifier based approach. It adapts UR to different classifiers, further improving the competitiveness of MRwMR-BUR for classification performance oriented tasks. The performance of both MRwMR-BUR-KSG and MRwMR-BUR-CLF is validated via experiments using six public datasets and three popular classifiers. Specifically, as compared to MRwMR, the proposed MRwMR-BUR-KSG improves the test accuracy by 2% - 3% with 25% - 30% fewer features being selected, without increasing the algorithm complexity. MRwMR-BUR-CLF further improves the classification performance by 3.8%- 5.5% (relative to MRwMR), and it also outperforms three popular classifier dependent feature selection methods.
... In this step, we used the classification method of data mining to predict the key factors impacting dropout of Ethiopian university students. Before conducting experiments, we balanced the data using Weka's resample filter following Larose & Larose (2014) recommendations. Before balancing, the active classes had 3460 instances, while dropout class had 802 instances. ...
Article
Full-text available
University students’ dropout is a complex issue with life and career ramifications, especially in least developed countries. Ethiopia, a country with one of the least developed economies, has made considerable efforts to strengthen its higher education; yet university student attrition remains a major concern. In this study, we utilized the data mining methodology to reveal the important factors that impact dropout among the Ethiopian university students. The current research results indicate that personal, institutional, and academic factors affect university student dropout. In Ethiopia, low-performing rural female students are more likely to drop out than male students, according to the findings of this study. In general, rural low- achieving students have a greater likelihood of dropping out of university. This is likely to occur during the students' first semester of study, especially if they have a poor attendance rate. This research contributes to the body of knowledge by indicating that university remedial programs may be successful in reducing the incidents of students’ dropout. The current research has implications for policymakers in the least developed nations, such as Ethiopia, to construct dropout intervention programs based on the factors identified in this research.
... To optimise the data for presentation to the K-means algorithm, we applied data standardisation [39,40] to address issues of different data ranges, units of measure, and variance apparent in the variables. ...
Article
Full-text available
The risk posed by wildlife to air transportation is of great concern worldwide. In Australia alone, 17,336 bird-strike incidents and 401 animal-strike incidents were reported to the Air Transport Safety Board (ATSB) in the period 2010-2019. Moreover, when collisions do occur, the impact can be catastrophic (loss of life, loss of aircraft) and involve significant cost to the affected airline and airport operator (estimated at globally US$1.2 billion per year). On the other side of the coin, civil aviation, and airport operations have significantly affected bird populations. There has been an increasing number of bird strikes, generally fatal to individual birds involved, reported worldwide (annual average of 12,219 reported strikes between 2008-2015 being nearly double the annual average of 6,702 strikes reported 2001-2007) (ICAO, 2018). Airport operations including construction of airport infrastructure, frequent take-offs and landings, airport noise and lights, and wildlife hazard management practices aimed at reducing risk of birdstrike, e.g., spraying to remove weeds and invertebrates, drainage, and even direct killing of individual hazard species, may result in habitat fragmentation, population decline, and rare bird extinction adjacent to airports (Kelly T, 2006; Zhao B, 2019; Steele WK, 2021). Nevertheless, there remains an imperative to continually improve wildlife hazard management methods and strategies so as to reduce the risk to aircraft and to bird populations. Current approved wildlife risk assessment techniques in Australia are limited to ranking of identified hazard species, i.e., are 'static' and, as such, do not provide a day-to-day risk/collision likelihood. The purpose of this study is to move towards a dynamic, evidence-based risk assessment model of wildlife hazards at airports. Ideally, such a model should be sufficiently sensitive and responsive to changing environmental conditions to be able to inform both short and longer term risk mitigation decisions. Challenges include the identification and quantification of contributory risk factors, and the selection and configuration of modelling technique(s) that meet the aforementioned requirements. In this article we focus on likelihood of bird strike and introduce three distinct, but complementary, assessment techniques, i.e., Algebraic, Bayesian, and Clustering (ABC) for measuring the likelihood of bird strike in the face of constantly changing environmental conditions. The ABC techniques are evaluated using environment and wildlife observations routinely collected by the Brisbane Airport Corporation (BAC) wildlife hazard management team. Results indicate that each of the techniques meet the requirements of providing dynamic, realistic collision risks in the face of changing environmental conditions.
... The decision trees created by CART have two branches for each decision node. Difference from the decision tree for classification, which uses Gini Impurity or Entropy values as criteria for splitting root/decision nodes, the "goodness" criterion is applied in the CART algorithm to split root/decision nodes and is computed as follows [46]: ...
Article
Full-text available
Predicting the condition of sewer pipes plays a vital role in the formulation of predictive maintenance strategies to ensure the efficient renewal of sewer pipes. This study explores the potential application of ten machine learning (ML) algorithms to predict sewer pipe conditions in Ålesund, Norway. Ten physical factors (age, diameter, depth, slope, length, pipe type, material, network type, pipe form, and connection type) and ten environmental factors (rainfall, geology, landslide area, population, land use, building area, groundwater, traffic volume, distance to road, and soil type) were used to develop the ML models. The filter, wrapper, and embedded methods were used to assess the significance of the input factors. A dataset consisting of 1159 inspected sewer pipes was used to construct the sewer condition models, and 290 remaining inspections were used to verify the models. The results showed that sewer material and age are the most significant factors, otherwise the network type is the least contributor affecting the sewer conditions in the study area. Among the considered ML models, the Extra Trees Regression (R2 = 0.90, MAE = 11.37, and RMSE = 40.75) outperformed the other ML models and it is recommended for predicting sewer conditions for the study area. The results of this study can support utilities and relevant agencies in planning predictive maintenance strategies for their sewer networks.
... The proposed model used the Information Gain algorithm [17] to reduce the feature space from 527 API Calls and permissions to 50 features only and achieve a very close accuracy to what was achieved using 527 features. 4. ...
Article
Full-text available
The Android platform has become the most popular smartphone operating system, which makes it a target for malicious mobile apps. This paper proposes a machine learning-based approach for Android malware detection based on application features. Unlike many prior research that focused exclusively on API Calls and permissions features to improve detection efficiency and accuracy, this paper incorporates applications’ contextual features with API Calls and permissions features. Moreover, the proposed approach extracted a new dataset of static API Calls and permission features using a large dataset of malicious and benign Android APK samples. Furthermore, the proposed approach used the Information Gain algorithm to reduce the API and permission feature space from 527 to the most relevant 50 features only. Several combinations of API Calls, permissions, and contextual features were used. These combinations were fed into different machine-learning algorithms to show the significance of using the selected contextual features in detecting Android malware. The experiments show that the proposed model achieved a very high accuracy of about 99.4% when using contextual features in comparison to 97.2% without using contextual features. Moreover, the paper shows that the proposed approach outperformed the state-of-the-art models considered in this work.
... Data mining itu sendiri merupakan adalah ekstraksi pola yang menarik dari data dalam jumlah besar yang belum diketahui sebelumnya dan berguna [1] [2]. Proses penemuan pola terbukti dapat membantu proses pengelompokan berbagai topik skripsi yang ada sehingga diperoleh informasi yang bermakna dalam menentukan tren penelitian Universitas dari tahun ke tahun [3] [4]. ...
Article
Full-text available
The large amount of final project document data from study programs at the Sekolah Tinggi Manajemen Informatika dan Komputer (STMIK) Abulyatama can make a major contribution to the difficulty of the process of grouping a student's final project theme. The clustering process that has been carried out manually so far has been very ineffective and inefficient, so a data mining application is needed to manage the data, especially for clustering the data. The goal to be achieved from writing this thesis is to implement the Support Vector Machine with K-Means and K-Medoids to optimize the final assignment clustering. the results of the Optimization Support Vector Machine (SVM) analysis using K-Means and K-Medoids for Grouping Student Final Project Themes can be concluded in a number of ways, namely; with the K-Means Clustering method it can be seen that there are 23 data mining, 10 networks, 26 artificial intelligence, and 21 websites, and website 11 items.
... As a new trend in many fields of science from astronomy and biology to economy [35][36][37], machine learning strategies can also be used for electrochemical systems like fuel cells and electrolyzers. Machine learning is an instrument for data mining that uncovers hidden information from large and complex databases [38][39][40]. For many years, various data mining tools have been successfully used in a variety of fields ranging from chemical engineering [41][42][43][44][45][46][47][48] to biology [49] and astronomy [37] to customer relations and economy [35]. ...
Article
Full-text available
In this work, a database of 789 experimental points extracted from 30 academic publications was used. The primary objective was to use novel machine-learning techniques to investigate how descriptor variables affect current density, power density, and polarization, and to identify rules or pathways that result in high current density, low power density, and low polarization. First, Shapley analysis was done to find and compare the magnitude of the contribution of each variable on current density as well as the positive and negative effects of all the variables. Then, correlation coefficient heat maps were provided to display the existence of any linear relationship between the input and output variables. Additionally, k-nearest neighbor classification (as an optimal model) was able to demonstrate the entire impact of all features on the outputs. Finally, the Bayesian optimization algorithm showed that the optimum performance of polymer electrolyte membrane electrolyzer could be reached with less experimental effort and time than the usual research plan. It was then concluded that machine learning methods can aid in determining the best conditions for designing a polymer electrolyte membrane electrolyzer to produce hydrogen, which can be used to guide the planning of future experiments. Graphical abstract
... The survey consisted of 5 parts that concerned: A-knowledge of a freeshop concept, B-a visit to a freeshop (only for persons who were familiar with the concept of a freeshop), C-an attitude to consumption (an assessment of 6 statements in the 5-degree Likert scale where 5 meant surely yes, whilst 1-surely no), D-an attitude toward actions related to a circular economy/zero waste and money saving (an assessment of 15 statements in the 5-degree Likert scale (where 5 meant definitely important, whilst 1-totally irrelevant), and M-a matrix (sex, age, an attitude to using freeshops). The list of statements we based on the literature review [44][45][46][47][48][49] and circular economy goals defined in governmental and non-governmental documents [9][10][11]. We proposed our own set of factors that we thought best reflected the freeshop idea. ...
Article
Full-text available
Current socioeconomic and environmental problems require radical solutions, including applying the circular economy and a zero-waste concept to customer behavior. One such solution is the concept of freeshops. A freeshop is a place where one can leave things one does not need and take useful items. The main purpose of this concept is to reuse things and thus prevent overproduction. The article is based on a survey carried out among students of the University of Gdańsk (n = 381). An affinity analysis was used to evaluate the data. The main aim of the paper is to discover a major set of factors that influence consumers choosing freeshops’ offer. In general, there were two groups of factors: economic (i.e., saving money) and connected with environmental protection (e.g., recycling). The primary result is that economic factors are more important for surveyed students than those related to environmental protection.
... kNN belongs to non-parametric and supervised classifiers [37,38]. In the classification procedure, different k values (k), where k represents the number of neighbors in the classification model, were tested to calculate the classification outcomes. ...
Article
PurposeSchizophrenia refers to a lifelong drastic and debilitating mental illness. Clinically, it is described by several symptoms. Monitoring the schizophrenic electroencephalogram (EEG) has been the subject of many recent studies. However, the precise diagnosis of schizophrenia using EEG remains a challenging issue and is still in its infancy. The current study aimed to propose an intelligent system for schizophrenia detection by weighting the entropy indices of the selected EEG electrode.Methods Using some entropy measures of the publicly available EEG data at RepOD, the performances of different classifiers, including support vector machine (SVM), k-nearest neighbor (kNN), and Naïve Bayes (NB), were evaluated separately. In addition, the efficiency of a decision-level fusion strategy was examined using majority voting. Applying the one-way ANOVA test, we proposed a methodology for scoring and selecting the top EEG channel for each attribute. Subsequently, the classification performances were also inspected using EEG channel selection and weighting the designated channels. Specifically, the main innovation of the proposed method lies in the selection and weighting of brain channels before entering the classification module.ResultsThe results showed that weighting the best brain channel improved classification accuracy. By applying NB, the accuracy of the diagnosis is increased up to 100%.Conclusion The proposed heuristic scheme is superior to state-of-the-art EEG schizophrenia diagnosis tools.
... Data mining is one of the best and most powerful tool that can be used, when the data is very large and also randomly as is the case in social media, where data mining is linked to several areas including machine learning, statistics, information retrieval, databases, and even data visualization [1], There is one formal definition of data mining registered at Princeton University, as follows: "Data manipulation using advanced data search capabilities and statistical algorithms to discover patterns and correlations in pre-existing large databases; a method for discovering new meaning in data". The main idea of using data mining through specific algorithms is to obtain new information that is better understood by a large set of hidden or latent data [2]. ...
Article
Social networking sites are a significant source of information to know the behavior of users and to know what is occupying society of all ages and accordingly helpful information can be provided to specialists and decision-makers. According to official sources, 98.43% of Saudi youth use social networking sites. The study and analysis of social media data are done to provide the necessary information to increase investment opportunities within the Kingdom of Saudi Arabia, by studying and analyzing what people occupy on the communication sites through their tweets about the labor market and investment. Given the huge volume of data and also its randomness, a survey of the data will be done and collected from through keywords, the priority of arranging the data, and recording it as (positive - negative - mixed). The study analysis and conclusion will be based on data-mining and its techniques of analysis and deduction.
... Information gain describes how much an individual decision improves the model's predictive power, and gain feature importance is the average gain of all decisions which use a given variable. [15] Similarly, coverage is defined as the total number of samples affected by a decision in the model. Coverage feature importance is the average coverage of all decisions involving a given variable. ...
Article
Purpose: Patients with neovascular age-related macular degeneration (nAMD) have varying responses to anti-vascular endothelial growth factor injections. Limited early response (LER) after three monthly loading doses is associated with poor long-term vision outcomes. This study predicts LER in nAMD and uses feature importance analysis to explain how baseline variables influence predicted LER risk. Methods: Baseline age, best visual acuity (BVA), central subfield thickness (CST), and baseline and 3 months intraretinal fluid (IRF) and subretinal fluid (SRF) for 286 eyes were collected in a retrospective clinical chart review. At month 3, LER was defined as the presence of fluid, while early response (ER) was the absence thereof. Decision tree classification and feature importance methods determined the influence of baseline age, BVA, CST, IRF, and SRF, on predicted LER risk. Results: One hundred and sixty-seven eyes were LERs and 119 were ERs. The algorithm achieved area under the curve = 0.66 in predicting LER. Baseline SRF was most important for predicting LER while age, BVA, CST, and IRF were somewhat less important. Nonlinear trends were observed between baseline variables and predicted LER risk. Zones of increased predicted LER risk were identified, including age <74 years, and CST <290 or >350 μm, IRF >750 nL, and SRF >150 nL. Conclusion: These findings explain baseline variable importance for predicting LER and show SRF to be the most important. The nonlinear impact of baseline variables on predicted risk is shown, increasing understanding of LER and aiding clinicians in assessing personalized LER risk.
... Noisy merupakan data masih mengandung error atau merupakan value yang tidak wajar. Inconsistent merupakan data masih mengandung nilai yang saling bertentangan (Larose, 2005). Beberapa tahapan yang dilakukan dalam preprocessing data adalah sebagai berikut : ...
Article
Perkembangan teknologi informasi saat ini telah merambah ke berbagai sektor termasuk sektor kesehatan. Dalam sektor kesehatan, perkembangan ilmu kedokteran mengalami kemajuan yang sangat pesat yang ditandai dengan ditemukannya penyakit-penyakit baru yang belum teridentifikasi sebelumnya. Salah satu penyakit yang berkembang saat ini yaitu penyakit pada organ hati. Salah satunya adalah penyakit Hepatitis. Diagnosa awal penyakit ini setelah memperhatikan gejala adalah melakukan tes fungsi hati yang biasa disebut LFT (Liver Function Test). Dengan beberapa atribut dari hasil pemeriksaan LFT tersebut akan mudah digunakan untuk menganalisis penyakit tersebut. Salah satu teknologi kecerdasan buatan yang dapat digunakan untuk menganalisis penyakit tersebut adalah machine learning. Machine Learning telah banyak digunakan dalam bidang medis yaitu untuk menganalisa dataset medis. Salah satu metode machine learning adalah Support Vector Machine (SVM). Ciri dari metode ini adalah menemukan fungsi pemisah (klasifier) yang optimal yang bisa memisahkan dua set data dari dua kelas yang berbeda. Data yang digunakan pada penelitiaan ini didapat dari UCI (Universitas California Invene) Machine Learning Repository yang berjumlah 579 data pasien. Dalam dataset tersebut, terdapat 11 atribut yang akan digunakan untuk mendiagnosis penyakit dengan menggunakan metode support vector machine polynomial. Dengan menggunakan Cross validation, menggunakan pengujian pada 10 atribut data Pasien Hati India memiliki rata-rata nilai akurasi sebesar 87.65%.
... K-Means adalah algoritma klastering untuk menemukan kelompok dari objek yang non overlapping [14]. K-Means juga dianggap sebagai algoritma yang efektif untuk mengelompokkan suatu data [15]. Kmeans adalah algoritma klastering dalam bidang data mining yang hanya mampu menangkap fitur kelas yang linear . ...
Article
Full-text available
Cardiovascular disease is a disease caused by impaired function of the heart and blood vessels. This disease is caused by many factors, one of which is genetics, while the causes are age, gender, and family history. In this study, classification of 62 individuals with normal response and cardiovascular disease was carried out. Discriminant Analysis (AD) is a method that classifies data into two or more groups based on several variables where data that has been entered into one group will not be included in another group. The Support Vector Machine (SVM) performs classification by building an N-dimensional hyperplane that optimally separates data into two categories in the input space. Furthermore, AD and SVM will be compared to get which method has the best accuracy, after that it will be added to clustering using k-means and k-means kernels to improve the accuracy of each method. The results of this study are AD and SVM have accuracy values of 83.33% and 91.66%, for AD and SVM which are subjected to k-means have accuracy values of 91.66 % and 91.66 %, and for AD and SVM subjected to k-means kernel has an accuracy value of 100 % and 100 %.
Article
Full-text available
The increasing number of climbers has an impact on the need for a system that can recommend mountains for climbingaccording to the ability of climbers. This study aims to create a system that can help climbers determine the mountainaccording to their abilities. Researchers use one of the methods in data mining, namely classification, using the K =Nearest Neighbor (K-NN) algorithm.This research has produced a web-based system where this system can classify and provide recommendationsaccording to the ability of climbers. This system is equipped with a hiking trail map which is expected to help make iteasier for climbers to choose the mountain they will climb.
Article
Problem. Video surveillance is a process of monitoring various objects, which is implemented with the use of video cameras - optical-electronic and microprocessor devices, designed for visual control of the environment, with the aim of the safety of life, activity and property of a modern person. Such processes and objects can be, for example, cars moving at an intersection, on a street or on a country road, a road surface during the control of its condition and quality, a security system of any infrastructure object. Goal. The purpose of the study is the analysis of the technical composition of systems for detecting anomalies in the video of video surveillance cameras and a comparative review of computational methods for processing the results of this observation. To achieve the goal, it is necessary to research literary sources, that is, articles in scientific journals, reports at conferences, articles on non-thematic web portals, monographs and textbooks, the names of which indicate the possibility of finding information useful for this research. Methodology. As part of the research task, we are interested in the technologies, systems and methods that have been proposed and developed for obtaining, processing and analyzing video sequences and images, including machine vision tasks, image classification, object and anomaly detection, image segmentation, etc. Results. As a result of this research, the following was done: 1) An overview of the main modern systems for detecting anomalies in the video series of video surveillance cameras was conducted. It was concluded that the differences between the anomaly detection systems in the video series of video surveillance cameras are due to the choice of methods for processing video information. 2) An analysis of methods of detecting anomalies in the video series of video surveillance cameras was carried out. For this purpose, a classification of modern methods of detecting anomalies in the video series was developed and the basics of the theory of deep neural networks were considered in terms of the possibility of their application for classification, localization, segmentation, detection, identification and tracking of objects in the video series of surveillance cameras. Originality. An overview of the main modern systems for detecting anomalies in the video series of video surveillance cameras was conducted. It was concluded that the differences between the systems for searching for anomalies in the video series of video surveillance cameras are determined by the choice of methods for processing video information. An analysis of the methods of detecting anomalies in the video series of video surveillance cameras was carried out. Practical value. The developed information system is already used to provide students of all educational institutions of Ukraine of the III level of accreditation with the information about our university; regarding the specialties offered by the university and the corresponding professions; regarding open days, preparatory courses and much more.
Article
Amaç: Bu çalışmada, blockchain teknolojileri konusunda internet üzerinde içerik yayınlayan bir platformun içerik analizi yapılmıştır. Araştırmanın amacı, platformun Facebook’ta paylaştığı içerikler için başlık bazında okunma oranını etkileyen faktörlerin (kelime ve kelime gruplarının) tespit edilmesidir. Yöntem: Araştırma sınırlılıkları kapsamında belirlenen tarih aralığında yayınlanan 2206 içerikten 500 tanesi rastgele seçilmiştir. İçeriklerin başlıkları Python programlama dili kullanılarak bu çalışmadaki probleme özel olarak farklı bir yaklaşımla ve standart metin madenciliği teknikleriyle çözümlenmiş ve metinler üzerinden yapısallaştırılmış iki farklı veri kümesi elde edilmiştir. Elde edilen iki farklı veri kümesi üzerinde çoklu doğrusal regresyon yöntemi kullanılarak analizler gerçekleştirilmiştir. Bulgular: Analizler sonucunda içerik başlıklarında kullanılan bazı kelime ve kelime gruplarının, içeriklerin okunma oranını etkilediği tespit edilmiştir. Ayrıca uygulanan farklı yaklaşımın standart metin madenciliği tekniklerine göre daha yüksek performans sağladığı belirlenmiştir. Sonuç: Araştırmada ham veri işlenerek değerli bilgiler elde edilmiştir. Teorik olarak ortaya çıkarılan bilgiler, uygulama pratiğiyle karşılaştırılmış ve tutarlı sonuçlar elde edildiği gözlemlenmiştir. Uygulanan farklı yaklaşımın etkili bir şekilde benzer metin madenciliği problemlerinde kullanılabileceği saptanmıştır. Özgünlük: Araştırmada içerik başlığı bazında yapılan metin madenciliğine dayalı analiz, farklı bir yaklaşımla ele alınmıştır. Bu yönüyle çalışma özgün bir nitelik taşımaktadır.
Article
Full-text available
Whenever people think about something or engage in activities, internal mental processes will be engaged. These processes consist of sensory representations, such as visual, auditory, and kinesthetic, which are constantly being used, and they can have an impact on a person’s performance. Each person has a preferred representational system they use most when speaking, learning, or communicating, and identifying it can explain a large part of their exhibited behaviours and characteristics. This paper proposes a machine learning-based automated approach to identify the preferred representational system of a person that is used unconsciously. A novel methodology has been used to create a specific labelled conversational dataset, four different machine learning models (support vector machine, logistic regression, random forest, and k-nearest neighbour) have been implemented, and the performance of these models has been evaluated and compared. The results show that the support vector machine model has the best performance for identifying a person’s preferred representational system, as it has a better mean accuracy score compared to the other approaches after the performance of 10-fold cross-validation. The automated model proposed here can assist Neuro Linguistic Programming practitioners and psychologists to have a better understanding of their clients’ behavioural patterns and the relevant cognitive processes. It can also be used by people and organisations in order to achieve their goals in personal development and management. The two main knowledge contributions in this paper are the creation of the first labelled dataset for representational systems, which is now publicly available, and the use of machine learning techniques for the first time to identify a person’s preferred representational system in an automated way.
Article
Full-text available
The readiness of product inventory is very important, product shortages related to other products can make buyers disappointed and then cancel to buy products that were previously planned to be purchased at once. Sellers can experience a decrease in the number of sales to revenue. In this case, the seller needs to know the pattern of customer habits when making purchases by going through sales transaction data that has occurred. Association techniques can be used to analyze the pattern of interrelationships between items in transaction events. With the a priori algorithm as a popular association algorithm, the pattern of sales transaction data can be analyzed through the research stage. From the implementation of the algorithm with 1063 transaction data using 10% min-support and 75% min-confidence resulting in 4 association rules where 1) if you buy "kacer" and "love bird" you will buy "pentet" as much as 17% support, 2) if you buy "magpie" and "love bird" will also buy "pentet" at 16%, 3) if you buy "kacer" and "magpie" then you will buy "pentet" at 14%, 4) if you buy "anis" you will buy "pentet" of 11% with a confidence level of 76%, 81%, 84%, 77%, respectively. So, there are 5 main items that play a strong role in the rule that must be considered. Sellers can use the resulting item relationship patterns as consideration in managing inventory and structuring the items sold.
Article
Full-text available
One product of the MODIS sensor (Moderate Resolution Imaging Spectroradiometer) is the EVI2 (Two Band Enhanced Vegetation Index). It generates images of around 23 observations each year, that combined can be interpreted as time series. This work presents the results of using two types of features obtained from EVI2 time series: basic and polar features. Such features were employed in automatic classifi cation for land cover mapping, and we compared the infl uence of using single pixel versus object-based observations. The features were used to generate classifi cation models using the Random Forest algorithm. Classes of interest included Agricultural Area, Pasture and Forest. Results achieved accuracies up to 91,70% for the northern region of Mato Grosso state, Brazil.
Chapter
Knowledge is an invaluable resource for almost all entities, be they firms, organizations, communities, or individuals. Knowledge needs to be captured, processed, and analyzed. Well-defined knowledge can be represented in an accurate manner, such as a mathematical formula or a certain set of rules [1]. Knowledge can also be modeled, where a model permits us to explain reality, classify objects, and predict a value (or if an event will occur) knowing its relationship to other known values. If our knowledge is not complete, then we can approximate reality by learning from previous experiences and predicting an outcome with a certain likelihood of accuracy. Alongside the representation of knowledge, we need to store on a computer a reasoning method, i.e., an algorithm (a series of steps to be followed) to process this knowledge to arrive at an outcome/output (e.g., a decision, classification, or diagnosis).
Article
Full-text available
The spatiotemporal model consists of stationary and non-stationary data, respectively known as the Generalized Space–Time Autoregressive (GSTAR) model and the Generalized Space–Time Autoregressive Integrated (GSTARI) model. The application of this model in forecasting climate with rainfall variables is also influenced by exogenous variables such as humidity, and often the assumption of error is not constant. Therefore, this study aims to design a spatiotemporal model with the addition of exogenous variables and to overcome the non-constant error variance. The proposed model is named GSTARI-X-ARCH. The model is used to predict climate phenomena in West Java, obtained from National Aeronautics and Space Administration Prediction of Worldwide Energy Resources (NASA POWER) data. Climate data are big data, so we used knowledge discovery in databases (KDD) in this study. The pre-processing step is collecting and cleaning data. Then, the data mining process with the GSTARI-X-ARCH model follows the Box–Jenkins procedure: model identification, parameter estimation, and diagnostic checking. Finally, the post-processing step for visualization and interpretation of forecast results was conducted. This research is expected to contribute to developing the spatiotemporal model and forecast results as recommendations to the relevant agencies.
Research
Full-text available
As many students are confused about their future career field, they are unable to decide their career path either because of a lack of information or misconceptions. At the age of 18, students do not have adequate knowledge to correctly understand which is the right professional path. As we grow, we recognize that each student has doubts about what to pursue after 12th grade.
Article
Full-text available
Kurs mata uang merupakan hal yang penting bagi perdagangan internasional. Perbedaan nilai mata uang suatu negara dapat mempengaruhi nilai uatu barang atau jasa terhadap negara lain. Kurs mata uang sifatnya dinamis dan dipengaruhi oleh banyak hal seperti permintaan, penawaran, neraca pembayaran, tingkat inflasi, tingkat suku bunga, peraturan dan kebijakan pemerintah. Dengan adanya ilmu data mining, diharapkan dapat membantu pedagang valuta asing atau bank dalam memprediksi nilai mata uang di masa yang akan datang. Penelitian ini bertujuan untuk memprediksi nilai kurs mata uang dan membandingkan algoritma yang paling akurat untuk memprediksi nilai kurs mata uang. Menggunakan metode forecasting, beberapa algoritma dapat diaplikasikan seperti Linear Regression dan Neural Network. Kedua algoritma tersebut dapat digunakan karena memiliki alat ukur time series sebagai acuan dalam metode forecasting. Penulis menggunakan aplikasi RapidMiner sebagai sarana pengimplementasian algoritma terhadap dataset. Proses memprediksi nilai kurs mata uang menggunakan operator windowing yang kemudian diolah menggunakan kedua algoritma sehingga menghasilkan nilai keakuratan nilai prediksi. Hasil penelitian ini adalah nilai prediksi kurs jual dan beli rupiah terhadap SGD Dollar mencapai 10754.600 dan 10641.450. Hasil uji performa menyimpulkan bahwa algoritma Linear Regression sedikit lebih unggul dalam memprediksi nilai kurs mata uang dengan RMSE 28.012 +/- 5.626 dan 27.556 +/- 5.893, Absolute Error sebesar 21.444 +/- 4.095 dan 21.198 +/- 4.247, dan Relative Error sebesar 0.20% +/- 0.04%.
Article
Perencanaan akademik di Universitas Buana Perjuangan Karawang menjadi hal yang penting untuk dilakukan, karena perencanaan akademik dibutuhkan untuk mempersiapkan segala kebutuhan kegiatan perkuliahan selama tahun ajaran baru berlangsung. Namun hal ini menjadi sangat sulit dilakukan karena tidak adanya prediksi dari jumlah mahasiswa baru yang menjadi acuan dalam perencanaan akademik. Untuk memudahkan perencanaan akademik maka diperlukan suatu peramalan jumlah mahasiswa baru. Peramalan yang digunakan yakni menggunakan metode single exponential smoothing. Hasil yang didapatkan pada penelitian ini yakni penggunaan metode single exponential smoothing dengan pemulusan alpha 0.9 menghasilkan error terkecil dengan MSE sebesar 43791.11936, dan nilai error MAPE sebesar 8.51%.
Chapter
In the current trend towards contactless recognition, palmprint biometrics has proven to be very effective due to its uniqueness, reliability, acceptability, non-intrusiveness and low cost of acquisition devices. Palmprint biometrics can be used in many situations by simply capturing the hand with the camera of a mobile device. However, its use in real-life situations adds a high variability to the capturing conditions, increasing the complexity of the recognition process and causing failure of many processing methods up to date. In this study, a deep-learning algorithm is proposed to detect the palmprint region of interest and rotating the image as needed for the subsequent feature extraction on it. For this purpose, a convolutional neural network (CNN) has been trained and evaluated with 2445 hand images from 6 different databases that cover diverse environmental conditions. Results show that this algorithm provides an averaged F 1-score of 89% in images with complex backgrounds, dim light or varied hand arrangements; and it correctly processes images in which users are wearing rings, something that traditional segmentation cannot handle. Some conditions such as hard shadowing remain very complex for this algorithm, but it could be highly improved by increasing the volume of training datasets.KeywordsBiometricsPalmprintRegion of interest (ROI)Deep-learningObject detectionConvolutionalNeural NetworkMask R-CNN
Chapter
Traditionally, encryption keys are stored in a cryptosystem, risking theft or loss. Biometric encryption is a possible solution to avoid the need to store keys securely. The basic idea is that a cryptographic key can be generated from or bond with the biometric data of a user whenever a key is needed. This chapter aims to review and evaluate the state-of-the-art biometric encryption techniques. Different techniques to acquire, process, and extract data from iris biometric samples are evaluated and compared. Three template-free techniques are designed and implemented to test the performances of the system in terms of false acceptance rate (FAR) and false rejection rate (FRR). The results show that it is possible to generate an ECC (error correcting code) key pair and identify a person with a 3.7% FAR and a 21% FRR, values that can be further improved by optimizing the initial processing of the iris.
ResearchGate has not been able to resolve any references for this publication.