Article

Data Mining: Concepts and Techniques

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Grouping the data set can be constituted by the relationship between keywords in a text document. The keywords are analyzed by gathering the keywords that coming together and finding the association relationship of them by using association rule mining techniques [8]. This research is grouping undergraduate Computer Science student final project based on frequent itemset. ...
... It takes two measures to this rule, namely the support and confidence. Support is the possibility of X and Y appear in a transaction, while confidence is the possibility of Y when X also appears [8]. ...
... Clustering with the smallest SSE is the best clustering result. SSE is defined as follows (1) [8]: ...
... When the literature is reviewed, it is seen that data mining emerged conceptually in the 1960s when computers were used to solve data analysis problems (Han, Kamber, & Pei, 2012). Data mining, which was called as data scanning in the first days, has reached the present term with the consideration of computer engineers. ...
... The J48 algorithm is also known as ID3. In this algorithm, the entropy and information gain values for the target class are calculated using equations 1 to 3. The expected information needed to classify a tuple in D is given by (Han et al., 2012): ...
... Therefore, accuracy values were taken into consideration in the study. Thus, the predictor will be a good predictor for new observations (Han et al., 2012). ...
... In other words, the hyperplane should be specified for each instance − → x i , the distance between the sample and the hyper-page is the maximum. Each hyperplane can be described as follows (Han et al., 2011): ...
... So, for maximizing the distance between hyperplanes | − → w | should be minimized. Therefore, the following terms are established for each sample (Han et al., 2011): ...
... Thus, classification of samples is turned to an optimization question like the following equation (Han et al., 2011): ...
Article
Full-text available
Epilepsy is a disorder of the central nervous system that is often accompanied by recurrent seizures. World health organization (WHO) estimated that more than 50 million people worldwide suffer from epilepsy. Although electroencephalogram (EEG) signals contain vital physiological and pathological information of brain and they are a prominent medical tool for detecting epileptic seizures, visual interpretation of such tools is time-consuming. Since early diagnosis of epilepsy is essential to control seizures, we present a new method using data mining and machine learning techniques to diagnose epileptic seizures automatically. The proposed detection system consists of three main steps: In the first step, the input signals are pre-processed by discrete wavelet transform (DWT) and sub-bands containing useful information are extracted. In the second step, the features of each sub-band are extracted by approximate entropy (ApEn) and sample entropy (SampEn) and then these features are ranked by ANOVA test. Finally, feature selection is done by the FSFS technique. In the third step, three algorithms are used to classify seizures: Least squared support vector machine (LS-SVM), K nearest neighbors (KNN) and Naive Bayes model (NB). The average accuracy for both LS-SVM and NB was 98% and it was 94.5% for KNN, while the results show that the proposed method can detect epileptic seizures with an average accuracy of 99.5%, 99.01% of sensitivity and 100% of specificity which show an improvement over most similar methods and can be used as an effective tool in diagnosing this complication.
... Analyzing and evaluating large amounts of data requires data mining and machine learning to provide more accurate information. Machine learning algorithms could analyze and examine natural patterns in the data that generate insight and help make better decisions and predictions [6]. The data usually are analyzed based on trial and error, but this approach becomes impossible to adopt when the datasets are large and heterogeneous. ...
... One of the characteristics of Education 4.0 is that education is built and developed based on lecturer performance and student perceptions by using data collected from their daily learning activities. Data were stored, processed, and analyzed to become information or knowledge, known as Data Mining [6]. Data analysis relating to the World of Education and its results to provide input to the learning process development is known as data mining in education [9,10]. ...
... Borkar et al. [13] also indicated that data mining could analyze datasets from different perspectives and summarize them into valuable information to identify big datasets patterns. Data mining's primary function is to implement techniques and algorithms to detect and extract patterns, artificial intelligence, and visualization techniques [6]. Data mining algorithms and methods are also developed and used in the education sector. ...
Article
The covid-19 pandemic is currently occurring affects almost all aspects of life, including education. School From Home (SFH) is one of the ways to prevent the spread of Covid-19. The face-to-face learning method in class turns into online learning using information technology facilities. Even though there are many barriers to implementing classes online, online learning provides a new perspective for students' learning process. One of the factors for the online learning process's success is the interaction between the two main actors in the learning process, i.e., lecturers and students. The study's purpose was to analyze students' perceptions of the online learning process. The research data were obtained from a student questionnaire, which included five main criteria in the learning process: 1) self-management aspects, 2) personal efforts, 3) technology utilization, 4) perceptions of self-roles, and 5) perceptions of the role of the lecturer. Students provide an assessment through a questionnaire about the online learning methods they experience during the Covid-19 pandemic. The random forest algorithm was applied to examine data. The study results were focused on three main criteria (variable importance) that affect students' perceptions of the online learning process. The results described that the students' satisfaction in online learning is influenced by 1) The relationship between students and lecturers. 2) The learning materials need to be changed and adapted to the online learning method; 3) The use of technology to access online learning. The study contributes to improving the online learning method for the student.
... A ANÁLISE DOS DADOS: O PROCESSO DE DESCOBERTA EM BANCO DE DADOSNessa parte do estudo, realizou-se a análise das variáveis selecionadas para o processo de MD. Na prática, teve início a descoberta de conhecimento em bases de dados, uma metodologia interativa envolvendo muitas etapas(FAYYAD et al., 1996;HAN;KAMBER;PEI, 2012), como demonstrado na Figura 1. ...
... A ANÁLISE DOS DADOS: O PROCESSO DE DESCOBERTA EM BANCO DE DADOSNessa parte do estudo, realizou-se a análise das variáveis selecionadas para o processo de MD. Na prática, teve início a descoberta de conhecimento em bases de dados, uma metodologia interativa envolvendo muitas etapas(FAYYAD et al., 1996;HAN;KAMBER;PEI, 2012), como demonstrado na Figura 1. ...
... A ANÁLISE DOS DADOS: O PROCESSO DE DESCOBERTA EM BANCO DE DADOSNessa parte do estudo, realizou-se a análise das variáveis selecionadas para o processo de MD. Na prática, teve início a descoberta de conhecimento em bases de dados, uma metodologia interativa envolvendo muitas etapas(FAYYAD et al., 1996;HAN;KAMBER;PEI, 2012), como demonstrado na Figura 1. ...
Article
Full-text available
A Taxa de Mortalidade Infantil representa um dos grandes desafios humanitários a serem superados. No contexto nacional, o estado de Pernambuco tem logrado êxitos em diminuí-la, porém, a quantidade de óbitos infantis relacionados a causa evitáveis continua alta. Portanto, este estudo objetivou descobrir associações entre variáveis relacionadas com mortalidade infantil pernambucana de menores de um ano de idade e associar as descobertas com recomendações de políticas públicas. Utilizou-se a base de dados do Sistema de Informação sobre Mortalidade de 2018, fornecida abertamente pelo Ministério da Saúde. Na metodologia, utilizou-se o processo de descoberta de conhecimento em banco de dados com o suporte do software WEKA e do algoritmo Apriori. Ao analisar os resultados, pode-se sugerir melhorias no sistema público de saúde estadual, principalmente no que diz respeito a mitigar as causas evitáveis por atenção às gestantes. Palavras-chave: Dados Abertos, Mineração de Dados Públicos, Mortalidade Infantil.
... Algoritma Greedy stepwise bekerja dengan menemukan fitur terbaik, fitur yang terbaik atau yang paling relevan mempunyai jumlah dimensi yang paling berkontribusi pada akurasi. Seleksi fitur untuk menentukan fitur terbaik dan terburuk (Han, Kamber, & Pei, 2012). ...
... Algoritma greedy dengan seleksi subset atribut (Han et al., 2012) ...
Article
Full-text available
Disabilitas merupakan gangguan, keterbatasan aktivitas dan pembatasan partisipasi. Disabilitas disebut juga interaksi antara individu dengan kondisi kesehatan seperti (Cerebral palsy, sindrom Down dan depresi), faktor pribadi dan lingkungan seperti sikap negatif. Disabilitas dapat mengganggu perkembangan alami tubuh tergantung pada jenis kelamin, usia dan lingkungan. Penderita disabilitas merupakan kelompok minoritas terbesar didunia, 80% penderita berasal dari negara-negara berkembang. Selain itu Anak-anak menempati menyandang disabilitas dengan jumlah sepertiga dari jumlah keseluruhan penyandang disabilitas di dunia. Pada penerapannya proses diagnosis dan klasifikasi dimensi disabilitas membutuhkan ahli terapis okupasi. Teknik data mining dapat digunakan untuk membantu proses diagnosis yang bertujuan untuk menghindari kesalahan dalam diagnosis. Tujuan penelitian ini adalah untuk mengklasifikasikan masalah kemampuan perawatan diri anak disabilitas menjadi 7 kelas. Penelitian ini menggunakan dataset yang merepresentasikan masalah kemampuan perawatan diri anak dengan disabilitas. Dataset yang akan digunakan memiliki permasalahan multidimensional dataset dimana dataset memiliki fitur yang lebih banyak dibandingkan dengan jumlah datanya. Multidimensional dataset dilihat dari Jumlah fitur yang dimiliki yaitu 205 fitur dan 1 label dengan jumlah data sebanyak 70. Metode yang diusulkan pada penelitian ini yaitu greedy stepwise sebagai metode untuk mengatasi masalah multidimensional dataset dengan menyeleksi fitur bertujuan memilih fitur yang paling relevan. Selain greedy stepwise diterapkan juga metode neural network yang digunakan sebagai algoritme klasifikasi. Hasil penelitian menunjukan bahwa metode seleksi fitur greedy stepwise dengan penerapan neural network memperoleh nilai akurasi sebesar 84.2857% yang bisa disimpulkan hasil akurasinya baik.
... These commonly found features may contribute little to the author group identification. Here, we used the well-known information gain (IG) and entropy (E) to evaluate the contribution of such features for classification [24]. These were calculated as follows: ...
... In other words, features with a high information gain are informative features that should be kept. Both information gain and entropy can be used as metrics to determine the usefulness of a given feature A [24]. Between the two, we used information gain for this research because it provided better results in our evaluation. ...
Article
Full-text available
Malware are developed for various types of malicious attacks, e.g., to gain access to a user’s private information or control of the computer system. The identification and classification of malware has been extensively studied in academic societies and many companies. Beyond the traditional research areas in this field, including malware detection, malware propagation analysis, and malware family clustering, this paper focuses on identifying the “author group” of a given malware as a means of effective detection and prevention of further malware threats, along with providing evidence for proper legal action. Our framework consists of a malware-feature bipartite graph construction, malware embedding based on DeepWalk, and classification of the target malware based on the k-nearest neighbors (KNN) classification. However, our KNN classifier often faced ambiguous cases, where it should say “I don’t know” rather than attempting to predict something with a high risk of misclassification. Therefore, our framework allows human experts to intervene in the process of classification for the final decision. We also developed a graphical user interface that provides the points of ambiguity for helping human experts to effectively determine the author group of the target malware. We demonstrated the effectiveness of our human-in-the-loop classification framework via extensive experiments using real-world malware data.
... The major component of a smart city is given in Fig.1 [2]. Data mining, a synonym for another term Knowledge Discovery from Data [1]. It is a process of iterative sequence of following steps (1) Data cleaning (2) Data Integration (3) Data Selection (4) Data Transformation (5) Data Mining (6) Pattern Evaluation (7) Knowledge representation. ...
... It is a process of iterative sequence of following steps (1) Data cleaning (2) Data Integration (3) Data Selection (4) Data Transformation (5) Data Mining (6) Pattern Evaluation (7) Knowledge representation. Data mining is actually one of the step in the above process, but a major step which helps to uncovers hidden pattern for evaluation [1]. It consequently scans huge volume of information for models and examples utilizing computational methods from measurements and machine learning and data hypothesis. ...
Article
Full-text available
People invest a larger part of their energy in their home or work environment and for some, these spots are our asylums. As society and innovation progress there is a developing interest for enhancing the knowledge of the situations in which we live and work. Data mining techniques plays a pivotal role in establishing the smarter home and city to greener environment in the past decade. For a smarter environment, sensors can be embedded anywhere while the occupants play out their day by day schedule. The data are collected from the sensor, stored in a database or a network and data mining algorithms can be used to extract interesting patterns. Some of the application like activity recognition such as cooking, watching TV, sleeping etc., For a greener environment, data mining plays a vital role in predicting the energy consumption like mobile phone energy consumption, waste water treatment, building energy use, predicting traffic density, anomaly detection in roads etc. Data mining techniques works efficiently in imbalanced and microarray datasets which bring some disadvantages like over-fitting, poor performance and low efficiency. This paper reviews various data mining approaches to provide better understanding about the different methods that may help interested researchers to work future in establishing the smart and greener environment.
... For discovering useful and hidden trends or patterns that are used in the diagnosis and in recurrence. Classification data mining algorithm technique categorized the breast cancer datasets into a number of classes [15]. The main classification technique is to accurately predict the new data class and also give an effective result for the analysis of huge datasets [16]. ...
... With a wide range of applications, such as fraud detection, product recommendations, email spam filtering, and medical diagnosis, machine learning has attracted a lot of attention in recent decades. (1) Learning is the automatic discovery of previously undiscovered patterns and structures in data. (2) Based on the sample data, the machine learning algorithm creates a mathematical model that improves the output (prediction) of the algorithm according to previous experiences. ...
Article
Full-text available
Data is being generated at an increasing rate in a variety of fields as science and technology advance. The generated data are being saved for future decision-making. Data mining is the process of extracting patterns and useful information from massive amounts of data. The distance measure, which is used to calculate how different two objects are from one another, is one such instrument. We have conducted a comprehensive survey of how the distance measures behave when employed with different algorithms. Furthermore, the effectiveness and performance of some novel similarity measures proposed by other authors are investigated.
... There are many traditional feature selection algorithms developed for selecting relevant features for emotional classification from a speech signal. One among them is filter and wrapper approaches done based on the criterion of information gain [3], mutual information [4] and principal component analysis [5] and so on. Alternatively, in the wrapper approach, a classifier is used, such as the K-nearest neighbour (KNN) [6] and support vector machine (SVM) [7], among others, to assess the quality of the resulting subsets. ...
... Based on the statistical measurement of the original (un-normalized) data, many methods have been proposed to normalize the data to within a specified range. In this study, we chose the minimum-maximum normalization method, which can linearly reduce the non-normalized data to the predefined upper and lower limits [93]. The calculation formula is as follows: ...
Article
Full-text available
Urban style is the comprehensive expression of the material environment, the associated cultural connotation and social life. Under the influence of globalization and rapid urban expansion, many cities around the world show a global convergence in style, which poses a challenge in terms of satisfying both function and local identity. However, the current insufficiency of research on the quantitative evaluation of urban style makes it hard to have a full grasp on how urban style can instruct land use and landscape planning strategies. In this paper, we propose Suitability, Aesthetics and Vitality as three core dimensions of urban style, and construct a quantitative evaluation framework for urban style evaluation at the street level. Taking a street in Hengyang County, China as an example, the method’s operability is demonstrated, and the results show that urban style performance is closely related to building construction periods, trends of urban expansion, and the natural environment. Improvement strategies include harmonizing urban spatial form, increasing the diversity of land use, and moderately improving the quality of building facades. This method can be applied at a greater scale to effectively reflect local characteristics and relevant problems. It can also provide an objective basis for future planning and construction.
... Pada metode ini, pengklasifikasian dilakukan berdasarkan kategori dari k tetangga terdekat antara data testing dengan seluruh data training [28]. Nilai k ditentukan melalui trial and error, yakni melakukan uji coba dengan nilai k yang berbeda-beda hingga diperoleh error yang paling minimum [29]. Selain error, dapat pula dipertimbangkan berdasarkan akurasi [30]. ...
... Machine learning is used to classify, cluster, and identify/predict patterns in data (Jiawei Han, Kamber, & Pei, 2012). Supervised learning (i.e., for labelled data) and unsupervised learning (i.e., for unlabeled data) are the main two classifications of machine learning techniques. ...
Article
Full-text available
Financial domains are suffering from organized fraudulent activities that are inflicting the world on a larger scale. Basel Anti-Money Laundering (AML) index enlists 146 countries, which are impacted by criminal acts like money laundering, and represents the country's risk level with a notable deteriorating trend over the last five years. Despite AML being a substantially focused area, only a fraction of such activities has been prevented. Because financial data related to this field is concealed, access is limited and protected by regulatory authorities. This paper aims to study a graph-based machine-learning model to identify fraudulent transactions using the financial domain's synthetic dataset (100K nodes, 5.3M edges). Graph-based machine learning with financial datasets resulted in promising 77-79% accuracy with a limited feature set. Even better results can be achieved by enriching the feature vector. This exploration further leads to pattern detection in the graph, which is a step toward AML detection.
... In data mining we extract the information from a huge data set and apply the suitable techniques to find knowledge & patterns for further use. Data mining is an integral part of many related fields including statistics, machine learning, pattern recognition, database systems, visualization, data warehouse, and information retrieval [1]. ...
... In short, clustering is equal to classification. The only difference is that the classes are not defined and determined in advance, and the grouping of the data is done without supervision (Han, Kamber, & Pei, 2012). Because of its simplicity, convergence speed, and high efficiency, the k-means algorithm is the most popular method in clustering (Berkhin, 2006;Li, Yu, Lei, & Tang, 2017;Jain & Dubes, 1988). ...
Article
Full-text available
In recent years, crime has been critical to be analyzed and tracked to identify the trends and associations with crime patterns and activities. Generally, the analysis is conducted to discover the area or location where the crime is high or low by using different clustering methods, including k-means clustering. Even though the k-means algorithm is commonly used in clustering techniques because of its simplicity, convergence speed, and high efficiency, finding the optimal number of clusters is difficult. Determining the correct clusters for crime analysis is critical to enhancing current crime resolution rates, avoiding future incidents, spending less time for new officers, and increasing activity quality. To address the problem of estimating the number of clusters in the crime domain without the interference of humans, the research carried out Elbow, Silhouette, Gap Statistics, and NbClust methods on datasets of Major Crime Indicators (MCI) in 2014−2019. Several stages were performed to process the crime datasets: data understanding, data preparation, cluster modelling, and cluster validation. The first two phases were performed in the R Studio environment and the last two stages in Azure Studio. From the experimental result, Elbow, Silhouette, and NbClust methods suggest a similar number of optimum clusters that is two. After validating the result using the average Silhouette method, the research considers two clusters as the best clusters for the dataset. The visualization result of Silhouette method displays the value of 0,73. Then, the observation of the data is well-grouped. It is placed in the correct group.
... For Tingilidou and Kirkos (2010, p. 2) data mining is the extraction of implicit, previously unknown and potentially useful information from data. Data mining in the views of Han and Kamber (2012) is the process of discovering interesting patterns and knowledge from data. The data sources can include databases, data warehouses, the Web, other information repositories, or data that are stream into the system dynamically, in order to detect possible regularities, trends and associations that are not known a priori (Acciani, Fucilli, & Sardaro, 2011, p. 27). ...
... Clustering is the most common unsupervised learning used to find the possible grouping or inherent pattern in the given data. The Gaussian mixture model and k-means are types of clustering approaches [76]. A typical general unsupervised learning model is shown in Figure 6 and Table 2. Reinforcement Learning is a machine learning algorithm that allows the machine or software agents to intelligently regulate the best behavior within a definite context and exploit the performance. ...
Article
Full-text available
One of the biggest problems the maritime industry is currently experiencing is corrosion, resulting in short and long-term damages. Early prediction and proper corrosion monitoring can reduce economic losses. Traditional approaches used in corrosion prediction and detection are time-consuming and challenging to execute in inaccessible areas. Due to these reasons, artificial intelligence-based algorithms have become the most popular tools for researchers. This study discusses state-of-the-art artificial intelligence (AI) methods for marine-related corrosion prediction and detection: (1) predictive maintenance approaches and (2) computer vision and image processing approaches. Furthermore, a brief description of AI is described. The outcomes of this review will bring forward new knowledge about AI and the development of prediction models which can avoid unexpected failures during corrosion detection and maintenance. Moreover, it will expand the understanding of computer vision and image processing approaches for accurately detecting corrosion in images and videos.
... However, the hierarchical clustering is an unsupervised learning method, and once a group of objects is merged, the process at the next step will operate on the newly generated clusters. It will neither undo what was done previously, nor perform object swapping between clusters [46]. It means that errors may be accumulated, and thus lead to low-quality or even wrong clusters. ...
Article
Full-text available
Designers search for memories and retrieve appropriate mental information during design brainstorming. The specific contents of retrieved memories can serve as stimuli for new ideas, or act as barriers to innovation. These contents can be divided into different categories, which are reflected in designers' creativities, and derived from individual lives and design experiences. Appropriate categorization of retrieved memory exemplars remains a fundamental research issue. This study tentatively divided retrieved memory exemplars into eight categories from brainstorming on the topic of library desk and chair design. A verification questionnaire was performed and validated the accuracy of categorization. The categorization result could be applied to design education in terms of understanding students' design performances and capabilities.
... It should be emphasized that service time and cost values used in the above equations are required to be normalized in the range of 1 to 10 via min-max normalization [34]. ...
Article
Full-text available
In this paper, the novel heuristic search algorithm called Smart Root Search (SRS) was examined for solving a set of different-sized service time–cost optimization in cloud computing service composition (STCOCCSC) problems, and its performance was compared with those of the ICACRO-C, ICACRO-I, ICA, and Niching PSO algorithms. STCOCCSC is an np-hard problem due to the large number of unique services available as well as the many service providers who provide services with different quality levels. Finding closer-to-optimal solutions supports cloud clients by providing them with higher quality-lower price services. The SRS obtained results proved that the SRS provided 6.74, 11.2, 47.95, and 87.29 percent performance improvement on average to the comparative algorithms, respectively, for all considered five problems. Furthermore, employing symmetry concepts in dividing the problem search space helps the algorithm to avoid premature convergence and any efficiency reduction while facing higher-dimensional search spaces. Due to these achievements, the SRS is a multi-purpose, flexible, and scalable heuristic search algorithm capable of being utilized in various optimization applications.
... The binary classified indices are pairwise compared with the binary classified reference data sets for an instance of flood data sets and another instance of cost of fatalities. The confusion matrices for each index are estimated to evaluate the accuracy of the indices using the values of precision (Equation 11), recall (Equation 12), F1-score (Equation 13), and accuracy (Equation 14) (Han et al., 2011). The confusion matrix depends on four performance measures: true positive (TP), true negative (TN), false positive (FP), and false negative (FN), which were obtained previously from a pairwise comparison between the binary classified reference data sets and the indices. ...
Article
Full-text available
Estimating the exposure of the coastal systems to natural hazards using coastal vulnerability models, which benefits from index-based approaches and utilize information about the characteristics of the system, has become extensively adopted in the past few decades in coastal management and planning. However, the explanatory power of index-based approaches and subjective selection of vulnerability factors are still in dispute. This study aims to introduce a stochastic coastal vulnerability model and assess its skill in characterizing and preserving simultaneous information about various comprising factors. Two common coastal vulnerability indices, additive coastal vulnerability index (ACVI) and multiplicative coastal vulnerability index (MCVI) are formed, and then their performances are compared to the proposed probabilistic coastal vulnerability index (PCVI) for the coastal counties of South Carolina. PCVI is developed based on the joint-probability analysis of vulnerability factors using copula functions, which makes it capable of preserving the importance of multivariate information, and in turn, forms a more informative index. The performance of indices is benchmarked against post-hazard flood maps and the cost of fatalities from Hurricane Florence (2018) and Hurricane Matthew (2016). The PCVI revealed more accurate results in terms of explaining the importance of vulnerability associated with biophysical and socio-economic factors. The capability of PCVI to preserve multivariate vulnerability information offers a more pragmatic approach to reflect the exposure and adaptive capacity of coastal communities facing coastal hazards.
... For instance, using values α= 3 and 288β= 1 would result in less of a jump between similar and dissimilar assemblageschoice of the appropriate distance metric or dissimilarity coefficient is crucial, there are no 293 definite rules guiding the choice of a clustering procedure. Clustering techniques are frequently grouped 294 based on the nature of the algorithm used to generate the groups e.g., partitional, hierarchical, density-295 based, are just some of the main clustering types(Han et al. 2011). An important distinction is whether 296 the procedure requires specifying the number of clusters or groups a priori or not. ...
Article
Full-text available
11 12 This study describes a methodology to uncover activity patterns obtained through surface survey. More 13 specifically, we present a way to elicit and interpret pottery surface assemblages from Mallorca's Late 14 Iron Age, the Balearic Period (550-1 BCE) recovered during the 2014-18 seasons of the Landscape, 15 Encounters and Identity Archaeological project (LEIAp). To achieve this goal, we derive a new binary 16 (pseudo-) metric or dissimilarity coefficient that in combination with a spectral biclustering algorithm 17 allows us to group areas on the landscape with similar pottery assemblages. This new metric better aligns 18 with our intuition about the similarities between pottery assemblages than other well-known binary 19 metrics. Careful examination of the composition of the groups obtained from clustering enables us to 20 forward interpretations regarding activities that occurred at various non-monumental enclaves 21 throughout the landscape. These results provide a more nuanced view of the landscape during this period 22 of time, in particular, by exposing the existence of non-monumental domestic spaces and areas destined 23 to transport and farming activities. 24
... It searches for "core objects", points that contain a minimum of observations (MinPoints) within its neighbourhood (defined by an epsilon radius), including the core point itself. If a point is found outside of any of the core object's neighborhood, it is considered noise [33]. Border points are points within reach of a core object without the minimum points in their neighbourhood to be considered core objects themselves (Figure 4). ...
Article
Full-text available
Dam surveillance activities are based on observing the structural behaviour and interpreting the past behaviour supported by the knowledge of the main loads. For day-to-day activities, data-driven models are usually adopted. Most applications consider regression models for the analysis of horizontal displacements recorded in pendulums. Traditional regression models are not commonly applied to the analysis of relative movements between blocks due to the non-linearities related to the simultaneity of hydrostatic and thermal effects. A new application of a multilayer perceptron neural network model is proposed to interpret the relative movements between blocks measured hourly in a concrete dam under exploitation. A new methodology is proposed for threshold definition related to novelty identification, taking into account the evolution of the records over time and the simultaneity of the structural responses measured in the dam under study. The results obtained through the case study showed the ability of the methodology presented in this work to characterize the relative movement between blocks and for the identification of novelties in the dam behaviour.
... A general procedure of the knowledge discovery (security insights) process from cyber data According to Han et al, 52 the term "knowledge mining from data" should have been used instead. Data mining, which is similar to another popular phrase "Data Science" 40 is defined as the process of extracting meaningful patterns and knowledge from large volumes of data. ...
Article
Due to the rising dependency on digital technology, cybersecurity has emerged as a more prominent field of research and application that typically focuses on securing devices, networks, systems, data and other resources from various cyber‐attacks, threats, risks, damages, or unauthorized access. Artificial intelligence (AI), also referred to as a crucial technology of the current Fourth Industrial Revolution (Industry 4.0 or 4IR), could be the key to intelligently dealing with these cyber issues. Various forms of AI methodologies, such as analytical, functional, interactive, textual as well as visual AI can be employed to get the desired cyber solutions according to their computational capabilities. However, the dynamic nature and complexity of real‐world situations and data gathered from various cyber sources make it challenging nowadays to build an effective AI‐based security model. Moreover, defending robustly against adversarial attacks is still an open question in the area. In this article, we provide a comprehensive view on “Cybersecurity Intelligence and Robustness,” emphasizing multi‐aspects AI‐based modeling and adversarial learning that could lead to addressing diverse issues in various cyber applications areas such as detecting malware or intrusions, zero‐day attacks, phishing, data breach, cyberbullying and other cybercrimes. Thus the eventual security modeling process could be automated, intelligent, and robust compared to traditional security systems. We also emphasize and draw attention to the future aspects of cybersecurity intelligence and robustness along with the research direction within the context of our study. Overall, our goal is not only to explore AI‐based modeling and pertinent methodologies but also to focus on the resulting model's applicability for securing our digital systems and society.
... To understand structures and patterns in complex geospatial data, it is common to incorporate data mining (for further reading see (Han et al., 2011)), knowledge discovery databases (KDD), and visualization methods. Koua and Kraak (2004) explained that the KDD framework can be implemented into geovisualizations through computational and visual analysis methods. ...
Article
Full-text available
This review article collects knowledge on the use of eye-tracking and machine learning methods for application in automated and interactive geovisualization systems. Our focus is on exploratory reading of geovisualizations (abbr. geoexploration) and on machine learning tools for exploring vector geospatial data. We particularly consider geospatial data that is unlabeled, confusing or unknown to the user. The contribution of the article is in (i) defining principles and requirements for enabling user interaction with the geovisualizations that learn from and adapt to user behavior, and (ii) reviewing the use of eye tracking and machine learning to design gaze-aware interactive map systems (GAIMS). In this context, we review literature on (i) human-computer interaction (HCI) design for exploring geospatial data, (ii) eye tracking for cartographic user experience, and (iii) machine learning applied to vector geospatial data. The review indicates that combining eye tracking and machine learning is promising in terms of assisting geoexploration. However, more research is needed on eye tracking for interaction and personalization of cartographic/map interfaces as well as on machine learning for detection of geometries in vector format.
Thesis
La segmentation à base d'atlas est une technique de segmentation de haut niveau qui est devenue un paradigme standard pour exploiter les connaissances a priori dans la segmentation d'images. Des régions différentes du corps humain détectées sur l’imagerie médicale comme le cerveau ou la région pelvienne féminine, par exemple, sont connues d’être anatomiquement complexes et de grande variabilité d’un patient à un autre, ce qui rend difficile la tâche de segmentation utilisant les techniques de segmentation de bas niveau. Nous proposons, dans ce travail, une approche de segmentation automatique à base d’atlas utilisant l'apprentissage en ligne. L'approche proposée a été appliquée dans un premier temps pour la segmentation du cervelet humain à partir d’images IRM cérébrales 2D. Dans un deuxième temps, l’approche a été appliquée pour la segmentation des régions locales susceptibles d'être affectées par le cancer du col de l'utérus à partir d’images IRMs pelviennes féminines 3D. L’approche de segmentation proposée se base sur une nouvelle technique de recalage qui utilise une procédure d'optimisation hybride basée sur une conception particulière d'algorithme génétique combinée à la méthode de la descente de gradient dans une stratégie multi-résolution. Les atlas utilisés dans ce travail ont été mis à notre disposition progressivement dans un ordre séquentiel. L'approche proposée est donc basée sur une méthode d'apprentissage automatique en ligne pour la construction de la base d’atlas et pour le processus de segmentation. Les résultats enregistrés sont prometteurs comparés à ceux donnés par des méthodes de segmentation conventionnelles.
Book
The Eleventh International Conference on Data Analytics (DATA ANALYTICS 2022), held between November 13 and November 17, 2022, continued the series on fundamentals in supporting data analytics, special mechanisms and features of applying principles of data analytics, application-oriented analytics, and target-area analytics. Processing of terabytes to petabytes of data, or incorporating non-structural data and multistructured data sources and types require advanced analytics and data science mechanisms for both raw and partially-processed information. Despite considerable advancements on high performance, large storage, and high computation power, there are challenges in identifying, clustering, classifying, and interpreting of a large spectrum of information. The conference had the following tracks:  Application-oriented analytics  Big Data  Sentiment/opinion analysis  Data Analytics in Profiling and Service Design  Fundamentals  Mechanisms and features  Predictive Data Analytics  Transport and Traffic Analytics in Smart Cities
Chapter
Der Einsatz von IT im Controlling gewinnt immer mehr an Bedeutung. Da der Geschäftserfolg von der Aktualität und Genauigkeit von Informationen abhängt, speichern sogenannte OLTP-Systeme nur aktuelle Datenzustände, damit die zu verarbeitenden Daten effizient und effektiv bereitgestellt werden können. Der zentrale Kern der Handels-IT sind leistungsfähige Warenwirtschaftssysteme, da sie die Bereitstellung entsprechender operativer und strategischer Informationen ermöglichen. Auf dezentraler Ebene werden Filial-Warenwirtschaftssysteme verwendet, da diese neben der reinen Abverkaufserfassung auch Bestandsführungs- und Auswertungsfunktionalität bieten. Der Einzelhandel braucht durch die Nähe zu seinen Konsumenten, aber auch zu seinen Lieferanten, neben den internen Daten externes Informationsmaterial. Daher werden Daten über Verbraucherverhalten und Konsumententrends häufig extern erworben. Durch sehr hohe Geschäftsvorgangsvolumen und die charakteristisch große Artikelanzahl, sind die zu verwaltenden Datenmengen der Handelsunternehmen in den letzten Jahren stark angestiegen. Im Gegensatz zu OLTP-Systemen ist OLAP ein flexibles Analyse-Konzept, um dynamisch auf Daten zuzugreifen und durch den Datenbestand zu navigieren. Zudem birgt die große Anzahl an Transaktionen mit externen Partnern durch elektronische Datenübermittlung ein hohes Rationalisierungspotenzial.
Chapter
In a behavioural science context, organizational intelligence refers to an organization’s ability to acquire, process and use information. The organizational learning and organizational decision-making literatures comprise what would be the organizational intelligence literature if organizational intelligence were an acknowledged field of study. I describe these two literatures, and I show how organizational learning and decision-making facilitate organizational intelligence and are themselves enhanced by organizational intelligence. I also describe the practice of organizational intelligence, a practice that seeks to aid decision-makers by determining the nature, capabilities, circumstances and likely behaviours of entities of interest to these decision-makers.
Article
A ampla utilização dos AVAs (Ambientes Virtuais de Aprendizagem) na Educação, podem contribuir para a utilização de TDIC (Tecnologias Digitais da Informação e da Comunicação), das Metodologias Ativas de Aprendizagem e que favorecem a abordagem CCS (Construcionista, Contextualizada e Significativa), na qual o cursista utiliza a tecnologia como instrumento para produzir algo que parte da sua vivência e realidade e o professor atua como mediador para ajudar este cursista a formalizar conceitos. Nesse contexto, a Internet e os dispositivos móveis têm contribuído para a proliferação de grande quantidade de dados em formato digital, mas ainda são pouco utilizados para gerar a descoberta de conhecimento. É onde se destaca a área de MDE (Mineração de Dados Educacionais), que consiste no desenvolvimento de métodos e técnicas orientados a explorar tais dados para melhor compreender o comportamento dos cursistas e em quais condições eles aprendem. O argumento norteador da pesquisa foi descobrir como as técnicas de MDE poderiam ser usadas para identificar indícios da abordagem CCS nos cursos da modalidade híbrida. O delineamento metodológico baseia-se na abordagem Ex Post Facto, pois a investigação foi realizada após a conclusão dos fatos. Para responder as questões norteadoras, o curso de pós-graduação em Educação Especial na Perspectiva Inclusiva do programa Redefor-Unesp foi analisado a partir das categorias CCS, sistematizados em um software experimental denominado EDMXP (Educational Data Mining eXPeriment). Os resultados foram compilados em uma linguagem que possibilita aos profissionais de Educação compreenderem melhor os dados para que, ao final, fosse possível constatar que a MDE pode ser um fator transformador em Educação, pois permite a tomada de decisão com base em dados e em fatos e não apenas de forma intuitiva ou por meio de experiências vivenciadas. Representa, portanto, uma nova forma de fazer e pensar a Educação.Palavras-chave: Mineração de Dados Educacionais. Abordagem CCS. Educação híbrida.
Article
Full-text available
Bidang pendidikan merupakan salah satu bidang yang merasakan dampak besar dari pandemi Covid-19. Dampak yang timbul adalah proses belajar mengajar harus dilakukan dari rumah dengan metode pembelajaran daring. Metode belajara mengajar ini menimbulkan respon atau pandangan yang beragam dari peserta didik. Hal ini yang membuat peneliti melakukan analisis terhadap pandangan-pandangan tersebut, baik yang berupa pendapat positif atau pendapat negatif. Proses Analisis dilakukan dengan menerapkan analisis sentimen atau opinion mining dari data komentar di media sosial Facebook, data teks diolah dengan metode prepocessing dan diberi label positif dan negatif. Berdasarkan data teks yang tersedia, dilakukan proses klasifikasi dengan algoritma K-Nearest Neighbors. RapidMiner digunakan untuk eksperimen data teks dengan algoritma KNN dengan tujuan mencari nilai akurasi, presisi dan recall. Dari hasil penelitian diperoleh nilai sebesar 87.00% untuk accuracy dan 0.916 untuk nilai AUC. Nilai-nilai yang cukup tinggi untuk klasifikasi opini mahasiswa terhadap pandemi ini sehingga penelitian ini digolongkan sebagai Excellent Classification.
Article
Full-text available
Someone's opinion on a product or service that is poured through a review is something that is quite important for the owner or potential customer. However, the large number of reviews makes it difficult for them to analyze the information contained in the reviews. Aspect-based sentiment analysis is the process of determining the sentiment polarity of a sentence based on predetermined aspects.This study aims to analyze an Indonesian restaurant review using a combination of Convolutional Neural Network and Contextualized Word Embedding models. Then it will be compared with a combination of Convolutional Neural Network and Traditional Word Embedding models. The result of aspect-classification on three models; BERT-CNN, ELMo-CNN, and Word2vec-CNN give the best results on the ELMo-CNN model with micro-average precision of 0.88, micro-average recall of 0.84, and micro-average f1-score of 0.86. Meanwhile, the sentiment-classification gives the best results on the BERT-CNN model with a precision value of 0.89, a recall of 0.89, and an f1-score of 0.91. Classification using data without stemming have almost similar results, even better than using data with stemming.
Article
Full-text available
Nowadays, in the domain of production logistics, one of the most complex planning processes is the accurate forecasting of production and assembly efficiency. In industrial companies, Overall Equipment Effectiveness (OEE) is one of the most common used efficiency measures at semi-automatic assembly lines. Proper estimation supports the right use of resources and more accurate and cost-effective delivery to the customers. This paper presents the prediction of OEE by comparing human prediction with one of the techniques of supervised machine learning through a real-life example. In addition to descriptive statistics, takt time-based decision trees are applied and the target-oriented OEE prediction model is presented. This concept takes into account recent data and assembly line targets with different weights. Using the model, the value of OEE can be predicted with an accuracy of within 1% on a weekly basis, four weeks in advance.
Article
Veri günümüzde çok sık karşılaşılan bir terimdir. Verinin doğru kullanımı doğru değerlendirmeyi sağlar. Bu da kaynakların verimli kullanımını, verilen hizmetin kalitesinin artmasını sağlamaktadır. Verinin en çok toplandığı alanların başında sağlık sektörü gelmektedir. Sağlık hizmet sunumunun maddi ve manevi yükü ağırdır. Bu hizmetin en iyi şekilde verilmesi, kaynakların doğru kullanılması ile yakın ilişkilidir. Sağlık verilerinden anlamlı sonuçların çıkarılarak hekimlere, hemşirelere ve sağlık yöneticileri gibi sağlık sektörü çalışanlarına yön gösterecek bilgilerin sağlanması sağlık verilerinin büyüklüğü düşünüldüğünde ancak veri madenciliği metotları ile mümkündür. Sağlık sektörünün insan hayatını direkt etkileyen bir doğası olması sebebi ile sağlıkta kullanılan verilerin kalitesinin en üst düzeyde olması beklenmektedir. Bu çalışmada veri kalitesini ve veri madenciliğini bütüncül olarak ele almıştır. Uygulama örnekleri aracılığıyla veri madenciliği ile sağlık sektöründe ne tür çalışmalar yapılabileceğine dair genel bir bakış açısı sağlanmıştır.
Article
The need for power is increasing every day in the world. In order to meet the power demand, numerous power plants are being built. Adding new power plants on available transmission lines leads to overloading problem. As a solution to this, series capacitors are used in transmission lines. Series capacitors increase the active power transfer capability. Also, it has a positive effect on system stability and voltage profile. However, on the contrary to the positive effects of series capacitor, it has some negative effects on distance relays which protect the transmission lines. In case of a short-circuit fault, the distance relays, depending on the location where the fault occurred and the capacitor effect might not see the fault that occurs on the line it protects or might see the fault on another line. In this study, a management relay was developed in order to prevent maloperations of distance relays due to capacitor effect. The developed relay identifies the faulted line in case of a short-circuit fault and sends tripping/blocking signal to the distance relays that might malfunction. At the end of the study, the verification of the management relay is carried out with various scenarios, and it is seen that the relay functions effectively during short-circuit simulations.
Conference Paper
This paper presents a multimodal biometric approach applied to all fingernails and knuckle creases of the five human fingers for identifying persons. In this paper, the proposed biometric technique consists of several phases. The method starts with the detection and localisation of the main components of the hand, defining the region of interest (ROI), segmentation, feature extraction by retraining the DenseNet201 model, measuring the similarity using different metrics, and lastly, improving the person identification performance by implementing score-level fusion. This approach presents different methods for person identification, which combine fingernails, knuckles based on the modality type, and whole hands based on different similarity metrics. This paper uses various similarity metrics to distinguish between individuals. These include the Bray-Curtis, Cosine, and Euclidean metrics. Two main score-level fusion techniques are employed: the majority voting (MV) and weighted average (WA). The experimental results are evaluated with well-known databases, the ’11k Hands’ and the Hong Kong Polytechnic University Contactless Hand Dorsal Images ’PolyU’, show the proposed algorithm’s efficiency. Using the MV on the Bray-Curtis similarity measure, the fingernail-based and the base-knuckle based fusion obtained 100% in the identification estimation. In addition, the identification rate gained 100% in regions of hands and whole hands from the two popular datasets exceeded the performance of the state-of-the-art approaches.
Article
Full-text available
This study addresses categorization issues related to adjective candidates in Estonian, focusing on the category of participles. The aim of the analysis was to assess the ranges of the prototypical adjective and to determine its degree of deviation on the prototypicality scale. The investigation was based on a group of validated adjectives – selected adjectives included in the Basic Estonian Dic­tionary – and two control groups of more and less lexicalized participles. We tested seven morphosyntactic corpus patterns characteristic of adjectives. The test patterns were based on the prototypical features of the adjective, as well as on observations made in the actual lexicographic analysis. To assess the sam­ple words and determine the significance of the test patterns from the point of view of defining adjectivity, we used deviation analysis. The results of this study can be applied to establish a measure of adjectivity for lexicographic judgments when distinguishing, for instance, lexicalized participles from regular ones.
Chapter
Automatic classification and monitoring the human activity using sensors is a decisive technology for Ambient Assisted Living (AAL). Early recognition approaches involved in manually outlining expert guidelines for the sensor values. Recently, machine learning-based human activity recognition methodologies have fascinated a lot of attention. Some human activities produce smoke in the environment which is danger in most situations. The objective of the proposed work is to detect the smoke making activities that are taken out inside any smart environment based on the data assimilated by a set of environmental sensors inspecting the constituents of the air. We have used lower and upper bound based capping technique to handle the outliers and then we have standardized the contents of the features using standard scaling. These preprocessing makes the sensor data more apposite for selected logistic regression algorithm. The outcome of the proposed method shows better result than many state-of-the-art methods.KeywordsSmoke activitySmart environmentIOT air quality sensorsOutlierData standardizationLogistic regression
Article
Mapping of landslide susceptibility is an important tool to prevent and control landslide disasters for a variety of applications, such as land use management plans. The main objective of this study was to propose an application of artificial intelligence systems, then evaluate and compare their efficiency for developing accurate landslide susceptibility mapping (LSM). The present study aims to explore and compare the frequency ratio (FR) method with three machine learning (ML) techniques, namely, random forests (RF), support vector machines (SVM), and multiple layer neural networks (MLP), for landslide susceptibility assessment in East Azerbaijan, Iran. To achieve this goal, 20 landslide-occurrence-related influencing factors were considered. A sum of 766 locations with landslide inventory was recognized in the context of the study, and the relief-F method was utilized in order to measure the conditioning factors’ prediction capacity in landslide models. In the forthcoming phase, three ML models (SVM, RF, and MLP) were trained by the training dataset. Lastly, the receiver operating characteristic (ROC) and statistical procedures were employed to validate and contrast the predictive capability of the FR model with the obtained three models. The findings of the study in terms of the relief-F method for the importance ranking of conditioning factors in the context area uncovered those eleven factors, such as slope, aspect, normalized difference vegetation index (NDVI), and elevation, have the highest impact on the occurrence of the landslide. The results show that the MLP model had the utmost rate of landslide spatial prediction capability (87.06%), after which the SVM model (80.0%), the RF model (76.67%), and the FR model (61.25%) demonstrated the second, third, and fourth rates. Besides, the study revealed that benefiting the optimal machine with the proper selection of the techniques could facilitate landslide susceptibility modeling.
Article
Full-text available
Suppose that all of C∞ functions f1,…, fk have the zero property. We give a necessary and sufficient condition for their product to have the same property This is a generalization of Bochnak’s result ([1]).