Data Mining: Practical Machine Learning Tools and Techniques
Abstract
Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, offers a thorough grounding in machine learning concepts, along with practical advice on applying these tools and techniques in real-world data mining situations. This highly anticipated fourth edition of the most acclaimed work on data mining and machine learning teaches readers everything they need to know to get going, from preparing inputs, interpreting outputs, evaluating results, to the algorithmic methods at the heart of successful data mining approaches. Extensive updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including substantial new chapters on probabilistic methods and on deep learning. Accompanying the book is a new version of the popular WEKA machine learning software from the University of Waikato. Authors Witten, Frank, Hall, and Pal include today's techniques coupled with the methods at the leading edge of contemporary research. Please visit the book companion website at http://www.cs.waikato.ac.nz/ml/weka/book.html It contains Powerpoint slides for Chapters 1-12. This is a very comprehensive teaching resource, with many PPT slides covering each chapter of the book Online Appendix on the Weka workbench; again a very comprehensive learning aid for the open source software that goes with the book Table of contents, highlighting the many new sections in the 4th edition, along with reviews of the 1st edition, errata, etc. Provides a thorough grounding in machine learning concepts, as well as practical advice on applying the tools and techniques to data mining projects Presents concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods Includes a downloadable WEKA software toolkit, a comprehensive collection of machine learning algorithms for data mining tasks-in an easy-to-use interactive interface Includes open-access online courses that introduce practical applications of the material in the book.
... Therefore, big differences between expected and predicted results are noticed [7][8][9] . However, recently various machine learning methods have been applied to predict the future stock market prices more accurately and precisely such as Support Vector Machine, Decision Trees, Fuzzy and Neural Networks [10][11][12][13]. Furthermore, ensemble methods were also applied massively, in which Adaboost is considered as the most famous compare to other ensembles [9,10]. ...
... However, recently various machine learning methods have been applied to predict the future stock market prices more accurately and precisely such as Support Vector Machine, Decision Trees, Fuzzy and Neural Networks [10][11][12][13]. Furthermore, ensemble methods were also applied massively, in which Adaboost is considered as the most famous compare to other ensembles [9,10]. ...
... As stated earlier, the significant goal of this study was to improve AdaboostM1 which is implemented in WEKA [10]. To achieve this, intensive experiments were conducted on the parameters of the AdaBoostM1 Algorithm. ...
Stock market investment has gained significant popularity due to its potential for economic returns, prompting extensive research in financial time series forecasting. Among the predictive models, various adaptations of the AdaBoostM1 algorithm have been applied to stock market prediction, either by tuning parameters or experimenting with different base learners. However, the achieved accuracy often remains suboptimal. This study addresses these limitations by introducing an enhanced version of AdaBoostM1 (ADA), implemented on the Waikato Environment for Knowledge Analysis (WEKA) platform, to forecast stock prices using historical data. The proposed model, termed AdaBoost with Multilayer Perceptron (ADA-MLP), replaces the commonly used Decision stumps with a set of Multilayer Perceptron (MLP) models as weak learners. The experimental results demonstrate that ADA-MLP consistently outperformed the standard AdaBoostM1 algorithm, achieving an average classification accuracy of 100%, compared to 98.48% by AdaBoostM1—a relative improvement of 1.52%. Additionally, ADA-MLP demonstrated superior performance against other enhanced versions of AdaBoost presented in prior studies, achieving an average of 5.3% higher accuracy. Statistical significance testing using the paired t-test confirmed the reliability of these results, with p-values < 0.05. The experiments were conducted on the Yahoo finance dataset from 25 years of historical data spanning from January 1995 to January 2020, comprising 6295 samples, ensuring a robust and comprehensive evaluation. These findings highlight the potential of ADA-MLP to enhance financial forecasting and offer a reliable tool for stock market prediction. Future research could explore extending this approach to other financial instruments and larger datasets to further validate its effectiveness.
... We try to generate a rule(line 12). Whenever none of the children of a node does generate a rule, the node itself tries to generate a new rule (lines [14][15][16][17][18][19]. This can occur when the children nodes do not see enough samples to satisfy the minimum support threshold, for example, or if the current node is a leaf. ...
... Although implemented in several popular random forest implementations [16,17], instance-based weighting is not implemented in Mllib. Oversampling replicates some of the records belonging to the minority class or classes, so that the dataset gets balanced [18]. In our scenario, the application of this technique to the training set did not converge successfully due to memory constraints. ...
... In our scenario, the application of this technique to the training set did not converge successfully due to memory constraints. Conversely, subsampling extracts a fraction of the majority class or classes, to reduce their volume to a size comparable to the minority class [18]. We applied this technique to the negative class to have a cardinality roughly equal to the positive one, in the training set. ...
Supervised learning algorithms are nowadays successfully scaling up to datasets that are very large in volume, leveraging the potential of in-memory cluster-computing Big Data frameworks. Still, massive datasets with a number of large-domain categorical features are a difficult challenge for any classifier. Most off-the-shelf solutions cannot cope with this problem. In this work we introduce DAC, a Distributed Associative Classifier. DAC exploits ensemble learning to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. Furthermore, it adopts several novel techniques to reach high scalability without sacrificing quality, among which a preventive pruning of classification rules in the extraction phase based on Gini impurity. We ran experiments on Apache Spark, on a real large-scale dataset with more than 4 billion records and 800 million distinct categories. The results showed that DAC improves on a state-of-the-art solution in both prediction quality and execution time. Since the generated model is human-readable, it can not only classify new records, but also allow understanding both the logic behind the prediction and the properties of the model, becoming a useful aid for decision makers.
... Comparison with existing regression-tree algorithms. We applied the state-of-the-art algorithms for learning regression trees (M5Prime [25] and GUIDE [17]) to our problem. Our goal is different from the goal of these algorithms: we aim to classify data, whereas both Guide and M5Prime aim to predict the output for previously unseen input. ...
... The most straightforward way to generalize the decision tree algorithm to learn regression trees is computationally expensive [18] as it requires solving two linear regression problems for each split candidate. Popular regression tree algorithms algorithms CART [8], M5Prime [25], GUIDE [17] propose various ways to avoid this problem. CART is a piecewise constant regression tree model that uses the standard regression-tree algorithm (with piecewise constant clusters) and then applies cross-validation to prune the tree. ...
... CART is a piecewise constant regression tree model that uses the standard regression-tree algorithm (with piecewise constant clusters) and then applies cross-validation to prune the tree. M5Prime [25] algorithm first constructs a piecewise constant model, and then fits linear regression models to leaves during pruning step. GUIDE regression tree algorithm [17], at each node, fits the best regression model that predicts the response variable and computes the residual. ...
Differential performance debugging is a technique to find performance problems. It applies in situations where the performance of a program is (unexpectedly) different for different classes of inputs. The task is to explain the differences in asymptotic performance among various input classes in terms of program internals. We propose a data-driven technique based on discriminant regression tree (DRT) learning problem where the goal is to discriminate among different classes of inputs. We propose a new algorithm for DRT learning that first clusters the data into functional clusters, capturing different asymptotic performance classes, and then invokes off-the-shelf decision tree learning algorithms to explain these clusters. We focus on linear functional clusters and adapt classical clustering algorithms (K-means and spectral) to produce them. For the K-means algorithm, we generalize the notion of the cluster centroid from a point to a linear function. We adapt spectral clustering by defining a novel kernel function to capture the notion of linear similarity between two data points. We evaluate our approach on benchmarks consisting of Java programs where we are interested in debugging performance. We show that our algorithm significantly outperforms other well-known regression tree learning algorithms in terms of running time and accuracy of classification.
... SVR uses an ε-insensitive loss function [40,41], which means that optimizing a SVR model involves minimizing the weights w i . This procedure leads to a flat evolution of y(x) and inherently reduces the risk of over-fitting [42] by simultaneously allowing for some larger deviations (up to a value of ε) between y and y m for individual samples ('outliers'). In this paper, we use ε = 0.1 T, 1.0 MJ/m 3 , and 0.01 eV/atom for the ML models for µ 0 M , K 1 , and E f , respectively. ...
... To validate the ML models built with a specific set of hyperparameters, we determine the Pearson correlation coefficient ρ and the mean absolute error (MAE) [42,56], ...
... To estimate whether the model is over-fitted, i.e., it exactly matches the training samples but nothing else, we performed tenfold cross validation (CV) [42]: The data set used for building and optimizing the ML model consists of 234 (232) samples of type ReFe 12-4z A 4z X for Re = Nd (Ce). It is randomly divided into ten subsets of which nine are used to train a new ML model and the tenth is used for validation. ...
Machine Learning (ML) plays an increasingly important role in the discovery and design of new materials. In this paper, we demonstrate the potential of ML for materials research using hard-magnetic phases as an illustrative case. We build kernel-based ML models to predict optimal chemical compositions for new permanent magnets, which are key components in many green-energy technologies. The magnetic-property data used for training and testing the ML models are obtained from a combinatorial high-throughput screening based on density-functional theory calculations. Our straightforward choice of describing the different configurations enables the subsequent use of the ML models for compositional optimization and thereby the prediction of promising substitutes of state-of-the-art magnetic materials like NdFeB with similar intrinsic hard-magnetic properties but a lower amount of critical rare-earth elements.
... The RF algorithm minimizes prediction variance in supervised machine learning by combining multiple decision trees through a process called bagging [35]. Initially, each tree is trained on random subsets of data; during validation, the algorithm averages the predictions to come at a balanced prediction outcome [35,36]. Its effectiveness lies in handling intricate decision boundaries for both categorical and continuous variables, managing noisy and high-dimensional datasets, and minimizing overfitting [37]. ...
... RF displayed resilience to overfitting, achieving consistent accuracy across different complexity levels. This characteristic makes it particularly valuable for large-scale forest assessments, where overfitting poses a significant challenge [35,36]. In contrast, CART exhibited greater sensitivity to model complexity, reinforcing its role as a simpler yet less flexible algorithm for specific applications. ...
Forest attributes such as the standing stock, diameter at the breast height, tree height, and basal area, are essential in forest management. Conventional estimation methods, which are still largely used in many parts of the world, are typically resource intensive. Machine learning algorithms working with remotely sensed data trained by ground measurements may provide a promising, more efficient alternative. This study evaluates the performance of three machine learning algo-rithms, namely Random Forest, Classification and Regression Trees, and Gradient Boosting Tree Algorithm in estimating these forest attributes. Ground truth data was sourced by measurements carried out in relevant forests from Romania and by an independent dataset from Brasov County. The predictive ability of the tested algorithms was examined by considering several spatial resolu-tions. The results showed varying degrees of performance. Random Forest was the best performer, with RMSE and R2 values over 0.8 for all attributes. GBTA excelled in predicting the standing stock, achieving R2 values over 0.9. The validation based on the independent dataset has confirmed higher performance for both RF and GBTA. In contrast, CART excelled in predicting the basal area, but struggled with breast height diameter, standing stock, and tree height. A sensitivity analysis that concerned the spatial resolution revealed high degrees of discrepancy. Random Forest and Gradient Boosting Tree Algorithm were more consistent when estimating the standing stock, but they have shown inconsistency for breast height diameter and tree height, while CART showed important variations. These results provide useful insights into the strengths and weaknesses of these algo-rithms, and provide the information required to select the best option when aiming to use similar solutions for estimation.
... Whenever algorithms are used to make predictions, they must carefully be evaluated to ensure that their predictions meaningfully represent medically relevant information. Evaluation must be specified for each problem [13]. For example, if an algorithm is being used to predict one of two things, such as whether a patient is depressed, then it could be evaluated by the percent of predictions that are correct [9]. ...
... There are many ways to evaluate how good an algorithm's predictions are [10,13,16,18]. The general approach is a two step process of measuring an algorithm's error, or how inaccurate its predictions are, and then comparing the algorithm's error with the error of simply guessing answers. ...
A new trend in medicine is the use of algorithms to analyze big datasets, e.g. using everything your phone measures about you for diagnostics or monitoring. However, these algorithms are commonly compared against weak baselines, which may contribute to excessive optimism. To assess how well an algorithm works, scientists typically ask how well its output correlates with medically assigned scores. Here we perform a meta-analysis to quantify how the literature evaluates their algorithms for monitoring mental wellbeing. We find that the bulk of the literature (77%) uses meaningless comparisons that ignore patient baseline state. For example, having an algorithm that uses phone data to diagnose mood disorders would be useful. However, it is possible to over 80% of the variance of some mood measures in the population by simply guessing that each patient has their own average mood - the patient-specific baseline. Thus, an algorithm that just predicts that our mood is like it usually is can explain the majority of variance, but is, obviously, entirely useless. Comparing to the wrong (population) baseline has a massive effect on the perceived quality of algorithms and produces baseless optimism in the field. To solve this problem we propose "user lift" that reduces these systematic errors in the evaluation of personalized medical monitoring.
... In the age of big data, machine learning has become an indispensable tool whose utility transcends scientific and academic boundaries [1]. From social networking [2,3] to object and image recognition [4,5], from advertising to finance [6], from engineering to medicine [7], from biological physics [8,9] to astrophysics [10], wherever there is preponderance of information and real data, machine learning is helping to find and quantify patterns and even discover basic laws [11]. ...
... Rather than being restricted to linear transformations of the input data into principal components, the autoencoder can incorporate nonlinear representations. Given a set of spin configurations {S i }, in an autoencoder the weights in the neural network can be trained through, for example, backpropagation [1,62], to return target values equal to the inputs. In this work, we choose convolutional neural networks (CNNs) [23,63,64] as the encoder and the decoder networks. ...
We apply unsupervised machine learning techniques, mainly principal component analysis (PCA), to compare and contrast the phase behavior and phase transitions in several classical spin models - the square and triangular-lattice Ising models, the Blume-Capel model, a highly degenerate biquadratic-exchange spin-one Ising (BSI) model, and the 2D XY model, and examine critically what machine learning is teaching us. We find that quantified principal components from PCA not only allow exploration of different phases and symmetry-breaking, but can distinguish phase transition types and locate critical points. We show that the corresponding weight vectors have a clear physical interpretation, which is particularly interesting in the frustrated models such as the triangular antiferromagnet, where they can point to incipient orders. Unlike the other well-studied models, the properties of the BSI model are less well known. Using both PCA and conventional Monte Carlo analysis, we demonstrate that the BSI model shows an absence of phase transition and macroscopic ground-state degeneracy. The failure to capture the `charge' correlations (vorticity) in the BSI model (XY model) from raw spin configurations points to some of the limitations of PCA. Finally, we employ a nonlinear unsupervised machine learning procedure, the `antoencoder method', and demonstrate that it too can be trained to capture phase transitions and critical points.
... We evaluate our procedure through the de-facto standard approach of k-fold cross-validation [34], [35]. In general, kfold cross validation requires: 1) splitting the dataset into k non-overlapping parts ("folds"); 2) training k different models, each using one of the folds as the testing set and the other ones as the training set; 3) computing the error metrics (e.g., RMSE) for each of the models; 4) averaging the error metrics. ...
... By doing so, we ensure that all folds reflect the time and space features of the original trace. Based on industry best practices [35] and tests run on reference datasets [36], we set k = 10. ...
Deployment and demand traces are a crucial tool to study today's LTE systems, as well as their evolution toward 5G. In this paper, we use a set of real-world, crowdsourced traces, coming from the WeFi and OpenSignal apps, to investigate how present-day networks are deployed, and the load they serve. Given this information, we present a way to generate synthetic deployment and demand profiles, retaining the same features of their real-world counterparts. We further discuss a methodology using traces (both real-world and synthetic) to assess (i) to which extent the current deployment is adequate to the current and future demand, and (ii) the effectiveness of the existing strategies to improve network capacity. Applying our methodology to real-world traces, we find that present-day LTE deployments consist of multiple, entangled, medium- to large-sized cells. Furthermore, although today's LTE networks are overprovisioned when compared to the present traffic demand, they will need substantial capacity improvements in order to face the load increase forecasted between now and 2020.
... With learned features of violations, cluster violations with X-means clustering algorithm. In this study, we use Weka's implementation [50] of X-means to cluster violations. Finally, we manually label each cluster with identified code patterns of violations from clustered similar code fragments of violations to show patterns clearly. ...
... In this study, all parameters of the two models are tuned through a visualizing network training UI 20 provided by DeepLearning4J. Finally, Weka's [50] implementation of X-means clustering algorithm uses the extracted features to find similar code for each violation type. Parameter settings for the clustering are enumerated in Table 4. ...
In this paper, we first collect and track a large number of fixed and unfixed violations across revisions of software. The empirical analyses reveal that there are discrepancies in the distributions of violations that are detected and those that are fixed, in terms of occurrences, spread and categories, which can provide insights into prioritizing violations. To automatically identify patterns in violations and their fixes, we propose an approach that utilizes convolutional neural networks to learn features and clustering to regroup similar instances. We then evaluate the usefulness of the identified fix patterns by applying them to unfixed violations. The results show that developers will accept and merge a majority (69/116) of fixes generated from the inferred fix patterns. It is also noteworthy that the yielded patterns are applicable to four real bugs in the Defects4J major benchmark for software testing and automated repair.
... We used the Waikato Environment for Knowledge Analysis (WEKA) software. We used version 3.9.6 [20], [21]. It split each dataset into training/validation (80%) and testing (20%) sets. ...
... The next step involved development of a confusion matrix to understand the type of errors made by each instructor. A confusion matrix allows for comprehensive evaluation of how well a model performs as well as where it might go wrong (Witten et al., 2005). In our study, in the context of evaluating instructors' predictions of item difficulty, using a confusion matrix aided in identifying specific patterns regarding their predictions. ...
In numerous studies focusing on assessment and evaluation of teaching Turkish as a foreign language, researchers have frequently identified issues related to the standardization and low validity and reliability of exams. Addressing these issues and investigating the underlying causes is paramount. Given the development of assessment tools by Turkish language teaching centers are typically the responsibility of instructors, it is essential to understand their perspectives regarding these tools. This study aimed to evaluate the perceptions of instructors concerning item difficulty in the context of teaching Turkish as a foreign language. Initially, item analyses were conducted on reading tests included in assessment tools designed by a Turkish language teaching center for B1, B2, and C1 proficiency levels. Instructors from various Turkish language teaching centers were asked to evaluate item difficulty through a prepared questionnaire. Data regarding instructors educational backgrounds, experiences, and involvement in exam creation were collected. Various analytical methods were employed to examine and interpret the obtained data. Item analysis results of examined tests were compared with instructors' perceptions of difficulty using fit analysis. Accuracy of instructors' item difficulty estimates was calculated for each instructor using Error Matrix, and success rates determined. To identify the effects of instructors' characteristics on item difficulty estimation, t-test and ANOVA analyses were performed. These analyses results were interpreted alongside item analyses, and recommendations provided to enhance the assessment and evaluation literacy of instructors teaching Turkish as a foreign language.
... Это возможно благодаря тому, что показатели состояния любого объекта могут быть собраны при помощи инфраструктуры управления и мониторинга JMX (Java Manager Extension) 2 [24] посредством менеджеров MBean или MXBean. Для экспериментов использовался сторонний набор данных, который предоставлен в свободный доступ в 2018 году 3 [25] и соответствует требованиям, предъявляемым к признакам для прогнозирования сбоев в настоящем исследовании. В наборе данных присутствует 831 показатель и 7.5 миллионов временных измерений с интервалом в одну минуту. ...
Программные сбои являются ощутимой и неизбежной проблемой при работе корпоративных программных систем. Обычно сбои обнаруживают при помощи слежения за превышением пороговых значений критических показателей системы. При этом предпринимать меры по предотвращению сбоев или их последствий часто не удаётся из-за недостатка времени на эти действия. Необходимо своевременно прогнозировать сбои, основываясь на журналах состояния приложений. Для этого были изучены различные подходы к прогнозированию сбоев, один из которых основан на обнаружении предшествующих им аномалий в данных состояния приложений. В работе предлагается несколько подходов, основанных на машинном обучении, для прогнозирования сбоев при помощи обнаружения аномалий. Лучшие результаты прогнозирования сбоев достигнуты при использовании метода градиентного бустинга над деревьями решений с применением метода скользящего окна и исключаемой преданомальной областью временного ряда в данных журналов. Это позволяет находить сбои в используемом наборе данных за приемлемое время до момента отказа системы. На случай отсутствия размеченных экспертом данных для обучения предложен подход к обучению без учителя с использованием изолирующих лесов и подход к автоматической разметке данных.
... An explosion in the number of networked IoT devices is already underway, and with such large numbers of networked devices generating vast quantities of data, extracting hidden information from big data is useful. A few pioneering studies [3,4] lay the foundations for data mining approaches. Over the past decade, researchers have attempted to find efficient algorithms for mining the huge volume of information generated by social networks, IoT networks, and wireless sensor networks [5][6][7][8][9][10]. ...
The highly efficient HEP algorithm is a useful tool for mining High Occupancy (HO) item sets. Occupancy is an important measure that describes the interestingness of frequent item sets. The current study examines the efficiency problems in mining HO item sets and proposes an improved HEP algorithm, named advanced HEP (A–HEP), based on set theory rules which eliminate a large number of redundant iterations. The study also proposes a novel adaptive-and-modified HEP (NAM–HEP) algorithm that uses HO Set-Enumeration (SE) trees to store HO item sets. The study proposes definitions for adaptive thresholds such as support threshold and occupancy threshold based on the attributes of the transaction database for efficient pruning of the HO-SE tree. Two pseudo-code blocks are presented in addition to a detailed description of the A–HEP and NAM–HEP algorithms and their advantages. Using the A–HEP and NAM–HEP algorithms, HO item sets are investigated from the practical transaction databases named mushroom and retail. The results indicate that the proposed A–HEP and NAM–HEP algorithms enhance mining performance and runtime benchmarks.
... Since the release of SPSS (IBM, 1968) in 1968, followed by SAS (Inc., 1976), Matlab (MathWorks, 1984), Excel (Microsoft, 1985), Python (Foundation, 1991), R (for Statistical Computing, 1995), PowerBI (Microsoft, 2013), and other specialized data analysis tools and programming languages, these advancements have significantly aided professionals in conducting statistical experiments and data analysis. Moreover, they have made data analysis more accessible to a broader range of practitioners (Witten et al., 2016). ...
In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.
... Deep learning differs significantly from traditional learning methods in feature extraction, model architecture, and data processing. Traditional methods rely on manually designed features and relatively simple models, suitable for small datasets [4]. Additionally, deep learning possesses powerful generalization capabilities, adapting well to new datasets and tasks [3]. ...
The era of Big Data has ushered in an unprecedented deluge of information, characterized by its massive volume, diverse variety, high velocity, and inherent veracity challenges. Traditional data processing and analysis techniques often falter in the face of such complexities. Deep Learning, with its capacity to discern intricate patterns and representations from raw data, has emerged as a promising tool to navigate this landscape. This survey article provides a comprehensive overview of the methods employed in the realm of Big Data, systematically categorized based on the stage of data processing they address (preprocessing, storage, or processing & management), the type of learning involved (supervised, unsupervised, or reinforcement), the specific data characteristics they tackle (volume, variety, velocity, or veracity), and their application areas. We delve into the strengths and limitations of each method, highlighting their suitability for different Big Data scenarios. Furthermore, we explore the challenges and future trends in applying Deep Learning to Big Data, emphasizing the need for innovative solutions to harness its full potential. By offering a structured classification and insightful analysis, this survey serves as a valuable resource for researchers and practitioners seeking to understand and leverage the synergy between Deep Learning and Big Data.
... It leverages research in various fields, including statistics, databases, pattern recognition, machine learning and data visualization to provide advanced business intelligence and web discovery solutions. The term of data mining refers to the "step in the overall process of knowledge discovery that consists of pre-processing, data mining, and post-processing" (Witten et al., 2016). It is the process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Frawley et al., 1992;Fayyad et al., 1996). ...
Delivering high-quality content is crucial for effective reading comprehension and successful learning. Ensuring educational materials are interpreted as intended by their authors is a persistent challenge, especially with the added complexity of multimedia and interactivity in the digital age. Authors must continuously revise their materials to meet learners' evolving needs. Detecting comprehension barriers and identifying actionable improvements within documents is complex, particularly in education where reading is fundamental. This study presents an analytical framework to help course designers enhance educational content to better support learning outcomes. Grounded in a robust theoretical foundation integrating learning analytics, reading comprehension, and content revision, our approach introduces usage-based document reengineering. This methodology adapts document content and structure based on insights from analyzing digital reading traces-interactions between readers and content. We define reading sessions to capture these interactions and develop indicators to detect comprehension challenges. Our framework enables authors to receive tailored content revision recommendations through an interactive dashboard, presenting actionable insights from reading activity. The proposed approach was implemented and evaluated using data from a European e-learning platform. Evaluations validate the framework's effectiveness, demonstrating its capacity to empower authors with data-driven insights for targeted revisions. The findings highlight the framework's ability to enhance educational content quality, making it more responsive to learners' needs. This research significantly contributes to learning analytics and content optimization, offering practical tools to improve educational outcomes and inform future developments in e-learning.
... To extend the DC method to portfolio management, we proposed a novel approach that aggregates the prices of multiple assets, allowing DC to be applied to a portfolio. Assets are selected each month by clustering the Preprint submitted to Journal of L A T E X Templates trading universe into distinct categories using the symmetric uncertainty (SU) metric Witten et al., 2005, based on data from the previous month. The center of each resulting cluster is chosen as a portfolio asset, and the prices of these centers are aggregated to apply the DC method to the entire portfolio. ...
The cryptocurrency market is characterized by high volatility and frequent regime changes, posing significant challenges for trading strategies that rely on static logic. To address these dynamic conditions, we propose a strategy based on the Directional Change (DC) methodology, integrating it with a portfolio approach utilizing clustering by symmetric uncertainty (SU). The DC method samples prices based on significant market movements rather than fixed time intervals, enhancing its flexibility in responding to market volatility. Additionally, we developed a meta-learning model to adaptively select hyperparameters of the strategy based on four feature sets: the strategy's meta-information, on-chain data from Bitcoin and Ethereum blockchain, DC-extracted indicators, and statistics that measure market regimes. Our findings indicate that varying hyperparameter ranges outperform each other under different market conditions, underscoring the need for adaptive selection. The meta-learning framework notably enhances strategy performance, achieving up to a tenfold increase in return rate and a threefold increase in the Sharpe ratio. Analysis of the meta-learning models shows a low correlation between outputs from models trained on distinct feature categories, suggesting that each captures a unique aspect of parameter selection. Additionally, feature behavior analysis reveals that different categories were most informative at various points in the backtest, with strategy meta-information and DC indicators standing out as the most impactful features.
... The Multilayer Perceptron (MLP) [39] is one of most commonly used neural network methods for classification. MLP is a feedforward neural networks that use standard back-propagation algorithm for training. ...
Online audio advertising is a particular form of advertising used abundantly in online music streaming services. In these platforms, which tend to host tens of thousands of unique audio advertisements (ads), providing high quality ads ensures a better user experience and results in longer user engagement. Therefore, the automatic assessment of these ads is an important step toward audio ads ranking and better audio ads creation. In this paper we propose one way to measure the quality of the audio ads using a proxy metric called Long Click Rate (LCR), which is defined by the amount of time a user engages with the follow-up display ad (that is shown while the audio ad is playing) divided by the impressions. We later focus on predicting the audio ad quality using only acoustic features such as harmony, rhythm, and timbre of the audio, extracted from the raw waveform. We discuss how the characteristics of the sound can be connected to concepts such as the clarity of the audio ad message, its trustworthiness, etc. Finally, we propose a new deep learning model for audio ad quality prediction, which outperforms the other discussed models trained on hand-crafted features. To the best of our knowledge, this is the first large-scale audio ad quality prediction study.
... Using the data described above, we compare the performance of the proposed method, LVQ + PSO, vis-à-vis C4.5 methods defined by Quinlan (1993) and PART defined by Witten et al. (2011). Both alternative methods allow classification rules. ...
One of the key elements in the banking industry rely on the appropriate selection of customers. In order to manage credit risk, banks dedicate special efforts in order to classify customers according to their risk. The usual decision making process consists in gathering personal and financial information about the borrower. Processing this information can be time consuming, and presents some difficulties due to the heterogeneous structure of data. We offer in this paper an alternative method that is able to classify customers' profiles from numerical and nominal attributes. The key feature of our method, called LVQ+PSO, is the finding of a reduced set of classifying rules. This is possible, due to the combination of a competitive neural network with an optimization technique. These rules constitute a predictive model for credit risk approval. The reduced quantity of rules makes this method not only useful for credit officers aiming to make quick decisions about granting a credit, but also could act as borrower's self selection. Our method was applied to an actual database of a credit consumer financial institution in Ecuador. We obtain very satisfactory results. Future research lines are exposed.
... The table below gives the performance of different classifiers on 1,467 preprocessed Twitter data to find the relevancy of the tweet towards Zika (Table 4). Unigram features were extracted from the texts using the Weka toolbox [21]. For this dataset, the classifiers perform fairly well with AUC values ranging from 0.78 to 0.94. ...
The purpose of this study was to do a dataset distribution analysis, a classification performance analysis, and a topical analysis concerning what people are tweeting about four disease characteristics: symptoms, transmission, prevention, and treatment. A combination of natural language processing and machine learning techniques were used to determine what people are tweeting about Zika. Specifically, a two-stage classifier system was built to find relevant tweets on Zika, and then categorize these into the four disease categories. Tweets in each disease category were then examined using latent dirichlet allocation (LDA) to determine the five main tweet topics for each disease characteristic. Results 1,234,605 tweets were collected. Tweets by males and females were similar (28% and 23% respectively). The classifier performed well on the training and test data for relevancy (F=0.87 and 0.99 respectively) and disease characteristics (F=0.79 and 0.90 respectively). Five topics for each category were found and discussed with a focus on the symptoms category. Through this process, we demonstrate how misinformation can be discovered so that public health officials can respond to the tweets with misinformation.
... Our approach reached to a precision of 0.740 and a recall of 0.982 (F1-score = 0.844) on the "upvote" class in a 10-fold cross-validation experiment. The feature analysis using the Weka toolkit [47] showed that the top 3 features were about the historical performance of the worker who proposed the message, and also 13 out of 20 top features were of a particular dimension of one of the word vectors. On the other hand, the performance of the "downvote" class is less effective. ...
Crowd-powered conversational assistants have been shown to be more robust than automated systems, but do so at the cost of higher response latency and monetary costs. A promising direction is to combine the two approaches for high quality, low latency, and low cost solutions. In this paper, we introduce Evorus, a crowd-powered conversational assistant built to automate itself over time by (i) allowing new chatbots to be easily integrated to automate more scenarios, (ii) reusing prior crowd answers, and (iii) learning to automatically approve response candidates. Our 5-month-long deployment with 80 participants and 281 conversations shows that Evorus can automate itself without compromising conversation quality. Crowd-AI architectures have long been proposed as a way to reduce cost and latency for crowd-powered systems; Evorus demonstrates how automation can be introduced successfully in a deployed system. Its architecture allows future researchers to make further innovation on the underlying automated components in the context of a deployed open domain dialog system.
... Any externally imported additional variables which were not generated using the GraphVar pipeline should also be quality controlled for continuity, errors in entry such as duplication, noise and extreme outliers if they are added as features to the design matrix (Witten et al. 2016). ...
Background: We previously presented GraphVar as a user-friendly MATLAB toolbox for comprehensive graph analyses of functional brain connectivity. Here we introduce a comprehensive extension of the toolbox allowing users to seamlessly explore easily customizable decoding models across functional connectivity measures as well as additional features. New Method: GraphVar 2.0 provides machine learning (ML) model construction, validation and exploration. Machine learning can be performed across any combination of network measures and additional variables, allowing for a flexibility in neuroimaging applications. Results: In addition to previously integrated functionalities, such as network construction and graph-theoretical analyses of brain connectivity with a high-speed general linear model (GLM), users can now perform customizable ML across connectivity matrices, network metrics and additionally imported variables. The new extension also provides parametric and nonparametric testing of classifier and regressor performance, data export, figure generation and high quality export. Comparison with existing methods: Compared to other existing toolboxes, GraphVar 2.0 offers (1) comprehensive customization, (2) an all-in-one user friendly interface, (3) customizable model design and manual hyperparameter entry, (4) interactive results exploration and data export, (5) automated cueing for modelling multiple outcome variables within the same session, (6) an easy to follow introductory review. Conclusions: GraphVar 2.0 allows comprehensive, user-friendly exploration of encoding (GLM) and decoding (ML) modelling approaches on functional connectivity measures making big data neuroscience readily accessible to a broader audience of neuroimaging investigators.
... 2) Alternative Textual Features: We explore different types of sub-networks for textual features extraction from noisy tags. Word2vec [47] is one of the most popular methods for textual feature extraction. Since the 1,000 most frequent tags are out of order and cannot form meaningful sentences, it is hard to directly generate word2vec features from these tags. ...
Image annotation aims to annotate a given image with a variable number of class labels corresponding to diverse visual concepts. In this paper, we address two main issues in large-scale image annotation: 1) how to learn a rich feature representation suitable for predicting a diverse set of visual concepts ranging from object, scene to abstract concept; 2) how to annotate an image with the optimal number of class labels. To address the first issue, we propose a novel multi-scale deep model for extracting rich and discriminative features capable of representing a wide range of visual concepts. Specifically, a novel two-branch deep neural network architecture is proposed which comprises a very deep main network branch and a companion feature fusion network branch designed for fusing the multi-scale features computed from the main branch. The deep model is also made multi-modal by taking noisy user-provided tags as model input to complement the image input. For tackling the second issue, we introduce a label quantity prediction auxiliary task to the main label prediction task to explicitly estimate the optimal label number for a given image. Extensive experiments are carried out on two large-scale image annotation benchmark datasets and the results show that our method significantly outperforms the state-of-the-art.
... Based on the above, one expects that leave-one-out CV (where each fold's size is 1 sample) should be the least biased. However, leave-one-out CV can collapse in the sense that it can provide extremely misleading estimates in degenerate situations (see [33], p. 151, and [19] for an extreme failure of leave-one-out CV and of the 0.632 bootstrap rule). We believe that the problem of leave-one-out CV stems from the fact that the folds may follow a totally different distribution than the distribution of the class in the original dataset: when only one example is left out, the distribution of one class in the fold is 100% and 0% for all the others. ...
Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation and a method by Tibshirani and Tibshirani, BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based hypothesis test we stop training of models on new folds of statistically-significantly inferior configurations. We name the method Bootstrap Corrected with Early Dropping CV (BCED-CV) that is both efficient and provides accurate performance estimates.
... We chose them because they are popular and have been shown to be effective at predicting defects in a recent study [27]. Naive Bayes (NB) [93] is a statistical technique which uses the combined probabilities of the different attributes to predict faultiness. Logistic Regression (LR) [19] is a regression technique which identifies the best set of weights for each attribute to predict the faulty or non-faulty class. ...
Concurrent programs are difficult to test due to their inherent non-determinism. To address this problem, testing often requires the exploration of thread schedules of a program; this can be time-consuming when applied to real-world programs. Software defect prediction has been used to help developers find faults and prioritize their testing efforts. Prior studies have used machine learning to build such predicting models based on designed features that encode the characteristics of programs. However, research has focused on sequential programs; to date, no work has considered defect prediction for concurrent programs, with program characteristics distinguished from sequential programs. In this paper, we present ConPredictor, an approach to predict defects specific to concurrent programs by combining both static and dynamic program metrics. Specifically, we propose a set of novel static code metrics based on the unique properties of concurrent programs. We also leverage additional guidance from dynamic metrics constructed based on mutation analysis. Our evaluation on four large open source projects shows that ConPredictor improved both within-project defect prediction and cross-project defect prediction compared to traditional features.
... The highly non-linear nature of dark matter evolution makes it a problem well-suited to machine learning. Machine learning is a highly efficient and powerful tool to learn relationships which are too complex for standard statistical techniques (Witten et al. 2016). In the context of structure formation, machine learning techniques have also been shown to be effective, for example, in learning the relationship between dark and baryonic matter from semi-analytic models (Kamdar et al. 2016;Agarwal et al. 2017;Nadler et al. 2017). ...
We train a machine learning algorithm to learn cosmological structure formation from N-body simulations. The algorithm infers the relationship between the initial conditions and the final dark matter haloes, without the need to introduce approximate halo collapse models. We gain insights into the physics driving halo formation by evaluating the predictive performance of the algorithm when provided with different types of information about the local environment around dark matter particles. The algorithm learns to predict whether or not dark matter particles will end up in haloes of a given mass range, based on spherical overdensities. We show that the resulting predictions match those of spherical collapse approximations such as extended Press-Schechter theory. Additional information on the shape of the local gravitational potential is not able to improve halo collapse predictions; the linear density field contains sufficient information for the algorithm to also reproduce ellipsoidal collapse predictions based on the Sheth-Tormen model. We investigate the algorithm's performance in terms of halo mass and radial position and perform blind analyses on independent initial conditions realisations to demonstrate the generality of our results.
... The software is entirely written on MATLAB and uses a modified version of the WEKA library [49] written in JAVA which is known as WekaUT (for more information refer to http://www.cs.utexas.edu/users/ml/risc/code/) for the clustering procedure. ...
The Morris Water Maze is commonly used in behavioural neuroscience for the study of spatial learning with rodents. Over the years, various methods of analysing rodent data collected in this task have been proposed. These methods span from classical performance measurements (e.g. escape latency, rodent speed, quadrant preference) to more sophisticated methods of categorisation which classify the animal swimming path into behavioural classes known as strategies. Classification techniques provide additional insight in relation to the actual animal behaviours but still only a limited amount of studies utilise them mainly because they highly depend on machine learning knowledge. We have previously demonstrated that the animals implement various strategies and by classifying whole trajectories can lead to the loss of important information. In this work, we developed a generalised and robust classification methodology which implements majority voting to boost the classification performance and successfully nullify the need of manual tuning. Based on this framework, we built a complete software, capable of performing the full analysis described in this paper. The software provides an easy to use graphical user interface (GUI) through which users can enter their trajectory data, segment and label them and finally generate reports and figures of the results.
... An important issue with unsupervised methods is the hardness of evaluating the output, since the quality of the output is rarely quantifiable. A vast literature on non-probabilistic methods exploits data mining methods, such as frequent pattern mining and anomaly detection [190]. We do not discuss these models here, since they are not probabilistic models of code. ...
Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.
... These procedures are routinely applied in machine-learning and statistics to avoid overfitting and overly optimistic error estimates. We employ a decision tree builder using the variance of explanatory variables and tree pruning using reduced-error pruning with back fitting (REPTree) implemented in the Weka package 38 . ...
We propose a novel representation of crystalline materials named orbital-field matrix (OFM) based on the distribution of valence shell electrons. We demonstrate that this new representation can be highly useful in mining material data. Our experiment shows that the formation energies of crystalline materials, the atomization energies of molecular materials, and the local magnetic moments of the constituent atoms in transition metal--rare-earth metal bimetal alloys can be predicted with high accuracy using the OFM. Knowledge regarding the role of coordination numbers of transition-metal and rare-earth metal elements in determining the local magnetic moment of transition metal sites can be acquired directly from decision tree regression analyses using the OFM.
... Testing and verification of DNNs. Traditional practices in evaluating machine learning systems primarily measure their accuracy on randomly drawn test inputs from manually labeled datasets [81]. Some machine learning systems like autonomous vehicles leverage ad hoc unguided simulations [2,4]. ...
Deep learning (DL) systems are increasingly deployed in safety- and security-critical domains including self-driving cars and malware detection, where the correctness and predictability of a system's behavior for corner case inputs are of great importance. Existing DL testing depends heavily on manually labeled data and therefore often fails to expose erroneous behaviors for rare inputs. We design, implement, and evaluate DeepXplore, the first whitebox framework for systematically testing real-world DL systems. First, we introduce neuron coverage for systematically measuring the parts of a DL system exercised by test inputs. Next, we leverage multiple DL systems with similar functionality as cross-referencing oracles to avoid manual checking. Finally, we demonstrate how finding inputs for DL systems that both trigger many differential behaviors and achieve high neuron coverage can be represented as a joint optimization problem and solved efficiently using gradient-based search techniques. DeepXplore efficiently finds thousands of incorrect corner case behaviors (e.g., self-driving cars crashing into guard rails and malware masquerading as benign software) in state-of-the-art DL models with thousands of neurons trained on five popular datasets including ImageNet and Udacity self-driving challenge data. For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running only on a commodity laptop. We further show that the test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve the model's accuracy by up to 3%.
... To classify each requirement of the respective data set, we implemented a Java-based feature extraction prototype that parses all requirements from the data set and extracts the values for all ten features mentioned above. Subsequently, we used Weka [16] to train a C4.5 decision tree algorithm [17] which comes with Weka as J48 implementation. According to Hussain et al. [6], we set the parameters for the minimum number of instances in a leaf to 6 to counter possible chances of over-fitting. ...
Classifying requirements into functional requirements (FR) and non-functional ones (NFR) is an important task in requirements engineering. However, automated classification of requirements written in natural language is not straightforward, due to the variability of natural language and the absence of a controlled vocabulary. This paper investigates how automated classification of requirements into FR and NFR can be improved and how well several machine learning approaches work in this context. We contribute an approach for preprocessing requirements that standardizes and normalizes requirements before applying classification algorithms. Further, we report on how well several existing machine learning methods perform for automated classification of NFRs into sub-categories such as usability, availability, or performance. Our study is performed on 625 requirements provided by the OpenScience tera-PROMISE repository. We found that our preprocessing improved the performance of an existing classification method. We further found significant differences in the performance of approaches such as Latent Dirichlet Allocation, Biterm Topic Modeling, or Naive Bayes for the sub-classification of NFRs.
... By using carefully created benchmark data (say, from quantum mechanics based materials simulations) as the starting point, non-linear associations between atomic configurations and potential energies (or forces, more pertinent to the present contribution) may be learned by induction. [9][10][11] This data-driven paradigm, popularly referred to as machine learning, has been shown by many groups to lead to viable pathways for the creation of interatomic potentials that; (1) surpass conventional interatomic potentials both in accuracy and versatility, (2) surpass quantum mechanical methods in cost (by orders of magnitude), and (3) rival quantum mechanics in accuracy, [12][13][14] at least within the configurational and chemical domains encompassed by the benchmark dataset used in the training of the potential. ...
Force fields developed with machine learning methods in tandem with quantum mechanics are beginning to find merit, given their (i) low cost, (ii) accuracy, and (iii) versatility. Recently, we proposed one such approach, wherein, the vectorial force on an atom is computed directly from its environment. Here, we discuss the multi-step workflow required for their construction, which begins with generating diverse reference atomic environments and force data, choosing a numerical representation for the atomic environments, down selecting a representative training set, and lastly the learning method itself, for the case of Al. The constructed force field is then validated by simulating complex materials phenomena such as surface melting and stress-strain behavior - that truly go beyond the realm of methods both in length and time scales. To make such force fields truly versatile an attempt to estimate the uncertainty in force predictions is put forth, allowing one to identify areas of poor performance and paving the way for their continual improvement.
... Se for esse o caso, o número de partições é definido igual ao número de instâncias. Em particular, foi utilizada a versão do algoritmo EM disponível no software Weka, versão 3.8.5 (Frank et al., 2016). O software Weka é uma coleção de algoritmos de aprendizado de máquina para tarefas de mineração de dados. ...
Este estudo tem como finalidade classificar e descrever os sistemas de produção de soja no bioma Mata Atlântica, utilizando dados do Censo Agropecuário de 2017, obtidos por meio de tabulações especiais. A aplicação de métodos de aprendizado de máquina demonstrou ser uma abordagem eficaz para alcançar esse propósito. Através da análise de cluster, foram descobertos três grupos entre 1.367 municípios, baseando-se em aspectos da produção e com ênfase no uso de tecnologias. Compreender a diversidade dos sistemas utilizados pelos agricultores é crucial para o planejamento de ações de pesquisa agropecuária, transferência de tecnologia e desenvolvimento rural, visando aumentar a eficiência desses sistemas no bioma. Entender as diferenças nos sistemas de produção de soja dos produtores rurais auxilia na definição de estratégias mais eficazes para diversos contextos locais e revela como as variações nas práticas de cultivo podem ser mais produtivas. Além disso, conhecer as particularidades de cada cluster abre caminhos para a difusão de inovações. O artigo lança luz para o debate sobre indicadores na cultura da soja salientando a necessidade de se considerar outras perspectivas além da tecnológica, como a perspectiva da sustentabilidade socioambiental.
... For , we apply the threshold shift approach (see Section 3.2) with accuracy as the quality measure. For , we use the shifting Random Forest classifier based on the RF implementation from the WEKA library [FHW16] using the options P 70 -I 10 -J 10 -N 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -S 1 -depth 6" where -I, -J, -N determine the initial, added and maximum number of trees. The model is bootstrapped using the initial batch of prelabeled pairs from . ...
Privacy-Preserving Record linkage (PPRL) is an essential component in data integration tasks of sensitive information. The linkage quality determines the usability of combined datasets and (machine learning) applications based on them. We present a novel privacy-preserving protocol that integrates clerical review in PPRL using a multi-layer active learning process. Uncertain match candidates are reviewed on several layers by human and non-human oracles to reduce the amount of disclosed information per record and in total. Predictions are propagated back to update previous layers, resulting in an improved linkage performance for non-reviewed candidates as well. The data owners remain in control of the amount of information they share for each record. Therefore, our approach follows need-to-know and data sovereignty principles. The experimental evaluation on real-world datasets shows considerable linkage quality improvements with limited labeling effort and privacy risks.
The identification and prediction of financial bankruptcy has gained relevance due to its impact on economic and financial stability. This study performs a systematic review of artificial intelligence (AI) models used in bankruptcy prediction, evaluating their performance and relevance using the PRISMA and PICOC frameworks. Traditional models such as random forest, logistic regression, KNN, and neural networks are analyzed, along with advanced techniques such as Extreme Gradient Boosting (XGBoost), convolutional neural networks (CNN), long short-term memory (LSTM), hybrid models, and ensemble methods such as bagging and boosting. The findings highlight that, although traditional models are useful for their simplicity and low computational cost, advanced techniques such as LSTM and XGBoost stand out for their high accuracy, sometimes exceeding 99%. However, these techniques present significant challenges, such as the need for large volumes of data and high computational resources. This paper identifies strengths and limitations of these approaches and analyses their practical implications, highlighting the superiority of AI in terms of accuracy, timeliness, and early detection compared to traditional financial ratios, which remain essential tools. In conclusion, the review proposes approaches that integrate scalability and practicality, offering predictive solutions tailored to real financial contexts with limited resources.
Mental health is vital to human well-being, and prevention strategies to address mental illness have a significant impact on the burden of disease and quality of life. With the recent developments in body-worn sensors, it is now possible to continuously collect data that can be used to gain insights into mental health states. This has the potential to optimize psychiatric assessment, thereby improving patient experiences and quality of life. However, access to high-quality medical data for research purposes is limited, especially regarding diagnosed psychiatric patients. To this extent, we present the OBF-Psychiatric dataset which comprises motor activity recordings of patients with bipolar and unipolar major depression, schizophrenia, and ADHD (attention deficit hyperactivity disorder). The dataset also contains data from a clinical sample diagnosed with various mood and anxiety disorders, as well as a healthy control group, making it suitable for building machine learning models and other analytical tools. It contains recordings from 162 individuals totalling 1565 days worth of motor activity data with a mean of 9.6 days per individual.
In today's quickly advancing trade scene, firms are progressively contributing in enormous information to pick up a competitive edge, driven by the far-reaching conviction that it's vital for development and execution. Be that as it may, the relationship between enormous information characteristics and firm execution remains insufficiently caught on. This thinks about points to bridge this crevice by investigating the effect of huge information characteristics-velocity, volume, and variety-on firm development execution, particularly in item advancement, whereas too analyzing whether development execution intercedes the relationship between these characteristics and generally firm execution. Grounded in organizational learning hypothesis, the ponder hypothesizes that huge information can improve advancement adequacy and productivity, possibly progressing firm execution. In any case, it challenges the suspicion that more information is continuously superior. Utilizing auxiliary condition modeling on information from 239 directors, the think about uncovers that information speed and assortment altogether improve firm advancement execution, whereas information volume does not. Information speed develops as the foremost basic calculate. These discoveries contribute to the writing by highlighting the significance of recognizing between diverse enormous information characteristics and their unmistakable impacts on advancement and execution. Then comes about offer significant experiences for firms pointing to use huge information viably. By emphasizing the noteworthiness of information speed and assortment over volume, the think about proposes that organizations ought to center on creating real-time or near-real-time information preparing capabilities and broadening their information sources to drive advancement. This nuanced understanding challenges the one-size-fits-all approach to huge information usage and gives important direction for firms in designating assets and setting techniques for huge information activities. The study's suggestions amplify past scholarly circles, advertising viable suggestions for directors and decision-makers in a period where data-driven development is progressively crucial for keeping up competitiveness. As organizations proceed to hook with the complexities of enormous information, this investigate gives a guide for more focused on and viable utilization of information assets, possibly driving to upgraded development results and by and large firm execution.
На сьогодні в нумізматиці як галузі історичних досліджень активно використовуються цифрові моделі контентного аналізу, математико-статистичні та ймовірнісні моделі, теорія штучного інтелекту, моделі математичного аналізу, аналітичної геометрії та топології. Окрім того, в нумізматиці набувають популярності задачі цифрового розпізнавання нумізматичного матеріалу, що зумовлює актуальність методів теорії розпізнавання образів як сучасного предмета практичного застосування для розвитку нумізматичних досліджень.Мета статті – дослідження та систематизація особливостей алгоритмічної реалізації цифрових методів розпізнавання нумізматичного матеріалу.У статті проведено аналіз наявних найбільш популярних застосунків для розпізнавання монет, класифіковано методи теорії розпізнавання образів, досліджено їхню алгоритмічну реалізацію. Це дало можливість сформулювати такі висновки.По-перше, аналіз найпоширеніших реалізацій цифрового розпізнавання нумізматичних об’єктів, які зорієнтовані на використання як на стаціонарнихпристроях, так і на мобільних, показав, що особливості їхнього основного функціоналу зводяться до розв’язання власне проблем розпізнавання монет, а наявність додаткового функціоналу надає користувачам можливості раціонального вибору з-поміж різних продуктів.По-друге, у практичній реалізації теорії розпізнавання образів, яка становить теоретичну основу всіх інтелектуальних цифрових систем, що наділені функціоналом ідентифікації нумізматичних об’єктів, використовуються такі групи узагальнених методів: класичні методи, методи машинного навчання, методи глибокого навчання та методи безконтрольного навчання.По-третє, нами визначена математична постановка задачі розпізнавання образів у загальному варіанті а також розкрита алгоритмічна та математична структури реалізації найбільш популярних методів розпізнавання нумізматичних об’єктів – методу шаблонів, методу найближчих сусідів, методу кластеризації та методу згорткових нейронних мереж.Продовження дослідження може стосуватися визначення алгоритмічних реалізацій інших методів теорії розпізнавання образів для ідентифікації нумізматичного матеріалу та дослідження варіантів написання кодів для реалізації описаних методів та проведення аналізу їхньої продуктивності для різноманітного обладнання.
This chapter delves into the realm of anomaly detection in Wireless Sensor Networks (WSNs) and the Internet of Things (IoT), emphasizing their pivotal role in bolstering security. Focusing on diverse domains such as healthcare, environmental monitoring, and process industries, the chapter consolidates findings from various studies employing innovative anomaly detection techniques. One notable approach integrates supervised and unsupervised methods for continuous patient monitoring, showcasing successful anomaly detection in physiological variables using an autoencoder and XGBoost algorithm. The survey extends its scope to large-scale environmental sensing systems, where the proposed Anomaly Detection Framework demonstrates effectiveness in detecting emission events. Moreover, the paper explores sustainability initiatives, utilizing contextual anomaly detection in collaboration with Power smiths. The proposed algorithm, validated in simulation environments using historical data, exhibits promising real-time performance. An array of anomaly detection algorithms is presented, addressing challenges in diverse domains. These include a variance-based algorithm for sensor data, BRBAR for handling uncertain sensor data, anomaly detection in medical data, outlier detection in big sensor data, integration of SVM and YASA for activity recognition, density estimation for anomaly detection, and biomedical signal analysis. The survey concludes by highlighting future research directions, emphasizing the importance of addressing challenges in WSNs and IoT, such as resource constraints and collaboration with prevention-based techniques. Ongoing research aims to incorporate data stream mining techniques, apply anomaly detection methods to specific industries, and explore benchmark data selection for comprehensive evaluations. The taxonomy presented in the survey categorizes techniques, models, and architectures, providing a valuable guide for researchers and practitioners navigating the intricate landscape of anomaly detection in sensor systems. Open research inquiries pave the way for future investigations, contributing to the continuous evolution and improvement of anomaly detection methodologies.
Stroke is a leading cause of death and permanent disability, making it a serious global health concern. Cell death is the result of impaired blood flow to the brain. Patient outcomes are ultimately impacted by prompt and precise stroke type identification, which is essential for efficient management and treatment. The potential of machine learning algorithms to categorize stroke subtypes and forecast the probability of stroke occurrence is examined in this project. A thorough dataset that included clinical features, medical imaging results, and patient demographics was put together. To guarantee compatibility with machine learning algorithms, this dataset was preprocessed, fixing missing values and transforming categorical variables into a numerical format. To find the most pertinent variables for prediction, feature selection was done. Four machine learning algorithms were used: Random Forest (an ensemble learning technique), k-NN (a nearest-neighbor method), J48 (a decision tree algorithm), and Naive Bayes. A 10-fold cross-validation technique was used to thoroughly assess the model's performance, guaranteeing solid and trustworthy outcomes. The Random Forest algorithm proved to be effective in predicting stroke, as evidenced by its highest accuracy. The potential of machine learning to help medical professionals prevent, diagnose, and treat strokes is highlighted by this finding. The knowledge gathered from this research could improve patient care and guide the creation of individualized treatment programs. This project highlights the wider application of machine learning in healthcare beyond stroke. Machine learning has the potential to revolutionize healthcare delivery by utilizing data analysis and predictive modeling, which could result in better patient outcomes, more individualized treatments, and better diagnostics.
The DeMI Interface tool represents an innovative graphical user interface tool rooted in the Decision-Making Integration (DeMI) framework, offering decision makers a holistic approach within Process System Engineering (PSE). This state-of-the-art tool seamlessly combines process network synthesis and machine learning, using the Process Graph (P-graph) and the Waikato Environment for Knowledge Analysis (WEKA) software tools. By incorporating the P-graph methodology into the DeMI framework, the tool constructs a versatile superstructure that is designed to transform municipal solid waste (MSW) into valuable products. Leveraging the accelerated branch-and-bound algorithm, the P-graph efficiently generates optimal structures capable of accommodating diverse complexities, thus ensuring adaptability in synthesising process configurations. The true strength of the DeMI tool lies in its integration with WEKA, where it uses the feasible structures derived from the P-graph as a comprehensive database. By scrutinising correlations between variables within the process structure, DeMI provides invaluable insights into crucial aspects such as total waste weight, profit estimation, and the selection of appropriate waste conversion technologies. The M5P algorithm, adept at profit estimation, establishes correlations between MSW weight and profitability, while the J48 algorithm offers recommendations for suitable waste conversion technologies based on profit potential. With its intuitive and flexible interface, DeMI empowers decision makers to make informed decisions, thus enhancing decision-making in waste management and offering a promising model to address challenges in other industrial processes. In summary, DeMI makes a contribution to the advancement of PSE by offering a systematic approach to informed decision-making, underscoring its significance in optimising various industrial processes.
This entry introduces the topic of machine learning and provides an overview of its relevance for applied linguistics and language learning. The discussion will focus on giving an introduction to the methods and applications of machine learning in applied linguistics, and will provide references for further study.
Reverberation and background noise are common and unavoidable real-world phenomena that hinder automatic speaker recognition systems, particularly because these systems are typically trained on noise-free data. Most models rely on fixed audio feature sets. To evaluate the dependency of features on reverberation and noise, this study proposes augmenting the commonly used mel-frequency cepstral coefficients (MFCCs) with relative spectral (RASTA) features. The performance of these features was assessed using noisy data generated by applying reverberation and pink noise to the DEMoS dataset, which includes 56 speakers. Verification models were trained on clean data using MFCCs, RASTA features, or their combination as inputs. They validated on augmented data with progressively increasing noise and reverberation levels. The results indicate that MFCCs struggle to identify the main speaker, while the RASTA method has difficulty with the opposite class. The hybrid feature set, derived from their combination, demonstrates the best overall performance as a compromise between the two. Although the MFCC method is the standard and performs well on clean training data, it shows a significant tendency to misclassify the main speaker in real-world scenarios, which is a critical limitation for modern user-centric verification applications. The hybrid feature set, therefore, proves effective as a balanced solution, optimizing both sensitivity and specificity.
This paper argues that two commonly-used discretization approaches, fixed k-interval discretization and entropy-based discretization have sub-optimal characteristics for naive-Bayes classification. This analysis leads to a new discretization method, Proportional k-Interval Discretization (PKID), which adjusts the number and size of discretized intervals to the number of training instances, thus seeks an appropriate trade-off between the bias and variance of the probability estimation for naive-Bayes classifiers. We justify PKID in theory, as well as test it on a wide cross-section of datasets. Our experimental results suggest that in comparison to its alternatives, PKID provides naive-Bayes classifiers competitive classification performance for smaller datasets and better classification performance for larger datasets.
Connectionist learning procedures are presented for “sigmoid” and “noisy-OR” varieties of probabilistic belief networks. These networks have previously been seen primarily as a means of representing knowledge derived from experts. Here it is shown that the “Gibbs sampling” simulation procedure for such networks can support maximum-likelihood learning from empirical data through local gradient ascent. This learning procedure resembles that used for “Boltzmann machines”, and like it, allows the use of “hidden” variables to model correlations between visible variables. Due to the directed nature of the connections in a belief network, however, the “negative phase” of Boltzmann machine learning is unnecessary. Experimental results show that, as a result, learning in a sigmoid belief network can be faster than in a Boltzmann machine. These networks have other advantages over Boltzmann machines in pattern classification and decision making applications, are naturally applicable to unsupervised learning problems, and provide a link between work on connectionist learning and work on the representation of expert knowledge.
It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accurately so that these experiments can be performed in a reasonable length of time, preferably interactively. This paper suggests a method to achieve this using a very simple algorithm that gives good performance across different supervised learning schemes and when compared to one of the most common methods for feature subset selection.
If it is to qualify as knowledge, a learner's output should be accurate, stable and comprehensible. Learning multiple models can improve significantly on the accuracy and stability of single models, but at the cost of losing their comprehensibility (when they possess it, as do, for example, simple decision trees and rule sets). This paper proposes and evaluates CMM, a meta-learner that seeks to retain most of the accuracy gains of multiple model approaches, while still producing a single comprehensible model. CMM is based on reapplying the base learner to recover the frontiers implicit in the multiple model ensemble. This is done by giving the base learner a new training set, composed of a large number of examples generated and classified according to the ensemble, plus the original examples. CMM is evaluated using C4.5RULES as the base learner, and bagging as the multiple-model methodology. On 26 benchmark datasets, CMM retains on average 60% of the accuracy gains obtained by bagging ...
This paper is concerned with extending neural networks to multi-instance learning. In multi-instance learning, each example corresponds to a set of tuples in a single relation. Furthermore, examples are classied as positive if at least one tuple (i.e. at least one attribute-value pair) satises certain conditions. If none of the tuples satisfy the requirements, the example is classied as negative. We will study how to extend standard neural networks (and backpropagation) to multi instance learning. It is clear that the multi-instance setting is more expressive than the attribute-value setting, but less expressive than e.g. relational learning or inductive logic programming. 2 The Multi-instance setting