To read the full-text of this research, you can request a copy directly from the authors.
... Since the data has numeric data type with only the classification as nominal leading to the category of labeled data set [17]. Therefore it is needed to perform supervised data mining on the target data set [17] [18]. This narrowed down the choice of classifiers to only few, classifiers that can handle numeric data as well as give a classification (amongst a predefined set of classifications). ...
... The mining started with the C4.5 algorithm which in WEKA is implemented using J4.8 classifier [18]. The reduced error pruning [23] [24] was selected and the cross validation type was kept at default value of 10 folds since it generates a fairly accurate classification. ...
... In order to use the PART rule generator that uses C4.5 algorithm to generate partial decision trees and generate explicit rules [18] the rules subclass from the WEKA classify tab was selected followed by the PART classifier. Like decision tree, reduced error pruning and default 10 fold cross validation were selected allowing generalizing the rules in contrast specializing it. ...
Abundance of movie data across the internet makes it an obvious candidate for machine learning and knowledge discovery. But most researches are directed towards bi-polar classification of movie or generation of a movie recommendation system based on reviews given by viewers on various internet sites. Classification of movie popularity based solely on attributes of a movie i.e. actor, actress, director rating, language, country and budget etc. has been less highlighted due to large number of attributes that are associated with each movie and their differences in dimensions. In this paper, we propose classification scheme of pre-release movie popularity based on inherent attributes using C4.5 and PART classifier algorithm and define the relation between attributes of post release movies using correlation coefficient.
... Thus, by assuming the number of floating-point operations to be (n × d), Table 6 reports the operations count for 1-nearest neighbor search. Second, Witten et al. (2011) xxviii Evolutionary Computation Volume 30, Number 2 estimate that decision tree's depth is O(log(n)) based on the assumption that the induced tree is 'bushy' and binary, where n is the number of training examples. Accordingly, we assume that a prediction using CART's induced decision tree exactly takes log(n) floating-point operations, and have obtained and reported the prediction cost of CART for datasets of Table 6. ...
This paper presents a novel method, called Modular Grammatical Evolution (MGE), towards validating the hypothesis that restricting the solution space of NeuroEvolution to modular and simple neural networks enables the efficient generation of smaller and more structured neural networks while providing acceptable (and in some cases superior) accuracy on large data sets. MGE also enhances the state-of-the-art Grammatical Evolution (GE) methods in two directions. First, MGE's representation is modular in that each individual has a set of genes, and each gene is mapped to a neuron by grammatical rules. Second, the proposed representation mitigates two important drawbacks of GE, namely the low scalability and weak locality of representation, towards generating modular and multi-layer networks with a high number of neurons. We define and evaluate five different forms of structures with and without modularity using MGE and find single-layer modules with no coupling more productive. Our experiments demonstrate that modularity helps in finding better neural networks faster. We have validated the proposed method using ten well-known classification benchmarks with different sizes, feature counts, and output class count. Our experimental results indicate that MGE provides superior accuracy with respect to existing NeuroEvolution methods and returns classifiers that are significantly simpler than other machine learning generated classifiers. Finally, we empirically demonstrate that MGE outperforms other GE methods in terms of locality and scalability properties.
... Although samples from unlabeled test datasets can often be effectively diagnosed by these methods, most of them are usually based on an ideal assumption that training datasets and test datasets are collected from the same part of the same machine and are quite similar in terms of data distribution [21]. Real world conditions are often very different from ideal ones because obtaining vibration data with labels from critical parts of some machines is hard or even impossible [22]. Furthermore, collecting data samples from each machine and labeling them manually cost a large amount of time and money [23]. ...
Certain progress has been made in fault diagnosis under cross-domain scenarios recently. Most researchers have paid almost all their attention to promoting domain adaptation in a common space. However, several challenges that will cause negative transfer have been ignored. In this paper, a reweighting method is proposed to overcome this difficulty from two aspects. First, extracted features differ greatly from one another in promoting positive transfer, and measuring the difference is important. Measured by conditional entropy, the weight of adversarial losses for those well aligned features are reduced. Second, the balance between domain adaptation and class discrimination greatly influences the transferring task. Here, a dynamic weight strategy is adopted to compute the balance factor. Consideration is made from the perspective of maximum mean discrepancy and multiclass linear discriminant analysis. The first item is supposed to measure the degree of the domain adaptation between source and the target domain, and the second is supposed to show the classification performance of the classifier on the learned features in the current training epoch. Finally, extensive experiments on several bearing fault diagnosis datasets are conducted. The performance shows that our model has an obvious advantage compared with other common transferring algorithms.
... Thus, by assuming the number of floating-point operations to be (n × d ), Table 6 reports the operations count for 1-nearest neighbor search. Second, Witten et al. (2011) estimate that decision tree's depth is O(log(n)) based on the assumption that the induced tree is "bushy" and binary, where n is the number of training examples. Accordingly, we assume that a prediction using CART's induced decision tree exactly takes log(n) floating-point operations, and have obtained and reported the prediction cost of CART for data sets of Table 6. ...
This article presents a novel method, called Modular Grammatical Evolution (MGE), toward validating the hypothesis that restricting the solution space of NeuroEvolution to modular and simple neural networks enables the efficient generation of smaller and more structured neural networks while providing acceptable (and in some cases superior) accuracy on large data sets. MGE also enhances the state-of-the-art Grammatical Evolution (GE) methods in two directions. First, MGE's representation is modular in that each individual has a set of genes, and each gene is mapped to a neuron by grammatical rules. Second, the proposed representation mitigates two important drawbacks of GE, namely the low scalability and weak locality of representation, toward generating modular and multilayer networks with a high number of neurons. We define and evaluate five different forms of structures with and without modularity using MGE and find single-layer modules with no coupling more productive. Our experiments demonstrate that modularity helps in finding better neural networks faster. We have validated the proposed method using ten well-known classification benchmarks with different sizes, feature counts, and output class counts. Our experimental results indicate that MGE provides superior accuracy with respect to existing NeuroEvolution methods and returns classifiers that are significantly simpler than other machine learning generated classifiers. Finally, we empirically demonstrate that MGE outperforms other GE methods in terms of locality and scalability properties.
... Researchers tend to pay attention to the factor combination prone to landslides (Pourghasemi, Kariminejad, et al. 2020;Yao et al. 2019), which requires the methods to extract useful information from large amounts of data quickly and accurately. Data mining is an effective method to extract knowledge from complicated data (Ouyang et al. 2011;Witten et al. 2011;Sameen et al. 2020). ...
Landslide analysis prevents landslides from threatening resident safety and property, and the predominant method is susceptibility assessment which is cumbersome and time-consuming. The association rule algorithm (ARA) is proposed to mine the correlation between the factors and landslides simply and rapidly. The original ARA cannot reflect the scope of landslides which is non-negligible for landslide analysis and is thus improved to mine the frequent secondary-factor combinations (SFCs). Firstly, eight factors are selected using the out-of-bag error and chi-squared (χ2) test. The accuracy of the factor selection is further verified employing landslide susceptibility assessment which is predicted using 30% of study grid data selected randomly as the training data. The improved ARA employs the area of historical landslides to mine the frequent SFCs, and the results are then verified by the frequency ratio and χ2 test. It is concluded that the frequent SFCs are: (21, 41), (21, 74), (34, 41), (34, 74), (41, 74), (21, 41, 74), and (34, 41, 74), and the area with the SFCs needs special protection. The present study provides a valuable reference for the primary prevention of landslides.
... The machine learning model can predict a new sample within normal scope, but the amount of abnormal data usually is not large enough to precisely predict outliers. 60 For instance, in 2019, superhigh thermoelectric figures of merit (ZT > 400) were reported. 61 However, this high ZT could only exist at the structure phase transition temperature; because there are insufficient similar data around that temperature, this abnormal value of ZT is difficult to predict using machine learning methods. ...
There has been increasing demand for materials with functional thermal properties, but traditional experiments and simulations are high-cost and time-consuming. The emerging discipline, materials informatics, is an effective approach that can accelerate materials development by combining material science and big data techniques. Recently, materials informatics has been successfully applied to designing thermal materials, such as thermal interface materials for heat-dissipation, thermoelectric materials for power generation, etc. This mini-review summarizes the research progress associated with studies regarding the prediction and discovery of materials with desirable thermal transport properties by using materials informatics. Based on the review of past research, perspectives are discussed and future directions for studying functional thermal materials by materials informatics are given.
A particular application for the use of conformal wearables equipped with inertial sensors and wireless access is the ability to objectively quantify the response to an assortment of deep brain stimulation parameter configurations for the treatment of a movement disorder, such as Parkinson’s disease. Four parameters are available for the tuning of deep brain stimulation, such as amplitude, frequency, pulse width, and polarity. A summary of the research achieved by applying a series of deep brain stimulation amplitude settings, which is the predominant of the available parameters, for a subject with Parkinson’s disease is discussed. The associated acceleration magnitude derived from the conformal wearable’s inertial sensor system provides a visual context to contrast the deep brain stimulation therapeutic response. Intuitively, this inertial sensor signal data can be post-processed using Python for software automation to establish a feature set for machine learning classification. Using the Waikato Environment for Knowledge Analysis (WEKA) a multilayer perceptron neural network is applied to attain considerable classification accuracy to distinguish between the assortment of deep brain stimulation amplitude settings. As an extension the J48 decision tree, K-nearest neighbors, support vector machine, logistic regression, and random forest machine learning algorithms are considered in the context of both their classification accuracy and time to develop the machine learning model. This research is extended to a deep learning algorithm utilizing a convolutional neural network. Additional optimization techniques, such as hyperparameter tuning and modifying the feature set, are discussed. The amalgamation of conformal wearables comprised of internalized inertial sensors and wireless characteristics with machine learning are envisioned to substantially impact and augment clinical situational awareness and diagnostics acuity for the therapeutic intervention of deep brain stimulation for treating movement disorders.
Unsupervised domain adaptation has achieved certain success in recent cross-domain fault diagnosis research. As a widely used transfer strategy, the distribution alignment often occurs with the problems of too few valid alignment samples, too low confidence of predicted labels, and the inadequate alignment of marginal or conditional distributions. Therefore, this paper proposes a statistical distribution recalibration method of soft labels (SDRS). First, SDRS defines the valid samples and confusion interval in the statistical distribution of per-class predicted probabilities. Then, from the perspective of binary classification, a recalibration space in the confusion interval is further optimized by a center distance metric, to improve predicted confidence and valid distribution alignment. Built on SDRS, a novel cross-domain fault diagnosis approach named SDRS-DAN is constructed, where dynamic distribution adaptation is used to match and adjust the marginal and conditional distribution discrepancies adaptively. Extensive experiments prove the effectiveness of SDRS-DAN in cross-location and cross-machine scenarios.
In recent decades, eye-movement detection technology has improved significantly, and eye-trackers are available not only as standalone research tools but also as computer peripherals. This rapid spread gives further opportunities to measure the eye-movements of participants. The current paper provides classification models for the prediction of food choice and selects the best one. Four choice sets were presented to 112 volunteered participants, each choice set consisting of four different choice tasks, resulting in altogether sixteen choice tasks. The choice sets followed the 2, 4, 6 and 8-alternative forced-choice paradigm. Tobii X2-60 eye-tracker and Tobii Studio software were used to capture and export gazing data, respectively. After variable filtering, thirteen classification models were elaborated and tested; moreover, eight performance parameters were computed. The models were compared based on the performance parameters using the sum of ranking differences algorithm. The algorithm ranks and groups the models by comparing the ranks of their performance metrics to a predefined gold standard. Techniques based on decision trees were superior in all cases, regardless of the choice tasks and food product categories. Among the classifiers, Quinlan's C4.5 and cost-sensitive decision trees proved to be the best-performing ones. Future studies should focus on the fine-tuning of these models as well as their applications with mobile eye-trackers.
Background
Overtraining occurs when an optimization process is applied for too many steps, leading to a model describing noise in addition to the signal present in the data. This effect may affect typical approaches for species tree reconstruction that use maximum likelihood optimization procedures on a small sample of concatenated genes. In this context, overtraining may result in trees better describing the specific evolutionary history of the sampled genes rather than the sought evolutionary relationships among the species.
Results
Using a cross-validation-like approach on real and simulated datasets we showed that overtraining occurs in a significant fraction of cases, leading to species trees that are more distant from a gold-standard reference tree than a previously considered (and rejected) solution in the optimization process. However, we show that the shape of the likelihood curve is informative of the optimal stopping point. As expected, overtraining is aggravated in smaller gene samples and in datasets with increased levels of topological variation among gene trees, but occurs also in controlled, simulated scenarios where a common underlying topology is enforced.
Conclusions
Overtraining is frequent in species tree reconstruction and leads to a final tree that is worse in describing the evolutionary relationships of the species under study than an earlier (and rejected) solution encountered during the likelihood optimization process. This result should help develop specific methods for species tree reconstruction in the future, and may improve our understanding of the complexity of tree likelihood landscapes.
Driving frequency, amplitude and phase difference of two-phase sinusoidal voltages are the input parameters which have influence on speed stability of travelling wave ultrasonic motors (TWUSMs).These parameters are also time-varying due to the variations in operating temperature. In addition, a complete mathematical model of the TWUSM has not been derived yet. Owing to these reasons, a machine learning approach is required for determining the compatibility of operating parameters related to speed stability of TWUSMs. For this purpose, an intelligent decision support tool has been designed for TWUSMs in this study. The input parameters such as driving frequency, amplitude, phase difference of two-phase sinusoidal voltages and operating temperature were evaluated by the k-nearest neighbor algorithm in the decision support tool. The results have shown that the proposed tool provides effective results in the compatibility determination of operating parameters related to speed stability of TWUSMs.
The utilization ratio of wind energy, which is one of the renewable energy sources, is increased around 25% since last 15 years. However, the parameters such as performance of wind turbines and climate features are not analyzed adequately. At the analysis stage of these parameters, data mining techniques are required to be used. In this study, the agglomerative hierarchical clustering method which is one of the data mining techniques is used to analyze the provinces located in the Central Anatolia Region of Turkey in terms of average wind speed. Nearest neighbor algorithm is used as the clustering algorithm. Euclidean, Manhattan and Minkowski distance metrics are used determine the optimum hierarchical clustering results in this algorithm. The achieved clustering results based on Euclidean distance metric provide the optimum inferences to expert according to other distance metrics. Region which is the largest one of 7 regions in Turkey are clustered according to the average wind speed. The three largest cities of Turkey also exist in this region. In clustering process, the agglomerative hierarchical clustering method which is one of the data mining techniques is coded using C# software development kit of Microsoft Visual Studio.NET platform in an object oriented way. The distances to each other of observations in dataset are determined using Euclidean, Manhattan and Minkowski distance metric correlations. The acquired distance values are evaluated in the nearest neighbor algorithm to determine similarity of the provinces to each other in terms of average wind speed. The analysis results exhibit that the fault rate of Minkowski distance metric is proportionally higher while Euclidean and Manhattan distance metrics provides more reliable and appropriate interferences according to dataset.
Based on our previous efforts, we introduce in this letter a high-efficient MPEG-2 to H.264 transcoder for the baseline profile in the spatial domain. Machine learning tools are used to exploit the correlation between the macroblock (MB) decision of the H.264 video format and the distribution of the motion compensated residual in MPEG-2. Moreover, a dynamic motion estimation technique is also proposed to further speed-up the decision process. Finally, we go a step further on our previous research efforts by combining the two aforementioned speed-up approaches. Our simulation results over more than 40 sequences at common intermediate format and quarter common intermediate format resolutions show that our proposal outperforms the MB mode selection of the rate-distortion optimization option of the H.264 encoder process by reducing the computational requirements by up to 90%, while maintaining the same coding efficiency. Finally, we conduct a comparative study of our approach with the most relevant fast inter-prediction methods for MPEG-2 to H.264 transcoder recently reported in the literature.
We investigate whether XCS, a genetic algorithm based learning classifier system, can harness information from a user's environment to help desktop applications better personalize themselves to individual users. Specifically, we evaluate XCSs ability to predict user-preferred actions for a calendar and a media player. Results from three real-world user studies indicate that XCS significantly outperforms a decision-tree learner to successfully predict user preferences for these two desktop interfaces. Our results also show that removing external user-related contextual information degrades XCSs performance. This performance degradation emphasizes the need for desktop applications to access external contextual information to better learn user preferences. Our results highlight the potential for a learning classifier systems based approach for personalizing desktop applications to improve the quality of human-computer interaction.
ResearchGate has not been able to resolve any references for this publication.