Thesis

Employing Machine Learning and Deep Learning Models for Electricity Theft Detection in Smart Grids (MS Thesis with Source Codes)

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Electricity theft (ET) is a major problem in developing countries. It a�ects the economy that causes revenue loss. It also decreases the reliability and stability of electricity utilities. Due to these losses, the quality of supply e�ects and tari � imposed on legitimate consumers. ET is an essential part of Non-technical loss (NTL) and it is challenging for electricity utilities to �nd the responsible people. Several methodologies have developed to identify ET behaviors automatically. However, these approaches mainly assess records of consumers' electricity usage, may prove inadequate in detecting ET due to a variety of theft attacks and irregularity of consumers' behavior. Moreover, some important challenges are needed to be addressed. (i) The number of normal consumers has been wrongly identi�ed as fraudulent. This leads to high False-positive rate (FPR). After the detection of theft, on-site inspection is needed to validate the detected person, either is it fraudulent or not and it is costly. (ii) The imbalanced nature of datasets which negatively a�ect on the model's performance. (iii) The problem of over�tting and generalization error is often faced in deep learning models, predicts unseen data inaccurately. So, the motivation for this work to detect illegal consumers accurately. We have proposed four Arti�cial intelligence (AI) models in this thesis. In system model 1, we have proposed Enhanced arti�cial neural network blocks with skip connections (EANNBS). It makes training easier, reduces over�tting, FPR and generalization error, as well as execution time. Temporal convolutional network with enhanced multi-layer perceptron (TCN-EMLP) is proposed in system model 2. It analyzes the sequential data based on daily electricity-usage records, obtained from smart meters. At the same time, EMLP integrates the non-sequential auxiliary data, such as data related to electrical connection type, property area, electrical appliances usage, etc. System model 3 based on Residual network (RN) that is used to automate feature extraction while three tree-based classi�ers such as Decision tree (DT), Random forest (RF) and Adaptive boosting (AdaBoost) are trained on the obtained features for classi�cation. Hyperparameter tuning toolkit is presented in this system model, named as Hyperactive optimization toolkit. Bayesian is used as an optimizer in this toolkit that aims to simplify the tuning process of DT, RF and AdaBoost. In system model 4, input is forwarded to three di�erent and well-known Machine learning (ML) techniques, i.e., Support vector machine (SVM), as an input. At this stage, a meta-heuristic algorithm named Simulated annealing (SA) is employed to acquire optimal values for ML models' hyperparameters. Finally, ML models' outputs are used as features for meta-classi�ers to achieve �nal classi�cation with Light Gradient boosting machine (LGBM) and Multi-layer perceptron (MLP). Furthermore, Pakistan residential electricity consumption dataset (PRECON1), State grid corporation of china (SGCC2) and Commission for energy regulation (CER3) datasets is used in this thesis. SGCC dataset contains 9% fraudulent consumers, which are extremely less than non-fraudulent consumers, due to the imbalance nature of data. Furthermore, many classi�cation techniques have poor predictive class accuracy for the positive class. These techniques mainly focus on minimizing the error rate while ignoring the minority class. Many re-sampling techniques are used in literature to adjust the class ratio; however, sometimes, these techniques remove the important information that is necessary to learn the model and cause over�tting. By using six previous theft attacks, we generate theft cases to mimic the real world theft attacks in original data. We have proposed the combination of oversampling and under-sampling techniques that is Near miss borderline synthetic minority oversampling technique (NMB-SMOTE), Tomek link borderline synthetic minority oversampling technique with support vector machine (TBSSVM) and Synthetic minority oversampling technique with near miss (SMOTE-NM) to handle imbalanced classi�cation problem. We have conducted a comprehensive experiment using SGCC, CER and PRECON datasets. The performance of suggested model is validated using di�erent performance metrics that are derived from Confusion matrix (CM).

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In the near future, it is highly expected that smart grid (SG) utilities will replace existing fixed pricing with dynamic pricing, such as time-of-use real-time tariff (ToU). In ToU, the price of electricity varies throughout the whole day based on the respective utilities’ decisions. We classify the whole day into two periods with very high and low probabilities of theft activities, termed as the “theft window” and “non-theft window”, respectively. A “smart” malicious consumer can adjust his/her theft to mostly targeting the theft window, manipulate actual usage reporting to outsmart existing theft detectors, and achieve the goal of “paying reduced tariff”. Simulation results show that existing schemes do not detect well such window-based theft activities conversely exploiting ToU strategies. In this paper, we begin by introducing the core concept of window-based theft cases, which is defined at the basis of ToU pricing as well as consumption usage. A modified extreme gradient boosting (XGBoost) based machine learning (ML) technique called dynamic electricity theft detector (DETD) has been presented to detect a new type of theft cases.
Article
Full-text available
Human activity recognition (HAR) is a classification task that involves predicting the movement of a person based on sensor data. As we can see, there has been a huge growth and development of smartphones over the last 10–15 years—they could be used as a medium of mobile sensing to recognize human activity. Nowadays, deep learning methods are in a great demand and we could use those methods to recognize human activity. A great way is to build a convolutional neural network (CNN). HAR using Smartphone dataset has been widely used by researchers to develop machine learning models to recognize human activity. The dataset has two parts: training and testing. In this paper, we propose a hybrid approach to analyze and recognize human activity on the same dataset using deep learning method on cloud-based platform. We have applied principal component analysis on the dataset to get the most important features. Next, we have executed the experiment for all the features as well as the top 48, 92, 138, and 164 features. We have run all the experiments on Google Colab. In the experiment, for the evaluation of our proposed methodology, datasets are split into two different ratios such as 70–10–20% and 80–10–10% for training, validation, and testing, respectively. We have set the performance of CNN (70% training–10% validation–20% testing) with 48 features as a benchmark for our work. In this work, we have achieved maximum accuracy of 98.70% with CNN. On the other hand, we have obtained 96.36% accuracy with the top 92 features of the dataset. We can see from the experimental results that if we could select the features properly then not only could the accuracy be improved but also the training and testing time of the model.
Article
Full-text available
Electricity theft is one of the main causes of non-technical losses and its detection is important for power distribution companies to avoid revenue loss. The advancement of traditional grids to smart grids allows a two-way flow of information and energy that enables real-time energy management, billing and load surveillance. This infrastructure enables power distribution companies to automate electricity theft detection (ETD) by constructing new innovative data-driven solutions. Whereas, the traditional ETD approaches do not provide acceptable theft detection performance due to high-dimensional imbalanced data, loss of data relationships during feature extraction and the requirement of experts' involvement. Hence, this paper presents a new semi-supervised solution for ETD, which consists of relational denoising autoencoder (RDAE) and attention guided (AG) TripleGAN, named as RDAE-AG-TripleGAN. In this system, RDAE is implemented to derive features and their associations while AG performs feature weighting and dynamically supervises the AG-TripleGAN. As a result, this procedure significantly boosts the ETD. Furthermore, to demonstrate the acceptability of the proposed methodology over conventional approaches, we conducted extensive simulations using the real power consumption data of smart meters. The proposed solution is validated over the most useful and suitable performance indicators: area under the curve, precision, recall, Matthews correlation coefficient, F1-score and precision-recall area under the curve. The simulation results prove that the proposed method efficiently improves the detection of electricity frauds against conventional ETD schemes such as extreme gradient boosting machine and transductive support vector machine. The proposed solution achieves the detection rate of 0.956, which makes it more acceptable for electric utilities than the existing approaches.
Chapter
Full-text available
Computers have brought about significant technological improvements leading to the creation of enormous volumes of data, particularly in health care systems. The availability of vast amounts of data contributed to a greater need for data mining techniques to produce useful knowledge. Accurate analyzes of medical data are gaining early detection of illness and patient care with the increase of data in biomedical and health care communities. The data mining is one of the major approaches for developing sophisticated algorithms for classification of data. Some have castigated Data mining for not meeting all of the humanistic statistics specifications [5]. Classification of diseases is that one of the main applications of data mining and many important attempts have been made in recent years to improve the accuracy of the diagnosis of diseases through data mining. We used four prominent data mining algorithms such as Naive Bayes Classifier, K-Nearest Neighbors (KNN) Classifier, Artificial Neural Networks (ANN) and Support Vector Machine (SVM) algorithms to develop predictive models using that the ILPD (Indian Liver Patient Data Set) from the UCI Machine learning repository. For performance comparison purposes, we used the 10-fold cross validation method to calculate the estimation of six predictive models. We find that the support vector machine delivers the best results in a 74.82 percent accuracy classification and 56.55 percent accuracy of the Naive Bayes performed the worst. The performance metrics of classifiers were analyzed on medical dataset further sections below.
Article
Full-text available
Different mathematical models, Artificial Intelligence approach and Past recorded data set is combined to formulate Machine Learning. Machine Learning uses different learning algorithms for different types of data and has been classified into three types. The advantage of this learning is that it uses Artificial Neural Network and based on the error rates, it adjusts the weights to improve itself in further epochs. But, Machine Learning works well only when the features are defined accurately. Deciding which feature to select needs good domain knowledge which makes Machine Learning developer dependable. The lack of domain knowledge affects the performance. This dependency inspired the invention of Deep Learning. Deep Learning can detect features through self-training models and is able to give better results compared to using Artificial Intelligence or Machine Learning. It uses different functions like ReLU, Gradient Descend and Optimizers, which makes it the best thing available so far. To efficiently apply such optimizers, one should have the knowledge of mathematical computations and convolutions running behind the layers. It also uses different pooling layers to get the features. But these Modern Approaches need high level of computation which requires CPU and GPUs. In case, if, such high computational power, if hardware is not available then one can use Google Colaboratory framework. The Deep Learning Approach is proven to improve the skin cancer detection as demonstrated in this paper. The paper also aims to provide the circumstantial knowledge to the reader of various practices mentioned above.
Article
Full-text available
Electricity is widely used around 80\% of the world. Electricity theft has dangerous effects on utilities in terms of power efficiency and costs billions of dollars per annum. The~enhancement of the traditional grids gave rise to smart grids that enable one to resolve the dilemma of electricity theft detection (ETD) using an extensive amount of data formulated by smart meters. This data are used by power utilities to examine the consumption behaviors of consumers and to decide whether the consumer is an electricity thief or benign. However, the traditional data-driven methods for ETD have poor detection performances due to the high-dimensional imbalanced data and their limited ETD capability. In this paper, we present a new class balancing mechanism based on the interquartile minority oversampling technique and a combined ETD model to overcome the shortcomings of conventional approaches. The combined ETD model is composed of long short-term memory (LSTM), UNet and adaptive boosting (Adaboost), and termed LSTM--UNet--Adaboost. In~this~regard, LSTM--UNet--Adaboost combines the advantages of deep learning (LSTM-UNet) along with ensemble learning (Adaboost) for ETD. {Moreover, the performance of the proposed LSTM--UNet--Adaboost scheme was simulated and evaluated over the real-time smart meter dataset given by the State Grid Corporation of China. The simulations were conducted using the most appropriate performance indicators, such as area under the curve, precision, recall and F1 measure. The proposed solution obtained the highest results as compared to the existing benchmark schemes in terms of selected performance measures. More specifically, it achieved the detection rate of 0.92, which~was the highest among existing benchmark schemes, such as logistic regression, support vector machine and random under-sampling boosting technique. Therefore, the simulation outcomes validate that the proposed LSTM--UNet--Adaboost model surpasses other traditional methods in terms of ETD and is more acceptable for real-time practices.
Article
Full-text available
Due to the increase in the number of electricity thieves, the electric utilities are facing problems in providing electricity to their consumers in an efficient way. An accurate Electricity Theft Detection (ETD) is quite challenging due to the inaccurate classification on the imbalance electricity consumption data, the overfitting issues and the High False Positive Rate (FPR) of the existing techniques. Therefore, intensified research is needed to accurately detect the electricity thieves and to recover a huge revenue loss for utility companies. To address the above limitations, this paper presents a new model, which is based on the supervised machine learning techniques and real electricity consumption data. Initially, the electricity data are pre-processed using interpolation, three sigma rule and normalization methods. Since the distribution of labels in the electricity consumption data is imbalanced, an Adasyn algorithm is utilized to address this class imbalance problem. It is used to achieve two objectives. Firstly, it intelligently increases the minority class samples in the data. Secondly, it prevents the model from being biased towards the majority class samples. Afterwards, the balanced data are fed into a Visual Geometry Group (VGG-16) module to detect abnormal patterns in electricity consumption. Finally, a Firefly Algorithm based Extreme Gradient Boosting (FA-XGBoost) technique is exploited for classification. The simulations are conducted to show the performance of our proposed model. Moreover, the state-of-the-art methods are also implemented for comparative analysis, i.e., Support Vector Machine (SVM), Convolution Neural Network (CNN), and Logistic Regression (LR). For validation, precision, recall, F1-score, Matthews Correlation Coefficient (MCC), Receiving Operating Characteristics Area Under Curve (ROC-AUC), and Precision Recall Area Under Curve (PR-AUC) metrics are used. Firstly, the simulation results show that the proposed Adasyn method has improved the performance of FA-XGboost classifier, which has achieved F1-score, precision, and recall of 93.7%, 92.6%, and 97%, respectively. Secondly, the VGG-16 module achieved a higher generalized performance by securing accuracy of 87.2% and 83.5% on training and testing data, respectively. Thirdly, the proposed FA-XGBoost has correctly identified actual electricity thieves, i.e., recall of 97%. Moreover, our model is superior to the other state-of-the-art models in terms of handling the large time series data and accurate classification. These models can be efficiently applied by the utility companies using the real electricity consumption data to identify the electricity thieves and overcome the major revenue losses in power sector.
Article
Full-text available
Multi-microgrid (MMG) system is a new method that concurrently incorporates different types of distributed energy resources, energy storage systems and demand responses to provide reliable and independent electricity for the community. However, MMG system faces the problems of management, real-time economic operations and controls. Therefore, this study proposes an energy management system (EMS) that turns an infinite number of MMGs into a coherence and efficient system, where each MMG can achieve its goals and perspectives. The proposed EMS employs a cooperative game to achieve efficient coordination and operations of the MMG system and also ensures a fair energy cost allocation among members in the coalition. This study considers the energy cost allocation problem when the number of members in the coalition grows exponentially. The energy cost allocation problem is solved using a column generation algorithm. The proposed model includes energy storage systems, demand loads, real-time electricity prices and renewable energy. The estimate of the daily operating cost of the MMG using a proposed deep convolutional neural network (CNN) is analyzed in this study. An optimal scheduling policy to optimize the total daily operating cost of MMG is also proposed. Besides, other existing optimal scheduling policies, such as approximate dynamic programming (ADP), model prediction control (MPC), and greedy policy are considered for the comparison. To evaluate the effectiveness of the proposed model, the real-time electricity prices of the electric reliability council of Texas are used. Simulation results show that each MMG can achieve energy cost savings through a coalition of MMG. Moreover, the proposed optimal policy method achieves MG's daily operating cost reduction up to 87.86% as compared to 79.52% for the MPC method, 73.94% for the greedy policy method and 79.42% for ADP method.
Article
Full-text available
Today’s electricity grid is rapidly evolving, with increased penetration of renewable energy sources (RES). Conventional Optimal Power Flow (OPF) has non-linear constraints that make it a highly non-linear, non-convex optimisation problem. This complex problem escalates further with the integration of RES, which are generally intermittent in nature. In this article, an optimal power flow model combines three types of energy resources, including conventional thermal power generators, solar photovoltaic generators (SPGs) and wind power generators (WPGs). Uncertain power outputs from SPGs and WPGs are forecasted with the help of lognormal and Weibull probability distribution functions, respectively. The over and underestimation output power of RES are considered in the objective function i.e. as a reserve and penalty cost, respectively. Furthermore, to reduce carbon emissions, a carbon tax is imposed while formulating the objective function. A grey wolf optimisation technique (GWO) is employed to achieve optimisation in modified IEEE-30 and IEEE-57 bus test systems to demonstrate its feasibility. Hence, novel contributions of this work include the new objective functions and associated framework for optimising generation cost while considering RES; and, secondly, computational efficiency is improved by the use of GWO to address the non-convex OPF problem. To investigate the effectiveness of the proposed GWO-based approach, it is compared in simulation to five other nature-inspired global optimisation algorithms and two well-established hybrid algorithms. For the simulation scenarios considered in this article, the GWO outperforms the other algorithms in terms of total cost minimisation and convergence time reduction.
Article
Full-text available
Network attacks are increasing day by day. In order to detect them, a system has been created, which actively detects intrusions and attacks in a network or an intranet. The system that detects these types of attacks and intrusions is called intrusion detection system (IDS). The attacks are of two kinds, known and unknown. The IDSs are able to protect against known attacks as they are designed specifically for them. As the usage of the Internet is growing every day, the attacks are increasing as well and all of them are not known to an IDS without proper upgradation, which is harmful as it will not be detected by the IDS and leave the system open to threats. Therefore, an IDS should not just detect the known attacks but even provide security from unknown attacks. Motivated by this, in this article, an ensemble‐based IDS using XGBoost is presented. There has been previous research on the topic and with the help of improved technologies, it becomes possible to improve the efficiency and accuracy of the ensemble based IDS. This article proposes to present a scheme that shows the usage of XGBoost with ensemble based IDS can provide better results as XGBoost is based on the tree boosting machine learning algorithms, which helps dealing with a smoother “bias‐variance” trade‐off. The experiment is performed on the KDDCup99 dataset and the recorded accuracy of the proposed method through this experiment is 99.95%.
Article
Full-text available
The electrical losses in power systems are divided into non-technical losses (NTLs) and technical losses (TLs). NTL is more harmful than TL because it includes electricity theft, faulty meters and billing errors. It is one of the major concerns in the power system worldwide and incurs a huge revenue loss for utility companies. Electricity theft detection (ETD) is the mechanism used by industry and academia to detect electricity theft. However, due to imbalanced data, overfitting issues and the handling of high-dimensional data, the ETD cannot be applied efficiently. Therefore, this paper proposes a solution to address the above limitations. A long short-term memory (LSTM) technique is applied to detect abnormal patterns in electricity consumption data along with the bat-based random under-sampling boosting (RUSBoost) technique for parameter optimization. Our proposed system model uses the normalization and interpolation methods to pre-process the electricity data. Afterwards, the pre-processed data are fed into the LSTM module for feature extraction. Finally, the selected features are passed to the RUSBoost module for classification. The simulation results show that the proposed solution resolves the issues of data imbalancing, overfitting and the handling of massive time series data. Additionally, the proposed method outperforms the state-of-the-art techniques; i.e., support vector machine (SVM), convolutional neural network (CNN) and logistic regression (LR). Moreover, the F1-score, precision, recall and receiver operating characteristics (ROC) curve metrics are used for the comparative analysis.
Article
Full-text available
Renewable energy sources (RESs) are considered to be reliable and green electric power generation sources. Photovoltaics (PVs) and wind turbines (WTs) are used to provide electricity in remote areas. Optimal sizing of hybrid RESs is a vital challenge in a stand-alone environment. The meta-heuristic algorithms proposed in the past are dependent on algorithm-specific parameters for achieving an optimal solution. This paper proposes a hybrid algorithm of Jaya and a teaching-learning-based optimization (TLBO) named the JLBO algorithm for the optimal unit sizing of a PV-WT-battery hybrid system to satisfy the consumer's load at minimal total annual cost (TAC). The reliability of the system is considered by a maximum allowable loss of power supply probability (LPSPmax) concept. The results obtained from the JLBO algorithm are compared with the original Jaya, TLBO, and genetic algorithms. The JLBO results show superior performance in terms of TAC, and the PV-WT-battery hybrid system is found to be the most economical scenario. This system provides a cost-effective solution for all proposed LPSPmax values as compared with PV-battery and WT-battery systems.
Article
Full-text available
Forecasting in the smart grid (SG) plays a vital role in maintaining the balance between demand and supply of electricity, efficient energy management, better planning of energy generation units and renewable energy sources and their dispatching and scheduling. Existing forecasting models are being used and new models are developed for a wide range of SG applications. These algorithms have hy-perparameters which need to be optimized carefully before forecasting. The optimized values of these algorithms increase the forecasting accuracy up-to a significant level. In this paper, we present a brief literature review of forecasting models and the optimization methods used to tune their hyperparam-eters. In addition, we have also discussed the data preprocessing methods. A comparative analysis of these forecasting models, according to their hyperparameter optimization, error methods and prepro-cessing methods, is also presented. Besides, we have critically analyzed the existing optimization and data preprocessing models and highlighted the important findings. A survey of existing survey papers is also presented and their recency score is computed based on the number of recent papers reviewed in them. By recent, we mean that the year in which a survey paper is published and its previous three years. Finally, future research directions are discussed in detail.
Article
Full-text available
In this work, recurrent and linear sequences are studied, exploring the teaching of these numbers with the aid of a computational resource, known as Google Colab. Initially, a brief historical exploration inherent to these sequences is carried out, as well as the construction of the characteristic equation of each one. Thus, their respective roots will be investigated and analyzed, through fractal theory based on Newton's method. For that, Google Colab is used as a technological tool, collaborating to teach Fibonacci, Lucas, Mersenne, Oresme, Jacobsthal, Pell, Leonardo, Padovan, Perrin and Narayana sequences in Brazil and Portugal. It is also possible to notice the similarity of some of these sequences, in addition to relating them with some figures present and their corresponding visualization.
Article
Full-text available
Energy consumption is increasing exponentially with the increase in electronic gadgets. Losses occur during generation, transmission, and distribution. The energy demand leads to increase in electricity theft (ET) in distribution side. Data analysis is the process of assessing the data using different analytical and statistical tools to extract useful information. Fluctuation in energy consumption patterns indicates electricity theft. Utilities bear losses of millions of dollar every year. Hardware-based solutions are considered to be the best; however, the deployment cost of these solutions is high. Software-based solutions are data-driven and cost-effective. We need big data for analysis and artificial intelligence and machine learning techniques. Several solutions have been proposed in existing studies; however, low detection performance and high false positive rate are the major issues. In this paper, we first time employ bidirectional Gated Recurrent Unit for ET detection for classification using real time-series data. We also propose a new scheme, which is a combination of oversampling technique Synthetic Minority Oversampling TEchnique (SMOTE) and undersampling technique Tomek Link: “Smote Over Sampling Tomik Link (SOSTLink) sampling technique”. The Kernel Principal Component Analysis is used for feature extraction. In order to evaluate the proposed model’s performance, five performance metrics are used, including precision, recall, F1-score, Root Mean Square Error (RMSE), and receiver operating characteristic curve. Experiments show that our proposed model outperforms the state-of-the-art techniques: logistic regression, decision tree, random forest, support vector machine, convolutional neural network, long short-term memory, hybrid of multilayer perceptron and convolutional neural network.
Conference Paper
Full-text available
Electricity theft is a criminal practice of stealing electricity. In the country like Pakistan where the consumption is more than the production, the electricity theft can be hazardous for the economy. During the year 2017–18, there was a loss of 53 billion Rs. to the economy due to electricity theft. A novel system for the detection of electricity thefts is designed. The dataset provided by State Grid Corporation of China (SGCC) was used which contained two classes i.e. normal and theft. The dataset comprised of data collected for 1,035 days. The dataset included various missing and erroneous values. Preprocessing techniques such as interpolation was used to get the missing values and for the breakdown of signal, empirical mode decomposition was employed. After that the features were extracted from the signals of both classes. After a number of experiments combinations of features were found that gave maximum accuracy. K-nearest neighbors (KNN) classifier was used because of its advantage that it is very fast and simple. System was able to detect the electricity theft with accuracy of 91.0%. The system is very reliable and can be helpful in reducing the losses due to electricity theft. It is a very easy to use system as well as cost efficient.
Article
Full-text available
Effective detection of electricity theft is essential to maintain power system reliability. With the development of smart grids, traditional electricity theft detection technologies have become ineffective to deal with the increasingly complex data on the users’ side. To improve the auditing efficiency of grid enterprises, a new electricity theft detection method based on improved synthetic minority oversampling technique (SMOTE) and improve random forest (RF) method is proposed in this paper. The data of normal and electricity theft users were classified as positive data (PD) and negative data (ND), respectively. In practice, the number of ND was far less than PD, which made the dataset composed of these two types of data become unbalanced. An improved SOMTE based on K-means clustering algorithm (K-SMOTE) was firstly presented to balance the dataset. The cluster center of ND was determined by K-means method. Then, the ND were interpolated by SMOTE on the basis of the cluster center to balance the entire data. Finally, the RF classifier was trained with the balanced dataset, and the optimal number of decision trees in RF was decided according to the convergence of out-of-bag data error (OOB error). Electricity theft behaviors on the user side were detected by the trained RF classifier.
Conference Paper
Full-text available
Due to increase in electronic appliances, electricity is becoming basic necessity of life. Consumption of electricity depends on various factors like temperature , wind, humidity, weekend, working days and season. In electricity load forecasting, many researchers perform data analysis on electricity data provided by utilities to extract meaningful information. Smart Grid (SG) is power supply network which allows consumers to monitor their energy usage. It integrates different components of electricity like variety of operations, smart appliances, data collected from smart meters and efficient energy sources. To reduce the consumption of electricity, accurate prediction is compulsory. A good forecasting model makes an acceptable use of all characteristics of electric loads based data and also reduces dimensionality of that data. Various machine learning techniques are proposed for load and price forecasting in literature. In this research, we present a survey based on different short term electricity forecasting techniques for price and load. We broadly categorize different types of techniques into traditional machine learning and deep learning techniques.
Conference Paper
Full-text available
Smart grid (SG) is bringing revolutionary changes in the electric power system. SG is supposed to provide economic, social, and environmental benefits for many stakeholders. A smart meter is an essential part of the SG. Data acquisition, transmission, processing, and interpretation are factors to determine the success of smart meters due to the excess amount of data in the grid. Electricity price and load are considered the most influential factors in the energy management system. Moreover, electricity price and load forecasting performed through data analytics give future trends and patterns of consumption. The energy market trade is based on price forecasting. Accurate forecasting of electricity price and load improves the reliability and management of electricity market operations. The aim of this paper is to explore the state of the art proposed for price and load forecasting in terms of their performance for reliable and efficient smart energy management systems.
Conference Paper
Full-text available
High price fluctuations have a direct impact on electricity market. Thus, accurate and plausible price forecasts have been implemented to mitigate the consequences of price dynamics. This paper proposes two techniques to deal with the Electricity Price Forecasting (EPF) problem. Firstly, Convolutional Neural Network (CNN) model is used to predict the EPF. Secondly, a principle component analysis model is used for the feature extraction. We have conducted simulations to prove the effectiveness of the proposed approach, which show that CNN based approach outperforms the multilayer perceptron model.
Conference Paper
Full-text available
Traditional grid moves toward Smart Grid (SG). In traditional grids, electricity was wasted in generation-transmission-distribution. SG is introduced to solve prior issues. In smart grids, how to utilize massive smart meter's data in order to improve and promote the efficiency and viability of both generation and demand side is a compelling issue. A good forecasting model makes an acceptable use of all characteristics of the electric loads' data and also reduces dimensionality of that data. Many data-driven methods have been proposed in the literature for load forecasting. In this paper, EEMD based ECNN model is proposed to forecast load of electic-ity using AEMO data. From the results, ECNN outperforms benchmark methods especially by applying EEMD for decomposition and DAE for feature extraction.
Conference Paper
Full-text available
Conventional grid moves towards Smart Grid (SG). In conventional grids, electricity is wasted in generation-transmissions-distribution, and communication is in one direction only. SG is introduced to solve prior issues. In SG, there are no restrictions, and communication is bi-directional. Electricity forecasting plays a significant role in SG to enhance operational cost and efficient management. Load and price forecasting gives future trends. In literature many data-driven methods have been discussed for price and load forecasting. The objective of this paper is to focus on literature related to price and load forecasting in last four years. The author classifies each paper in terms of its problems and solutions. Additionally, the comparison of each proposed technique regarding performance are presented in this paper. Lastly, papers limitations and future challenges are discussed.
Article
Full-text available
Unlike the existing research that focuses on detecting electricity theft cyber-attacks in the consumption domain, this paper investigates electricity thefts at the distributed generation (DG) domain. In this attack, malicious customers hack into the smart meters monitoring their renewable-based DG units and manipulate their readings to claim higher supplied energy to the grid and hence falsely overcharge the utility company. Deep machine learning is investigated to detect such a malicious behavior. We aim to answer three main questions in this paper: a) What are the cyber-attack functions that can be applied by malicious customers to the generation data in order to falsely overcharge the utility company? b) What sources of data can be used in order to detect these cyber-attacks by the utility company? c) Which deep machine learning-model should be used in order to detect these cyber-attacks? Our investigation revealed that integrating various data from the DG smart meters, meteorological reports, and SCADA metering points in the training of a deep convolutional-recurrent neural network offers the highest detection rate (99.3%) and lowest false alarm (0.22%).
Article
Full-text available
Class imbalance is a common issue in many applications such as medical diagnosis, fraud detection, web advertising, etc. Although standard deep learning method has achieved remarkably high-performance on datasets with balanced classes, its ability to classify imbalanced dataset is still limited. This paper proposes a novel end-to-end deep neural network architecture and adopts Gumbel distribution as an activation function in neural networks for class imbalance problem in the application of binary classification. Our proposed architecture, named GEV-NN, consists of three components: the first component serves to score input variables to determine a set of suitable input, the second component is an auto-encoder that learns efficient explanatory features for the minority class, and in the last component, the combination of the scored input and extracted features are then used to make the final prediction. We jointly optimize these components in an end-to-end training. Extensive experiments using real-world imbalanced datasets showed that GEV-NN significantly outperforms the state-of-the-art baselines by around 2% at most. In addition, the GEV-NN gives a beneficial advantage to interpret variable importance. We find key risk factors for hypertension, which are consistent with other scientific researches, using the first component of GEV-NN.
Article
Full-text available
Day-ahead electricity pricing is an important strategy for electricity providers to improve grid stability through load scheduling. In this paper, we investigate a general framework for modelling electricity retail pricing based on load demand and market price information. Without any a priori knowledge, we have considered a finite time approach with dynamic system inputs. Our objective is to minimize the average system cost and rebound peaks through energy procurement price, load scheduling and renewable energy source (RES) integration. Initially, the energy consumption cost is calculated based on market clearing price and scheduled load. Then, through reformulation and subsequent modification of optimization problem, we utilize a day-ahead price information to construct individualized price profiles for each user, respectively. To analyse the applicability of proposed pricing policy, analytical solution is obtained which is further validated through comparison with solution obtained from genetic algorithm (GA). From results, it is observed that proposed price policy is non-discriminatory in nature and each user obtained a fair electricity tariff rather than a day-ahead price, which is based on load demand and consumption variation of other users. We also show that optimization problem is sequentially solved with bounded performance guarantee and asymptotic optimality. Finally, simulations are carried in different scenarios; aggregated load and market price, and aggregated load, individualized load, market price and proposed price. Results reveal that our proposed mechanism can charge the price to each user with 23.77% decrease or 5.12% increase based on system requirements.
Article
Full-text available
Over the last decades, load forecasting is used by power companies to balance energy demand and supply. Among the several load forecasting methods, medium-term load forecasting is necessary for grid’s maintenance planning, settings of electricity prices, and harmonizing energy sharing arrangement. The forecasting of the month ahead electrical loads provides the information required for the interchange of energy among power companies. For accurate load forecasting, this paper proposes a model for medium-term load forecasting that uses hourly electrical load and temperature data to predict month ahead hourly electrical loads. For data preprocessing, modified entropy mutual information-based feature selection is used. It eliminates the redundancy and irrelevancy of features from the data. We employ the conditional restricted Boltzmann machine (CRBM) for the load forecasting. A meta-heuristic optimization algorithm Jaya is used to improve the CRBM’s accuracy rate and convergence. In addition, the consumers’ dynamic consumption behaviors are also investigated using a discrete-time Markov chain and an adaptive k-means is used to group their behaviors into clusters. We evaluated the proposed model using GEFCom2012 US utility dataset. Simulation results confirm that the proposed model achieves better accuracy, fast convergence, and low execution time as compared to other existing models in the literature.
Article
Full-text available
With the ever-growing demand of electric power, it is quite challenging to detect and prevent Non-Technical Loss (NTL) in power industries. NTL is committed by meter bypassing, hooking from the main lines, reversing and tampering the meters. Manual on-site checking and reporting of NTL remains an unattractive strategy due to the required manpower and associated cost. The use of machine learning classifiers has been an attractive option for NTL detection. It enhances data-oriented analysis and high hit ratio along with less cost and manpower requirements. However, there is still a need to explore the results across multiple types of classifiers on a real-world dataset. This paper considers a real dataset from a power supply company in Pakistan to identify NTL. We have evaluated 15 existing machine learning classifiers across 9 types which also include the recently developed CatBoost, LGBoost and XGBoost classifiers. Our work is validated using extensive simulations. Results elucidate that ensemble methods and Artificial Neural Network (ANN) outperform the other types of classifiers for NTL detection in our real dataset. Moreover, we have also derived a procedure to identify the top-14 features out of a total of 71 features, which are contributing 77% in predicting NTL. We conclude that including more features beyond this threshold does not improve performance and thus limiting to the selected feature set reduces the computation time required by the classifiers. Last but not least, the paper also analyzes the results of the classifiers with respect to their types, which has opened a new area of research in NTL detection.
Article
Full-text available
An increase in the world's population results in high energy demand, which is mostly fulfilled by consuming fossil fuels (FFs). By nature, FFs are scarce, depleted, and non-eco-friendly. Renewable energy sources (RESs) photovoltaics (PVs) and wind turbines (WTs) are emerging alternatives to the FFs. The integration of an energy storage system with these sources provides promising and economical results to satisfy the user's load in a stand-alone environment. Due to the intermittent nature of RESs, their optimal sizing is a vital challenge when considering cost and reliability parameters. In this paper, three meta-heuristic algorithms: teaching-learning based optimization (TLBO), enhanced differential evolution (EDE), and the salp swarm algorithm (SSA), along with two hybrid schemes (TLBO + EDE and TLBO + SSA) called enhanced evolutionary sizing algorithms (EESAs) are proposed for solving the unit sizing problem of hybrid RESs in a stand-alone environment. The objective of this work is to minimize the user's total annual cost (TAC). The reliability is considered via the maximum allowable loss of power supply probability (LPSP max) concept. The simulation results reveal that EESAs provide better results in terms of TAC minimization as compared to other algorithms at four LPSP max values of 0%, 0.5%, 1%, and 3%, respectively, for a PV-WT-battery hybrid system. Further, the PV-WT-battery hybrid system is found as the most economical scenario when it is compared to PV-battery and WT-battery systems.
Article
Full-text available
Distance-based algorithms are widely used for data classification problems. The k-nearest neighbour classification (k-NN) is one of the most popular distance-based algorithms. This classification is based on measuring the distances between the test sample and the training samples to determine the final classification output. The traditional k-NN classifier works naturally with numerical data. The main objective of this paper is to investigate the performance of k-NN on heterogeneous datasets, where data can be described as a mixture of numerical and categorical features. For the sake of simplicity, this work considers only one type of categorical data, which is binary data. In this paper, several similarity measures have been defined based on a combination between well-known distances for both numerical and binary data, and to investigate k-NN performances for classifying such heterogeneous data sets. The experiments used six heterogeneous datasets from different domains and two categories of measures. Experimental results showed that the proposed measures performed better for heterogeneous data than Euclidean distance, and that the challenges raised by the nature of heterogeneous data need personalised similarity measures adapted to the data characteristics.
Article
Full-text available
Smart Grid (SG) plays vital role in modern electricity grid. The data is increasing with the drastic increase in number of users. An efficient technology is required to handle this dramatic growth of data. Cloud computing is then used to store the data and to provide numerous services to the consumers. There are various cloud Data Centers (DC), which deal with the requests coming from consumers. However, there is a chance of delay due to the large geographical area between cloud and consumer. So, a concept of fog computing is presented to minimize the delay and to maximize the efficiency. However, the issue of load balancing is raising; as the number of consumers and services provided by fog grow. So, an enhanced mechanism is required to balance the load of fog. In this paper, a three-layered architecture comprising of cloud, fog and consumer layers is proposed. A meta-heuristic algorithm: Improved Particle Swarm Optimization with Levy Walk (IPSOLW) is proposed to balance the load of fog. Consumers send request to the fog servers, which then provide services. Further, cloud is deployed to save the records of all consumers and to provide the services to the consumers, if fog layer is failed. The proposed algorithm is then compared with existing algorithms: genetic algorithm, particle swarm optimization, binary PSO, cuckoo with levy walk and BAT. Further, service broker policies are used for efficient selection of DC. The service broker policies used in this paper are: closest data center, optimize response time, reconfigure dynamically with load and new advance service broker policy. Moreover, response time and processing time are minimized. The IPSOLW has outperformed to its counterpart algorithms with almost 4.89% better results.
Article
Full-text available
Smart city adoption and deployment has taken the centre stage worldwide with its realisation clearly hinged on energy efficiency, but its planning is threatened by the vulnerability of smart grids (SGs). Adversaries launch attacks with various motives, but the rampaging electricity theft menace is causing major concerns to SGs deployments and consequently, energy efficiency. Smart electricity meters (SEMs) deployments via the advanced metering infrastructure (AMI) present promising solutions and even greater potential as it provides adequate data for analytical inferences to achieving proactive measures against various cyber-attacks. This study suggests the sources of threats as the first step of such proactive measures of curbing electricity thefts. It provides a framework for monitoring, identifying and curbing the threats based on factors indicative of electricity thefts in a smart utility network. The proposed framework basically focuses on these symptoms of the identified threats indicative of possible electricity theft occurrence to decide on preventing thefts. This study gives a useful background to smart city planners in realising a more reliable, robust and secured energy management scheme required for a sustainable city.
Article
Full-text available
Recently, power systems are facing the challenges of growing power demand, depleting fossil fuel and aggravating environmental pollution (caused by carbon emission from fossil fuel based power generation). The incorporation of alternative low carbon energy generation, i.e., Renewable Energy Sources (RESs), become crucial for energy systems. Effective Demand Side Management (DSM) and RES incorporation enable power systems to maintain demand, supply balance and optimize energy in an environment friendly manner. The wind power is a popular energy source because of its environmental and economical benefits. However, the uncertainty of wind power, makes its incorporation in energy systems really difficult. To mitigate the risk of demand-supply imbalance, an accurate estimation of wind power is essential. Recognizing this challenging task, an efficient deep learning based prediction model is proposed for wind power forecasting. The proposed model has two stages. In the first stage, Wavelet Packet Transform (WPT) is used to decompose the past wind power signals. Other than decomposed signals and lagged wind power, multiple exogenous inputs (such as, calendar variable and Numerical Weather Prediction (NWP)) are also used as input to forecast wind power. In the second stage, a new prediction model Efficient Deep Convolution Neural Network (EDCNN) is employed to forecast wind power. A DSM scheme is formulated based on forecasted wind power, day-ahead demand and price. The proposed forecasting model's performance is evaluated on big data of Maine wind farm ISO NE, USA.
Article
Full-text available
As one of the major factors of the nontechnical losses (NTLs) in distribution networks, the electricity theft causes significant harm to power grids, which influences power supply quality and reduces operating profits. In order to help utility companies solve the problems of inefficient electricity inspection and irregular power consumption, a novel hybrid convolutional neural network-random forest (CNN-RF) model for automatic electricity theft detection is presented in this paper. In this model, a convolutional neural network (CNN) firstly is designed to learn the features between different hours of the day and different days from massive and varying smart meter data by the operations of convolution and downsampling. In addition, a dropout layer is added to retard the risk of overfitting, and the backpropagation algorithm is applied to update network parameters in the training phase. And then, the random forest (RF) is trained based on the obtained features to detect whether the consumer steals electricity. To build the RF in the hybrid model, the grid search algorithm is adopted to determine optimal parameters. Finally, experiments are conducted based on real energy consumption data, and the results show that the proposed detection model outperforms other methods in terms of accuracy and efficiency.
Conference Paper
Electricity theft can be considered as a Non-Technical Loss (NTL) in smart grids which is very harmful to the power system. Electricity Theft Detection (ETD) is a procedure to detect atypical behaviors in smart grids which can be achieved via the massive amount of data that is generated by these networks due to using smart meter tools and Information and Communications Technology (ICT). Since the existing methods are not exceptionally robust to detect this type of attack, also considering the strength of the convolutional neural network (CNN), an Ensemble Deep Convolutional Neural Network (EDCNN) algorithm for ETD in smart grids has been proposed. As the first layer of the model, a random under bagging technique is applied to deal with the imbalance data, then deep CNNs are utilized on each subset and finally, a voting system is embedded as the last part. This study has been conducted on a dataset which contains consumption information of more than 42,000 customers over a period of 24 months. Various performance parameters containing AUC, precision, recall, f1-score and accuracy have been reported as the results.
Conference Paper
In the smart grid scenario, monitoring and automation of the distribution system are required up to the low voltage level of the electric system. An option to monitor the low voltage grids is to use measurements collected from the end-user smart meters. Even in this case, a still open issue is the detection and identification of possible bad data, which cannot be easily achieved with traditional methods due to the low measurement redundancy. At low voltage level, a particular case of bad data is given by possible electricity thefts, which clearly need to be detected both to prevent revenue losses and to avoid an incorrect operation of the monitoring system. This paper deals with this topic and presents an approach to detect and identify smart meter bad data associated to electricity thefts. The proposed approach allows identifying the source of the bad data via the time series analysis of the measurement residuals obtained during the low voltage grid state estimation process. The investigations and tests carried out in this work show that the proposed method can be an effective way to discover electricity thefts in the grid also in cases where traditional methods fail.
Article
The stealthy false data injection (FDI) attacks in smart grids can bypass the bad data detection, and thus make an incorrect state estimate in the control center. In this short paper, a distributed data-driven intrusion detection approach is proposed to reveal the existence of the sparse stealthy FDI attack in a multi-area interconnected power system. The proposed distributed intrusion detection approach avoids the over-fitting issue that is extensively seen when implementing machine learning algorithms for large-scale systems. Firstly, each area estimates the entire system state based on a distributed state estimation algorithm. Then, the state of each local area is used as trained neural network input to detect the stealthy FDI attacks. Simulation results on the IEEE 118-bus system verify that the proposed method not only reduces the risk of over-fitting, but also can locate the areas which have been attacked.
Article
Prediction of load and price are two critical key in power system planning and operation. Most of the recent works in this area forecast the load and price signals separately but, a dynamic model in smart grid is evaluated while, the customers may have opportunity to react the proposed prices changing through shifting the electricity usages from expensive to cheaper hours. So, the load and price signals are coupled strongly which made the previous prediction models ineffective. In this research, a synthetic prediction approach has been proposed by considering the load and price signals, simultaneously. This method works as multi-input multi-output (MIMO) model based on least square support vector machine (LSSVM) forecast engine. Furthermore, a dyadic wavelet transform (DWT) is suggested in this approach to decompose the original signal into different small sub-signals. Beside of that, the modified mutual information (MMI) filter has been used to choose the best candidate input of forecast engine. The learning section is also coupled with novel modified optimization algorithm based on gravitational search algorithm (GSA) which called as modified GSA (MGSA). Finally, various forecasting errors have been considered as average mean absolute percentage error and error variance to get the comparison outcomes and performance of forecasting approaches. For this purpose, different markets have been considered as test case to show the efficiency of suggested approach.
Article
The remarkable flexibility and adaptability of ensemble methods and deep learning models have led to the proliferation of their application in bioinformatics research. Traditionally, these two machine learning techniques have largely been treated as independent methodologies in bioinformatics applications. However, the recent emergence of ensemble deep learning—wherein the two machine learning techniques are combined to achieve synergistic improvements in model accuracy, stability and reproducibility—has prompted a new wave of research and application. Here, we share recent key developments in ensemble deep learning and look at how their contribution has benefited a wide range of bioinformatics research from basic sequence analysis to systems biology. While the application of ensemble deep learning in bioinformatics is diverse and multifaceted, we identify and discuss the common challenges and opportunities in the context of bioinformatics research. We hope this Review Article will bring together the broader community of machine learning researchers, bioinformaticians and biologists to foster future research and development in ensemble deep learning, and inspire novel bioinformatics applications that are unattainable by traditional methods. Recent developments in machine learning have seen the merging of ensemble and deep learning techniques. The authors review advances in ensemble deep learning methods and their applications in bioinformatics, and discuss the challenges and opportunities going forward.
Article
In the research of computer-aided diagnosis, the shortage of disease feature dimension curse and the imbalance of medical samples have always been the focus of research on diagnostic decision support systems. For these two problems, we propose a feature selection algorithm based on association rules and an integrated classification algorithm based on random equilibrium sampling. We extracted and cleaned the electronic medical record text obtained from the hospital to obtain a diabetes data set. The proposed algorithm was verified in this data set and the public data set UCI. Experimental results show that the feature selection algorithm based on association rules is better than the CART, ReliefF and RFE-SVM algorithms in terms of feature dimension and classification accuracy. The proposed integrated classification algorithm based on random equalization sampling is superior to the comparative SMOTE-Boost and SMOTE-RF algorithms in macro precision, macro-full rate and macro F1 value, which embodies the robustness of the algorithm.
Article
Crime linkage is a challenging task in crime analysis, which is to find serial crimes committed by the same offenders. It can be regarded as a binary classification task detecting serial case pairs. However, most case pairs in the real world are nonserial, so there is a serious class imbalance in the crime linkage. In this paper, we propose a novel random forest based on the information granule. The approach doesn’t resample the minority class or the majority class but concentrates on indistinguishable case pairs at the classification boundary. The information granule is used to identify case pairs that are difficult to distinguish in the dataset and constructs a nearly balanced dataset in the uncertainty region to deal with the imbalanced problem. In the proposed approach, random trees come from the original dataset and the above mentioned nearly balanced dataset. A real-world robbery dataset and some public imbalanced datasets are employed to measure the performance of the approach. The results show that the proposed approach is effective in dealing with class imbalances, and it can be extended to combine with other methods solving class imbalances.
Article
Ensemble learning methods have already shown to be powerful techniques for creating classifiers. However, when dealing with real-world engineering problems, class imbalance is usually found. In such scenario, canonical machine learning algorithms may not present desirable solutions, and techniques for overcoming this problem must be used. In addition to using learning algorithms that alleviate the imbalance between classes, multi-objective optimization design (MOOD) approaches can be used to improve the prediction performance of ensembles of classifiers. This paper proposes a study of different MOOD approaches for ensemble learning. First, a taxonomy on multi-objective ensemble learning (MOEL) is proposed. In it, four types of existing approaches are defined: multi-objective ensemble member generation (MOEMG), multi-objective ensemble member selection (MOEMS), multi-objective ensemble member combination (MOEMC) and multi-objective ensemble member selection and combination (MOEMSC). Additionally, new approaches can be derived by combining the previous ones, such as multi-objective ensemble member generation and selection (MOEMGS), multi-objective ensemble member generation and combination (MOEMGC) and multi-objective ensemble member generation, selection and combination (MOEMGSC). With the given taxonomy, two experiments are conducted for (1) comparing the performance of the MOEL techniques for generating and aggregating base models on several imbalanced benchmark problems and (2) the performance of MOEL techniques against other machine learning techniques in a real-world imbalanced drinking-water quality anomaly detection problem. Finally, results indicate that MOOD is able to improve the predictive performance of existing ensemble learning techniques.
Article
Binary datasets are considered imbalanced when one of their two classes has less than 40% of the total number of the data instances (i.e., minority class). Existing classification algorithms are biased when applied on imbalanced binary datasets, as they misclassify instances of minority class. Many techniques are proposed to minimize the bias and to increase the classification accuracy. Synthetic Minority Oversampling Technique (SMOTE) is a well-known approach proposed to address this problem. It generates new synthetic data instances to balance the dataset. Unfortunately, it generates these instances randomly, leading to the generation of useless new instances, which is time and memory consuming. Different SMOTE derivatives were proposed to overcome this problem (such as Borderline SMOTE), yet the number of generated instances slightly changed. To overcome such problem, this paper proposes a novel approach for generating synthesized data instances known as Hybrid Clustered Affinitive Borderline SMOTE (HCAB-SMOTE). It managed to minimize the number of generated instances while increasing the classification accuracy. It combines undersampling for removing majority noise instances and oversampling approaches to enhance the density of the borderline. It uses k-means clustering on the borderline area and identify which clusters to oversample to achieve better results. Experimental results show that HCAB-SMOTE outperformed SMOTE, Borderline SMOTE, AB-SMOTE and CAB-SMOTE approaches which were developed before reaching HCABSMOTE, as it provided the highest classification accuracy with the least number of generated instances.
Article
The advent of Big Data has ushered a new era of scientific breakthroughs. One of the common issues that affects raw data is class imbalance problem which refers to imbalanced distribution of values of the response variable. This issue is present in fraud detection, network intrusion detection, medical diagnostics, and a number of other fields where negatively labeled instances significantly outnumber positively labeled instances. Modern machine learning techniques struggle to deal with imbalanced data by focusing on minimizing the error rate for the majority class while ignoring the minority class. The goal of our paper is demonstrate the effects of class imbalance on classification models. Concretely, we study the impact of varying class imbalance ratios on classifier accuracy. By highlighting the precise nature of the relationship between the degree of class imbalance and the corresponding effects on classifier performance we hope to help researchers to better tackle the problem. To this end, we carry out extensive experiments using 10-fold cross validation on a large number of datasets. In particular, we determine that the relationship between the class imbalance ratio and the accuracy is convex.
Article
In binary classification, class-imbalance problem occurs when the number of samples in one class is much larger than that of the other class. In such cases, the performance of a classifier is generally poor on the minority class. Classifier ensembles are used to tackle this problem where each member is trained using a different balanced dataset that is computed by randomly undersampling the majority class and/or randomly oversampling the minority. Although the primary target of imbalance learning is the minority class, downsampling-based schemes employ the same minority sample set for all members whereas oversampling the minority is challenging due to its unclear structure. On the other hand, heterogeneous ensembles utilizing multiple learning algorithms have a higher potential in generating diverse members than homogeneous ones. In this study, the use of heterogeneous ensembles for imbalance learning is addressed. Experiments are conducted on 66 datasets to explore the relation between the heterogeneity of the ensemble and performance scores using AUC and F1 measures. The results obtained have shown that the performance scores improve as the number of classification methods is increased from one to five. Moreover, when compared with homogeneous ensembles, significantly higher scores are achieved using heterogeneous ones. Also, it is observed that multiple balancing schemes contribute to the performance scores of some homogeneous and heterogeneous ensembles. However, the improvements are not significant for either approach.