ArticlePDF Available

Analysis of Temperature and Humidity Data for Future value prediction

Authors:
  • Symbiosis Institute of Technology Nagpur

Abstract and Figures

Knowledge of climate data in a region is essential for business, society, agriculture, pollution and energy applications. In research and development, it forces the researchers to pay an extra attention towards this type of matter. As there is a spectacular achievement in this field over the past few years, among all the other seasonal climatic attributes, the main factor used by the researcher is the Sea Surface Temperature (SST) to develop the systems for temperature and humidity prediction. Data mining is one such technology which is employed in inferring useful knowledge that can be put to use from a vast amount of data, various data mining techniques such as Classification, Prediction, Clustering and Outlier analysis can be used for the purpose. The main aim of this paper is to acquire temperature and humidity data and use an efficient data mining technique to find the hidden patterns inside the large dataset so as to transfer the retrieved information into usable knowledge for classification and prediction of climate condition.
Content may be subject to copyright.
Analysis of Temperature and Humidity Data
for Future value prediction
Badhiye S. S.#1, Wakode B. V.#2, Chatur P. N.#3
#1. M. Tech Computer Science and Engineering Department,
#2. Asst. Professor Information Technology Department,
#3. Head Computer Science and Engineering Department,
Government College of Engineering, Amravati, Maharashtra, India.
Abstract - Knowledge of climate data in a region is essential for
business, society, agriculture, pollution and energy applications.
In research and development, it forces the researchers to pay an
extra attention towards this type of matter. As there is a
spectacular achievement in this field over the past few years,
among all the other seasonal climatic attributes, the main factor
used by the researcher is the Sea Surface Temperature (SST) to
develop the systems for temperature and humidity prediction.
Data mining is one such technology which is employed in inferring
useful knowledge that can be put to use from a vast amount of
data, various data mining techniques such as Classification,
Prediction, Clustering and Outlier analysis can be used for the
purpose. The main aim of this paper is to acquire temperature
and humidity data and use an efficient data mining technique to
find the hidden patterns inside the large dataset so as to transfer
the retrieved information into usable knowledge for classification
and prediction of climate condition.
Keywords: Data Mining, Data Mining Techniques, Sea Surface
Temperature
I. INTRODUCTION
Synoptic data or climate data are the two classifications of
weather data. Synoptic data is the real-time data provided for
use in aviation safety and forecast modeling. Climate data is
the official data record, usually provided after some quality
control is performed on it [1].
Climate and weather affects the human society in all the
possible ways. Crop production in agriculture, the most
important factor for water resources i.e. Rain, an element of
weather, and the proportion of these elements increases or
decreases due to change in climate [7]. Energy sources, e.g.
natural gas and electricity are greatly depends on weather
conditions. Climate is not fixed, the fluctuation in the climate
can be seen from year to year, e.g. rain/dry; cold/warm seasons
significantly influence society as in all the possible ways.
Depending upon the techniques used Data Mining can be
divided into three basic types, i.e. Association Rules Mining,
Cluster analysis and Classification/Prediction [7]. The paper
describes how to use a data mining technique, “k-Nearest
Neighbor (KNN)”, how to develop a system that uses numeric
historical data to forecast the climate of a specific region or
city. The main aim of this paper is to acquire temperature and
humidity data and use k-Nearest Neighbor algorithm to find
hidden patterns inside a large data so as to transfer the retrieved
information into usable knowledge for classification and
prediction of temperature and humidity.
II. LITERATURE SURVEY
Prediction of the future values by analyzing Temperature
and humidity data is one of the important parts which can be
helpful to the society as well as to the economy. Work has been
done in this constrain since years. Different techniques have
been applied to predict the temperature and humidity and other
parameters of weather. Some of the work in this area is as
follows:
In data mining, the unsupervised learning technique of
clustering is a useful method for ascertaining trends and
patterns in data. Most general clustering techniques do not take
into consideration the time-order of data. Tasha R. Inniss used
a mathematical programming and statistical techniques and
methodologies to develop a seasonal clustering technique for
determining clusters of time series data, and applied this
technique to weather and aviation data to determine
probabilistic distributions of arrival capacity scenarios, which
can be used for efficient traffic flow management. The
seasonal clustering technique is modeled as a set partitioning
integer programming problem and resulting clustering’s are
evaluated using the mean square ratio criterion [2]. The
resulting seasonal distributions, which have satisfied the mean
square ratio criterion, can be used for the required inputs
(distributions of airport arrival capacity scenarios) into
stochastic ground holding models. In combination, the results
would give the optimal number of flights to ground in a ground
delay program to aid more efficient traffic flow management
[2].
S. Kotsiantis, A. Kostoulas, S. Lykoudis, A. Argiriou, K.
Menagias investigate the efficiency of data mining techniques
in estimating minimum, maximum and mean temperature
values. Using temperature data from the city of Patras in
Greece, a Regression algorithm is applied for the number of
results. The performance of these algorithms has been
evaluated using standard statistical indicators, such as
Correlation Coefficient, Root Mean Squared Error, etc. [1]
Godfrey C. Onwubolu1, Petr Buryan, Sitaram Garimella,
Visagaperuman Ramachandran,Viti Buadromo and Ajith
Abraham, presented the data mining activity that was
employed in weather data prediction or forecasting. The
approach employed is the enhanced Group Method of Data
Badhiye S. S et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (1) , 2012, 3012 - 3014
3012
Handling (e-GMDH). The weather data used for the research
include daily temperature, daily pressure and monthly rainfall
[3]. The results of e-GMDH were compared with those of PNN
and its variant, e-PNN. E-GMDH outperformed PNN an its
variant in most modeling and prediction problem. They showed
that end users of data mining should endeavor to follow the
methodologies of data mining since suspicious data points or
outliers in a vast amount of data could give unrealistic results
which may affect knowledge inference.
S. Kotsiantis, A. Kostoulas, S. Lykoudis, A. Argiriou, K.
Menagias proposed a hybrid data mining technique that can be
used to predict more accurately the mean daily temperature
values [4], it was found that the regression algorithms could
enable experts to predict temperature values with satisfying
accuracy using as input the temperatures of the previous years.
The hybrid data mining technique produce the most accurate
results.
Simple temperature prediction methods mining in the past
weather data records produced accurate prediction for
development of intelligent control solutions. The problem was
closely related to the prediction of the actual weather
conditions within the immediate environment of the
greenhouse, an intelligent greenhouse collects its own climate
data, with time weather records from weather station localized
strictly by the greenhouse were mined to the algorithm,
increasing the prediction accuracy. Peter Eredics demonstrates
the limited performance of uninformed, simple methods for
temperature forecasts, and introduces more accurate solutions
using information from the problem domain [5].
A. Outlier Analysis
A database may contain data objects that do not comply with
the general behavior or model of the data. These data objects
are outliers.
Example: Use in finding Fraud usage of credit cards. Outlier
Analysis may uncover Fraud usage of credit cards by detecting
purchases of extremely large amounts for a given account
number in comparison to regular charges incurred by the same
account. Outlier values may also be detected with respect to the
location and type of purchase or the purchase frequency.
B. Clustering
Clustering analyses data objects without consulting a known
class label. The unsupervised learning technique of clustering
is a useful method for ascertaining trends and patterns in data,
when there are no pre-defined classes. There are two main
types of clustering, hierarchical and partition [10]. In
hierarchical clustering, each data point is initially in its own
cluster and then clusters are successively joined to create a
clustering structure. This is known as the agglomerative
method. In partition clustering, the number of clusters must be
known a priori. The partitioning is done by minimizing a
measure of dissimilarity within each cluster and maximizing
the dissimilarity between different clusters.
Fig. 1 Figure of Cluster Analysis
C. Classification and Prediction
Classification is the process of finding a model that
describes and distinguishes data classes or concepts for the
purpose of being able to use the model to predict the class of
objects whose class label is unknown [6].
Classification model can be represented in various forms
such as
1) IF-THEN Rules
2) A decision tree
3) Neural network.
4) K-Nearest Neighbor etc.
Fig. 2 Classification models
III. MOTIVATION
It has become important to find an effective and accurate
tool to analyze and extract hidden knowledge from climate data
due to its increasing availability during the last decade.
Knowledge of climate data in a region is essential for
business, society, agriculture, pollution and energy
applications, research and development.
Temperature and humidity data is also used in the estimation
of bio-meteorological parameters in a region. Data Mining is
recently applied to show affect of climate variation in
vegetation and thus, statistical Data Miner software
STATISTICA 10 is used for the data mining purpose which is
intelligent data miner software where various Artificial
Intelligence algorithms are applied.
IV. PROPOSED PLAN
The objective is to be able to predict the values of
temperature and humidity parameters of climate with higher
Badhiye S. S et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (1) , 2012, 3012 - 3014
3013
accuracy, and prove the prediction ability of data mining
technique in the same context.
There are different steps in which this paper will be
implemented and various methodologies are used in each step
as shown in the figure below:
Fig. 3 Design of Temperature and Humidity Data Analysis System
A. Data Collection
This is the most important part while implementing any of
the data mining technique and thus for this purpose we are
using 10 channel midi-data logger system. This system
provides temperature and humidity data in form of excel
sheets.
Data Loggers are based on digital processor. It is an
electronic device that record data over the time in relation to
location either with a built in instrument or sensor or via
external instruments and sensors.
Data Logger can automatically collect data on a 24-hour
basis, this is the primary and the most important benefit of
using the data loggers.
B. Data Pre-Processing
The next important step in data mining is data preprocessing
the challenge faced in knowledge discovery process in
temperature and humidity data is poor data quality. Thus, data
is to be pre-processed so as to remove the noisy and unwanted
data. In this study, the weather data is used which consists of
various parameters as temperature, humidity, rain, wind speed
etc. but only temperature and humidity data is required for
analysis in this proposed study, thus, pre-processing means
removing the other unwanted parameters from the dataset.
C. Knowledge Discovery
For knowledge extraction various data mining techniques
such as Outlier Analysis, Clustering, Prediction and
Classification and Association rules can be applied in
Statistical Data Miner Software such as Statistica 10.
The classification algorithm k-Nearest Neighbor is used
which is based on Euclidean distance between two points, used
to find out the closeness between unknown samples with the
known classes. The unknown sample is then mapped to the
most common class in its k-nearest neighbors.
D. Result of Analysis
The future values of temperature and humidity are predicted
depending on the result of the classification algorithm.
V. CONCLUSION
k-Nearest Neighbor classification is an easy to understand
and easy to implement classification technique. Despite its
simplicity, it can perform well in many situations. In particular,
a well known result by Cover and Hart [11] shows that the
error of the nearest neighbor rule is bounded above by twice
the Bayes error under certain reasonable assumptions. Also, the
error of the general k-Nearest Neighbor method asymptotically
approaches that of the Bayes error and can be used to
approximate it. k-Nearest Neighbor is particularly well suited
for multi-modal classes as well as applications in which an
object can have many class labels. Thus, the proposed method
aims at temperature and humidity prediction with accuracy
nearer to 100%.
REFERENCES
[1] S. Kotsiantis and et. al., “Using Data Mining Techniques for Estimating
Minimum, Maximum and Average Daily Temperature Values”, World
Academy of Science, Engineering and Technology 2007 pp. 450-454
[2] Tasha R. Inniss “Seasonal clustering technique for time series data”,
European Journal of Operational Research (175) 2006 pp. 376–384
[3] Godfrey C. Onwubolu1, Petr Buryan, Sitaram Garimella, Visagaperuman
Ramachandran,Viti Buadromo and Ajith Abraham “Self-organizing data
mining for weather Forecasting” IADIS European Conference Data Mining
2007 pp. 81-88
[4] S. Kotsiantis, A. Kostoulas, S. Lykoudis, A. Argiriou, K. Menagias, “A
Hybrid Data Mining Technique for Estimating Mean Daily Temperature
Values”, IJICT Journal Volume 1 (5) pp. 54-59
[5] Peter Eredics “Short-Term External Air Temperature Prediction for an
Intelligent Greenhouse by Mining Climatic Time Series” WISP 2009 6th IEEE
International Symposium on Intelligent Signal Processing, 26–28 August, 2009
Budapest, Hungary pp.317-322
[6] Thair Nu Phyu, “Survey of classification techniques in Data Mining”,
IMECS 2009 Volume 1 Hong Kong pp. 1-5
[7] Han J., Kamber M.: Data Mining concepts and Techniques, Elsevier
Science and Technology, Amsterdam 2006
[8] Fix E., Hoges J. L.: Discriminatory Analysis – Non parametric
Discrimination : consistency properties. USAF School of Aviation Medicine,
Ranolph Field Texas (1951)
[9] Larose D. T.: Discovering Knowledge in Data: An Introduction to Data
Mining, Wiley, Chichester 2005
[10] B. S. Everitt, Cluster Analysis, Halsted Press, John Wiley and Sons, New
York, 1975
[11] Cover T, Hart P (1967) “Nearest neighbor pattern classification”. IEEE
Trans Inform Theory Volume 13(1) pp. 21–27
Data
Collection Data
Pre
p
rocessin
g
Database
Store
Knowledge
Discover
y
Results of
Anal
y
sis
Temp.
and
humidit
y data
from
various
sensor
device
(eg.
Data
logger)
Data
Transformati
on and
Data
Combination
Store data in
Database
Classificatio
n Result
from
Climate
Data
Help to
make
Decisio
Take
Decision
(Hot or
Humid
Data
Miner s/w
Data
Analysis
Data
Min
ing
Badhiye S. S et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 3 (1) , 2012, 3012 - 3014
3014
... Scientists so far recognise temperature as important factor to predict its changes [2], [3], [4] for managing information security issues in cyberphysical systems. Other usages are also applicable for monitoring environment temperature such as human wellbeing [5]. ...
Article
Full-text available
This paper presents a system that is able to detect physical intrusion in a specific space based on temperature and humidity change. This specific space was housing hardware components important for information security management infrastructure. Presented system is able to predict that two spaces are connected and that there is a physical breach in protected space. The presented prediction approach involves identifying patterns in historical data, where the subsequent outcomes are already known in advance, and validating these patterns using more recent data. System is implemented using k-Nearest Neighbours, Random Forest, and Support Vector Machine algorithms in Python programming language on Raspberry Pi. Real observed data to predict if specific temperature and humidity indicates intrusion were used. This approach can be used to detect intrusions in the room or in other closed space. More specifically thermal equilibrium phenomenon between two spaces after barrier between them are opened was monitored. Through process of supervised learning using labelled data, system was able to detect intrusion by using k-nearest neighbours, random forest, and support vector machine with different accuracy. Presented model shows better results using k-nearest neighbours and support vector machine with accuracy of 100% compared to random forest with accuracy of 95%. The system is low cost because of cheap Raspberry Pi controller and sensors.
... The first study related to this proposed study is reference [13] [14], which discusses temperature and humidity data analysis. This study produces a prediction on a series of numerical data. ...
Article
Full-text available
Internet of Things devices that were implemented to support condition monitoring and control systems in mushroom houses get a lot of temperature and humidity data from mushroom houses. The large number of temperature and humidity data can be used as the variables or indicators that affect mushroom production. Temperature and humidity data can be classified based on the production of mushrooms produced in one house or mushroom houses. This study aims to determine the classification of temperature and humidity data in mushroom houses based on mushroom production. The method used is a data mining approach based on the K-Nearest Neighbor. This research begins with determining the variables from training data or training data, taking testing data, then the testing data is reprocessed based on the K-Nearest Neighbor method with training data. Finally, evaluation of the method used was carried out by calculating the accuracy value. As a result, the accuracy of the K- Nearest Neighbor method was about 89%. These results are expected to be used to forecast the yield of mushroom production for future research. The forecast can be seen from the pattern of temperature and humidity data that is formed based on a certain period of time.
... It is always challenging for the scientist to predict climatic variables with new techniques. Badhiye et al. (2012) used K-nn method to discover knowledge from data for classification and prediction purpose. Zaw and Naing (2008) investigated the performance of different statistical models for predicting rainfall data of Myanmar. ...
Article
Full-text available
Accurately and timely predicting climatic variables are most challenging task for the researchers. Scientists have been trying numerous methods for forecasting environmental data with different methods and found confusing performance of different methods. Recently machine learning tools are considering as a robust technique for predicting climatic variables because these tools extracted hidden relationship from the data and can predict more correctly than existing methods. In this paperwe compare the forecasting performance of various machine learning algorithms such as Classification and Regression Trees (CART), Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbors (K-NN) and Random Forest (RF) in case of Bogura district in Bangladesh. The weekly rainfall related time series data such as temperature, humidity, wind speed, sunshine, minimum temperature and maximum temperature for the time period January, 1971 to December, 2015 were considered. The model evaluation criteria precision, recall and f-measure and overall accuracy confirms that Random Forest algorithm give best forecasting performance and cross validation approach which produce some graphical view model comparison also confirm that the Random Forest algorithm is the most suitable algorithm for predicting rainfall in case of Bogura district, Bangladesh during this study period.
... They collected three years of data comprising of around 15000 instances. [55] applied the K-Nearest Neighbor method to find the hidden pattern in a large dataset containing meteorological data of a particular area. They achieved a very high rate of accuracy but their model is unsuitable to use in remote areas. ...
Preprint
Full-text available
Weather is a phenomenon that affects everything and everyone around us on a daily basis. Weather prediction has been an important point of study for decades as researchers have tried to predict the weather and climatic changes using traditional meteorological techniques. With the advent of modern technologies and computing power, we can do so with the help of machine learning techniques. We aim to predict the weather of an area using past meteorological data and features using the Multiple Linear Regression Model. The performance of the model is evaluated and a conclusion is drawn. The model is successfully able to predict the average temperature of a day with an error of 2.8 degrees Celsius.
... Weather prediction is an important goal of atmospheric research. Hence changes weather condition is risky for human society [3,5,15].It affects the human society in all the possible ways. Weather prediction is usually done using the data gathered by remote sensing satellites. ...
Article
Full-text available
Data mining is the computer assisted process of digging through and analysing enormous sets of data and then extracting the meaningful data. Data mining tools predicts behaviours and future trends, allowing businesses to make proactive decisions. It can answer questions that traditionally were very time consuming to resolve. Therefore they can be used to predict meteorological data that is weather prediction. Weather prediction is a vital application in meteorology and has been one of the most scientifically and technologically challenging problems across the world in the last century. Predicting the weather is essential to help preparing for the best and the worst of the climate. Accurate Weather Prediction has been one of the most challenging problems around the world. Many weather predictions like rainfall prediction, thunderstorm prediction, predicting cloud conditions are major challenges for atmospheric research. This paper presents the review of Data Mining Techniques for Weather Prediction and studies the benefit of using it. The paper provides a survey of available literatures of some algorithms employed by different researchers to utilize various data mining techniques, for Weather Prediction. The work that has been done by various researchers in this field has been reviewed and compared in a tabular form. For weather prediction, decision tree and k-mean clustering proves to be good with higher prediction accuracy than other techniques of data mining.
... Synoptic concerns with the observation of different weather elements within the specific time of observation. In order to keep track of the changing weather, a meteorological centre prepares a series of synoptic charts every day, which forms the very basic of weather forecasts [10]. It involves huge collection and analysis of observational data obtained from thousands of weather stations. ...
Article
Full-text available
Cloudburst is a devastating disaster that usually occurs during rainy seasons at Himalayan regions. The recent floods in the 'Kedarnath' area, Uttarakhand are a classic example of flash floods in the Mandakini River due to cloudburst that devastated the country by killing thousands of people besides livestock. The traditional methods used for cloudburst prediction are weather forecasting, data mining techniques for weather prediction by modelling meteorological data, laser beam atmospheric extinction measurements from manned and unmanned aerospace vehicles. These techniques are more expensive and time consuming along with uncertainty of accurate prediction. The proposed method in this paper is Arduino based cloudburst predetermination system with real time calculation of rainfall intensity.
Chapter
Full-text available
Weather monitoring for different parameters is very common nowadays but accurate forecasting of weather or atmosphere for a particular location is a very challenging task. Odisha is a state in eastern India which is very much prone to climatic hazards like cyclones, super cyclones and earthquakes. This state has seen many cyclones in its history. The prediction of cyclones and preparedness as per the prediction can save the life and property of people in mass for which the Government has taken such steps in the past and could be able to save the lives of many people from impending climatic disasters. Therefore, to predict the future weather, here we collected 15 years of weather data set of the Odisha region from a Data access viewer by giving different parameters and implemented the collected dataset in python programming to train the machine to predict the future condition by analysing the huge amount of data given. The machine can analyse and predict the data by using different methods such as Decision Tree (DT), Naïve Bayes (NB) and Support Vector Machine (SVM), then finally calculated that out of these 3 algorithms the accuracy rate of SVM is highest among the other two algorithms and the accuracy is 77%. So, here the different algorithms have been used to predict the weather of Odisha to validate the model using a dataset of weather.KeywordsSVMNaïve BayesDecision treeData access viewer etc.
Article
Weather is a phenomenon that describes the present atmospheric conditions like temperature, wind direction, humidity etc. over a geographical location. Weather monitoring involves use of high configuration computer system that require high amount of power and expenditure for installation and maintenance. In the present work we have developed a low cost IoT based system to monitor weather conditions of a location. This system is limited in its geographical radius but its low cost reliable features make it possible to use such a series of systems instead of just one to monitor weather conditions over a large pheriphery.
Article
Full-text available
These days focus is more on technologies like Artificial Intelligence, Machine Learning and IoT. There is lots of platforms available for IOT implementation. ESP8266 chip is among them Here the implementation is about prediction of different aspects of weather data that can be used in many ways like predicting the future condition of different region of earth or predicting future condition of different planets and their different regions. To implement this system, we need different sensors like pressure sensor humidity sensor, temperature sensor and a light intensity sensor i.e DHT11 is utilize for temperature and humidity data together and LDR. Is for light intensity. The data which is sensed by different sensors are than uploaded to Thingspeak which is an API for cloud server by the help of NodeMCU and then converted to csv format. The data can be used for monitoring the real time values too. Machine Learning Environment can be setup by the help of a CNN model. Training of model can be done by recorded values of sensor data. After recording data from sensors to NodeMCU like temperature, pressure, humidity and light intensity and after these values are sent to python environment that is Jupyter notebook for further analysis. Here the data which is used is real time data to predict the particular value and test the model.
Conference Paper
Full-text available
Climate control for intelligent greenhouses is currently an active field of research. Model based intelligent greenhouse control systems seem to increase the control performance over traditional solutions. In this paper the control intelligence means the preferably minimal maintenance (heating) cost, within the climatic conditions required to grow sensitive floral cultures. To this purpose it is not enough merely to follow optimally some particular temperature profile. Costs can be minimized if the internal air temperature is predictable within a reasonable time span, and this depends on how well we can predict the dynamics of the external air temperature. The problem is closely related to the prediction of the actual weather conditions within the immediate environment of the greenhouse. The paper demonstrates the limited performance of uninformed, simple methods for temperature forecasts, and introduces more accurate solutions using information from the problem domain.
Book
Cluster analysis comprises a range of methods for classifying multivariate data into subgroups. By organizing multivariate data into such subgroups, clustering can help reveal the characteristics of any structure or patterns present. These techniques have proven useful in a wide range of areas such as medicine, psychology, market research and bioinformatics. This fifth edition of the highly successful Cluster Analysis includes coverage of the latest developments in the field and a new chapter dealing with finite mixture models for structured data. Real life examples are used throughout to demonstrate the application of the theory, and figures are used extensively to illustrate graphical techniques. The book is comprehensive yet relatively non-mathematical, focusing on the practical aspects of cluster analysis. Key Features: textbullet} Presents a comprehensive guide to clustering techniques, with focus on the practical aspects of cluster analysis. textbullet{ Provides a thorough revision of the fourth edition, including new developments in clustering longitudinal data and examples from bioinformatics and gene studies textbullet Updates the chapter on mixture models to include recent developments and presents a new chapter on mixture modeling for structured data. Practitioners and researchers working in cluster analysis and data analysis will benefit from this book.
Article
Estimates of temperature values at a specific time of day, from daytime and daily profiles, are needed for a number of environmental, ecological, agricultural and technical applications, ranging from natural hazards assessments, crop growth forecasting to design of solar energy systems. The scope of this research is to investigate the efficiency of data mining techniques in estimating minimum, maximum and mean temperature values. For this reason, a number of experiments have been conducted with well-known regression algorithms using temperature data from the city of Patras in Greece. The performance of these algorithms has been evaluated using standard statistical indicators, such as Correlation Coefficient, Root Mean Squared Error, etc.
Article
: The discrimination problem (two population case) may be defined as follows: e random variable Z, of observed value z, is distributed over some space (say, p-dimensional) either according to distribution F, or according to distribution G. The problem is to decide, on the basis of z, which of the two distributions Z has.
Chapter
Chapter Five begins with a discussion of the differences between supervised and unsupervised methods. In unsupervised methods, no target variable is identified as such. Most data mining methods are supervised methods, however, meaning that (a) there is a particular pre-specified target variable, and (b) the algorithm is given many examples where the value of the target variable is provided, so that the algorithm may learn which values of the target variable are associated with which values of the predictor variables. A general methodology for supervised modeling is provided, for building and evaluating a data mining model. The training data set, test data set, and validation data sets are discussed. The tension between model overfitting and underfitting is illustrated graphically, as is the bias-variance tradeoff. High complexity models are associated with high accuracy and high variability. The mean-square error is introduced, as a combination of bias and variance. The general classification task is recapitulated. The k-nearest neighbor algorithm is introduced, in the context of a patient-drug classification problem. Voting for different values of k are shown to sometimes lead to different results. The distance function, or distance metric, is defined, with Euclidean distance being typically chosen for this algorithm. The combination function is defined, for both simple unweighted voting and weighted voting. Stretching the axes is shown as a method for quantifying the relevance of various attributes. Database considerations, such as balancing, are discussed. Finally, k-nearest neighbor methods for estimation and prediction are examined, along with methods for choosing the best value for k.