ChapterPDF Available

Statistical Methods for Data Mining

Authors:

Abstract

The aim of this chapter is to present the main statistical issues in Data Mining (DM) and Knowledge Data Discovery (KDD) and to examine whether traditional statistics approach and methods substantially differ from the new trend of KDD and DM. We address and emphasize some central issues of statistics which are highly relevant to DM and have much to offer to DM Key wordsStatistics-Regression Models-False Discovery Rate (FDR)-Model selection and False Discovery Rate (FDR)
Chapter 1
STATISTICAL METHODS FOR DATA MINING
Yoav Benjamini
Department of Statistics, School of Mathematical Sciences, Sackler Faculty for Exact
Sciences
Tel Aviv University
ybenja@post.tau.ac.il
Moshe Leshno
Faculty of Management and Sackler Faculty of Medicine
Tel Aviv University
leshnom@post.tau.ac.il
Abstract The aim of this chapter is to present the main statistical issues in Data
mining (DM) and Knowledge Data Discovery (KDD) and to examine
whether traditional statistics approach and methods substantially differ
from the new trend of KDD and DM. We address and emphasize some
central issues of statistics which are highly relevant to DM and have
much to offer to DM.
Keywords: Statistics, Regression Models, False Discovery Rate (FDR), Model se-
lection and False Discovery Rate (FDR)
1. Introduction
In the words of anonymous saying there are two problems in mod-
ern science: too many people using different terminology to solve the
same problems and even more people using the same terminology to
address completely different issues. This is particularly relevant to the
relationship between traditional statistics and the new emerging field of
knowledge data discovery (KDD) and data mining (DM). The explosive
growth of interest and research in the domain of KDD and DM of recent
1
2
years is not surprising given the proliferation of low-cost computers and
the requisite software, low-cost database technology (for collecting and
storing data) and the ample data that has been and continues to be
collected and organized in databases and on the web. Indeed, the imple-
mentation of KDD and DM in business and industrial organizations has
increased dramatically, although their impact on these organizations is
not clear. The aim of this chapter is mainly to present the main sta-
tistical issues in DM and KDD and to examine the role of traditional
statistics approach and methods in the new trend of KDD and DM. We
argue that data miners should be familiar with statistical themes and
models and statisticians should be aware of the capabilities and limi-
tation of data mining and the ways in which data mining differs from
traditional statistics.
Statistics is the traditional field that deals with the quantification,
collection, analysis, interpretation, and drawing conclusions from data.
Data mining is an interdisciplinary field that draws on computer sci-
ences (data base, artificial intelligence, machine learning, graphical and
visualization models), statistics and engineering (pattern recognition,
neural networks). DM involves the analysis of large existing data bases
in order to discover patterns and relationships in the data, and other
findings (unexpected, surprising, and useful). Typically, it differs from
traditional statistics on two issues: the size of the data set and the fact
that the data were initially collected for purpose other than the that
of the DM analysis. Thus, experimental design, a very important topic
in traditional statistics, is usually irrelevant to DM. On the other hand
asymptotic analysis, sometimes criticized in statistics as being irrelevant,
becomes very relevant in DM.
While in traditional statistics a data set of 100 to 10
4
entries is con-
sidered large, in DM even 10
4
may be considered a small set set fit to
be used as an example, rather than a problem encountered in practice.
Problem sizes of 10
7
to 10
10
are more typical. It is important to empha-
size, though, that data set sizes are not all created equal. One needs to
distinguish between the number of cases (observations) in a large data
set (n), and the number of features (variables) available for each case
(m). In a large data set,n ,m or both can be large, and it do es matter
which, a point on which we will elaborate in the continuation. Moreover
these definitions may change when the same data set is being used for
two different purposes. A nice demonstration of such an instance can
be found in the 2001 KDD competition, where in one task the number
of cases was the number of purchasing customers, the click information
being a subset of the features, and in the other task the clicks were the
cases.
Statistical Methods for Data Mining 3
Our aim in this chapter is to indicate certain focal areas where sta-
tistical thinking and practice have much to offer to DM. Some of them
are well known, whereas others are not. We will cover some of them
in depth, and touch upon others only marginally. We will address the
following issues which are highly relevant to DM:
Size
Curse of Dimensionality
Assessing uncertainty
Automated analysis
Algorithms for data analysis in Statistics
Visualization
Scalability
Sampling
Modelling relationships
Model selection
We briefly discuss these issues in the next section and then devote
special sections to three of them. In section 3 we explain and present
how the most basic of statistical metho dologies, namely regression anal-
ysis, has developed over the years to create a very flexible tool to model
relationships, in the form of Generalized Linear Models (GLMs). In sec-
tion 4 we discuss the False Discovery Rate (FDR) as a scalable approach
to hypothesis testing. In section 5 we discuss how FDR ideas contribute
to flexible model selection in GLM. We conclude the chapter by asking
whether the concepts and methods of KDD and DM differ from those of
traditional statistical, and how statistics and DM should act together.
2. Statistical Issues in DM
2.1 Size of the Data and Statistical Theory
Traditional statistics emphasizes the mathematical formulation and
validation of a methodology, and views simulations and empirical or
practical evidence as a less form of validation. The emphasis on rigor
has required proof that a proposed method will work prior to its use.
In contrast, computer science and machine learning use exp erimental
validation methods. In many cases mathematical analysis of the per-
formance of a statistical algorithm is not feasible in a specific setting,
4
but becomes so when analyzed asymptotically. At the same time, when
size becomes extremely large, studying performance by simulations is
also not feasible. It is therefore in settings typical of DM problems that
asymptotic analysis becomes both feasible and appropriate. Interest-
ingly, in classical asymptotic analysis the number of cases n tends to
infinity. In more contemporary literature there is a shift of emphasis to
asymptotic analysis where the number of variables m tends to infinity.
It is a shift that has occurred because of the interest of statisticians and
applied mathematicians in wavelet analysis (see Chapter ...), where the
number of parameters (wavelet coefficients) equals the number of cases,
and has proved highly successful in areas such as the analysis of gene
expression data from microarrays.
2.2 The curse of dimensionality and approaches
to address it
The curse of dimensionality is a well documented and often cited
fundamental problem. Not only do algorithms face more difficulties as
the the data increases in dimension, but the structure of the data it-
self changes. Take, for example, data uniformly distributed in a high-
dimensional ball. It turns out that (in some precise way, see Meilijson,
1991) most of the data points are very close to the surface of the ball.
This phenomenon becomes very evident when looking for the k-Nearest
Neighbors of a point in high-dimensional space. The points are so far
away from each other that the radius of the neighborhood b ecomes ex-
tremely large.
The main remedy offered for the curse of dimensionality is to use only
part of the available variables per case, or to combine variables in the
data set in a way that will summarize the relevant information with
fewer variables. This dimension reduction is the essence of what goes
on in the data warehousing stage of the DM process, along with the
cleansing of the data. It is an important and time-consuming stage of
the DM operations, accounting for 80-90% of the time devoted to the
analysis.
The dimension reduction comprises two types of activities: the first
is quantifying and summarizing information into a number of variables,
and the second is further reducing the variables thus constructed into a
workable number of combined variables. Consider, for instance, a phone
company that has at its disposal the entire history of calls made by a
customer. How should this history be reflected in just a few variables?
Should it be by monthly summaries of the number of calls per month
for each of the last 12 months, such as their means (or medians), their
Statistical Methods for Data Mining 5
maximal number, and a certain percentile? Maybe we should use the
mean, standard deviation and the number of calls below two standard
deviations from the mean? Or maybe we should use none of these but
rather variables capturing the monetary values of the activity? If we
take this last approach, should we work with the cost itself or will it be
more useful to transfer the cost data to the log scale? Statistical theory
and practice have much to offer in this respect, both in measurement
theory, and in data analysis practices and tools. The variables thus
constructed now have to be further reduced into a workable number of
combined variables. This stage may still involve judgmental combination
of previously defined variables, such as cost per number of customers
using a phone lines, but more often will require more automatic methods
such as principal components or independent components analysis (for
a further discussion of principle component analysis see Roberts and
Everson, 2001).
We cannot conclude the discussion on this topic without noting that
occasionally we also start getting the blessing of dimensionality, a term
coined by David Donoho (Donoho, 2000) to describe the phenomenon of
the high dimension helping rather than hurting that we often encounter
as we proceed up the scale in working with very high dimensional data.
For example, for large m if the data we study is pure noise, the i-th
largest observation is very close to its expectations under the model
for the noise! Another case in point is microarray analysis, where the
many non-relevant genes analyzed give ample information about the
distribution of the noise, making it easier to identify real discoveries.
We shall see a third case below.
2.3 Assessing uncertainty
Assessing the uncertainty surrounding knowledge derived from data
is recognized as a the central theme in statistics. The concern about
the uncertainty is down-weighted in KDD, often because of the myth
that all relevant data is available in DM. Thus, standard errors of av-
erages, for example, will be ridiculously low, as will prediction errors.
On the other hand experienced users of DM tools are aware of the vari-
ability and uncertainty involved. They simply tend to rely on seemingly
”non-statistical” technologies such as the use of a training sample and
a test sample. Interestingly the latter is a methodology widely used in
statistics, with origins going back to the 1950s. The use of such valida-
tion methods, in the form of cross-validation for smaller data sets, has
been a common practice in exploratory data analysis when dealing with
medium size data sets.
6
Some of the insights gained over the years in statistics regarding the
use of these tools have not yet found their way into DM. Take, for
example, data on food store baskets, available for the last four years,
where the goal is to develop a prediction model. A typical analysis will
involve taking a random training sample from the data, then testing the
model on the training sample, with the results guiding us as to the choice
of the most appropriate model. However, the model will be used next
year, not last year. The main uncertainty surrounding its conclusions
may not stem from the person to person variability captured by the
differences between the values in the training sample, but rather follow
from the year to year variability. If this is the case, we have all the
data, but only four observations. The choice of the data for validation
and training samples should reflect the higher sources of variability in
the data, by each time setting the data of one year aside to serve as the
source for the test sample (for an illustrated yet profound discussion of
these issues in exploratory data analysis see Mosteller and Tukey, 1977,
Ch. 7,8).
2.4 Automated analysis
The inherent dangers of the necessity to rely on automatic strategies
for analyzing the data, another main theme in DM, have been demon-
strated again and again. There are many examples where trivial non-
relevant variables, such as case number, turned out to be the best predic-
tors in automated analysis. Similarly, variables displaying a major role
in predicting a variable of interest in the past, may turn out to be useless
because they reflect some strong phenomenon not expected to occur in
the future (see for example the conclusions using the onion metaphor
from the 2002 KDD competition). In spite of these warnings, it is clear
that large parts of the analysis should be automated, especially at the
warehousing stage of the DM.
This may raise new dangers. It is well known in statistics that having
even a small proportion of outliers in the data can seriously distort its
numerical summary. Such unreasonable values, deviating from the main
structure of the data, can usually b e identified by a careful human data
analyst, and excluded from the analysis. But once we have to warehouse
information about millions of customers, summarizing the information
about each customer by a few numbers has to be automated and the
analysis should rather deal automatically with the possible impact of a
few outliers.
Statistical theory and methodology supply the framework and the
tools for this endeavor. A numerical summary of the data that is not
Statistical Methods for Data Mining 7
unboundedly influenced by a negligible proportion of the data is called
a resistant summary. According to this definition the average is not re-
sistant, for even one straying data value can have an unbounded effect
on it. In contrast, the median is resistant. A resistant summary that
retains its good properties under less than ideal situations is called a
robust summary, the α-trimmed mean (rather than the median) being
an example of such. The concepts of robustness and resistance, and the
development of robust statistical tools for summarizing location, scale,
and relationships, were developed during the 1970’s and the 1980’s, and
resulting theory is quite mature (see, for instance, Ronchetti (Ronchetti
et al., 1986; Dell’Aquila and Ronchetti, 2004), even though robustness
remains an active area of contemporary research in statistics. Robust
summaries, rather than merely averages, standard deviations, and simple
regression coefficients, are indispensable in DM. Here too, some adap-
tation of the computations to size may be needed, but efforts in this
direction are being made in the statistical literature.
2.5 Algorithms for data analysis in statistics
Computing has always been a fundamental to statistic, and it re-
mained so even in times when mathematical rigorousity was most highly
valued quality of a data analytic tool. Some of the important computa-
tional tools for data analysis, rooted in classical statistics, can be found
in the following list: efficient estimation by maximum likelihood, least
squares and least absolute deviation estimation, and the EM algorithm;
analysis of variance (ANOVA, MANOVA, ANCOVA), and the analy-
sis of repeated measurements; nonparametric statistics; log-linear anal-
ysis of categorial data; linear regression analysis, generalized additive
and linear models, logistic regression, survival analysis, and discrimi-
nant analysis; frequency domain (spectrum) and time domain (ARIMA)
methods for the analysis of time series; multivariate analysis tools such as
factor analysis, principal component and later independent component
analyses, and cluster analysis; density estimation, smoothing and de-
noising, and classification and regression trees (decision trees); Bayesian
networks and the Monte Carlo Markov Chain (MCMC) algorithm for
Bayesian inference.
For an overview of most of these topics, with an eye to the DM commu-
nity see Hastie, Tibshirani and Friedman, 2001. Some of the algorithms
used in DM which were not included in classical statistic, are considered
by some statisticians to be part of statistics (Friedman, 1998). For exam-
ple, rule induction (AQ, CN2, Recon, etc.), associate rules, neural net-
8
works, genetic algorithms and self-organization maps may be attributed
to classical statistics.
2.6 Visualization
Visualization of the data and its structure, as well as visualization of
the conclusions drawn from the data, are another central theme in DM.
Visualization of quantitative data as a major activity flourished in the
statistics of the 19th century, faded out of favor through most of the 20th
century, and began to regain importance in the early 1980s. This impor-
tance in reflected in the development of the Journal of Computational
and Graphical Statistics of the American Statistical Association. Both
the theory of visualizing quantitative data and the practice have dramat-
ically changed in recent years. Spinning data to gain a 3-dimensional
understanding of pointclouds, or the use of projection pursuit are just
two examples of visualization technologies that emerged from statistics.
It is therefore quite frustrating to see how much KDD software de-
viates from known principles of good visualization practices. Thus, for
instance the fundamental principle that the retinal variable in a graphi-
cal display (length of line, or the position of a point on a scale) should be
proportional to the quantitative variable it represents is often violated
by introducing a dramatic p erspective. Add colors to the display and
the result is even harder to understand.
Much can be gained in DM by mining the knowledge about visualiza-
tion available in statistics, though the visualization tools of statistics are
usually not calibrated for the size of the data sets commonly dealt within
DM. Take for example the extremely effective Boxplots display, used for
the visual comparisons of batches of data. A well-known rule determines
two fences for each batch, and points outside the fences are individually
displayed. There is a traditional default value in most statistical soft-
ware, even though the rule was developed with batches of very small size
in mind (in DM terms). In order to adapt the visualization technique
for routine use in DM, some other rule which will probably be adaptive
to the size of the batch should be developed. As this small example
demonstrates, visualization is an area where joint work may prove to be
extremely fruitful.
2.7 Scalability
In machine learning and data mining scalability relates to the ability of
an algorithm to scale up with size, an essential condition being that the
storage requirement and running time should not become infeasible as
the size of the problem increases. Even simple problems like multivariate
Statistical Methods for Data Mining 9
histograms become a serious task, and may benefit from complex algo-
rithms that scale up with size. Designing scalable algorithms for more
complex tasks, such as decision tree modeling, optimization algorithms,
and the mining of association rules, has been the most active research
area in DM. Altogether, scalability is clearly a fundamental problem in
DM mostly viewed with regard to its algorithmic aspects. We want to
highlight the duality of the problem by suggesting that concepts should
be scalable as well. In this respect, consider the general belief that hy-
pothesis testing is a statistical concept that has nothing to offer in DM.
The usual argument is that data sets are so large that every hypothesis
tested will turn out to be statistically significant - even if differences or
relationships are minuscule. Using association rules as an example, one
may wonder whether an observed lift for a given rule is ”really differ-
ent from 1”, but then find that at the traditional level of significance
used (the mythological 0.05) an extremely large number of rules are in-
deed significant. Such findings brought David Hand (Hand, 1998) to
ask ”what should replace hypothesis testing?” in DM. We shall discuss
two such important scalable concepts in the continuation: the testing
of multiple hypotheses using the False Discovery Rate and the penalty
concept in model selection.
2.8 Sampling
Sampling is the ultimate scalable statistical tool: if the number of
cases n is very large the conclusions drawn from the sample depend only
on the size of the sample and not on the size of the data set. It is often
used to get a first impression of the data, visualize its main features, and
reach decisions as to the strategy of analysis. In spite of its scalability
and usefulness sampling has been attacked in the KDD community for its
inability to find very rare yet extremely interesting pieces of knowledge.
Sampling is a very well developed area of statistics (see for example
Cochran, 1977), but is usually used in DM at the very basic level. Strat-
ified sampling, where the probability of picking a case changes from one
stratum to another, is hardly ever used. But the questions are relevant
even in the simplest settings: should we sample from the few positive re-
sponses at the same rate that we sample from the negative ones? When
studying faulty loans, should we sample larger loans at a higher rate?
A thorough investigations of such questions, phrased in the realm of
particular DM applications may prove to be very beneficial.
Even greater benefits might be realized when more advanced sampling
models, especially those related to super populations, are utilized in
DM. The idea here is that the population of customers we view each
10
year, and from which we sample, can itself b e viewed as a sample of
the same super population. Hence next year’s customers will again be a
population sampled from the super population. We leave this issue wide
open.
3. Modeling Relationships using Regression
Models
Demonstrating that statistics, like data mining, is concerned with
turning data into information and knowledge, even though the terminol-
ogy may differ, in this section we present a major statistical approach
being used in data mining, namely regression analysis. In the late 1990s,
statistical methodologies such as regression analysis were not included
in commercial data mining packages. Nowadays, most commercial data
mining software includes many statistical to ols and in particular re-
gression analysis. Although regression analysis may seem simple and
anachronistic, it is a very p owerful tool in DM with large data sets,
especially in the form of the generalized linear models (GLMs). We
emphasize the assumptions of the models being used and how the un-
derlying approach differs from that of machine learning. The reader
is referred to McCullagh and Nelder, 1991 and Chapters ... for more
detailed information on the specific statistical methods.
3.1 Linear Regression Analysis
Regression analysis is the process of determining how a variable y
is related to one, or more, other variables x
1
, . . . , x
k
. The y is usually
called the dependent variable and the x
i
’s are called the independent or
explanatory variables. In a linear regression model we assume that
y
i
= β
0
+
k
X
j=1
β
j
x
ji
+ ε
i
i = 1, . . . , M (1.1)
and that the ε
i
’s are independent and are identically distributed as
N (0, σ
2
) and M is the number of data points. The expected value of y
i
is given by
E(y
i
) = β
0
+
k
X
j=1
β
j
x
ji
(1.2)
To estimate the coefficients of the linear regression model we use the
least square estimation which gives results equivalent to the estimators
obtained by the maximum likelihood method. Note that for the linear
regression model there is an explicit formula of the β’s. We can write
Statistical Methods for Data Mining 11
(1.1) in matrix form by Y = X · β
t
+ ε
t
where β is the transpose of the
vector [β
0
, β
1
, . . . , β
k
], ε is the transpose of the vector ε= [ε
1
, . . . , ε
M
]
and the matrix X is given by
X =
1 x
11
· · · x
1k
1 x
21
· · · x
2k
1
.
.
.
.
.
.
.
.
.
1 x
k1
· · · x
Mk
(1.3)
The estimates of the β’s are given (in matrix form) by
ˆ
β=(X
t
X)
1
X
t
Y .
Note that in linear regression analysis we assume that for a given x
1
, . . . , x
k
y
i
is distributed as N (β
0
+
P
k
j=1
β
j
x
ji
, σ
2
). There is a large class of gen-
eral regression models where the relationship between the y
i
s and the
vector x is not assumed to be linear, that can be converted to a linear
model.
Machine learning approach compared to regression analysis aims to
select a function f F from a given set of functions F, that best approx-
imates or fits the given data. Machine learning assumes that the given
data (x
i
, y
i
), (i = 1, . . . , M) is obtained by a data generator, producing
the data according to an unknown distribution p(x, y) = p(x)p(y|x).
Given a loss function Ψ(y f(x)), the quality of an approximation pro-
duced by the machine learning is measured by the expected loss, the
expectation being below the unknown distribution p(x, y). The subject
of statistical machine learning is the following optimization problem:
min
f∈F
Z
Ψ(y f(x))dp(x, y) (1.4)
when the density function p(x, y) is unknown but a random indepen-
dent sample of (x
i
, y
i
) is given. The problem of minimizing (1.4) on
the basis of the data is the subject of statistical machine learning. If
F is the set of all linear function of x and Ψ(y f(x)) = (y f(x))
2
then if p(y|x) is normally distributed then the minimization of (1.4) is
equivalent to linear regression analysis.
3.2 Generalized Linear Models
Although in many cases the set of linear function is good enough to
model the relationship between the stochastic response y as a function
of x it may not always suffice to represent the relationship. The general-
ized linear model increases the family of functions F that may represent
the relationship between the response y and x. The tradeoff is between
12
having a simple model and a more complex model representing the rela-
tionship between y and x. In the general linear model the distribution of
y given x does not have to be normal, but can be any of the distributions
in the exponential family (see McCullagh and Nelder, 1991). Instead of
the expected value of y|x being a linear function, we have
g(E(y
i
)) = β
0
+
k
X
j=1
β
j
x
ji
(1.5)
where g(·) is a monotone differentiable function.
In the generalized additive models, g(E(y
i
)) need not to be a linear
function of x but has the form:
g(E(y
i
)) = β
0
+
k
X
j=1
σ
j
(x
ji
) (1.6)
where σ(·)’s are smooth functions. Note that neural networks are a
special case of the generalized additive linear models. For example the
function that a multilayer feedforward neural network with one hidden
layer computes is (see Chapter ... for detailed information):
y
i
= f(x) =
m
X
l=1
β
j
· σ
k
X
j=1
w
jl
x
ji
θ
j
(1.7)
where m is the number of processing-units in the hidden layer. The
family of functions that can be computed depends on the numb er of
neurons in the hidden layer and the activation function σ. Note that
a standard multilayer feedforward network with a smooth activation
function σ can approximate any continuous function on a compact set
to any degree of accuracy if and only if the network’s activation function
σ is not a polynomial (Leshno et al., 1993).
There are methods for fitting generalized additive models. However,
unlike linear models for which there exits a framework of statistical in-
ference, for machine learning algorithms as well as generalized additive
methods, no such framework have yet been developed. For example,
using a statistical inference framework in linear regression one can test
the hypothesis that all or part of the coefficients are zero.
The total sum of squares (SST ) is equal to the sum of squares due to
regression (SSR ) plus the residual sum of square (RSS
k
), i.e.
Statistical Methods for Data Mining 13
M
X
i=1
(y
i
y)
2
| {z }
SST
=
M
X
i=1
( ˆy
i
y)
2
| {z }
SSR
+
M
X
i=1
(y
i
ˆy
i
)
2
| {z }
RSS
k
(1.8)
The percentage of variance explained by the regression is a very popu-
lar method to measure the goodness-of-fit of the model. More specifically
R
2
and the adjusted R
2
defined below are used to measure the goodness
of fit.
R
2
=
P
M
i=1
(ˆy
i
y)
2
P
M
i=1
(y
i
y)
2
= 1
RSS
k
SST
(1.9)
Adjusted-R
2
= 1 (1 R
2
)
M 1
M k 1
(1.10)
We next turn to a special case of the general additive model that is
very popular and powerful tool in cases where the responses are binary
values.
3.3 Logistic regression
In logistic regression the y
i
s are binary variables and thus not nor-
mally distributed. The distribution of y
i
given x is assumed to follow a
binomial distribution such that:
log
µ
p(y
i
= 1|x)
1 p(y
i
= 1|x)
= β
0
+
k
X
j=1
β
j
x
ji
(1.11)
If we denote π(x) = p(y = 1|x) and the real valued function g(t) =
t
1t
then g(π(x)) is a linear function of x. Note that we can write y = π(x)+ε
such that if y = 1 then ε = 1 π(x) with probability π(x), and if y = 0
then ε = π(x) with probability 1 π(x). Thus, π(x) = E(y|x) and
π(x) =
e
β
0
+
P
k
j=1
β
j
x
j
1 + e
β
0
+
P
k
j=1
β
j
x
j
(1.12)
Of the several methods to estimates the β’s, the method of maximum
likelihood is one most commonly used in the logistic regression routine
of the major software packages.
In linear regression, interest focuses on the size of R
2
or adjusted-R
2
.
The guiding principle in logistic regression is similar: the comparison of
observed to predicted values is based on the log likelihood function. To
14
compare two models - a full model and a reduced model, one uses the
following likelihood ratio:
D = 2 ln
µ
likelihod of the reduced model
likelihod of the full model
(1.13)
The statistic D in equation (1.13), is called the deviance (McCullagh
and Nelder, 1991). Logistic regression is a very powerful tool for classifi-
cation problems in discriminant analysis and is applied in many medical
and clinical research studies.
3.4 Survival analysis
Survival analysis addresses the question of how long it takes for a
particular event to happen. In many medical applications the most im-
portant response variable often involves time; the event is some hazard
or death and thus we analyze the patient’s survival time. In business
application the event may be a failure of a machine or market entry
of a competitor. There are two main characteristics of survival anal-
ysis that make it different from regression analysis. The first is that
the presence of censored observation, where the event (e.g. death) has
not necessarily occurred by the end of the study. Censored observa-
tion may also occur when patients are lost to follow-up for one reason
or another. If the output is censored, we do not have the value of the
output, but we do have some information about it. The second, is that
the distribution of survival times is often skewed or far from normal-
ity. These features require special methods of analysis of survival data,
two functions describing the distribution of survival times being of cen-
tral importance: the hazard function and the survival function. Using
T to represent survival time, the survival function denoted by S(t), is
defined as the probability of survival time to be greater than t, i.e.
S(t) = Pr(T > t) = 1 F (t), where F (t) is the cumulative distribu-
tion function of the output. The hazard function, h(t), is defined as the
probability density of the output at time t conditional upon survival to
time t, that is h(t) = f (t)/S(t), where f(t) is the probability density
of the output. Is is also known as the instantaneous failure rate and
presents the probability that an event will happen in a small time in-
terval t, given that the individual has survived up to the beginning of
this interval, i.e. h(t) = lim
t0
Pr(tT <t+∆t|tT )
t
= f (t)/S(t). The haz-
ard function may remain constant, increase, decrease or take some more
complex shape. Most modeling of survival data is done using a propor-
Statistical Methods for Data Mining 15
tional hazard model. A proportional-hazard model, which assumes that
the hazard function is of the form
h(t) = α(t) exp
Ã
β
0
+
n
X
i=1
β
i
x
i
!
(1.14)
α(t) is a hazard function on its own, called the baseline hazard func-
tion, corresponding to that for the average value of all the covariates
x
1
, . . . , x
n
. This is called a proportional-hazard model, because the haz-
ard function for two different patients have a constant ratio. The inter-
pretation of the β’s in this model is that the effect is multiplicative.
There are several approaches to survival data analysis. The simplest it
to assume that the baseline hazard function is constant which is equiv-
alent to assuming exponential distribution. Another simple approach
would be to assume that the baseline hazard function is of the two-
parameter family of function, like the Weibull distribution. In these
cases the standard methods such as maximum likelihood can be used.
In other cases one may restrict α(t) for example by assuming it to be
monotonic. In business application, the baseline hazard function can
be determined by experimentation, but in medical situations it is not
practical to carry out an experiment to determine the shape of the base-
line hazard function. The Cox proportional hazards model ( Cox, 1972),
introduced to overcome this problem, has become the most commonly
used procedure for modelling the relationship of covariates to a survival
outcome and it is used in almost all medical analyses of survival data.
Estimation of the β’s is based on the partial likelihood function intro-
duced by Cox ( Cox, 1972; Therneau and Grambsch, 2000).
There are many other important statistical themes that are highly
relevant to DM, among them: statistical classification methods, spline
and wavelets, decision trees and others (see Chapters ... for more detailed
information on these issues). In the next section we elaborate on the
False Discovery Rate (FDR) metho d (Benjamini and Hochberg, 1995),
a most salient feature of DM.
4. False Discovery Rate (FDR) Control in
Hypotheses testing
As noted before there is a feeling that the testing of a hypothesis is
irrelevant in DM. However the problem of separating a real phenomenon
from its background noise is just as fundamental a concern in DM as in
statistics. Take for example an association rule, with an observed lift
which is bigger than 1, as desired. Is it also significantly bigger than 1
in the statistical sense, that is beyond what is expected to happen as a
16
result of noise? The answer to this question is given by the testing of the
hypothesis that the lift is 1. However, in DM a hypothesis is rarely tested
alone, as the above point demonstrates. The tested hypothesis is always
a member of a larger family of similar hypotheses, all association rules
of at least a given support and confidence being tested simultaneously.
Thus, the testing of hypotheses in DM always invokes the ”Multiple
Comparisons Problem” so often discussed in statistics which is interest-
ing in itself as the first DM problem in the statistics of 50 years ago
did just that: when a feature of interest (a variable) is measured on
10 subgroups (treatments), and the mean values are compared to some
reference value (such as 0), the problem is a small one, but take these
same means and search among all pairwise comparisons between the
treatments to find a significant difference, and the number of compar-
isons increases to 10*(10-1)/2=45 - which is in general quadratic in the
number of treatments. It becomes clear that if we allow an .05 proba-
bility of deciding that a difference exists in a single comparison even if
it really does not, thereby making a false discovery (or a type I error in
statistical terms), we can expect to find on the average 2.25 such errors
in our pool of discoveries. No wonder this DM activity is sometimes
described in statistics as ”post hoc analysis” - a nice definition for DM
with a traditional flavor.
The attitude that has been taken during 45 years of statistical re-
search is that in such problems the probability of making even one false
discover should be controlled, that is controlling the Family Wise Error
rate (FWE) as it is called. The simplest way to address the multiple
comparisons problem, and offer FWE control at some desired level α, is
to use the Bonferroni procedure: conduct each of the m tests at level
α/m. In problems where m becomes very large the penalty to the re-
searcher from the extra caution becomes heavy, in the sense that the
probability of making any discovery becomes very small, and so it is not
uncommon to avoid the need to adjust for multiplicity.
The False Discovery Rate (FDR), namely the expectation of the pro-
portion of false discoveries (rejected true null hypotheses) among the
discoveries (the rejected hypotheses), was developed by Benjamini and
Hochberg, 1995 to bridge these two extremes. When the null hypothesis
is true for all hypotheses - the FDR and FWE criteria are equivalent.
However, when there are some hypotheses for which the null hypotheses
are false, an FDR controlling procedure may yield many more discoveries
at the expense of having a small proportion of false discoveries.
Formally, let H
0i
, i = 1, . . . m be the tested null hypotheses. For
i = 1, . . . m
0
the null hypotheses are true, and for the remaining m
1
=
mm
0
hypotheses they are not. Thus, any discovery about a hypothesis
Statistical Methods for Data Mining 17
from the first set is a false discovery, while a discover about a hypothesis
from the second set is a true discovery. Let V denote the number of false
discoveries and R the total number of discoveries. Let the proportion of
false discoveries be
Q =
½
V/R if R > 0
0 if R = 0
,
and define F DR = E(Q).
Benjamini and Hochberg advocated that the FDR should be con-
trolled at some desirable level q, while maximizing the number of dis-
coveries made. They offered the linear step-up procedure as a simple and
general procedure that controls the FDR. The linear step-up procedure
makes use of the m p-values, P = (P
1
, . . . P
m
) so in a sense it is very
general, as it compares the ordered values P
(1)
. . . P
(m)
to the set
of constants linearly interpolated between q and q/m.:
Definition 4.1 The Linear step-up Procedure: Let k = max{i : P
(i)
iq/m}, and reject the k hypotheses associated with P
(1)
, . . . P
(k)
. If no
such a k exists reject none.
The procedure was first suggested by Eklund (Seeger, 1968) and for-
gotten, then independently suggested by Simes (Simes, 1986). At both
points in time it went out of favor because it does not control the FWE.
Benjamini and Hochberg, 1995, showed that the procedure does control
the FDR, raising the interest in this procedure. Hence it is now referred
to as the Benjamini and Hochb erg procedure (BH procedure), or (unfor-
tunately) the FDR procedure (e.g. in SAS). Here, we use the descriptive
term, i.e. the linear step-up procedure (for a detailed historical review
see Benjamini and Hochberg, 2000).
For the purpose of practical interpretation and flexibility in use, the
results of the linear step-up procedure can also b e reported in terms of
the FDR adjusted p-values. Formally, the FDR adjusted p-value of H
(i)
is p
LSU
(i)
= min{
mp
(j)
j
| j i }. Thus the linear step-up procedure at level
q is equivalent to rejecting all hypotheses whose FDR adjusted p-value
is q.
It should also be noted that the dual linear step-down procedure,
which uses the same constants but starts with the smallest p-value and
stops at the last {P
(i)
iq/m}, also controls the FDR (Sarkar, 2002).
Even though it is obviously less powerful, it is sometimes easier to cal-
culate in very large problems.
The linear step-up procedure is quite striking in its ability to control
the FDR at precisely q · m
0
/m, regardless of the distributions of the test
statistics corresponding to false null hypotheses (when the distributions
under the simple null hypotheses are independent and continuous).
18
Benjamini and Yekutieli (Benjamini and Yekutieli, 2001) studied the
procedure under dependency. For some type of positive dependency they
showed that the above remains an upper bound. Even under the most
general dependence structure, where the FDR is controlled merely at
level q(1 + 1/2 + 1/3 + . . . + 1/m), it is again conservative by the same
factor m
0
/m (Benjamini and Yekutieli, 2001).
Knowledge of m
0
can therefore be very useful in this setting to improve
upon the performance of the FDR controlling procedure. Were this
information to be given to us by an ”oracle”, the linear step-up procedure
with q
0
= q ·m/m
0
would control the FDR at precisely the desired level q
in the independent and continuous case. It would then be more powerful
in rejecting many of the hypotheses for which the alternative holds. In
some precise asymptotic sense, Genovese and Wasserman (Genovese and
Wasserman, 2002a) showed it to be the best possible procedure.
Schweder and Spjotvoll (Schweder and Spjotvoll, 1982) were the first
to try and estimate this factor, albeit informally. Hochberg and Ben-
jamini (Hochberg and Benjamini, 1990) formalized the approach. Ben-
jamini and Hochberg (Benjamini and Hochberg, 2000) incorporated it
into the linear step-up procedure, and other adaptive FDR controlling
procedures make use of other estimators (see Efron et al., 2001; Storey,
2002, and Storey, Taylor and Siegmund, 2004). Benjamini, Krieger and
Yekutieli, 2001, offer a very simple and intuitive two-stage procedure
based on the idea that the value of m
0
can be estimated from the results
of the linear step-up procedure itself, and prove it controls the FDR at
level q.
Definition 4.2 Two-Stage Linear Step-Up Procedure (TST):
1 Use the linear step-up procedure at level q
0
=
q
1+
q
. Let r
1
be the
number of rejected hypotheses. If r
1
= 0 reject no hypotheses and
stop; if r
1
= m reject all m hypotheses and stop; or otherwise
2 Let ˆm
0
= (m r
1
).
3 Use the linear step-up procedure with q
= q
0
· m/ ˆm
0
Recent papers have illuminated the FDR from many different points
of view: asymptotic, Bayesian, empirical Bayes, as the limit of empirical
processes, and in the context of penalized model selection (Efron et al.,
2001; Storey, 2002; Genovese and Wasserman, 2002a; Abramovich et al.,
2001). Some of the studies have emphasized variants of the FDR, such
as its conditional value given some discovery is made (the positive FDR
in Storey, 2002), or the distribution of the proportion of false discover-
ies itself (the FDR in Genovese and Wasserman, 2002a; Genovese and
Wasserman, 2002b).
Statistical Methods for Data Mining 19
Studies on FDR methodologies have become a very active area of re-
search in statistics, many of them making use of the large dimension of
the problems faced, and in that respect relying on the blessing of di-
mensionality. FDR methodologies have not yet found their way into the
practice and theory of DM, though it is our opinion that they have a lot
to offer there, as the following example on variable selection shows
Example: Zytkov and Zembowicz, 1997; Zembowicz and Zytkov, 1996,
developed the 49er software to mine association rules using chi-square
tests of significance for the independence assumption, i.e. by testing
whether the lift is significantly > 1. Finding that too many of the m
potential rules are usually significant, they used 1/m as a threshold for
significance, comparing each p-value to the threshold, and choosing only
the rules that pass the threshold. Note that this is a Bonferroni-like
treatment of the multiplicity problem, controlling the FWE at α = 1.
Still, they further suggest increasing the threshold if a few hypotheses are
rejected. In particular they note that the performance of the threshold is
especially good if the largest p-value of the selected k rules is smaller than
k times the original 1/m threshold. This is exactly the BH procedure
used at level q = 1 and they arrived at it by merely checking the actual
performance on a specific problem. In spite of this remarkable success,
theory further tells us that it is important to use q < 1/2, and not 1,
to always get good performance. The preferable values for q are, as far
as we know, between 0.05 and 0.2. Such values for q further allow us to
conclude that only approximately q of the discovered association rules
are not real ones. With q = 1 such a statement is meaningless.
5. Model (Variables or Features) Selection using
FDR Penalization in GLM
Most of commonly used variable selection procedures in linear models
choose the appropriate subset by minimizing a model selection criterion
of the form: RSS
k
+σ
2
kλ, where RSS
k
is the residual sum of squares for
a model with k parameters as defined in the previous section, and λ is
the penalization parameter. For the generalized linear models discussed
above twice the logarithm of the likelihoo d of the model takes on the
role of RSS
k
, but for simplicity of exposition we shall continue with the
simple linear model. This penalized sum of squares might ideally be
minimized over all k and all subsets of variables of size k, but practically
in larger problems it is usually minimized either by forward selection
or backward elimination, adding or dropping one variable at a time.
The different selection criteria can be identified by the value of λ they
20
use. Most traditional model selection criteria make use of a fixed λ
and can also be described as fixed level testing. The Akaike Information
Criterion (AIC) and the C
p
criterion of Mallows both make use of λ = 2,
and are equivalent to testing at level 0.16 whether the coefficient of each
newly included variable in the model is different than 0. Usual backward
and forward algorithms use similar testing at the .05 level, which is
approximately equivalent to using λ = 4.
Note that when the selection of the model is conducted over a large
number of potential variables m, the implications of the above approach
can be disastrous. Take for example m = 500 variables, not an unlikely
situation in DM. Even if there is no connection whatsoever between
the predicted variable and the potential set of predicting variables, you
should expect to get 65 variables into the selected model - an unaccept-
able situation.
More recently model selection approaches have been examined in the
statistical literature in settings where the number of variables is large,
even tending to infinity. Such studies, usually held under an assumption
of orthogonality of the variables, have brought new insight into the choice
of λ. Donoho and Jonhstone (Donoho and Johnstone, 1995) suggested
using λ, where λ = 2log(m), whose square root is called the ”universal
threshold” in wavelet analysis. Note that the larger the pool over which
the model is searched, the larger is the penalty per variable included.
This threshold can also be viewed as a multiple testing Bonferroni pro-
cedure at the level α
m
, with .2 α
m
.4 for 10 m 10000. More
recent studies have emphasized that the penalty should also depend on
the size of the already selected model k, λ = λ
k,m
, increasing in m and
decreasing in k. They include Abramovich and Benjamini, 1996; Birge
and Massart, 2001; Abramovich, Bailey and Sapatinas , 2000; Tibshirani
and Knight, 1999; George and Foster, 2000, and Foster and Stine, 2004.
As full review is beyond our scope, we shall focus on the suggestion that
is directly related to FDR testing.
In the context of wavelet analysis Abramovich and Benjamini, 1996
suggested to using FDR testing, thereby introducing a threshold that
increases in m and decreases with k. Abramovich, Bailey and Sapatinas
, 2000, were able to prove in an asymptotic setup, where m tends to
infinity and the model is sparse, that using FDR testing is asymptotically
minimax in a very wide sense. Their argument hinges on expressing the
FDR testing as a penalized RSS as follows:
RSS
k
+ σ
2
i=k
X
i=1
z
2
i
m
·
q
2
, (1.15)
Statistical Methods for Data Mining 21
where z
α
is the 1α percentile of a standard normal distribution. This
is equivalent to using λ
k,m
=
1
k
P
i=k
i=1
z
2
i
m
·
q
2
in the general form of penalty.
When the models considered are sparse, the penalty is approximately
2σ
2
log(
m
k
·
2
q
). The FDR level controlled is q, which should be kept at a
level strictly less than 1/2.
In a followup study Gavrilov, 2003, investigated the properties of
such penalty functions using simulations, in setups where the number
of variables is large but finite, and where the potential variables are
correlated rather than orthogonal. The results show the dramatic failure
of all traditional ”fixed penalty per-parameter” approaches. She found
the FDR-penalized selection procedure to have the best performance in
terms of minimax behavior over a large numb er of situations likely to
arise in practice, when the number of potential variables was more than
32 (and a close second in smaller cases). Interestingly they recommend
using q = .05, which turned out to be well calibrated value for q for
problems with up to 200 variables (the largest investigated).
Example: Foster and Stine, 2004, developed these ideas for the case
when the predicted variable is 0-1, demonstrating their usefulness in DM,
in developing a prediction model for loan default. They started with
approximately 200 potential variables for the model, but then added all
pairwise interactions to reach a set of some 50,000 potential variables.
Their article discusses in detail some of the issues reviewed above, and
has a very nice and useful discussion of important computational aspects
of the application of the ideas in real a large DM problem.
6. Concluding Remarks
KDD and DM are a vaguely defined field in the sense that the defini-
tion largely depends on the background and views of the definer. Fayyad
defined DM as the nontrivial pro cess of identifying valid, novel, poten-
tially useful, and ultimately understandable patterns in data. Some def-
initions of DM emphasize the connection of DM to databases containing
ample of data. Another definitions of KDD and DM is the following
definition: ”Nontrivial extraction of implicit, previously unknown and
potentially useful information from data, or the search for relationships
and global patterns that exist in databases”. Although mathematics,
like computing is a tool for statistics, statistics has developed over a
long time as a subdiscipline of mathematics. Statisticians have devel-
oped mathematical theories to support their methods and a mathemati-
cal formulation based on probability theory to quantify the uncertainty.
Traditional statistics emphasizes a mathematical formulation and vali-
dation of its methodology rather than empirical or practical validation.
22
The emphasis on rigor has required a proof that a proposed method
will work prior to the use of the metho d. In contrast, computer sci-
ence and machine learning use experimental validation methods. Statis-
tics has developed into a closed discipline, with its own scientific jargon
and academic objectives that favor analytic proofs rather than practi-
cal methods for learning from data. We need to distinguish between the
theoretical mathematical background of statistics and its use as a tool in
many experimental scientific research studies. We believe that comput-
ing methodology and many of the other related issues in DM should be
incorporated into traditional statistics. An effort has to be made to cor-
rect the negative connotations that have long surrounded data mining in
the statistics literature (Chatfield, 1995) and the statistical community
will have to recognize that empirical validation does constitute a form
of validation (Friedman, 1998).
Although the terminology used in DM and statistics may differ, in
many cases the concepts are the same. For example, in neural net-
works we use terms like ”learning”, ”weights” and ”knowledge” while in
statistics we use ”estimation”, ”parameters” and ”value of parameters”,
respectively. Not all statistical themes are relevant to DM. For example,
as DM analyzes existing databases, experimental design is not relevant
to DM. However, many of them, including those covered in this chapter,
are highly relevant to DM and any data miner should be familiar with
them.
In summary, there is a need to increase the interaction and collabo-
ration between data miners and statistics. This can be done by over-
coming the terminology barriers, working on problems stemming from
large databases. A question that has often been raised among statisti-
cians is whether DM is not merely part of statistics. The point of this
chapter was to show how each can benefit from the other, making the
inquiry from data a more successful endeavor, rather than dwelling on
theoretical issues of dubious value.
References
Abramovich F. and Benjamini Y., (1996). Adaptive thresholding of wavelet
coefficients. Computational Statistics & Data Analysis, 22:351–361.
Abramovich F., Bailey T .C. and Sapatinas T., (2000). Wavelet analysis
and its statistical applications. Journal of the Royal Statistical Society
Series D-The Statistician, 49:1–29.
Abramovich F., Benjamini Y., Donoho D. and Johnstone I., (2000).
Adapting to unknown sparsity by controlling the false discovery rate.
Technical Report 2000-19, Department of Statistics, Stanford Univer-
sity.
Benjamini Y. and Ho chberg Y., (1995). Controlling the false discover
rate: A practical and powerful approach to multiple testing. J. R.
Statist. Soc. B, 57:289–300.
Benjamini Y. and Hochberg Y., (2000). On the adaptive control of the
false discovery fate in multiple testing with independent statistics.
Journal of Educational and Behavioral Statistics, 25:60–83.
Benjamini Y., Krieger A.M. and Yekutieli D., (2001). Two staged linear
step up for controlling procedure. Technical report, Department of
Statistics and O.R., Tel Aviv University.
Benjamini Y. and Yekutieli D., (2001). The control of the false discov-
ery rate in multiple testing under dependency. Annals of Statistics,
29:1165–1188.
Berthold M. and Hand D., (1999). Intelligent Data Analysis: An Intro-
duction. Springer.
Birge L. and Massart P., (2001). Gaussian model selection. Journal of
the European Mathematical Society, 3:203–268.
Chatfield C., (1995). Model uncertainty, data mining and statistical in-
ference. Journal of the Royal Statistical Society A, 158:419–466.
Cochran W.G., (1977). Sampling Techniques. Wiley.
Cox D.R., (1972). Regressio models and life-tables. Journal of the Royal
Statistical Society B, 34:187–220.
Dell’Aquila R. and Ronchetti E.M., (2004). Introduction to Robust Statis-
tics with Economic and Financial Applications. Wiley.
23
24
Donoho D.L. and Johnstone I.M., (1995). Adapting to unknown smooth-
ness via wavelet shrinkage. Journal of the American Statistical Asso-
ciation, 90:1200–1224.
Donoho D., (2000). American math. society: Math challenges of the 21st
century: High-dimensional data analysis: The curses and blessings of
dimensionality.
Efron B., Tibshirani R.J., Storey J.D. and Tusher V., (2001). Empirical
Bayes analysis of a microarray experiment. Journal of the American
Statistical Association, 96:1151–1160.
Friedman J.H., (1998). Data Mining and Statistics: What’s the connec-
tions?, Proc. 29th Symposium on the Interface (D. Scott, editor).
Foster D.P. and Stine R.A., (2004). Variable selection in data mining:
Building a predictive model for bankruptcy. Journal of the American
Statistical Association, 99:303–313.
Gavrilov Y., (2003). Using the falls discovery rate criteria for model
selection in linear regression. M.Sc. Thesis, Department of Statistics,
Tel Aviv University.
Genovese C. and Wasserman L., (2002a). Operating characteristics and
extensions of the false discovery rate procedure. Journal of the Royal
Statistical Society Series B, 64:499–517.
Genovese C. and Wasserman L., (2002b). A stochastic process approach
to false discovery rates. Technical Report 762, Department of Statis-
tics, Carnegie Mellon University.
George E.I. and Foster D.P., (2000). Calibration and empirical Bayes
variable selection. Biometrika, 87:731–748.
Hand D., (1998). Data mining: Statistics and more? The American Statis-
tician, 52:112–118.
Hand D., Mannila H. and Smyth P., (2001). Principles of Data Mining.
MIT Press.
Han J. and Kamber M., (2001). Data Mining: Concepts and Techniques.
Morgan Kaufmann Publisher.
Hastie T., Tibshirani R. and Friedman J., (2001). The Elements of Sta-
tistical Learning: Data Mining, Inference, and Prediction. Springer.
Hochberg Y. and Benjamini Y., (1990). More powerful procedures for
multiple significance testing. Statistics in Medicine, 9:811–818.
Leshno M., Lin V.Y., Pinkus A. and Schocken S., (1993). Multilayer
feedforward networks with a non polynomial activation function can
approximate any function. Neural Networks, 6:861–867.
McCullagh P. and Nelder J.A., (1991). Generalized Linear Model. Chap-
man & Hall.
REFERENCES 25
Meilijson I., (1991). The expected value of some functions of the convex
hull of a random set of points sampled in r
d
. Isr. J. of Math., 72:341–
352.
Mosteller F. and Tukey J.W., (1977). Data Analysis and Regression : A
Second Course in Statistics. Wiley.
Roberts S. and Everson R. (editors), (2001). Independent Component
Analysis : Principles and Practice. Cambridge University Press.
Ronchetti E.M., Hampel F.R., Rousseeuw P.J. and Stahel W.A., (1986).
Robust Statistics : The Approach Based on Influence Functions. Wiley.
Sarkar S.K., (2002). Some results on false discovery rate in stepwise
multiple testing procedures. Annals of Statistics, 30:239–257.
Schweder T. and Spjotvoll E., (1982). Plots of p-values to evaluate many
tests simultaneously. Biometrika, 69:493–502.
Seeger P., (1968). A note on a method for the analysis of significances
en mass. Technometrics, 10:586–593.
Simes R.J., (1986). An improved Bonferroni procedure for multiple tests
of significance. Biometrika, 73:751–754.
Storey J.D., (2002). A direct approach to false discovery rates. Journal
of the Royal Statistical Society Series B, 64:479–498.
Storey J.D., Taylor J.E. and Siegmund D., (2004). Strong control, con-
servative point estimation, and simultaneous conservative consistency
of false discovery rates: A unified approach. Journal of the Royal Sta-
tistical Society Series B, 66:187–205.
Therneau T.M. and Grambsch P.M., (2000). Modeling Survival Data,
Extending the Cox Model. Springer.
Tibshirani R. and Knight K., (1999). The covariance inflation criterion
for adaptive model selection. Journal of the Royal Statistical Society
Series B, 61:Part 3 529–546.
Zembowicz R. and Zytkov J.M., (1996). From contingency tables to vari-
ous froms of knowledge in databases. In U.M. Fayyad, R. Uthurusamy,
G. Piatetsky-Shapiro and P. Smyth (editors) Advances in Knowledge
Discovery and Data Mining (pp. 329-349). MIT Press.
Zytkov J.M. and Zembowicz R., (1997). Contingency tables as the foun-
dation for concepts, concept hierarchies and rules: The 49er system
approach. Fundamenta Informaticae, 30:383–399.
... Hence, this study uses an ML-based, data/fact-driven analytic approach to critically analyze the factors underlying CSF. Several researchers (Friedman, 1998;Breiman, 2001;Benjamini and Leshno, 2010;Kuzey et al., 2014) have compared efficacy of ML algorithms to those of the classical statistical techniques, using a variety of empirical techniques and a wide range of problem settings, to prove the viability and/or superiority of ML techniques (as a complement or replacement) over to traditional methods. According to Breiman (2001), ML type algorithmic models are able to provide better predictive accuracy than classical approaches, consequently leading to the extraction of more reliable information/explanation about the underlying phenomenon. ...
... It was shown that ML algorithms can provide a compelling alternative approach to traditional statistical methods when applied to prediction type problems and also predictionbased investigative studies (i.e. assessment of relative importance/influence of underlying factors) (Breiman, 2001;Benjamini and Leshno, 2010;Friedman, 1998). As the traditional statistical techniques have several restrictive assumptions (e.g. ...
Purpose The paper aims to identify and critically analyze the factors influencing cost system functionality (CSF) using several machine learning techniques including decision trees, support vector machines and logistic regression. Design/methodology/approach The study employed a self-administered survey method to collect the necessary data from companies conducting business in Turkey. Several prediction models are developed and tested; and a series of sensitivity analyses is performed on the developed prediction models to assess the ranked importance of factors/variables. Findings Certain factors/variables influence CSF much more than others. The findings of the study suggest that utilization of management accounting practices require a functional cost system, which is supported by a comprehensive cost data management process (i.e., acquisition, storage and utilization). Research limitations/implications The underlying data was collected using a questionnaire survey; thus, it is subjective which reflects the perceptions of the respondents. Ideally, it is expected to reflect the objective of the practices of the firms. Secondly, we have measured CSF it on a “Yes” or “No” basis which does not allow survey respondents reply in between them; thus, it might have limited the choices of the respondents. Thirdly, the Likert scales adopted in the measurement of the other constructs might be limiting the answers of the respondents. Practical implications Information technology plays a very important role for the success of CSF practices. That is, successful implementation of a functional cost system relies heavily on a fully-integrated information infrastructure capable of constantly feeding CSF with accurate, relevant and timely data. Originality/value In addition to providing evidence regarding the factors underlying CSF based on a broad range of industries interesting finding, this study also illustrates the viability of machine learning methods as a research framework to critically analyze domain specific data.
... Machine learning and data mining methods rely on statistical techniques and try to extract knowledge or learn models from data itself [81,82]. Such knowledge, models and other characteristics extracted from data are then used to solve different kinds of problems and tasks. ...
Thesis
Full-text available
The rapid development and integration of Information Technologies over the last decades influenced all areas of our life, including the business world. Yet not only the modern enterprises become digitalised, but also security and criminal threats move into the digital sphere. To withstand these threats, modern companies must be aware of all activities within their computer networks. The keystone for such continuous security monitoring is a Security Information and Event Management (SIEM) system that collects and processes all security-related log messages from the entire enterprise network. However, digital transformations and technologies, such as network virtualisation and widespread usage of mobile communications, lead to a constantly increasing number of monitored devices and systems. As a result, the amount of data that has to be processed by a SIEM system is increasing rapidly. Besides that, in-depth security analysis of the captured data requires the application of rather sophisticated outlier detection algorithms that have a high computational complexity. Existing outlier detection methods often suffer from performance issues and are not directly applicable for high-speed and high-volume analysis of heterogeneous security-related events, which becomes a major challenge for modern SIEM systems nowadays. This thesis provides a number of solutions for the mentioned challenges. First, it proposes a new SIEM system architecture for high-speed processing of security events, implementing parallel, in-memory and in-database processing principles. The proposed architecture also utilises the most efficient log format for high-speed data normalisation. Next, the thesis offers several novel high-speed outlier detection methods, including generic Hybrid Outlier Detection that can efficiently be used for Big Data analysis. Finally, the special User Behaviour Outlier Detection is proposed for better threat detection and analysis of particular user behaviour cases. The proposed architecture and methods were evaluated in terms of both performance and accuracy, as well as compared with classical architecture and existing algorithms. These evaluations were performed on multiple data sets, including simulated data, well-known public intrusion detection data set, and real data from the large multinational enterprise. The evaluation results have proved the high performance and efficacy of the developed methods. All concepts proposed in this thesis were integrated into the prototype of the SIEM system, capable of high-speed analysis of Big Security Data, which makes this integrated SIEM platform highly relevant for modern enterprise security applications.
... Such a rule is quite likely due to chance and hardly interesting (see Example 1). Therefore, we should evaluate the probability of observing such a large lift value, if X and A were actually independent (independence testing, H 0 : Γ = 1 (Benjamini and Leshno 2005)) or, alternatively, that the lift is at most some threshold γ 0 > 1 (H 0 : Γ ≤ γ 0 (Lallich et al. 2007)). ...
Article
Full-text available
Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery. Most importantly, application of appropriate statistical tests allows precise control over the risk of false discoveries—patterns that are found in the sample data but do not hold in the wider population from which the sample was drawn. Statistical tests can also be applied to filter out patterns that are unlikely to be useful, removing uninformative variations of the key patterns in the data. This tutorial introduces the key statistical and data mining theory and techniques that underpin this fast developing field. We concentrate on two general classes of patterns: dependency rules that express statistical dependencies between condition and consequent parts and dependency sets that express mutual dependence between set elements. We clarify alternative interpretations of statistical dependence and introduce appropriate tests for evaluating statistical significance of patterns in different situations. We also introduce special techniques for controlling the likelihood of spurious discoveries when multitudes of patterns are evaluated. The paper is aimed at a wide variety of audiences. It provides the necessary statistical background and summary of the state-of-the-art for any data mining researcher or practitioner wishing to enter or understand statistically sound pattern discovery research or practice. It can serve as a general introduction to the field of statistically sound pattern discovery for any reader with a general background in data sciences.
... It is the science of learning from data and includes everything from collecting and organizing to analyzing and presenting data. It is concerned with probabilistic models, specifically inference, using data [40]. Figure 7 represents relation of data mining to other disciplines of computer science and mathematics. ...
... These are analytics tools that provide a statistical summary that recognized data configurations and patterns and predict future events to improve educational outcomes ( Ferguson, 2012). Teachers must know more than descriptive statistics, they must have ground knowledge on inferential statistical methods, especially regression analysis, which is a foremost statistical approach being used in data mining ( Benjamini & Leshno, 2010). Moreover, teachers need to learn the basics of Knowledge Database Discovery (KDD) and other data mining concepts that include clustering, classification, association rule mining, social network analysis as well as its dynamics with statistics using regression analysis for modelling relationships (Anna Lea Dyckhoff et al., 2012). ...
Conference Paper
Full-text available
This paper highlights the importance of developing data-centric teaching-research skills of 21 st-century teachers. It critically examines the literature related to the teaching-research nexus and learning analytics, and presents a model of the integration of these concepts, and their relevance to the teaching and learning process. This paper also presents the desirable skills set of learning analysts and identifies the analytic tools that they can use to carry out this role. Finally, this paper suggests the future direction of teacher training and professional development programs that will equip teachers with the right teaching-research skills and tools to help the millennial learners gain academic success. This study recommends that teachers need to be well informed about their students by having access to their data and acquiring the necessary competen-cies to perform appropriate data analysis. Furthermore, they are expected to engage with analytical tools and educational technologies in order to realize a more effective teaching and learning experience, and improve academic outcomes .
Article
Full-text available
Persea americana, commonly known as avocado, is becoming increasingly important in global agriculture. There are dozens of avocado varieties, but more than 85% of the avocados harvested and sold in the world are of the Hass one. Furthermore, information on the market of agricultural products is valuable for decision-making; this has made researchers try to determine the behavior of the avocado market, based on data that might affect it one way or another. In this paper, a machine learning approach for estimating the number of units sold monthly and the total sales of Hass avocados in several cities in the United States, using weather data and historical sales records, is presented. For that purpose, four algorithms were evaluated: Linear Regression, Multilayer Perceptron, Support Vector Machine for Regression and Multivariate Regression Prediction Model. The last two showed the best accuracy, with a correlation coefficient of 0.995 and 0.996, and a Relative Absolute Error of 7.971 and 7.812, respectively. Using the Multivariate Regression Prediction Model, an application that allows avocado producers and sellers to plan sales through the estimation of the profits in dollars and the number of avocados that could be sold in the United States was created.
Article
Purpose The aim of this paper is to present an empirical study on the effect of two synthetic attributes to popular classification algorithms on data originating from student transcripts. The attributes represent past performance achievements in a course which are defined as global and local performances. Global performance of a course is an aggregated performance achieved by all students that have taken this course and local performance of a course is an aggregated performance achieved in the prerequisite courses by the student taking the course. Design/methodology/approach The paper uses Educational Data Mining techniques to predict student performance in courses, where it identifies the relevant attributes that are the most key influencers for predicting the final grade (performance) and reports the effect of the two suggested attributes to the classification algorithms. As a research paradigm, the paper follows Cross Industry Standard Process for Data Mining (CRISP-DM) using RapidMiner Studio software tool. Six classification algorithms are experimented, C4.5 and CART Decision Trees, Naive Bayes, k-neighboring, rule-based induction and Support Vector Machines. Findings The outcomes of the paper show that the synthetic attributes have positively improved the performance of the classification algorithms and also they have been highly ranked according to their influence to the target variable. Originality/value This paper proposes two synthetic attributes that are integrated into real dataset. The key motivation is to improve the quality of the data and make classification algorithms perform better. The paper also presents empirical results showing the effect of these attributes on selected classification algorithms.
Conference Paper
This article briefly discusses the approaches to the use of Big Data in the educational process of higher educational institutions. There is a brief description of nature of big data, their distribution in the education industry and new ways to use Big Data as part of the educational process are offered as well. This article describes a method for the analysis of the relevant requests by using Yandex.Wordstat (for laboratory works on the processing of data) and Google Trends (for actual pictures of interest and preference in a higher education institution).
Article
Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas. It is concerned with the secondary analysis of large databases in order to find previously unsuspected relationships which are of interest or value to the database owners. New problems arise, partly as a consequence of the sheer size of the data sets involved, and partly because of issues of pattern matching. However, since statistics provides the intellectual glue underlying the effort, it is important for statisticians to become involved. There are very real opportunities for statisticians to make significant contributions.
Book
Introduction.- Estimating the Survival and Hazard Functions.- The Cox Model.- Residuals.- Functional Form.- Testing Proportional Hazards.- Influence.- Multiple Events per Subject.- Frailty Models.- Expected Survival.
Article
When a large number of tests are made, possibly on the same data, it is proposed to base a simultaneous evaluation of all the tests on a plot of cumulative P-values using the observed significance probabilities. The points corresponding to true null hypotheses should form a straight line, while those for false null hypotheses should deviate from this line. The line may be used to estimate the number of true hypotheses. The properties of the method are studied in some detail for the problems of comparing all pairs of means in a one-way layout, testing all correlation coefficients in a large correlation matrix, and the evaluation of all 2×2 subtables in a contingency table. The plot is illustrated on real data.