An Empirical Study of the Naïve Bayes Classifier
Abstract
The naive Bayes classifier greatly simplify learn-ing by assuming that features are independent given class. Although independence is generally a poor assumption, in practice naive Bayes often competes well with more sophisticated classifiers. Our broad goal is to understand the data character-istics which affect the performance of naive Bayes. Our approach uses Monte Carlo simulations that al-low a systematic study of classification accuracy for several classes of randomly generated prob-lems. We analyze the impact of the distribution entropy on the classification error, showing that low-entropy feature distributions yield good per-formance of naive Bayes. We also demonstrate that naive Bayes works well for certain nearly-functional feature dependencies, thus reaching its best performance in two opposite cases: completely independent features (as expected) and function-ally dependent features (which is surprising). An-other surprising result is that the accuracy of naive Bayes is not directly correlated with the degree of feature dependencies measured as the class-conditional mutual information between the fea-tures. Instead, a better predictor of naive Bayes ac-curacy is the amount of information about the class that is lost because of the independence assump-tion.
An empirical study of the naive Bayes classifier
I. Rish
T.J. Watson Research Center
rish@us.ibm.com
Abstract
The naive Bayes classifier greatly simplify learn-
ing by assuming that features are independent given
class. Although independence is generally a poor
assumption, in practice naive Bayes often competes
well with more sophisticated classifiers.
Our broad goal is to understand the data character-
istics which affect the performance of naive Bayes.
Our approach uses Monte Carlo simulations that al-
low a systematic study of classification accuracy
for several classes of randomly generated prob-
lems. We analyze the impact of the distribution
entropy on the classification error, showing that
low-entropy feature distributions yield good per-
formance of naive Bayes. We also demonstrate
that naive Bayes works well for certain nearly-
functional feature dependencies, thus reaching its
best performance in two opposite cases: completely
independent features (as expected) and function-
ally dependent features (which is surprising). An-
other surprising result is that the accuracy of naive
Bayes is not directly correlated with the degree
of feature dependencies measured as the class-
conditional mutual information between the fea-
tures. Instead, a better predictor of naive Bayes ac-
curacy is the amount of information about the class
that is lost because of the independence assump-
tion.
1 Introduction
Bayesian classifiers assign the most likely class to a given
example described by its feature vector. Learning such clas-
sifiers can be greatly simplified by assuming that features are
independent given class, that is, ,
where is a feature vector and is a class.
Despite this unrealistic assumption, the resulting classifier
known as naive Bayes is remarkably successful in practice,
often competing with much more sophisticated techniques [6;
8; 4; 2]. Naive Bayes has proven effective in many practical
applications, including text classification, medical diagnosis,
and systems performance management [2; 9; 5].
T.J. Watson Research Center, 30 Saw Mill River Road,
Hawthorne, NY 10532. Phone +1 (914) 784-7431
The success of naive Bayes in the presence of feature de-
pendencies can be explained as follows: optimality in terms
of zero-one loss (classification error) is not necessarily related
to the quality of the fit to a probability distribution (i.e., the
appropriateness of the independence assumption). Rather, an
optimal classifier is obtained as long as both the actual and
estimated distributions agree on the most-probable class [2].
For example, [2]prove naive Bayes optimality for some prob-
lems classes that have a high degree of feature dependencies,
such as disjunctive and conjunctive concepts.
However, this explanation is too general and therefore not
very informative. Ultimately, we would like to understand
the data characteristics which affect the performance of naive
Bayes. While most of the work on naive Bayes compares
its performance to other classifiers on particular benchmark
problems (e.g., UCI benchmarks), our approach uses Monte
Carlo simulations that allow a more systematic study of clas-
sification accuracy on parametric families of randomly gen-
erated problems. Also, our current analysis is focused only
on the bias of naive Bayes classifier, not on its variance.
Namely, we assume an infinite amount of data (i.e., a perfect
knowledge of data distribution) which allows us to separate
the approximation error (bias) of naive Bayes from the error
induced by training sample set size (variance).
We analyze the impact of the distribution entropy
on the classification error, showing that certain almost-
deterministic, or low-entropy, dependencies yield good per-
formance of naive Bayes. (Note that almost-deterministic
dependencies are a common characteristic in many practi-
cal problem domains, such as, for example, computer sys-
tem management and error-correcting codes.) We show that
the error of naive Bayes vanishes as the entropy
approaches zero. Another class of almost-deterministic de-
pendencies generalizes functional dependencies between the
features. Particularly, we show that naive Bayes works best in
two cases: completely independent features (as expected) and
functionally dependent features (which is less obvious), while
reaching its worst performance between these extremes.
We also show that, surprisingly, the accuracy of naive
Bayes is not directly correlated with the degree of feature de-
pendencies measured as the class-conditional mutual infor-
mation between the features, ( and are
features and is the class). Instead, our experiments re-
veal that a better predictor of naive Bayes accuracy can be
41
the loss of information that features contain about the class
when assuming naive Bayes model, namely
, where is the mutual information be-
tween features and class under naive Bayes assumption.
This paper is structured as follows. In the next section we
provide necessary background and definitions. Section 3 dis-
cusses naive Bayes performance for nearly-deterministic de-
pendencies, while Section 4 demonstrates that the “informa-
tion loss” criterion can be a better error predictor than the
strength of feature dependencies. A summary and conclu-
sions are given in Section 5.
2 Definitions and Background
Let be a vector of observed random vari-
ables, called features, where each feature takes values from
its domain . The set of all feature vectors (examples, or
states), is denoted . Let be an un-
observed random variable denoting the class of an example,
where can take one of values Capi-
tal letters, such as , will denote variables, while lower-case
letters, such as , will denote their values; boldface letters
will denote vectors.
A function , where ,
denotes a concept to be learned. Deterministic corre-
sponds to a concept without noise, which always assigns the
same class to a given example (e.g., disjunctive and conjunc-
tive concepts are deterministic). In general, however, a con-
cept can be noisy, yielding a random function .
A classifier is defined by a (deterministic) function
(a hypothesis) that assigns a class
to any given example. A common approach is to asso-
ciate each class with a discriminant function ,
, and let the classifier select the class with max-
imum discriminant function on a given example:
.
The Bayes classifier (that we also call Bayes-optimal
classifier and denote ), uses as discriminant functions
the class posterior probabilities given a feature vector, i.e.
. Applying Bayes rule gives
, where
is identical for all classes, and therefore can be ignored. This
yields Bayes discriminant functions
(1)
where is called the class-conditional prob-
ability distribution (CPD). Thus, the Bayes classifier
(2)
finds the maximum a posteriori probability (MAP) hypothe-
sis given example . However, direct estimation of
from a given set of training examples is hard when
the feature space is high-dimensional. Therefore, approxima-
tions are commonly used, such as using the simplifying as-
sumption that features are independent given the class. This
yields the naive Bayes classifier defined by discrimi-
nant functions
(3)
The probability of a classification error, or of a clas-
sifier is defined as
where is the expectation over . denotes the
Bayes error (Bayes risk).
We say that classifier is optimal on a given problem if its
risk coincides with the Bayes risk. Assuming there is no noise
(i.e. zero Bayes risk), a concept is called separable by a set of
functions if every example
is classified correctly when using each as discriminant
functions.
As a measure of dependence between two features
and we use the class-conditional mutual information [1],
which can be defined as
where is the class-conditional entropy of , defined
as:
Mutual information is zero when and are mutually in-
dependent given class , and increases with increasing level
of dependence, reaching the maximum when one feature is a
deterministic function of the other.
3 When does naive Bayes work well? Effects
of some nearly-deterministic dependencies
In this section, we discuss known limitations of naive
Bayes and then some conditions of its optimality and near-
optimality, that include low-entropy feature distributions and
nearly-functional feature dependencies.
3.1 Concepts without noise
We focus first on concepts with or for any
and (i.e. no noise), which therefore have zero Bayes risk.
The features are assumed to have finite domains ( -th feature
has values), and are often called nominal. (A nominal fea-
ture can be transformed into a numeric one by imposing an
order on its domain.) Our attention will be restricted to bi-
nary classification problems where the class is either 0 or 1.
Some limitations of naive Bayes are well-known: in case
of binary features ( for all ), it can only
learn linear discriminant functions [3], and thus it is always
suboptimal for non-linearly separable concepts (the classical
example is XOR function; another one is -of- concepts [7;
2]). When for some features, naive Bayes is able
to learn (some) polynomial discriminant functions [3]; thus,
polynomial separability is a necessary, although not suffi-
cient, condition of naive Bayes optimality for concepts with
finite-domain features.
Despite its limitations, naive Bayes was shown to be opti-
mal for some important classes of concepts that have a high
degree of feature dependencies, such as disjunctive and con-
junctive concepts [2]. These results can be generalized to
concepts with any nominal features (see [10]for details):
42
Theorem 1 [10]The naive Bayes classifier is optimal for
any two-class concept with nominal features that assigns
class 0 to exactly one example, and class 1 to the other ex-
amples, with probability 1. 1
The performance of naive Bayes degrades with increas-
ing number of class-0 examples (i.e., with increasing prior
, also denoted ), as demonstrated in Figure
1a. This figure plots average naive Bayes error computed
over 1000 problem instances generated randomly for each
value of . The problem generator, called Zer-
oBayesRisk, assumes features (here we only consider two
features), each having values, and varies the number of
class-0 examples from 1 to (so that varies
from to 0.5; the results for are sym-
metric)2. As expected, larger (equivalently, larger
), yield a wider range of problems with various dependencies
among features, which result into increased errors of Bayes;
a closer look at the data shows no other cases of optimality
besides .
Surprisingly, the strength of inter-feature dependen-
cies, measured as the class-conditional mutual information
(also denoted ), is not a good predictor of
naive Bayes performance: while average naive Bayes error
increases monotonically with , the mutual information
is non-monotone, reaching its maximum around .
This observation is consistent with previous empirical results
on UCI benchmarks [2]) that also revealed low correlation
between the degree of feature dependence and relative perfor-
mance of naive Bayes with respect to other classifiers, such
as C4.5, CN2, and PEBLS.
It turns out that the entropy of class-conditional marginal
distributions, , is a better predictor of naive Bayes
performance. Intuitively, low entropy of means that
most of 0s are “concentrated around ” one state (in the limit,
this yields the optimality condition stated by Theorem 1). In-
deed, plotting average in Figure 1a demonstrates
that both average error and average entropy increase mono-
tonically in . Further discussion of low-entropy distri-
butions is given next in the more general context of noisy
(non-zero Bayes risk) classification problems.
3.2 Noisy concepts
Low-entropy feature distributions
Generally, concepts can be noisy, i.e. can have non-
deterministic and thus a non-zero Bayes risk.
A natural extension of the conditions of Theorem 1 to noisy
concepts yields low-entropy, or “extreme”, probability distri-
butions, having almost all the probability mass concentrated
in one state. Indeed, as shown in [10], the independence
assumption becomes more accurate with decreasing entropy
which yields an asymptotically optimal performance of naive
Bayes. Namely,
Theorem 2 [10]Given that one of the following conditions
hold:
1Clearly, this also holds in case of a single example of class 1.
2Note that in all experiments perfect knowledge of data distribu-
tion (i.e., infinite amount of data)is assumed in order to avoid the
effect of finite sample size.
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
0
0.5
1
1.5
2
2.5
3
P(0)
NBerror, I(X1;X2|C), and H(P(x1|c) vs. P(0) (n=2, m=2, k=10, N=1000)
NBerr
I(X1;X2|C)
H(P(x1|c)
(a)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
delta
error
Average errors vs. mutual information (n=2, m=2, k=10)
boptErr
NBerr
I(X;Y|C)
(b)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
delta
NBerr−boptErr
Average error difference vs. mutual information (n=2, m=2, k=10)
NBerr−boptErr
I(X;Y|C)/300
(c)
Figure 1: (a) results for the generator ZeroBayesRisk
(k=10, 1000 instances): average naive Bayes error
(NBerr), class-conditional mutual information between fea-
tures ( ), and entropy of marginal distribution,
; the error bars correspond to the standard devi-
ation of each measurement across 1000 problem instances;
(b) Results for the generator EXTREME: average Bayes
and naive Bayes errors and average ; (c) results
for the generator FUNC1: average difference between naive
Bayes error and Bayes error ( - constant for all ),
and scaled I(X1;X2—C) (divided by 300).
43
1. a joint probability distribution is such
that for some state
, or
2. a set of marginal probability distributions
is such that for each ,
for some ,
then .
The performance of naive Bayes on low-entropy distri-
butions is demonstrated using a random problem genera-
tor called EXTREME. This generator takes the number of
classes, , number of features, , number of values per fea-
ture, , and the parameter , and creates class-conditional
feature distributions, each satisfying the condition
if , where the are different states
randomly selected from possible states. For each class ,
the remaining probability mass in is randomly
distributed among the remaining states. Class prior
distributions are uniform. Once is generated, naive
Bayes classifier (NB) is compared against the Bayes-optimal
classifier (BO).
Figure 1b shows that, as expected, the naive Bayes error
(both the average and the maximum) converges to zero with
(simulation performed on a set of 500 problems with
, , ). Note that, similarly to the previ-
ous observations, the error of naive Bayes is not a monotone
function of the strength of feature dependencies; namely, the
average class-conditional mutual information plotted in Fig-
ure 1b is a concave function reaching its maximum between
and , while the decrease of average naive
Bayes error is monotone in .
Almost-functional feature dependencies
Another ”counterintuitive” example that demonstrates the
non-monotonic relation between the feature dependence and
the naive Bayes accuracy is the case of certain functional and
nearly-functional dependencies among features. Formally,
Theorem 3 [10]Given equal class priors, Naive Bayes is
optimal if for every feature , ,
where is a one-to-one mapping 3.
Namely, naive Bayes can be optimal in situations just oppo-
site to the class-conditional feature independence (when mu-
tual information is at minimum) - namely, in cases of com-
pletely deterministic dependence among the features (when
mutual information achieves its maximum). For exam-
ple, Figure 1c plots the simulations results obtained using
an ”nearly-functional” feature distribution generator called
FUNC1, which assumes uniform class priors, two features,
each having values, and ”relaxes” functional dependencies
between the features using the noise parameter . Namely,
this generator selects a random permutation of numbers,
which corresponds to a one-to-one function that binds the
two features: . Then it generates ran-
domly two class-conditional (marginal) distributions for the
3A similar observation was made in [11], but the important ”one-
to-one” condition on functional dependencies was not mentioned
there. However, it easy to construct an example of a non-one-to-
one functional dependence between the features that yields non-zero
error of naive Bayes.
feature, and , for class 0 and class 1, re-
spectively. Finally, it creates class-conditional joint feature
distributions satisfying the following conditions:
and
This way the states satisfying functional dependence obtain
probability mass, so that by controlling we can get as
close as we want to the functional dependence described be-
fore, i.e. the generator relaxes the conditions of Theorem 3.
Note that, on the other hand, gives us uniform distri-
butions over the second feature
, which makes it independent of (given class ). Thus
varying from 0 to 1 explores the whole range from deter-
ministic dependence to complete independence between the
features given class.
The results for 500 problems with are summarized
in Figure 1c, which plots the difference between the average
naive Bayes error and average Bayes risk (which turned out to
be , a constant for all ) is plotted against . We can
see that naive Bayes is optimal when (functional de-
pendence) and when (complete independence), while
its maximum error is reached between the two extremes. On
the other hand, the class-conditional mutual information de-
creases monotonically in , from its maximum at (func-
tional dependencies) to its minimum at (complete
independence)4.
4 Information loss: a better error predictor
than feature dependencies?
As we observed before, the strength of feature dependencies
(i.e. the class-conditional mutual information between the
features) ’ignored’ by naive Bayes is not a good predictor of
its classification error. This makes us look for a better param-
eter that estimates the impact of independence assumption on
classification.
We start with a basic question: which dependencies be-
tween features can be ignored when solving a classification
task? Clearly, the dependencies which do not help distin-
guishing between different classes, i.e. do not provide any
information about the class. Formally, let
be the mutual information between the features and the
class (note the difference from class-conditional mutual
information) given the “true” distribution ,
while is the same quantity computed for
, the naive
Bayes approximation of . Then the parameter
measures the
amount of information about the class which is “lost” due to
naive Bayes assumption. Figure 2a shows that average
(“information loss”) increases monotonically with , just
as the average error of naive Bayes. More interestingly, Fig-
ure 2b plots average naive Bayes error versus average
for three different values of ( ), which all yield
4Note that the mutual information in Figure 1c is scaled (divided
by 300) to fit the error range.
44
almost same curve, closely approximated by a quadratic func-
tion . Our results, not shown here
due to space restrictions, also demonstrate that variance of
the error increases with for each fixed ; however,
maximum variance decreases with . While the dependence
between the error and the information loss requires further
study, it is clear that the for zero-Bayes-risk problems infor-
mation loss is a much better predictor of the error than the
mutual dependence between the features (compare to Figure
1a).
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
P(class=0)
NBerror
Ediff=RNB−R* vs. Idiff (n=2, m=2, k=10)
Ediff
Idiff
(a)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
0.05
0.1
0.15
0.2
0.25
0.3
Idiff=I(C;X1,X2)−INB(C;X1,X2)
NBerror
Ediff=RNB−R* vs. Idiff (n=2, m=2, k=5,10,15, N=2000)
y = 0.31*x2 + 0.098*x + 0.00087
k=5
k=10
k=15
quadratic
Quadratic: norm of residuals = 0.010884
(b)
Figure 2: Results for generator ZeroBayesRisk (13 values of
P(0) in range, 2000 instances per each value of ):
(a) Average naive Bayes error and average information loss
versus ; (b) Average naive Bayes error versus av-
erage ”information loss” for k=5,10, and 15.
For non-zero Bayes risk, the picture is somewhat less clear.
However, the information loss still seems to be a better er-
ror predictor than the class-conditional mutual information
between the features. Figure 3a plots the average differ-
ence between naive Bayes error and the Bayes risk, called
, and the information loss versus the parame-
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
delta
error
Ediff=RNB−R* vs. Idiff and Idiff+ (n=2, m=2, k=10)
Ediff
Idiff
Idiff+
(a)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.12
−0.1
−0.08
−0.06
−0.04
−0.02
0
0.02
delta
error
Ediff=RNB−R* vs. Idiff and Idiff+ (n=2, m=2, k=10)
Ediff
Idiff
Idiff+
(b)
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
Idiff=I(C;(X1,X2))−INB(C;(X1,X2) and MI=I(X1;X2|C)
Ediff=(RNB−R*)
Ediff=(RNB−R*) vs. Idiff and MI) (n=2, m=2, k=15)
Idiff
I(C;X1,X2)
(c)
Figure 3: Information loss on noisy concepts: aver-
age error difference between naive Bayes and optimal Bayes,
, and average for (a) generator EXTREME and
(b) generator FUNC1; (c) scatter plot of versus
and versus mutual information for gen-
erator RANDOM.
45
ter . At the first sight, it looks like is non-monotone
in while is monotone; particularly, while the error
increases with , information loss decreases in the interval
. Note, however, this interval yields (!)
values of . It appears that naive Bayes overestimates
the amount of information the features have about the class
(possibly, by counting same information twice due to the in-
dependence assumption), which results in negative .
If we assume that such overestimation is not harmful, just
equivalent to not losing any information, and plot instead the
average of (denoted ), we observe a
monotone relationship between the average of and
the average naive Bayes error, as one would expect (i.e., both
increase monotonically up to , and then decrease).
Similarly, in Figure 3b we plot the error difference
as well as and versus for our second gen-
erator of non-zero Bayes risk problems, FUNC1. In this
cases, naive Bayes always overestimates the amount of in-
formation about the class, thus is always non-positive,
i.e. . Its relation to the naive error Bayes which
reaches its maximum at some intermediate value of is thus
not clear.
Finally, we used a “completely” random problem genera-
tor (called RANDOM) to compare the class-conditional mu-
tual information between the features, , and the
information loss , on arbitrary noisy concepts. For
each class, this generator samples each
from a uniform distribution on the interval
[0.0,1.0]; the resulting probability table is then normalized
(divided by the total sum over all entries). Figure 3c shows
a scatter-plot for , the error difference between naive
Bayes and optimal Bayes classifiers, versus feature depen-
dence and versus information loss . In
this cases, we can see that both parameters are correlated with
the error, however, the variance is quite high, especially for
. Further study of both parameters on different
classes of noisy concepts is needed to gain a better under-
standing of their relevance to the classification error.
5 Conclusions
Despite its unrealistic independence assumption, the naive
Bayes classifier is surprisingly effective in practice since its
classification decision may often be correct even if its prob-
ability estimates are inaccurate. Although some optimality
conditions of naive Bayes have been already identified in the
past [2], a deeper understanding of data characteristics that
affect the performance of naive Bayes is still required.
Our broad goal is to understand the data characteristics
which affect the performance of naive Bayes. Our approach
uses Monte Carlo simulations that allow a systematic study of
classification accuracy for several classes of randomly gener-
ated problems. We analyze the impact of the distribution en-
tropy on the classification error, showing that certain almost-
deterministic, or low-entropy, dependencies yield good per-
formance of naive Bayes. Particularly, we demonstrate that
naive Bayes works best in two cases: completely indepen-
dent features (as expected) and functionally dependent fea-
tures (which is surprising). Naive Bayes has its worst perfor-
mance between these extremes.
Surprisingly, the accuracy of naive Bayes is not directly
correlated with the degree of feature dependencies measured
as the class-conditional mutual information between the fea-
tures. Instead, a better predictor of accuracy is the loss of
information that features contain about the class when assum-
ing naive Bayes model. However, further empirical and the-
oretical study is required to better understand the relation be-
tween those information-theoretic metrics and the behavior of
naive Bayes. Further directions also include analysis of naive
Bayes on practical application that have almost-deterministic
dependencies, characterizing other regions of naive Bayes op-
timality and studying the effect of various data parameters on
the naive Bayes error. Finally, a better understanding of the
impact of independence assumption on classification can be
used to devise better approximation techniques for learning
efficient Bayesian net classifiers, and for probabilistic infer-
ence, e.g., for finding maximum-likelihood assignments.
Acknowledgements
We wish to thank Mark Brodie, Vittorio Castelli, Joseph
Hellerstein, Jayram Thathachar, Daniel Oblinger, and Ri-
cardo Vilalta for many insightful discussions that contributed
to the ideas of this paper.
References
[1]T.M. Cover and J.A. Thomas. Elements of information theory.
New York:John Wiley & Sons, 1991.
[2]P. Domingos and M. Pazzani. On the optimality of the simple
Bayesian classifier under zero-one loss. Machine Learning,
29:103–130, 1997.
[3]R.O. Duda and P.E. Hart. Pattern classification and scene anal-
ysis. New York: John Wiley and Sons, 1973.
[4]N. Friedman, D. Geiger, and Goldszmidt M. Bayesian network
classifiers. Machine Learning, 29:131–163, 1997.
[5]J. Hellerstein, Jayram Thathachar, and I. Rish. Recognizing
end-user transactions in performance management. In Pro-
ceedings of AAAI-2000, pages 596–602, Austin, Texas, 2000.
[6]J. Hilden. Statistical diagnosis based on conditional indepen-
dence does not require it. Comput. Biol. Med., 14(4):429–435,
1984.
[7]R. Kohavi. Wrappers for performance enhancement and obliv-
ious decision graphs. Technical report, PhD thesis, Department
of Computer Science, Stanford, CA, 1995.
[8]P. Langley, W. Iba, and K. Thompson. An analysis of Bayesian
classifiers. In Proceedings of the Tenth National Conference
on Artificial Intelligence, pages 399–406, San Jose, CA, 1992.
AAAI Press.
[9]Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[10]I. Rish, J. Hellerstein, and T. Jayram. An analysis of data char-
acteristics that affect naive Bayes performance. Technical Re-
port RC21993, IBM T.J. Watson Research Center, 2001.
[11]H. Schneiderman and T. Kanade. A statistical method for 3d
detection applied to faces and cars. In Proceedings of CVPR-
2000, 2000.
46
- CitationsCitations724
- ReferencesReferences17
- The process uses natural language processing through a combination of pre-processing steps, filtering and stemming techniques, combined with relevant statistical and machine learning classification techniques[1],[2]. Common machine learning techniques such as Naïve Bayes[3], support vector machines (SVM)[4]and decision trees[5]are typically used as classifiers for identifying sentiment or opinion, and assigning a label (typically positive or negative) to a piece of text. These classifiers treat the classification of sentiment in a " black-and-white " manner, while in reality sentiment is rarely clear-cut.
[Show abstract] [Hide abstract] ABSTRACT: Sentiment analysis aims to identify the polarity of a document through natural language processing, text analysis and computational linguistics. Over the last decade, there has been much focus on sentiment analysis as the data available on-line has grown exponentially to include many sentiment based documents (reviews, feedback, articles). Many approaches consider machine learning techniques or statistical analysis, but there has been little use of the fuzzy classifiers in this field especially considering the ambiguity of language and the suitability of fuzzy approaches to deal with this ambiguity. This paper proposes a fuzzy rule based system for sentiment analysis, which can offer more refined outputs through the use of fuzzy membership degrees. We compare the performance of our proposed approach with commonly used sentiment classifiers (e.g. Decision Trees, Naïve Bayes) which are known to perform well in this task. The experimental results indicate that our fuzzy-based approach performs marginally better than the other algorithms. In addition, the fuzzy approach allows the definition of different degrees of sentiment without the need to use a larger number of classes.- Second, different classifiers are used as base learners. These are Support Vector Machine (SVM)[60], Fuzzy K-Nearest Neighbour (FKNN)[61], Naive Bayes (NB)[62], Kernel Ridge Regression Classifier (KRIDGE)[63], Random Vector Functional Link (RVFL)[64][65][66], Cascade-forward Neural Network (CFNN) and Feed-forward Neural Network (FFNN)[67]. Each of these unique classifiers is trained with different parameter settings and training functions.
[Show abstract] [Hide abstract] ABSTRACT: This paper introduces a two step algorithm for fault diagnosis of power transformers (2-ADOPT) using a binary version of the multi-objective particle swarm optimization (MOPSO) algorithm. Feature subset selection and ensemble classifier selection are implemented to improve the diagnosing accuracy for dissolved gas analysis (DGA) of power transformers. First, the proposed method selects the most effective features in a multi objective framework and the optimum number of features, simultaneously, which are used as inputs to train classifiers in the next step. The input features are composed of DGA performed on the oil of power transformers along with the various ratios of these gases. In the second step, the most accurate and diverse classifiers are selected to create a classifier ensemble. Finally, the outputs of selected classifiers are combined using the Dempster-Shafer combination rule in order to determine the actual faults of power transformers. In addition, the obtained results of the proposed method are compared to three other scenarios: 1) multi-objective ensemble classifier selection without any feature selection step which takes all the features to train classifiers and then applies MOPSO algorithm to find the best ensemble of classifiers, 2) a well-known classifier ensemble technique called random forests, and 3) another powerful decision tree ensemble which is called oblique random forests. The comparison results were favourable to the proposed method and showed the high reliability of this method for power transformers fault classification.- In this context, each attribute-value pair would be independently correlated to each of the possible classes, which means that a posterior probability is provided between the attribute-value pair and the class. A popular method of Bayesian learning is Naive Bayes [26]. This type of learning is naturally similar to the prior-knowledge based approach of human reasoning, which means that people make decisions, reasoning and judgments based on the knowledge they obtained before, towards having the most confident choice.
[Show abstract] [Hide abstract] ABSTRACT: Big data can be generally characterised by 5 Vs—Volume, Velocity, Variety, Veracity and Variability. Many studies have been focused on using machine learning as a powerful tool of big data processing . In machine learning context, learning algorithms are typically evaluated in terms of accuracy, efficiency, interpretability and stability. These four dimensions can be strongly related to veracity, volume, variety and variability and are impacted by both the nature of learning algorithms and characteristics of data. This chapter analyses in depth how the quality of computational models can be impacted by data characteristics as well as strategies involved in learning algorithms. This chapter also introduces a unified framework for control of machine learning tasks towards appropriate employment of algorithms and efficient processing of big data. In particular, this framework is designed to achieve effective selection of data pre-processing techniques towards effective selection of relevant attributes, sampling of representative training and test data, and appropriate dealing with missing values and noise. More importantly, this framework allows the employment of suitable machine learning algorithms on the basis of the training data provided from the data pre-processing stage towards building of accurate, efficient and interpretable computational models.- In this paper, we used supervised machine learning with probabilistic approach, Naïve Bayes Classifier (NBC) for the classification system of Indonesia news curator in twitter. The reason using NBC because the algorithm didn " t included the relation between each features and the implementation got time complexity faster than other supervised algorithm [12, 13]. Another prior work argued NBC can be used in classification problem especially with data from twitter and produced higher accuracy than another classification algorithm [6, 7].
[Show abstract] [Hide abstract] ABSTRACT: News curators in twitter are a user, which is interested in following, spreading, giving feedback of recent popular articles. There are two kinds of this user, news curator as human user and news aggregator as bot user. In prior works about news curator, the classification system built using followers, URL, mention and re-tweet feature. However, there are limited prior works for classifying Indonesian News Curator in twitter and still hard for labeling data involve just two features: followers and URL. In this paper, we proposed a framework for classifying Indonesian news curator in twitter using Naïve Bayes Classifier (NBC) and added features such as location, bio profile, and common tweet. Another purpose for analyzing the influential features of certain class, so it's make easier for labeling data of this role in the future. Examination result using percentage split as evaluating system produced 87% accuracy. The most influential features for news curator are followers, bio profile, mention and re-tweet. For news aggregator class are followers, location, and URL. The rest just common tweet feature for not both class. We implemented Feature Subset Selection (FSS) for increasing system performance and avoiding the over fitting data, it has produced 92.90% accuracy.- It streamlines the learning by classifying the features in an independent class. Accuracy of Naïve Bay is independent of feature dependencies on classes (Rish, 2001). Text categorization can be viewed in the context of subsequent documents probabilities, i.e.,P(í µí± í µí± |í µí± í µí± ), where the probabilities of j th.
[Show abstract] [Hide abstract] ABSTRACT: Virtual communities reflect worldwide connectivity, and an enabler for real time information sharing and targeted advertising. Twitter has widely emerged as one of the extensively used micro blogging service. This is the platform to share ideas, feelings and views for any event. People have freedom to post Tweets for a particular event. The success of an event can be predicted by users’ responses. Individual interaction patterns can strongly indicate personalities. Garbage or bosh replies can harm the fidelity of an event. To make it trustworthy, we have performed sentiment analysis for the prediction of offensiveness in Tweets. We have collected data from Twitter search and stream API. Text mining techniques (preprocessing, stemming, negation rule, tokenization and stop words removal) are used for cleaning data. Our approach can predict offensiveness in Tweets effectively. We also performed comparative analysis of different machine learning classifiers, i.e., Naïve Bays (NB), Support Vector Machine (SVM) and Logistic Regression (LR) to find sentiment polarity and found that SVM outperforms others. An in-house tool, ‘Interaction Pattern Predictor’, is developed using Python programming language. Our results are trustworthy as we have used three large data dictionaries to train our developed tool.- Depending on the different assumed distributions of p(x|Y ), the naive Bayes[40,41]method can be further divided into Gaussian naive Bayes and Bernoulli naive Bayes (see Additional file 1).
[Show abstract] [Hide abstract] ABSTRACT: Background The study of virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relative simple methods based on the similarity between the word frequency vectors of viruses and bacterial hosts have been developed to study virus-host associations, the problem is significantly understudied. We hypothesize that machine learning methods based on word frequencies can be efficiently used to study virus-host infectious associations. Methods We investigate four different representations of word frequencies of viral sequences including the relative word frequency and three normalized word frequencies by subtracting the number of expected from the observed word counts. We also study five machine learning methods including logistic regression, support vector machine, random forest, Gaussian naive Bayes and Bernoulli naive Bayes for separating infectious from non-infectious viruses for nine bacterial host genera with at least 45 infecting viruses. Area under the receiver operating characteristic curve (AUC) is used to compare the performance of different machine learning method and feature combinations. We then evaluate the performance of the best method for the identification of the hosts of contigs in metagenomic studies. We also develop a maximum likelihood method to estimate the fraction of true infectious viruses for a given host in viral tagging experiments. Results Based on nine bacterial host genera with at least 45 infectious viruses, we show that random forest together with the relative word frequency vector performs the best in identifying viruses infecting particular hosts. For all the nine host genera, the AUC is over 0.85 and for five of them, the AUC is higher than 0.98 when the word size is 6 indicating the high accuracy of using machine learning approaches for identifying viruses infecting particular hosts. We also show that our method can predict the hosts of viral contigs of length at least 1kbps in metagenomic studies with high accuracy. The random forest together with word frequency vector outperforms current available methods based on Manhattan and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$d_{2}^{*}$\end{document}d2? dissimilarity measures. Based on word frequencies, we estimate that about 95% of the identified T4-like viruses in viral tagging experiment infect Synechococcus, while only about 29% of the identified non-T4-like viruses and 30% of the contigs in the study potentially infect Synechococcus. Conclusions The random forest machine learning method together with the relative word frequencies as features of viruses can be used to predict viruses and viral contigs for specific bacterial hosts. The maximum likelihood approach can be used to estimate the fraction of true infectious associated viruses in viral tagging experiments. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1473-7) contains supplementary material, which is available to authorized users.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.
This publication is from a journal that may support self archiving.
Learn more
















