ArticlePDF Available

Abstract and Figures

Fuzzy logic is a powerful tool to model knowledge uncertainty, measurements imprecision, and vagueness. However, there is another type of vagueness that arises when data have multiple forms of expression that fuzzy logic does not address quite well. This is the case for multiple instance learning problems (MIL). In MIL, an object is represented by a collection of instances, called a bag. A bag is labeled negative if all of its instances are negative, and positive if at least one of its instances is positive. Positive bags encode ambiguity since the instances themselves are not labeled. In this paper, we introduce fuzzy inference systems and neural networks designed to handle bags of instances as input and capable of learning from ambiguously labeled data. First, we introduce the Multiple Instance Sugeno style fuzzy inference (MI-Sugeno) that extends the standard Sugeno style inference to handle reasoning with multiple instances. Second, we use MI-Sugeno to define and develop Multiple Instance Adaptive Neuro Fuzzy Inference System (MI-ANFIS). We expand the architecture of the standard ANFIS to allow reasoning with bags and derive a learning algorithm using backpropagation to identify the premise and consequent parameters of the network. The proposed inference system is tested and validated using synthetic and benchmark datasets suitable for MIL problems. We also apply the proposed MI-ANFIS to fuse the output of multiple discrimination algorithms for the purpose of landmine detection using Ground Penetrating Radar.
Content may be subject to copyright.
1
Multiple Instance Fuzzy Inference Neural Networks
Amine B. Khalifa, Hichem Frigui, Member, IEEE,
Multimedia Research Lab
CECS Department
University of Louisville
Louisville, KY 40292, USA
Abstract
Fuzzy logic is a powerful tool to model knowledge uncertainty, measurements imprecision, and vagueness.
However, there is another type of vagueness that arises when data have multiple forms of expression that fuzzy logic
does not address quite well. This is the case for multiple instance learning problems (MIL). In MIL, an object is
represented by a collection of instances, called a bag. A bag is labeled negative if all of its instances are negative,
and positive if at least one of its instances is positive. Positive bags encode ambiguity since the instances themselves
are not labeled. In this paper, we introduce fuzzy inference systems and neural networks designed to handle bags of
instances as input and capable of learning from ambiguously labeled data. First, we introduce the Multiple Instance
Sugeno style fuzzy inference (MI-Sugeno) that extends the standard Sugeno style inference to handle reasoning
with multiple instances. Second, we use MI-Sugeno to define and develop Multiple Instance Adaptive Neuro Fuzzy
Inference System (MI-ANFIS). We expand the architecture of the standard ANFIS to allow reasoning with bags
and derive a learning algorithm using backpropagation to identify the premise and consequent parameters of the
network. The proposed inference system is tested and validated using synthetic and benchmark datasets suitable for
MIL problems. We also apply the proposed MI-ANFIS to fuse the output of multiple discrimination algorithms for
the purpose of landmine detection using Ground Penetrating Radar.
I. INTRODUCTION
Fuzzy inference is a powerful modeling framework that can handle computing with knowledge uncertainty
and measurements imprecision effectively [1]. It has been successfully applied to a wide range of problems,
mainly in system modeling and control [2]–[4]. Most of the proposed fuzzy inference methods gained
success because of their ability to leverage expert knowledge to identify the model parameters [5]. This
practice simplifies system design and ensures that the knowledge base (if-then rules) used by the system is
easy to interpret [6].
More recently, fuzzy inference has increasingly been applied to more advanced applications, such as
content-based information retrieval [7], image segmentation [8], image annotation [9], pattern recognition
[10], recommender systems [11], and multiple classifier fusion [12]. The aforementioned applications are
more challenging as they require extensive knowledge base to accommodate for various scenarios. Since
this diverse knowledge base cannot be fully captured by domain experts, data-driven techniques are typically
used to identify and learn the inference system’s parameters [13], [14]. One such technique is the Adaptive
Neuro-Fuzzy Inference System (ANFIS) [15]. ANFIS is a universal approximator that combines the learning
and modeling power of neural networks and fuzzy logic into an adaptive inference system. It is a hybrid
intelligent system and it provides a systematic approach to jointly learn the optimal input space partition
(rules) and the optimal output parameters using supervised learning.
Typically, in supervised learning, access to large labeled training datasets improves the performance of
the devised algorithms by increasing their robustness and generalization capabilities. Nowadays, access to
such large datasets is becoming more convenient. However, for a supervised leaning method to benefit from
this data, it need to be carefully preprocessed, filtered, and labeled. Unfortunately, this process can be too
Email: a0benk01@louisville.edu
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may
no longer be accessible.
arXiv:1610.04973v1 [cs.NE] 17 Oct 2016
2
tedious as the vast portion of the collected data is unstructured, labeled ambiguously and at a coarse level.
An alternative and a relatively new framework of learning that tackles the inherent ambiguity better than
supervised learning, is the Multiple Instance Learning (MIL) paradigm [16].
A. Multiple Instance Learning
Unlike standard supervised learning, in MIL, an object is not represented by a simple data point, but
rather by a collection of instances, called a bag. Each bag can contain a different number of instances.
A bag is labeled negative if all of its instances are negative, and positive if at least one of its instances
is positive1. Positive bags can encode ambiguity since the instances themselves are not labeled. Given a
training set of labeled bags, the goal of MIL is to learn a concept that predicts the labels of training data
at the instance level and generalizes to predict the labels of testing bags and their instances [17]. We refer
to this definition as the standard MIL assumption. Multiple MIL paradigms have been proposed [18], but
for simplicity we focus our formulation on the standard MIL assumption.
The MIL is a well known problem that has been studied for the last 20 years, it was first formalized by
Dietterich et al. [19] providing a solution to drug activity prediction. Ever since, it has increasingly been
applied to a wide variety of tasks including content-based information retrieval [20], drug discovery [21],
pattern recognition [22], image classification [23], region-based image categorization [24], image annotation
[25], object tracking [26] and time series prediction [16]. In general, MIL can be applied in two contexts
of ambiguity: “polymorphism ambiguity” and “part-whole ambiguity” [27]. In polymorphism ambiguity,
an object can have multiple forms of expression in the input space and it is not known which form is
responsible for the object label. Whereas, in part-whole ambiguity, an object can be broken into several
parts represented by different feature vectors in the input space. However, only few parts are responsible for
the object label [28]. Polymorphism Ambiguity arise more often in applications related to chemistry and
bioscience. The original MIL application of drug discovery [16], [17] is a case of polymorphism ambiguity.
Part-whole Ambiguity is more common in pattern recognition problems. For example, in image annotation
features are usually extracted locally (from patches) while the labels, or tags, are only available gloablly at
the image level. Another closely related application is object detection. In this application, objects of interest
may cover only a limited region of the image, the rest could be other objects or background. Traditional
supervised learning requires identifying image patches containing the object of interest only and labeling
them. As indicated by Viola et al. [29], placing bounding boxes around objects is an inherently ambiguous
task. Thus, to avoid the tedious task of object segmentation and annotation, the problem of object detection
can be addressed using an MIL paradigm. To illustrate the need for MIL further, in the following we analyze
how a multiple instance (MI) representation can be applied to image classification. More details about MIL
taxonomy have been reported by Amores [30].
Consider the simple example of classifying images that contain “sky”. Using an MIL approach, each
training image is represented by a bag of instances where each instance corresponds to features extracted
from a region of interest. These regions could be obtained by segmenting the image or simply by dividing
it into patches. A multiple instance representation is well suited for this purpose because only few regions
may contain the object of interest (sky), that is the positive class. Other patches will be from background
or other classes. This representation is illustrated in Figure 1. Traditional single instance learning are based
on instance level (patch-level) labels and would require each image region to be correctly segmented and
labeled prior to learning.
B. Fuzzy Inference Systems
A Fuzzy Inference System (FIS) is a paradigm in soft computing which provides a means of approximate
reasoning [31]. A FIS is capable of handling computing with knowledge uncertainty and measurements
imprecision effectively [1]. It performs a non-linear mapping from an input space to an output space by
1Note that positive bags may also contain negative instances.
3
Fig. 1. Example of an image represented as a bag of 12 instances. Each instance correspond to a feature vector (e.g., color, texture) extracted
from one patch. The bag is labeled “sky” because at least one of its instances is sky. However, many other instances are not “sky”. Labels at
the instance level are not available.
deriving conclusions from a set of fuzzy if-then rules and known facts [32]. Fuzzy rules are condition/action
(if-then) rules composed of a set of linguistic variables (e.g. image patch). Each variable is assigned a
linguistic term (e.g. red, green, blue). For instance, the following rules could be used to identify patches
from the image in Figure 1:
If patch is blue and texture is smooth then region is sky.
If patch is blue and patch position is upper half then region is sky.
Typically, a FIS is composed of 5 components: (1) a Fuzzification unit that assigns a membership degree
to each crisp input dimension in the input fuzzy sets; (2) a Knowledge Base characterized by fuzzy sets of
linguistic terms; (3) a Rule Base containing a set of fuzzy if-then rules; (4) an Inference unit that performs
fuzzy reasoning; and (5) a Deffuzification unit that generates crisp output values. FIS has proven to be very
effective in various applications [2]–[4], [33]–[40]. However, it is not applicable to cases where objects are
represented by multiple instances.
C. Motivations For Multiple Instance Fuzzy Inference
There are two major limitations that prevent using standard FIS methods with multiple instance data.
First, due to the absence of labels at the instance level, we cannot use standard FIS learning methods to
construct the knowledge base. Second, we need an effective mechanism to aggregate instances’ confidences
and infer at the bag level. The above limitations are due mainly to the inherent architecture of fuzzy inference
systems. The standard inference systems reason with individual instances. First, the system’s input is an
individual instance. Second, the rules describe fuzzy regions within the instances space. Third, the output
of the system corresponds to the fuzzy inference using a single instance. Fourth, labels of the individual
instances are required when using learning techniques to identify the parameters of the system. In summary,
traditional fuzzy inference systems cannot be used effectively within the MIL framework.
To address the above limitations, we introduce two FIS designed to handle reasoning with bags of
instances and capable of learning form ambiguously labeled data. The first one, called Multiple Instance-
Sugeno (MI-Sugeno) extends the standard Sugeno system [41]. The second one, called Multiple Instance-
ANFIS (MI-ANFIS) extends the standard ANFIS [15] system and uses MI-Sugeno rules. We report results
on various experiments and discuss the advantages of using our proposed methods over closely related MIL
4
algorithms such as Multiple Instance Neural Networks [42] (MI-NN) and Multiple Instance RBF Neural
Networks [43] (RBF-MIP).
II. MU LTIPLE INSTANC E FUZ ZY IN FERENCE
In the following, let Bpbe a bag of Mpinstances with the jth instance denoted as xpj RDwith
elements x(p,j,k)corresponding to features, i.e.,
Bp=
xp1
xp2
.
.
.
xpMp
=
x(p,1,1) x(p,1,2) . . . x(p,1,D)
x(p,2,1) x(p,2,2) . . . x(p,2,D)
.
.
..
.
.....
.
.
x(p,Mp,1) x(p,Mp,2) . . . x(p,Mp,D)
.(1)
Note that the number of instances can vary between bags (Mpdepends on Bp). A bag is labeled positive
if at least one of its instances is positive, and negative if all of its instances are negative.
A. Multiple Instance Sugeno Style Fuzzy Inference
To adapt Sugeno inference to problems where objects are described by multiple instances, we propose
a multiple instance Sugeno inference (MI-Sugeno) system that uses multiple instance fuzzy if-then rules.
Recall that a fuzzy if-then rule is expressed as
if x is A then y is C (2)
where Aand Care fuzzy sets on universes of discourse Xand Y, respectively. The rule in (2) combines
the fuzzy propositions (x is A,y is C) into a logical implication abbreviated as ACwith membership
function µAC(x, y). The rule is defined using a premise part that is a single instance fuzzy proposition.
To generalize the rule in (2) to MI data, we define a multiple instance fuzzy rule as:
if Biis A then y is C if
Mi
_
j=1
(xij is A)then y is C (3)
where as in (2), Aand Care fuzzy sets on the universes of discourse Xand Y, respectively. In (3), Biis a
bag of instances xij as defined in (1), and Miis the number of instances in Bi. The premise part of a multiple
instance fuzzy rule (i.e., WMi
j=1(xij is A)) is a multiple instance proposition, whereas the consequent part is
a traditional proposition. In (3), Wis a joint operator that can be any T-conorm (maximum, algebraic sum,
bounded sum, etc.). The reason behind using a T-conorm for combining individual instances’ responses,
goes back to the standard MIL assumption [16], [17] which states that a bag is positive if and only if
one or more of its instances are positive. Thus, the bag-level class label is determined by the disjunction
of the instance-level class labels. We note that the T-conorm can be designed to handle a broader set of
non-standrad MIL problems, for example to allow the inference process to assign a higher degree of belief
to bags with more than one positive instance.
The proposed MI-Sugeno uses multiple instance fuzzy rules with a consequent part that is described by
means of a function Cthat maps a bag of instances to a crisp numerical value. Specifically, we define a
multiple instance sugeno rule as:
Ri(Bp) :
Mp
_
j=1
(If x(p,j,1) is Ai
1and x(p,j,2) is Ai
2. . . and x(p,j,D)is Ai
D),then
oi=C(xp1·bi,xp2·bi,...,xpMp·bi)
(4)
In (4), bi=bi
0, ..., bi
Dis a set of polynomial coefficients. When the polynomial coefficients biare first
order, the MI-Sugeno fuzzy model is called first order, and zero order when the polynomial coefficients
5
Fig. 2. Illustration of the proposed multiple instance Sugeno fuzzy inference system with 2 rules.
are zero order.
Figure 2 illustrates the proposed MI-Sugeno system and its fuzzy inference mechanism to derive the output,
o, in response to a bag of Minstances for the simple case of two rules. The premise part of the rules
evaluates all the bag’s instances simultaneously. The inference starts by the fuzzification of instances xpm of
input bag Bp. Fuzzification assigns a membership degree to each input instance dimension in the rules input
fuzzy sets. In Figure 2, instance xpm activates the ith input fuzzy set of the jth rule by a degree of truth
w(m,i,j). Next, an implication process is executed to combine the activations of the instances within the bag
resulting in the activation of the rules’ output with different degrees. In this example, we use a simple min
operator, and the output of rule Rjwill be partially activated by a degree wmj =mink=1,...,Dw(m,k,j). The
wmj (truth instances) are combined in the premise part using the max T-conorm, resulting in the activation
of rule Rjby a degree wj=maxm=1,...,M {wmj}. To evaluate the consequent part, first the linear response of
each instance is computed, i.e., xpj ·bi. Then, a function Cis used to compute the final output by combining
the instances’ responses. Many functions could be used and the choice should be domain-specfic. The output
of each rule, o1and o2, are crisp values. As in the traditional Sugeno fuzzy inference system, the overall
output of the system is obtained by taking the weighted average of the rules’ outputs.
The consequent part of the proposed MI-Sugeno style inference system is inspired by the work of Ray
and Page on multiple instance regression [44]. In their work, the authors proposed a regression framework
for predicting bags’ labels. This formulation allows the linear coefficients biand the parameters of the
combining function Cto be learned using optimazation techniques, as we will show in section II-B.
Similar to traditional fuzzy inference, the premise part of a multiple instance rule defines a local fuzzy
region within the instance space, and the consequent part describes the characteristics of the system’s
output within each region. More specifically, in problems, a local region describes a positive concept (also
called target concept), and the output of a rule represents the degree of “positivity” of the instances in that
target concept. A target concept is a region in the instances’ feature space that includes as many instances
from positive bags as possible and as few instances from negative bags as possible.
The Sugeno fuzzy model [41] was the first attempt at learning fuzzy rules from training data. It has
6
Fig. 3. Architecture of the proposed Multiple Instance Adaptive Neuro-Fuzzy Inference System
been used to develop the standard ANIFS which combines the representation power of fuzzy inference and
learning capability of neural networks to learn the rules. In the next section, we will use our MI-Sugeno
to develop a multiple instance extension of ANFIS (MI-ANFIS).
B. MI-ANFIS: A Multiple Instance Adaptive Neuro-Fuzzy Inference System
Let Bibe a bag of Miinstances as defined in (1). For simplicity, we introduce our MI-ANFIS for the
case of two rules. The generalization to an arbitrary number of rules is trivial. The MI-ANFIS with two
Sugeno rules can be described as:
R1(Bp) :
Mp
_
j=1
(If x(p,j,1) is A(1,1) and x(p,j,2) is A(1,2),...,and x(p,j,D)is A(1,D)),
then f1=C(xp1·b1,xp2·b1,...,xpMp·b1)
R2(Bp) :
Mp
_
j=1
(If x(p,j,1) is A(2,1) and x(p,j,2) is A(2,2),...,and x(p,j,D)is A(2,D)),
then f2=C(xp1·b2,xp2·b2,...,xpMp·b2)
(5)
Figure 3 illustrates the proposed MI-ANFIS architecture. As in the traditional ANFIS, nodes at the same
layer have similar functions. We denote the output of the ith node in layer las O(l,i)
Layer 1 is an adaptive layer, it calculates the degree to which an input instance satisfies a quantifier
A. Every node evaluates the membership degree µA(k,j)of an input instance, x, in the fuzzy set
A(k,j). Generally, µA(k,j)is a parameterized membership function (MF), for example a Gaussian
MF with
µA(k,j)(x) = exp((xckj )2
2σkj 2).(6)
In (6), ckj and σkj are the mean and variance of the gaussian function, and are referred to as the
premise parameters.
7
Layer 2 is a fixed layer where every node computes the product of all incoming inputs. It evaluates the
degree of truth of proposition instances, or simply, “truth instances”. The output of this layer is
computed using:
O(2,i)=ri/Mp,i[Mp]=
D
Y
j=1
µAi/Mp,j(x(p,i[Mp],j)).(7)
where is a ceiling operator, and i[Mp]is 1 + ((i1) mod Mp). As in the traditional ANFIS,
any T-norm can replace the product as the node function in this layer.
Layer 3 is a new addition when compared to the traditional ANFIS. Every node in this layer aggregates
the truth instances (within each bag) of the previous layer by means of a smooth T-conorm. In
this paper, we use a “softmax” function (Sα):
softmaxα(x1, x2, . . . , xn) = Sα(x1, x2, . . . , xn) =
n
X
i=1
xi·eαxi
Pn
j=1 eαxj.(8)
In (8), αdetermines the behavior of softmax. As αapproaches , softmax approaches the max
operator. When α= 0, it calculates the mean. As αapproaches −∞, softmax approaches the
min operator. The outputs of this layer are the firing strength of each input bag in each multiple
instance fuzzy rule. i.e.,
O(3,i)=wi=Sα({r(i,j)}Mp
j=1).(9)
Layer 3is also a fixed layer.
Layer 4 is a fixed layer. Every node in this layer calculates the normalized firing strength of each
rule, i.e.,
O(4,i)=wi=wi
P|O3|
j=1 wj
.(10)
where |O3|is the number of rules.
Layer 5 is an adaptive layer. Every node iin this layer computes the output of the ith multiple instance
rule using
O(5,i)=wiC(xp1·bi,xp2·bi,...,xpMp·bi).(11)
The parameters {bi}|O3|
i=1 are referred to as the consequent parameters. The only constraint on
Cis that it has to be a smooth function to allow for optimization techniques to be applied. We
use the “softmax” as the combining function for this layer. In this case, (11) is equivalent to:
O(5,i)=wiSα(xp1·bi,xp2·bi,...,xpMp·bi).(12)
note that the constant αhere is not necessary the same as in Layer 3.
Layer 6 is a fixed layer with a single node labeled Σ. It computes the overall output of the system
using
O(6,1) =
|O3|
X
i=1
O5,i =
|O3|
X
i=1
wiSα(xp1·bi,xp2·bi,...,xpMp·bi).(13)
To learn the parameters of the proposed MI-ANFIS network, we propose a generalization to the basic
learning algorithm presented by Jang [45]. Our variation is different from the ANFIS standard backpropa-
gation learning rule due to the additional layers (Layers 3 and 5) and the use of “softmax” function (in (9)
and (11)). Thus, all update equations need to be rederived.
BackPropagation Learning Rule: we assume that we have Ntraining bags, B={Bp|p= 1, . . . , N},
and their corresponding labels T={tp|p= 1, . . . , N}. After presenting the pth training bag, we compute
its squared error measure:
Ep= (tpOp)2.(14)
8
In (14), tpis the desired bag output, and Opis the computed output of the network when presented with
training bag Bp. Recall that labels at the instances level are not available and errors can be computed only
at the bag level.
The overall error measure of the network after presenting all Nbags is
E=
N
X
p=1
Ep.(15)
To develop the gradient descent optimization on E, we compute the error rate for the pth training bag at
each output node O(l,i). This error rate ε(l,i)(where 1l6indicates the MI-ANFIS layer) is defined as
εl,i =∂Ep
∂O(l,i)
.(16)
At the output node, we have
ε(6,1) =∂Ep
∂O(6,1)
=∂Ep
∂Op
=2(tpOp).(17)
For non-output nodes (i.e. internal nodes, l < 6), we derive the error rate using the chain rule
ε(l,i)=∂Ep
∂O(l,i)
=
Card(l+1)
X
h=1
∂Ep
∂O(l+1,h)
∂O(l+1,h )
∂O(l,i)
.(18)
where Card(l+ 1) refers the number of nodes at layer l+ 1.
Next, we seek to minimize the network error with respect to the premise parameters {ckj, σkj |1k
|O3|,1jD}, and with respect to the consequent parameters {bi}|O3|
i=1 .
The error rate with respect to a generic parameter θcan be computed using
∂Ep
∂θ =X
OG
∂Ep
∂O
∂O
∂θ .(19)
where Gis the set of nodes whose outputs depend on θ.
Using(15), the total error rate is given by
∂E
∂θ =
N
X
p=1
∂Ep
∂θ .(20)
Update Rule For Premise Parameters: First we compute the error rate for the premise parameters ckj
and σkj using
∂Ep
∂ckj
=
Mp
X
i=1
∂Ep
∂O(1,i+[(k1)D+(j1)]Mp)
∂O(1,i+[(k1)D+(j1)]Mp)
∂ckj
.(21)
and,
∂Ep
∂σkj
=
Mp
X
i=1
∂Ep
∂O(1,i+[(k1)D+(j1)]Mp)
∂O(1,i+[(k1)D+(j1)]Mp)
∂σkj
.(22)
9
Using the chain rule defined in (18), it can be shown that (see derivation in Appendix A)
∂Ep
∂ckj
=2(tpOp)× Sα(xp1·bk,xp2·bk,...,xpMp·bk)×P|O3|
l=1 wlwk
P|O3|
l=1 wl2
×
Mp
X
i=1 eαr(k,(i+(k1)Mp))
PMp
m=1 eαr(k,m)h1 + αr(k,(i+(k1)Mp)) − Sα({r(k,m)}Mp
m=1)i
×
D
Y
d=1,d6=j
µA(i+(k1)Mp)/Mp,dx(p,(i+(k1)Mp)[Mp],d)
×(x(p,(i+(k1)Mp)[Mp],j)ckj )
σ2
kj
×exp((x(p,(i+(k1)Mp)[Mp],j)ckj )2
2σ2
kj
)!.
(23)
The center parameters ckj are then updated using
4ckj =ηE
∂ckj
.(24)
where ηis the learning rate.
The update formula for σkj can be derived in a similar manner. It can be shown that
∂Ep
∂σkj
=2(tpOp)× Sα(xp1·bk,xp2·bk,...,xpMp·bk)×P|O3|
l=1 wlwk
P|O3|
l=1 wl2
×
Mp
X
i=1 eαr(k,(i+(k1)Mp))
PMp
m=1 eαr(k,m)h1 + αr(k,(i+(k1)Mp)) − Sα({r(k,m)}Mp
m=1)i
×
D
Y
d=1,d6=j
µA(i+(k1)Mp)/Mp,dx(p,(i+(k1)Mp)[Mp],d)
×(x(p,(i+(k1)Mp)[Mp],j)ckj )2
σ3
kj
×exp((x(p,(i+(k1)Mp)[Mp],j)ckj )2
2σ2
kj
)!.
(25)
The MF’s width, σkj, are then updated using
4σkj =ηE
∂σkj
.(26)
Update Rule For Consequent Parameters: The error rate for the consequent parameters {bi={bi
0, ..., bi
D}, i =
1. . . |O3|} is defined as
∂Ep
bi=∂Ep
∂bi
0
,∂Ep
∂bi
1
,...,∂Ep
∂bi
D.(27)
where, ∂Ep
∂bi
j
=∂Ep
∂O(5,i)
∂O(5,i)
∂bi
j
, f or j = 1, . . . , D. (28)
10
Using (18), it can be shown that (see Appendix B)
∂E
∂bi
j
=
N
X
p=1
∂Ep
∂bi
j
=
N
X
p=1
wi
Mp
X
m=1 1
PMp
h=1 exp(α(xph ·bixpm ·bi))2
×hx(p,m,j)
Mp
X
h=1
exp(α(xph ·bixpm ·bi)xpm ·bi
Mp
X
h=1
exp(α(xph ·bixpm ·bi)α(x(p,h,j)x(p,m,j))i!.
(29)
The consequent parameters are then updated using
4bi
j=η∂E
∂bi
j
.(30)
Equations (24), (26) and (30) can be used to update ckj ,σkj and bi
jparameters either on-line, bag by bag (
we want to emphasis here that the on-line learning is not achieved instance by instance, but rather bag by
bag), or off-line in batch mode after presentation of the entire data.
The proposed MI-ANFIS learning algorithm is summarized in Algorithm 1.
Algorithm 1 MI-ANFIS Basic Learning Algorithm
Inputs:B: the set of training bags.
T: labels of the training bags.
M: the number of instances in each bag.
α: the constant used in the “softmax” function.
η: the learning rate.
Emax: number of epochs.
: minimum parameters change value.
Outputs:bi: the sets of consequent parameters.
ci: the set of membership functions’ centers.
σi: the set of membership functions’ widths.
Initialize bi,ci, and σi.
repeat
Update biusing (30) and bi(new)=bi(old)+4bi.
Update ciusing (24) and ci(new)=ci(old)+4ci.
Update σiusing (26) and σi(new)=σi(old)+4σi.
until max(kbi(new)bi(old)k,kci(new)ci(old)k,kσi(new)σi(old)k)<  or Number of epochs > Emax
return bi,ci,σi
III. PREVENTING OVE RFI TT IN G: RU LE DROPOUT
Neural networks with large number of parameters are susceptible to overfitting. MI-ANFIS is no excep-
tion, particularly when using large number of multiple instance fuzzy rules and relatively small training
datasets. In such scenario, some rules could co-adapt to the training data and degrade the network ability to
generalize to unseen examples. In this section, we present a technique, known as Dropout, used to prevent
overfitting and rules’ co-adaptation.
Dropout is a regularization method that was introduced by Hinton et al. [46] to alleviate the serious problem
of overfitting in deep neural networks. Over the years, many methods have been developed to reduce
overfitting, including using a validation dataset to stop the training as soon as the performance gets worse,
adding weight penalties using L1 and L2 regularization, or artificially augmenting the training dataset using
11
label-preserving transformations. However, as noted by Hinton [46], the best way to regularize a fixed-size
model is to average the predictions of all possible settings of the parameters weighted by its posterior
probability given the training data. This can be achieved by combining the predictions of an exponential
number of models. Combining several models with different architectures have the advantage of better
generalization and per consequence better testing performance. While generating an ensemble of models is
trivial, training them all is prohibitively expensive.
Generally, Dropout works by setting to 0the output of each node in a given layer with probability 1p
(ptypically equals 0.5), during training. Nodes that are dropped out do not contribute to the parameters
updates. During testing, all nodes are used but the outputs are weighted by the probability p. Following
this strategy, every time a new training example is presented, the network samples and trains a different
architecture. In other words, Dropout trains an ensemble of networks (2Nnetworks, Nbeing the number of
nodes) simultaneously leading to an important speedup in training time as compared to traditional ensemble
methods. Figure 4 and Figure 5 illustrate the Dropout model.
Fig. 4. Dropout neural network model. (a) is a standard neural network. (b) is the same network after applying dropout. Doted lines indicate
a node that has been dropped. (source [46])
Fig. 5. Illustration of Dropout application. (a) a node is dropped with probability 1pat training time. (b) at test time the node is always
present and its outputs are weighted by p. (source [46])
In this paper, we propose to adopt the Dropout strategy to regularize MI-ANFIS networks. Typically,
overfitting occurs in MI-ANFIS networks when a set of multiple instance rules co-adapt to the provided
data early during the training process and prevent the remaining rules from learning. Thus, degrading the
12
network’s generalization capability. While the Dropout technique could be applied to MI-ANFIS as is (given
the inherited neural network nature of the architecture), care should be exercised when selecting nodes to
include in the list of the randomly dropped out nodes. MI-ANFIS nodes are different from that of standard
neural networks as they are grouped into rules to model and express linguistic terms. Simply dropping
few nodes from a given rule can change its role and could severely handicap the fuzzy inference process.
Hence, Dropout should be executed differently. In deep neural nets, Dropout is applied to selected layers
(vertically), for MI-ANFIS, we propose to apply Dropout on a rule by rule basis (i.e., horizontally). Either
the whole rule is included, or the whole rule is dropped. This can be achieved by applying Dropout to
Layer 5(see Figure 6), i.e., setting to zero the output of the “to be dropped out” rules. We will refer to
this derived technique as “Rule Dropout”. Using a Rule Dropout strategy to train MI-ANFIS networks is
approximatively equivalent to sampling and training 2R(Ris the number of rules) ensemble of networks.
Let pbe the probability with which a rule is present, formally, Rule Dropout is applied to Layer 5during
training as follows
O5,i =hiwiSα(xp1·bi,xp2·bi,...,xpMp·bi),(31)
where
hiBernoulli(p)(32)
is a Bernoulli random variable with probability pof being 1. During testing, Layer 5output is scaled by
p, i.e., O3,i =pwiSα(xp1·bi,xp2·bi,...,xpMp·bi). Figure 6 illustrates our MI-ANFIS network with 3
multiple instance fuzzy rules where, at a given iteration, rule 2 has been dropped out..
Deriving the new update equations for MI-ANFIS parameters requires taking into consideration the added
Bernoulli random variable, hi. It is straightforward to show that the new gradients with respect to premise
and consequent parameters are given by
∂Ep
∂ckj
=2(tpOp)
×hk× Sα(xp1·bk,xp2·bk,...,xpMp·bk)×P|O3|
l=1 wlwk
P|O3|
l=1 wl2
×
Mp
X
i=1 eαrk,(i+(k1)Mp)
PMp
m=1 eαrk,m h1 + αrk,(i+(k1)Mp)− Sα({rk,m }Mp
m=1)i
×
D
Y
d=1,d6=j
µA(i+(k1)Mp)/Mp,dxp,(i+(k1)Mp)[Mp],d
×(x(p,(i+(k1)Mp)[Mp],j)ckj )
σ2
kj
×exp((x(p,(i+(k1)Mp)[Mp],j)ckj )2
2σ2
kj
)!.(33)
13
Fig. 6. Illustration of Rule Dropout application. Doted lines indicate a rule that has been dropped.
and,
∂Ep
∂σkj
=2(tpOp)
×hk× Sα(xp1·bk,xp2·bk,...,xpMp·bk)×P|O3|
l=1 wlwk
P|O3|
l=1 wl2
×
Mp
X
i=1 eαrk,(i+(k1)Mp)
PMp
m=1 eαrk,m h1 + αrk,(i+(k1)Mp)− Sα({rk,m }Mp
m=1)i
×
D
Y
d=1,d6=j
µA(i+(k1)Mp)/Mp,dxp,(i+(k1)Mp)[Mp],d
×(x(p,(i+(k1)Mp)[Mp],j)ckj )2
σ3
kj
×exp((x(p,(i+(k1)Mp)[Mp],j)ckj )2
2σ2
kj
)!.(34)
14
In a similar manner,
∂E
∂bi
j
=
N
X
p=1
2(tpOp)
×hiwi
Mp
X
m=1
1
PMp
h=1 exp(α(xph ·bixpm ·bi))2
×hxpmj
Mp
X
h=1
exp(α(xph ·bixpm ·bi)
xpm ·bi
Mp
X
h=1
exp(α(xph ·bixpm ·bi)α(xphj xpmj )i.
(35)
As it can be seen, equations (33), (34), and (35) will get zeroed when the rule is dropped out (i.e., hk= 0
and hi= 0). Thus, its premise and consequent parameters are not updated.
IV. EXPERIMENTAL RESULTS
A. Synthetic Data
To illustrate the proposed multiple instance fuzzy inference and its ability to learn from data without
instance-level labels, first, we use a simple 2-Dim synthetic dataset. This data were generated from a
distribution of two positive contexts with centers at (0.5,0.5) and (1.5,1.5), and with a fixed standard
deviation. These centers are marked with squares in Figure 7, and the circles around the centers indicates
regions within 1 standard deviation. These regions are considered the two target concepts. From each positive
concept we generated 50 bags. Each bag has a random number, between 2 and 10, of instances. Each bag
from concept 1 (or 2) will have at least one instance close to target concept 1 (or 2). We also generated 50
negative bags randomly from non concept regions. Negative bags will have all of their instances outside
both target concepts. In Figure 7, instances from negative bags are shown as “.”, and instances from positive
bags are shown as “+” or “M” depending on the underlying concept. In Figure 7, we highlight one bag
from Concept 1 by circling all of its instances. As it can be seen, one of its instances is within one standard
deviation region of target concept 1 while the other instances are scattered around. We should emphasize
here that the centers of the target concepts in Figure 7 are unknown and not used by the learning algorithm.
They are shown here for illustration and validation purposes only.
1) MI-ANFIS Rules Learning: In the following, we show that the MI-ANFIS Learning Algorithm (Al-
gorithm 1) is capable of identifying positive concepts as well as their corresponding multiple instance
fuzzy rules. To initialize the premise parameters, we partition the instances’ space into 6 partition generated
randomly 2. We use the partitions’ centers as initial centers for the Gaussian MFs, and we initialize all
standard deviation parameters to a default value of 0.5.
The initial fuzzy sets (MFs) of the rules, before training, are displayed in Figure 8 in dashed lines. As
it can be seen, the initial 6 partitions simply cover random quadrants of the 2D instance space (if no label
information is used, as in this case, data would appear to have uniform distribution (refer to Figure 7)).
The learned fuzzy sets after convergence are shown in Figure 8 in bold lines. As it can be seen, the system
has correctly identified the positive concepts, and at the same time identified irrelevant rules (MI-Rule 1,
MI-Rule 3 and MI-Rule 5) and assigned low output values to each, 0.3,0.06 and 0.12 respectively.
B. Benchmark Datasets
To provide a quantitative evaluation of the proposed MI-ANFIS, we apply it to five benchmark data
sets commonly used to evaluate MIL methods: The MUSK1, MUSK2 [19], and Fox, Tiger, and Elephant
2A grid or manual partitioning could also be used
15
Fig. 7. Instances from positive and negative bags drawn from data that have 2 concepts. The center of each target concept is indicated by
a square and the circles indicate the region within 1 - standard deviation from the mean. Instances from negative bags are shown as “.”, and
instances from positive bags are shown as “+” or “M”. Instances from one sample positive bag are circled.
from the COREL data set [47]. MUSK1 and MUSK2 data sets consist of descriptions of molecules and
the objective is to classify whether a molecule smells musky [48]. In these data sets, each bag represents a
molecule and instances within each bag represent the different low-energy conformations of the molecule.
Each instance is characterized by 166 features. MUSK1 has 92 bags, of which 47 are positive, and MUSK2
has 102 bags, of which 39 are positive. The other data sets (Fox, Tiger, and Elephant), classify whether an
image contains the corresponding animal. Each data set consists of 200 images (bags): 100 positive images
containing the target animal and 100 negative images containing other animals. Each image is represented
as a set of patches (instances) and each patch is in turn represented by a 230 dimensional feature vector
describing its color, texture and shape information. We note that the three data sets are independent and used
as binary classification problem (positive v.s. negative). Table I summarizes the characteristics of the five
data sets. It is to be noted that for each benchmark data set, PCA was applied to reduce the dimensionality
of the features in order to speedup MI-ANFIS training and increase the interpretability of the generated
multiple instance fuzzy rules.
For each experiment, we construct a zero-order MI-ANFIS with a given number of multiple instance rules.
For MI-ANFIS the number of rules is not critical. It should be large enough to cover the diverse regions of
the input space and the multiple concepts. If the specified number of rules is too large, some will vanish
as illustrated in Figure 8 for the example with the synthetic data. Also, larger number of rules leads to
slower training. We use Gaussian MFs to describe the input fuzzy sets. For initialization, we use the FCM
algorithm to cluster the instances of the positive bags into a number of clusters equal to the number of fuzzy
16
Fig. 8. Learned MFs after convergence of MI-ANFIS training algorithm. Initial MFs before training are marked with dashed lines. Learned
MFs are shown in sold bold lines.
TABLE I. BE NCH MAR K DATA SET S
Data set dim.(PCA) No. Bags Positive Negative No.Instances
MUSK1 166(25) 92 47 45 240
MUSK2 166(25) 102 39 63 11044
Fox 230(10) 200 100 100 213
Tiger 230(10) 200 100 100 113
Elephant 230(10) 200 100 100 213
rules, and we initialize MFs’ centers as the clusters centers. MI-ANFIS was trained and tested using ten
fold cross validation. Table II summarizes all parameters used in training the MI-ANFIS (parameters were
manually selected using trial and error). We note that the reason behind using larger standard deviations
for MUSK1 and MUSK2 datasets is the higher dimensionality of these data sets. We expect the sparsity
to increase with the dimensions of the feature space, so we set the standard deviations to larger values to
allow the initial rules to cover the entirety of the input space.
First, to illustrates the advantage of using MI-ANFIS over the traditional ANFIS we compare these
two algorithms on the two MUSK data sets. Since ANFIS cannot learn from ambiguously labeled data,
for the sake of comparison, we consider the naive MIL assumption where all instances from positive
bags are considered positive and all instances from negative bags are considered negative. We refer to this
implementation as Naive-ANFIS. The results are summarized in Table III where the performance is reported
in terms of prediction accuracy averaged over all 10 cross validation sets (% of correct ±standard deviation).
17
TABLE II. MI-A NFI S TRAINING PAR AME TE RS
Parameter MUSK1 MUSK2 FOX Tiger Elephant
No. of MI Rules 6 3 2 4 3
No. of Inputs 25 25 10 10 10
MF’s σ100 100 10 10 10
Output parameters 1s 1s 1s 1s 1s
Softmax’s α1 1 1 1 1
Learning rate 0.1 0.1 0.1 0.1 0.1
TABLE III. COMPARISON OF MI-ANFIS PREDICTION ACCURACY (IN PERCENT)TO NAI VE -ANFIS ON TH E BEN CHMAR K DATA SETS .
Algorithms MUSK1 MUSK2 Fox Tiger Elephant
MI-ANFIS 93.49 90.58 66.4 84.5 86.97
±0.76 ±1.31 ±2.77 ±0.61 ±1.10
Naive-ANFIS 67.82 79.43 58.70 77.70 82.2
±4.04 ±5.04 ±1.35 ±0.83 ±0.83
As it can be seen, MI-ANFIS outperforms Naive-ANFIS significantly. This is because inaccurately labeled
instances within the positive bags were used for training the Naive-ANFIS. The difference in performance
between MI-ANFIS and Naive-ANFIS is greater for MUSK1 and MUSK2 because of the greater number
of instances per bag (more ambiguousity).
TABLE IV. COMPARISON OF MI-ANFIS PREDICTION ACCURACY (IN PERCENT)TO OTHE R METHO DS ON T HE BE NC HMA RK DATA
SE TS. R ESU LTS FO R 3TOP P ERFOR MIN G MET HO DS AR E SHO WN IN B OLD F ON T. WE US E REP ORTE D RESULTS , N/A IND ICATED TH AT A
GI VEN A LGO RI THM WAS N OT AP PLI ED TO T HAT DATA SET.
Algorithms MUSK1 MUSK2 Fox Tiger Elephant
MI-ANFIS 93.49 90.58 66.4 84.5 86.97
±0.76 ±1.31 ±2.77 ±0.61 ±1.10
MILES [49] 86.3 87.7 N/A N/A N/A
APR [19] 92.4 89.2 N/A N/A N/A
DD [21] 88.9 82.5 N/A N/A N/A
DD-SVM [50] 85.8 91.3 N/A N/A N/A
EM-DD [51] 84.8 84.9 56.1 72.1 78.3
Citation-KNN [52] 92.4 86.3 N/A N/A N/A
MI-SVM [47] 77.9 84.3 57.8 84.0 81.4
mi-SVM [47] 87.4 83.6 58.2 78.4 82.2
MI-NN [53] 88.0 82.0 N/A N/A N/A
Bagging-APR [54] 92.8 93.1 N/A N/A N/A
RBF-MIP [43] 91.3 90.1 N/A N/A N/A
±1.6±1.7
BP-MIP [42] 83.7 80.4 N/A N/A N/A
RBF-Bag-Unit [55] 90.3 86.6 N/A N/A N/A
MI-kernel [56] 88.0 89.3 60.3 84.2 84.3
PPPM-kernel [57] 95.6 81.2 60.3 80.2 82.4
MIGraph [56] 90.0 90.0 61.2 81.9 85.1
miGraph [56] 88.9 90.3 61.6 86.0 86.8
ALP-SVM [58] 86.3 86.2 66.0 86.0 83.5
MIForest [59] 85.0 82.0 64.0 82.0 84.0
Table IV compares the performance of the proposed algorithm to state of the art MIL algorithms on the
benchmark data sets.
Overall, MI-ANFIS is comparable to other MIL algorithms. In fact, on all tested data sets, MI-ANFIS
ranked consistently among the top three. For MUSK1, PPPM-kernel [57] performed the best (95.6%), but
this algorithm did not perform as well for the other sets. For MUSK2 Bagging-APR [54] achieved the best
18
TABLE V. BENC HMARK DATA SE TS
Data set dim.(PCA) No. Bags Positive Negative No.Instances
Elephant 230(10) 200 100 100 213
accuracy, as reported by [49]. MI-ANFIS achieved the best average performance for the Fox and Elephant
data sets, and second best performance after the miGraph [56] and ALP-SVM [58] methods for the Tiger
data set.
In order to demonstrate the gain in generalization acquired by MI-ANFIS when utilizing Rule Dropout,
we train an MI-ANFIS architecture for binary classification with and without Rule Dropout on a multiple
instance dataset sampled from COREL [47]. The dataset classify whether an image contains an elephant or
not, and consists of 200 images (bags): 100 positive images containing the target animal and 100 negative
images containing other animals. Each image is represented as a set of patches (instances) and each patch
is in turn represented by 230 features describing color, texture and shape information. Before training, we
applied PCA to reduce the dimensionality of the features to 10 dimensions to speedup MI-ANFIS. Table
V summarizes the dataset characteristics. Two MI-ANFIS networks composed of 15 rules each, with one
network employing Rule Dropout (with p= 0.7, this hyper-parameter was selected based on trial and error),
were trained on 90% of the data, and the remaining 10% was used for testing (split was done randomly).
Figure 9 shows the training and testing errors for both networks during 100 epochs. As it can be seen,
without Rule Dropout, starting at epoch 20, testing performance begins to degrade while training error
continues to decrease. In other words, overfitting begins to occur. Typically, using a cross validation data
set, this point can be detected and training would be stopped. However, this assumes that a cross validation
data is available (or training data is large enough to be split into training and testing) and more important
that the cross validation data is representative of the testing data. On the other hand, when using Rule
Dropout, overfitting is significantly reduced and MI-ANFIS achieved better testing performance at the end
of the training phase. Even though, when using Rule Dropout the training and testing error rates oscillate
(due to the randomness of the dropout process), overall MI-ANFIS achieved 0.1123 testing SSE with Rule
Dropout compared to 0.1451 testing SSE without Rule Dropout.
C. Application To Landmine Detection
In this section, we report the results of applying the proposed Multiple Instance Inference to fuse the
output of multiple discrimination algorithms for the purpose of landmine detection using Ground Penetrating
Radar (GPR). GPR data collected at different locations and different dates were used to train and test the
proposed MI-ANFIS. The alarm collection covers 319 encounters of various anti-tank mines with high
metal content (ATHM) and 422 encounters of various anti-tank mines with low metal content (ATLM).
The vehicle-mounted GPR sensor collects 3-dimensional data as the vehicle moves (Figure 10). The 3-
dimensions correspond to the spatial location on the ground (down-track, cross-track, and depth) and is
shown in Figure 11. Figure 11(b) shows a 2-D views of (depth, down-track) and (depth, cross-track) slices
of GPR data. As it can be seen, the target signature does not extend over all depth values. Thus, one global
feature vector may not discriminate between mines and clutter effectively. To overcome this limitation,
most classifiers developed for this application extract multiple features from small overlapping windows at
multiple depths. In the following, we assume that each training alarm (3-D data cube) has been divided
into 15 overlapping (depth wise) patches. Each patch is processed by 2 discrimination algorithms. These
algorithms are based on the Edge Histogram Descriptor (EHD) [60]. The first one, called EHDDT, extracts
features from each 2-D (down-track, depth) patch. The second discrimination algorithm, called EHDCT,
extracts information for the 2-D (Cross-Track, depth) patch. In addition, auxiliary features are synthesized
from each patch. In particular, “SignatureWidth” in the Down-track direction and “SignatureWidth” in the
Cross-Track direction are used to capture the effective width of the hyperbolic shape within each patch.
These auxiliary features are intended to provide contextual information that can support the relevance of
the EHDDT and/or EHDCT. As a result, each alarm is represented by a Bag of 15 instances and each
19
Fig. 9. Training and testing errors for two MI-ANFIS networks with and without Rule Dropout.
Fig. 10. Vehicle mounted GPR system for detecting buried Landmines.
instance is a 4-dimensional feature vector. Each bag is labeled as positive (has a target) or negative (non
target), but labels at the instance level are not available. The X-Y Ground truth information about the target
is available (using GPS and known target position on calibration lanes). However, the depth position cannot
be easily identified as it depends on target size, burial depth, soil type, and other environmental conditions.
Manually extracting the depth location can be very tedious. Similarly, during testing, it is not trivial how
to combine partial confidence values from the multiple windows. Therefore, the MIL paradigm is suitable
to solve this problem.
20
(a) 3D GPR Raw data. (b) 2-D views of the raw GPR data.
Fig. 11. 3-dimensional and 2-dimensional raw GPR data.
We construct a zero-order MI-ANFIS (constant consequent parameters) having 5 multiple instance rules,
and employing Gaussian MFs to describe the input fuzzy sets. To initialize the system’s parameters, first,
we use the FCM algorithm to cluster the instances that belong to positive bags into 5 clusters, and we
initialize the MFs’ centers as the clusters’ centers. Then, we initialize the standard deviations of the input
MFs and the output parameters to 1.
After initialization, we run MI-ANFIS basic learning algorithm (Algorithm 1) to jointly learn a fuzzy
description of the positive concepts as well as optimal rules’ output. Figure 12 is a graphical representation
of the 5 multiple instance rules prior to running the optimization process (dotted line curves) and the learned
rules after training (continuous curves). The fuzzy sets of the rules’ antecedents describe the location and
the extent of the positive concepts in the 4-D instance feature space. The rules’ consequent values can
be interpreted as an assessment of the “positivity” of each learned concept. For instance, the MI-ANFIS
learned the following two positive concepts to describe targets:
R1:If EH DDT is M edium and EHDC T is Medium and W idthDT is High and W idthC T is H igh then o1= 1.15.
R2:If EH DDT is M edium and EHDC T is Low and W idthDT is High and W idthC T is H igh then o2= 0.94.
1) Results: The proposed fusion method was trained and tested using 10-folds cross validation. Figure
14 displays the ROCs (averaged over the 10 folds). To provide a quantitative evaluation of the proposed
multiple instance fuzzy inference fusion method, we compare its performance to a fusion method based
on the standard Mamdani [12] and standard ANFIS [61]. Since the standard Mamdani and AFNIS cannot
learn from partially labeled data, an expert is used to label all instances of all bags within the training data.
We also compare MI-ANFIS performances to a naive MIL implementation of Mamdani (NaiveMamdani)
and ANFIS (NaiveANFIS) where all instances from positive bags are considered positive and all instances
from negative bags are considered negative.
As it can be seen in Figure 14, MI-ANFIS performed better than the standard ANFIS on the large dataset,
and as expected NaiveANFIS performed worse. The standard ANFIS performed better at low FAR (False
Alarms Rate), the reason is that strong Mines are easy to identify manually and in this case, the ground
truth helps. However, weaker mine signatures are not as easy to localize, so the truth may not be as accurate
and can degrade the performance. Overall, MI-ANFIS outperformed all presented fusion approaches and
21
Fig. 12. MI-ANIFS fusion rules before (dotted lines) and after training (solid lines).
Fig. 13. A plot of MI-ANFIS RMSE during 150 training epochs.
22
the individual discriminators (EHDDT and EHDCT). This is due to the ability of MI-ANFIS to overcome
labeling ambiguity by learning meaningful concepts.
Fig. 14. Comparison of the individual discriminators, the proposed MI-ANFIS, Mamdani, ANFIS, NaiveMamdani, and NaiveANFIS fusion
methods. Note that the Mamdani and ANFIS systems have the advantage of instance-level labels on training data.
As in standard ANFIS, we cannot prove convergence of the algorithms. However, in all conducted
experiments MI-ANFIS converged in less than 150 epochs. Figure 13 plots the root mean squared error
(RMSE) vs. the training epoch number.
V. REL ATED WORK
MI-ANFIS deals with ambiguity by introducing the novel concept of truth instances: when carrying
reasoning using a bag of instances at Layer 2 (Figure 3), a proposition will not only have one degree of
truth, it will have multiple degrees of truth (rij), we call truth instances. Thus, effectively encoding the
third vagueness component of ambiguity and increasing the expressive power of traditional fuzzy logic. In
addition to effectively model ambiguity, MI-ANIFS has the inherited capability of assessing the prediction
quality by outputting soft values. For example, depending on the αparameter of Softmax in Layer 3, MI-
ANFIS can assign higher outputs to bags with more than one positive instance. Thus, giving the end user
a way to assess the positiveness of a given bag.
Learning positive target concepts from ambiguously labeled data has been the core task of various MIL
algorithms (e.g. Diverse Density [16]). MI-ANFIS has proven that it can learn positive concepts effectively
while jointly providing a fuzzy representation of such regions. The fuzzy representation is combined into
meaningful and simple multiple instance rules that can be easily visualized and interpreted.
Compared to previously proposed multiple instance neural networks, such as Multiple Instance Neural Net-
works [42] (MI-NN) and Multiple Instance RBF Neural Networks [43] (RBF-MIP), MI-ANFIS advantage
is the use of multiple instance fuzzy logic to learn a fuzzy representation of true positive concepts. MI-NN
only learns standard neural network weights that do not carry any information regarding target concepts.
On the other hand, while standard RBF neural networks have been shown to be equivalent to zero order
23
traditional Sugeno systems under certain constraints [62], thus, capable of learning a fuzzy representation
of the inputs, RBF-MIP networks have different architecture and they do not employ adaptive radial basis
functions in the first layer. Instead, they represent the inputs by computing their distances to clusters of
training bags. This latter method is computationally expensive and its success depends greatly on the quality
of the training data as it takes into consideration all the training examples which may include wrongly
(noisily) labeled bags. RBF-MIP learns only discriminative regions of the bags space and does not learn
true positive concepts. Moreover, MI-ANFIS learning algorithms can be updated to support a wide range
of loss functions (criterions) such as cross entropy [63], maximum margin [64], etc. MI-NN is designed
to use a handcrafted loss function which is largely responsible for the multiple instance behavior of the
system and cannot be changed without substantially changing the architecture of MI-NN. This could be
disadvantageous if MI-NN is to be used to solve multiple instance - multiple class classification problems.
VI. CONCLUSIONS
In this paper, we have introduced a new framework to accomplish fuzzy inference with multiple instance
data. Our work generalizes Sugeno fuzzy inference style to reason with multiple instances, the new inference
style is called MI-Sugeno. We then used MI-Sugeno to develop MI-ANFIS, a novel neuro-fuzzy architecture
that extends the standard Adaptive Neuro-Fuzzy Inference System (ANFIS) to reason with bags of instances
in order to solve multiple instance learning problems. We developed a BackPropagation learning algorithm
and showed that the proposed system is capable of learning meaningful concepts from ambiguously labeled
data.
MI-ANFIS deals with ambiguity by introducing the novel concept of truth instances: when carrying rea-
soning using a bag of instances at Layer 2 (Figure 3), a proposition will not only have one degree of truth,
it will have multiple degrees of truth (rij), we call truth instances. Thus, effectively encoding the third
vagueness component of ambiguity and increasing the expressive power of traditional fuzzy logic.
Learning positive concepts from ambiguously labeled data has been the core task of various MIL algorithms.
MI-ANFIS has proven that it can learn positive concepts effectively while jointly providing a fuzzy
representation of such regions. The fuzzy representation is combined into meaningful and simple multiple
instance rules that can be easily visualized and interpreted.
Using synthetic and benchmark data sets we showed that the proposed Multiple Instance Fuzzy Inference
is comparable to state of the art MI machine learning algorithms. We also used our framework for a
real application and applied it to fuse the output of multiple discrimination algorithms for the purpose of
landmine detection using Ground Penetrating Radar.
In situations where overfitting is imminent, for example when using relatively smaller datasets to learn very
large MI-ANFIS networks, we proposed a regularization technique, we called Rule Dropout, and showed
that it could be used to train MI-ANFIS systems with better generalization.
In future work, we intend to develop a multiple class version of MI-ANFIS to be used to solve multiple
class classification problems. In addition, we will conduct a detailed analysis of MI-ANFIS convergence.
APPENDIX A
DE RI VATION OF PREMISE PAR AM ET ER S UP DATE RU LE S
From equations (21) and (22) the error rate for the premise parameters ckj and σkj are defined as following
∂Ep
∂ckj
=
Mp
X
i=1
∂Ep
∂O(1,i+[(k1)D+(j1)]Mp)
∂O(1,i+[(k1)D+(j1)]Mp)
∂ckj
.
and,
∂Ep
∂σkj
=
Mp
X
i=1
∂Ep
∂O(1,i+[(k1)D+(j1)]Mp)
∂O(1,i+[(k1)D+(j1)]Mp)
∂σkj
.
24
Using the chain rule defined in (18), we have
∂Ep
∂O(1,i+[(k1)D+(j1)]Mp)
=∂Ep
∂O(6,1)
×∂O(6,1)
∂O(5,k)
×∂O(5,k)
∂O(4,k)
×∂O(4,k)
∂O(3,k)
×∂O(3,k)
∂O(2,i+(k1)Mp)
×∂O(2,i+(k1)Mp)
∂O(1,i+[(k1)D+(j1)]Mp)
.
(36)
Hence, (21) is equivalent to
∂Ep
∂ckj
=∂Ep
∂O(6,1)
×∂O(6,1)
∂O(5,k)
×∂O(5,k)
∂O(4,k)
×∂O(4,k)
∂O(3,k)
×
Mp
X
i=1 "∂O(3,k)
∂O(2,i+(k1)Mp)
×∂O(2,i+(k1)Mp)
∂O(1,i+[(k1)D+(j1)]Mp)
×∂O(1,i+[(k1)D+(j1)]Mp)
∂ckj #.(37)
From (17), we have ∂Ep
∂O(6,1)
=2(tpOp).(38)
It is also straightforward to show that,
∂O(6,1)
∂O(5,k)
=(P|O3|
i=1 O(5,i))
∂O(5,k)
= 1.(39)
and,
∂O(5,k)
∂O(4,k)
=(wkSα(xp1·bk,xp2·bk,...,xpMp·bk))
(wk)=Sα(xp1·bk,xp2·bk,...,xpMp·bk).(40)
Continuing with the derivation, we have
∂O(4,k)
∂O(3,k)
=∂wk
∂wk
=
wi
P|O3|
l=1 wl
∂wk
=P|O3|
l=1 wlwk
P|O3|
l=1 wl2.(41)
and,
∂O(3,k)
∂O(2,i+(k1)Mp)
=Sα({r(k,j)}Mp
j=1)
∂r(k,(i+(k1)Mp))
=eαr(k,(i+(k1)Mp))
PMp
m=1 eαr(k,m)h1 + αr(k,(i+(k1)Mp)) − Sα({r(k,m)}Mp
m=1)i.(42)
The details of the derivation of the “softmax” function details can be found at [21].
Next, we need to compute ∂O(2,i+(k1)Mp)
∂O(1,i+[(k1)D+(j1)]Mp). We have,
O(2,i+(k1)Mp)=
D
Y
d=1
µA(i+(k1)Mp)/Mp,dx(p,(i+(k1)Mp)[Mp],d).(43)
and,
O(1,i+[(k1)D+(j1)]Mp)=µA(k,j)(x(p,(i+(k1)Mp)[Mp],d)).(44)
Thus,
∂O(2,i+(k1)Mp)
∂O(1,i+[(k1)D+(j1)]Mp)
=
QD
d=1 µA(i+(k1)Mp)/Mp,dx(p,(i+(k1)Mp)[Mp],d)!
µA(k,j)(x(p,(i+(k1)Mp)[Mp],d))=
D
Y
d=1,d6=j
µA(i+(k1)Mp)/Mp,dx(p,(i+(k1)Mp)[Mp],d).
(45)
25
Finally, we have
∂O(1,i+[(k1)D+(j1)]Mp)
∂ckj
=∂µA(k ,j)(x(p,(i+(k1)Mp)[Mp],j))
∂ckj
=
(x(p,(i+(k1)Mp)[Mp],j)ckj )
σ2
kj
×exp((x(p,(i+(k1)Mp)[Mp],j)ckj )2
2σ2
kj
).(46)
Substituting the derivatives in (37), we obtain (25).
The update formula for σkj can be derived in a similar manner. It can be shown that
∂Ep
∂σkj
=2(tpOp)× Sα(xp1·bk,xp2·bk,...,xpMp·bk)×P|O3|
l=1 wlwk
P|O3|
l=1 wl2
×
Mp
X
i=1 eαr(k,(i+(k1)Mp))
PMp
m=1 eαr(k,m)h1 + αr(k,(i+(k1)Mp))− Sα({r(k,m)}Mp
m=1)i
×
D
Y
d=1,d6=j
µA(i+(k1)Mp)/Mp,dx(p,(i+(k1)Mp)[Mp],d)
×(x(p,(i+(k1)Mp)[Mp],j)ckj )2
σ3
kj
×exp((x(p,(i+(k1)Mp)[Mp],j)ckj )2
2σ2
kj
)!.
(47)
APPENDIX B
DE RI VATION OF CONSEQUENT PARA ME TE RS U PDATE RULE
The error rate for the consequent parameters is defined in equations (27) and (28). Next, we compute
∂Ep
∂O(5,i)using the previously defined chain rule in (18), and obtain
∂Ep
∂O(5,i)
=∂Ep
∂O(6,1)
×∂O(6,1)
∂O(5,i)
.(48)
From (17), we have ∂Ep
∂O(6,1)
=2(tpOp).(49)
And from (50), we have ∂O(6,1)
∂O(5,i)
= 1.(50)
Continuing with the derivation, we have
∂O(5,i)
∂bi
j
=(wiSα(xp1·bi,xp2·bi,...,xpMp·bi))
∂bi
j
=
∂bi
jwi
Mp
X
m=1
xpm ·biexp(αxpm ·bi)
PMp
h=1 exp(αxph ·bi)
=wi
Mp
X
m=1
∂bi
jxpm ·biexp(αxpm ·bi)
PMp
h=1 exp(αxph ·bi)=wi
Mp
X
m=1
1
PMp
h=1 exp(α(xph ·bixpm ·bi))2
×hx(p,m,j)
Mp
X
h=1
exp(α(xph ·bixpm ·bi)xpm ·bi
Mp
X
h=1
exp(α(xph ·bixpm ·bi)α(x(p,h,j)x(p,m,j))i.
(51)
Thus, the overall error rate with respect to the consequent parameter bi
jis given according to (20) in equation
(29).
26
ACKNOWLEDGMENT
This work was supported in part by U.S. Army Research Office Grants Number W911NF-13-1-0066
and W911NF-14-1-0589. The views and conclusions contained in this document are those of the authors
and should not be interpreted as representing the official policies, either expressed or implied, of the Army
Research Office, or the U.S. Government.
REFERENCES
[1] L. A. Zadeh, A theory of approximate reasoning (AR). Berkeley: Electronics Research Laboratory, College of Engineering, University
of California, 1977.
[2] Zadeh, “Outline of a new approach to the analysis of complex systems and decision processes,” Systems, Man and Cybernetics, IEEE
Transactions on, no. 1, pp. 28–44, 1973.
[3] C.-W. Xu and Y.-Z. Lu, “Fuzzy model identification and self-learning for dynamic systems,” Systems, Man and Cybernetics, IEEE
Transactions on, vol. 17, no. 4, pp. 683–689, July 1987.
[4] R. Jager, H. Verbruggen, and P. Bruijin, “Fuzzy inference in rule-based control systems,” in Intelligent Systems Engineering, 1992., First
International Conference on (Conf. Publ. No. 360), Aug 1992, pp. 232–237.
[5] C.-C. Lee, “Fuzzy logic in control systems: fuzzy logic controller. ii,” Systems, Man and Cybernetics, IEEE Transactions on, vol. 20,
no. 2, pp. 419–435, Mar 1990.
[6] J. Casillas, Interpretability issues in fuzzy modeling. Springer, 2003, vol. 128.
[7] C.-Y. Chiu, H.-C. Lin, and S.-N. Yang, “A fuzzy logic cbir system,” in Fuzzy Systems, 2003. FUZZ ’03. The 12th IEEE International
Conference on, vol. 2, May 2003, pp. 1171–1176 vol.2.
[8] A. Othman, H. Tizhoosh, and F. Khalvati, “Efis 2014;evolving fuzzy image segmentation,” Fuzzy Systems, IEEE Transactions on, vol. 22,
no. 1, pp. 72–82, Feb 2014.
[9] S. Hajimirza and E. Izquierdo, “Gaze movement inference for implicit image annotation,” in Image Analysis for Multimedia Interactive
Services (WIAMIS), 2010 11th International Workshop on, April 2010, pp. 1–4.
[10] H. Kwan and L. Y. Cai, “Supervised fuzzy inference network for invariant pattern recognition,” in Circuits and Systems, 2000. Proceedings
of the 43rd IEEE Midwest Symposium on, vol. 2, 2000, pp. 850–854 vol.2.
[11] M. Adnan, M. Chowdury, I. Taz, T. Ahmed, and R. Rahman, “Content based news recommendation system based on fuzzy logic,” in
Informatics, Electronics Vision (ICIEV), 2014 International Conference on, May 2014, pp. 1–6.
[12] A. Ben Khalifa and H. Frigui, “Fusion of multiple algorithms for detecting buried objects using fuzzy inference,” Proc. SPIE, vol.
9072, pp. 90 720V–90720V–10, 2014. [Online]. Available: http://dx.doi.org/10.1117/12.2051217
[13] M.-Y. Chen and D. Linkens, “Rule-base self-generation and simplification for data-driven fuzzy models,” in Fuzzy Systems, 2001. The
10th IEEE International Conference on, vol. 1, 2001, pp. 424–427.
[14] D. Saletic, “On data-driven procedure for determining the number of rules in a takagi-sugeno fuzzy model,” in Computer as a Tool,
2005. EUROCON 2005.The International Conference on, vol. 2, Nov 2005, pp. 1132–1135.
[15] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, “Neuro-fuzzy and soft computing-a computational approach to learning and machine intelligence
[book review],Automatic Control, IEEE Transactions on, vol. 42, no. 10, pp. 1482–1484, 1997.
[16] O. Maron, “Learning from ambiguity,” Ph.D. dissertation, Massachusetts Institute of Technology, 1998.
[17] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Prez, “Solving the multiple instance problem with axis-parallel rectangles,” Artificial
Intelligence, vol. 89, no. 12, pp. 31 – 71, 1997. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0004370296000343
[18] E. Alpaydın, V. Cheplygina, M. Loog, and D. M. Tax, “Single-vs. multiple-instance classification,” Pattern Recognition, vol. 48, no. 9,
pp. 2831–2838, 2015.
[19] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P´
erez, “Solving the multiple instance problem with axis-parallel rectangles,” Artificial
intelligence, vol. 89, no. 1, pp. 31–71, 1997.
[20] C. Zhang, X. Chen, and W.-B. Chen, “An online multiple instance learning system for semantic image retrieval,” in Multimedia Workshops,
2007. ISMW ’07. Ninth IEEE International Symposium on, Dec 2007, pp. 83–84.
[21] O. Maron and T. Lozano-P´
erez, “A framework for multiple-instance learning,” in Proceedings of the 1997 Conference on Advances in
Neural Information Processing Systems 10, ser. NIPS ’97. Cambridge, MA, USA: MIT Press, 1998, pp. 570–576. [Online]. Available:
http://dl.acm.org/citation.cfm?id=302528.302753
[22] A. Karem and H. Frigui, “A multiple instance learning approach for landmine detection using ground penetrating radar,” in Geoscience
and Remote Sensing Symposium (IGARSS), 2011 IEEE International. IEEE, 2011, pp. 878–881.
[23] R. Rahmani and S. A. Goldman, “Missl: Multiple-instance semi-supervised learning,” in Proceedings of the 23rd international conference
on Machine learning. ACM, 2006, pp. 705–712.
[24] Y. Chen, J. Bi, and J. Wang, “Miles: Multiple-instance learning via embedded instance selection,” Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 28, no. 12, pp. 1931–1947, Dec 2006.
[25] C. Yang, M. Dong, and F. Fotouhi, “Region based image annotation through multiple-instance learning,” in Proceedings of the 13th
annual ACM international conference on Multimedia. ACM, 2005, pp. 435–438.
27
[26] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 33, no. 8, pp. 1619–1632, 2011.
[27] S. H. T. Andrews, “Multiple-instance learning via disjunctive programming boosting,Advances in neural information processing systems.,
no. 16, pp. 65–72, 2004.
[28] B. Babenko, “Multiple instance learning: algorithms and applications,” View Article PubMed/NCBI Google Scholar, 2008.
[29] C. Zhang, J. C. Platt, and P. A. Viola, “Multiple instance boosting for object detection,” in Advances in neural information processing
systems, 2005, pp. 1417–1424.
[30] J. Amores, “Multiple instance classification: Review, taxonomy and comparative study,Artificial Intelligence, vol. 201, pp. 81–105,
2013.
[31] P. L. Lanzi, W. Stolzmann, and S. W. Wilson, Learning classifier systems: from foundations to applications. Springer, 2000, no. 1813.
[32] O. Cordn, “A historical review of evolutionary learning methods for mamdani-type fuzzy rule-based systems: Designing interpretable
genetic fuzzy systems,” International Journal of Approximate Reasoning, vol. 52, no. 6, pp. 894 – 913, 2011.
[33] E. Mamdani, “Application of fuzzy algorithms for control of simple dynamic plant,Electrical Engineers, Proceedings of the Institution
of, vol. 121, no. 12, pp. 1585–1588, December 1974.
[34] R. Babuka and H. Verbruggen, “An overview of fuzzy modeling for control,” Control Engineering Practice, vol. 4, no. 11, pp. 1593 –
1606, 1996.
[35] M. Mizumoto, “Fuzzy controls under various fuzzy reasoning methods,” Information Sciences, vol. 45, no. 2, pp. 129 – 151, 1988.
[Online]. Available: http://www.sciencedirect.com/science/article/pii/0020025588900370
[36] C.-C. Lee, “Fuzzy logic in control systems: fuzzy logic controller. i,” Systems, Man and Cybernetics, IEEE Transactions on, vol. 20,
no. 2, pp. 404–418, Mar 1990.
[37] M. Sugeno and T. Yasukawa, “A fuzzy-logic-based approach to qualitative modeling,Fuzzy Systems, IEEE Transactions on, vol. 1,
no. 1, pp. 7–, Feb 1993.
[38] R. Yager and D. Filev, “Unified structure and parameter identification of fuzzy models,” Systems, Man and Cybernetics, IEEE Transactions
on, vol. 23, no. 4, pp. 1198–1205, Jul 1993.
[39] E. Tacker, “Modeling stabilization policies in financial systems,” in Decision and Control including the 16th Symposium on Adaptive
Processes and A Special Symposium on Fuzzy Set Theory and Applications, 1977 IEEE Conference on, Dec 1977, pp. 194–194.
[40] P. Singh, S. Bhanot, and H. Mohanta, “Optimized adaptive neuro-fuzzy inference system for ph control,” in Advanced Electronic Systems
(ICAES), 2013 International Conference on, Sept 2013, pp. 1–5.
[41] T. Takagi and M. Sugeno, “Fuzzy identification of systems and its applications to modeling and control,” Systems, Man and Cybernetics,
IEEE Transactions on, vol. SMC-15, no. 1, pp. 116–132, 1985.
[42] Z.-H. Zhou and M.-L. Zhang, “Neural networks for multi-instance learning,” Proceedings of the International Conference on Intelligent
Information Technology, Beijing, China, pp. 455–459, 2002.
[43] M.-L. Zhang and Z.-H. Zhou, “Adapting rbf neural networks to multi-instance learning,Neural Processing Letters, vol. 23, no. 1, pp.
1–26, 2006.
[44] S. Ray and D. Page, “Multiple instance regression,” in ICML, vol. 1, 2001, pp. 425–432.
[45] J.-S. Jang, “Anfis: adaptive-network-based fuzzy inference system,Systems, Man and Cybernetics, IEEE Transactions on, vol. 23, no. 3,
pp. 665–685, 1993.
[46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from
overfitting,The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[47] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vector machines for multiple-instance learning,” in Advances in neural
information processing systems, 2002, pp. 561–568.
[48] Y. Li, D. M. Tax, R. P. Duin, and M. Loog, “Multiple-instance learning as a classifier combining problem,Pattern Recognition, vol. 46,
no. 3, pp. 865–874, 2013.
[49] Y. Chen, J. Bi, and J. Z. Wang, “Miles: Multiple-instance learning via embedded instance selection,” Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 28, no. 12, pp. 1931–1947, 2006.
[50] Y. Chen and J. Z. Wang, “Image categorization by learning and reasoning with regions,” The Journal of Machine Learning Research,
vol. 5, pp. 913–939, 2004.
[51] Q. Zhang and S. A. Goldman, “Em-dd: An improved multiple-instance learning technique,” in Advances in neural information processing
systems, 2001, pp. 1073–1080.
[52] J. Wang, “Solving the multiple-instance problem: A lazy learning approach,” in In Proc. 17th International Conf. on Machine Learning,
2000.
[53] J. Ramon and L. De Raedt, “Multi instance neural networks,” in Proceedings of the ICML-2000 workshop on attribute-value and
relational learning, 2000, pp. 53–60.
[54] Z.-H. Zhou and M.-L. Zhang, “Ensembles of multi-instance learners,” in Machine Learning: ECML 2003. Springer, 2003, pp. 492–502.
[55] A. Bouchachia, “Multiple instance learning with radial basis function neural networks,” in Neural Information Processing. Springer,
2002, Conference Proceedings, pp. 440–445.
[56] Z.-H. Zhou, Y.-Y. Sun, and Y.-F. Li, “Multi-instance learning by treating instances as non-iid samples,” in Proceedings of the 26th annual
international conference on machine learning. ACM, 2009, pp. 1249–1256.
28
[57] H.-Y. Wang, Q. Yang, and H. Zha, “Adaptive p-posterior mixture-model kernels for multiple instance learning,” in Proceedings of the
25th international conference on Machine learning. ACM, 2008, pp. 1136–1143.
[58] P. V. Gehler and O. Chapelle, “Deterministic annealing for multiple-instance learning,” in International conference on artificial intelligence
and statistics, 2007, pp. 123–130.
[59] C. Leistner, A. Saffari, and H. Bischof, “Miforests: multiple-instance learning with randomized trees,” in Computer Vision–ECCV 2010.
Springer, 2010, pp. 29–42.
[60] H. Frigui and P. Gader, “Detection and discrimination of land mines in ground-penetrating radar based on edge histogram descriptors
and a possibilistic k-nearest neighbor classifier,Trans. Fuz Sys., vol. 17, no. 1, pp. 185–199, Feb. 2009.
[61] A. B. Khalifa and H. Frigui, “Fusion of multiple landmine detection algorithms using an adaptive neuro fuzzy inference system,” in
Geoscience and Remote Sensing Symposium (IGARSS), 2014 IEEE International. IEEE, 2014, pp. 3148–3151.
[62] J.-S. Jang and C.-T. Sun, “Functional equivalence between radial basis function networks and fuzzy inference systems,Neural Networks,
IEEE Transactions on, vol. 4, no. 1, pp. 156–159, Jan 1993.
[63] M. Yi-de, L. Qing, and Q. Zhi-Bai, “Automated image segmentation using improved pcnn model based on cross-entropy,” in Intelligent
Multimedia, Video and Speech Processing, 2004. Proceedings of 2004 International Symposium on. IEEE, 2004, pp. 743–746.
[64] H. Li, T. Jiang, and K. Zhang, “Efficient and robust feature extraction by maximum margin criterion,” Neural Networks, IEEE Transactions
on, vol. 17, no. 1, pp. 157–165, 2006.
... DL has also been expressed as hierarchical structures (Katole et al., 2015) which has extended classification problems to even finer grains of classification. DL has even been combined with fuzzy logic for multiple-instance problems allowing handcrafted rules to be applied across an image (Khalifa & Frigui, 2016). ...
Article
Prediction of clinical outcomes using the patient’s medical data enhances clinical decision making and improves prognostic accuracy. Deep learning (DL) for medical decision support systems has particularly shown expert-level accuracy in predicting clinical outcomes. However, most of these machine learning and Artificial Intelligent (AI) models lack interpretability which causes major trust-related problems in healthcare. This necessitates the need for interpretable AI systems that can explain their decisions. This paper is designed to address the problem of how clinical decision support systems can be designed to be transparent, interpretable and therefore comprehensible for humans. In this view, this paper proposes an attentive hierarchical adaptive neuro-fuzzy inference system (AH-ANFIS), that combines fuzzy inference in a hierarchical architecture with attention for selecting the important rules. The proposed model benefits from the rule-based structure of ANFIS that enables the user to interpret the abstractions of hidden layers. The hierarchical structure in fuzzy modeling helps to overcome the rule explosion problem that arises with large medical data and improve interpretability. To ensure the improvement of interpretability with hierarchical architecture, an evolutionary algorithm (EA) is used to decompose the input space into an optimal permutation of input subsets which results in subsystems with independent meaning. The attention-based rule selector identifies the most activated rule to select important patient-specific features for predicting the clinical outcome. To verify the performance of the proposed AH-ANFIS and analyze the important features for classification, we perform experiments with two cancer diagnostic datasets. By pruning fuzzy if-then rules using recursive rule elimination (RRE), the complexity of the model is largely reduced while maintaining the performance of the system making it more interpretable.
... DL has also been expressed as hierarchical structures [6] which has extended classification problems to even finer grains of classification. DL has even been combined with fuzzy logic for multiple instance problems allowing hand crafted rules to be applied across an image [7]. ...
Chapter
The Increased focus on the aspects in the Computer Science Field like Machine Learning and Neural Networks have brought about a change in diagnosis methods of various diseases. Machine Learning has advanced so much and provides benefits like enhanced efficiency and fast decision making which prove to be crucial in the field of medical diagnosis. One of the highly deemed research fields which happen to be greatly affected by this advance is the field of cancer diagnosis. Cancer was responsible for 18 Million cases in 2018 and was responsible for over 9 million deaths. Over the years, the cancer rate has been decreasing marginally, but the last 3 years have seen an increase by 2%. Accurate diagnostic methods need to be considered to when dealing with cancer because incorrect diagnosis can be even fatal to the patient, hence a tool is required, which can trump even the human brain at diagnosing cancer. Machine learning can prove to be this tool due to the fact that it learns along the way once you feed an input to it, so as time goes on it gets smarter and smarter eventually always exceeding the accuracy percentage of its results and thereby proving to be a useful tool for medical diagnosis. In this paper, we propose a support vector machine (SVM) classifier for brain tumor segmentation and diagnosis. 50 MRI images have been used to test this and an accuracy of 100% has been achieved in diagnosis of the tumor as benign or malignant. Kernel accuracy of 90% in each case has also been achieved.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Conference Paper
We present a fusion method, based on fuzzy inference, for detecting buried objects using ground-penetrating radar (GPR) data. The performance of different discrimination algorithms can vary significantly depending on the target type, burial orientation, and other environmental conditions. In some cases, algorithms can provide complementary evidence, while in other cases they can provide contradicting evidence. Thus, effective fusion of these algorithms can achieve higher probability of detection with fewer false alarms. The proposed fusion method is based on an Adaptive Neuro Fuzzy Inference System (ANFIS) [1] capable of simultaneously identifying local contexts as well as learning optimal weights for combining local expert discriminators. It is capable of learning meaningful and simple fuzzy rules for different regions of the input space. Results on large and diverse GPR data collections show that the proposed fusion approach can identify local, simple, and meaningful rules capable of non-linear fusion of different discriminators. We also show that the proposed fuzzy inference outperforms other commonly used fusion methods.
Conference Paper
We present a fusion method, based on fuzzy inference, for detecting buried objects using ground-penetrating radar (GPR) data. The GPR sensor generates 3-dimensional data that correspond to depth, down-track, and cross-track. Most discrimination algorithms process only 2-D slices of the 3-D cube: (down-track, depth) or (cross-track, depth). The performance of the down-track and cross-track discrimination algorithms can vary significantly depending on the target shape, burial orientation, and other environmental conditions. In some cases, these algorithms can provide complementary evidence, while in other cases they provide contradicting evidence. Thus, effective fusion of these algorithms can achieve higher probability of detection with fewer false alarms. The proposed fusion method is capable of learning meaningful and simple fuzzy rules for different regions of the input space, generated by partial confidence values of the different discriminators as well as additional background information. To learn the rules, first, the input space is partitioned to identify local contexts. Second, input membership functions are learned based on the distribution of the partial confidence values of the individual discriminators within each context. Third, output membership functions are generated by considering the relative numbers of targets and non-targets within each context. Finally, the input and output membership functions are combined into a Mamdani-type fuzzy inference system. The output of the learning process is a fuzzy rule base adapted to different contexts. Results on large and diverse GPR data collections show that the proposed fusion can identify local, simple, and meaningful rules capable of non-linear fusion of different discriminators. We also show that the proposed fuzzy inference outperforms commonly used fusion methods.
Conference Paper
Fuzzy logic is an approach that helps in computing based on “degrees of truth” rather than the usual “true or false” (1 or 0) Boolean logic. Recommender systems represent user preferences for the purpose of suggesting items to read or browse. They have become fundamental applications in electronic commerce and information access, providing suggestions that effectively prune large information spaces so that users are directed toward those items that best meet their needs and preferences. A variety of techniques have been proposed for performing recommendation, including content-based, collaborative, knowledge-based and other techniques. Our method implements fuzzy logic to find a set of articles related to other articles which can be recommended to a reader. There is a simple reason behind using fuzzy logic. Related or recommendable news articles are not easily translated into the absolute terms of 0 and 1. We cannot absolutely point out a certain article 'X' and say that it is related to 'Y'. That is why we tried to develop a fuzzy system from several attributes of a news article which will eventually describe whether an article is worth for recommendation to a user or not.