Content uploaded by Amine ben khalifa

Author content

All content in this area was uploaded by Amine ben khalifa on Jan 13, 2017

Content may be subject to copyright.

1

Multiple Instance Fuzzy Inference Neural Networks

Amine B. Khalifa, Hichem Frigui, Member, IEEE,

Multimedia Research Lab

CECS Department

University of Louisville

Louisville, KY 40292, USA

Abstract

Fuzzy logic is a powerful tool to model knowledge uncertainty, measurements imprecision, and vagueness.

However, there is another type of vagueness that arises when data have multiple forms of expression that fuzzy logic

does not address quite well. This is the case for multiple instance learning problems (MIL). In MIL, an object is

represented by a collection of instances, called a bag. A bag is labeled negative if all of its instances are negative,

and positive if at least one of its instances is positive. Positive bags encode ambiguity since the instances themselves

are not labeled. In this paper, we introduce fuzzy inference systems and neural networks designed to handle bags of

instances as input and capable of learning from ambiguously labeled data. First, we introduce the Multiple Instance

Sugeno style fuzzy inference (MI-Sugeno) that extends the standard Sugeno style inference to handle reasoning

with multiple instances. Second, we use MI-Sugeno to deﬁne and develop Multiple Instance Adaptive Neuro Fuzzy

Inference System (MI-ANFIS). We expand the architecture of the standard ANFIS to allow reasoning with bags

and derive a learning algorithm using backpropagation to identify the premise and consequent parameters of the

network. The proposed inference system is tested and validated using synthetic and benchmark datasets suitable for

MIL problems. We also apply the proposed MI-ANFIS to fuse the output of multiple discrimination algorithms for

the purpose of landmine detection using Ground Penetrating Radar.

I. INTRODUCTION

Fuzzy inference is a powerful modeling framework that can handle computing with knowledge uncertainty

and measurements imprecision effectively [1]. It has been successfully applied to a wide range of problems,

mainly in system modeling and control [2]–[4]. Most of the proposed fuzzy inference methods gained

success because of their ability to leverage expert knowledge to identify the model parameters [5]. This

practice simpliﬁes system design and ensures that the knowledge base (if-then rules) used by the system is

easy to interpret [6].

More recently, fuzzy inference has increasingly been applied to more advanced applications, such as

content-based information retrieval [7], image segmentation [8], image annotation [9], pattern recognition

[10], recommender systems [11], and multiple classiﬁer fusion [12]. The aforementioned applications are

more challenging as they require extensive knowledge base to accommodate for various scenarios. Since

this diverse knowledge base cannot be fully captured by domain experts, data-driven techniques are typically

used to identify and learn the inference system’s parameters [13], [14]. One such technique is the Adaptive

Neuro-Fuzzy Inference System (ANFIS) [15]. ANFIS is a universal approximator that combines the learning

and modeling power of neural networks and fuzzy logic into an adaptive inference system. It is a hybrid

intelligent system and it provides a systematic approach to jointly learn the optimal input space partition

(rules) and the optimal output parameters using supervised learning.

Typically, in supervised learning, access to large labeled training datasets improves the performance of

the devised algorithms by increasing their robustness and generalization capabilities. Nowadays, access to

such large datasets is becoming more convenient. However, for a supervised leaning method to beneﬁt from

this data, it need to be carefully preprocessed, ﬁltered, and labeled. Unfortunately, this process can be too

Email: a0benk01@louisville.edu

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may

no longer be accessible.

arXiv:1610.04973v1 [cs.NE] 17 Oct 2016

2

tedious as the vast portion of the collected data is unstructured, labeled ambiguously and at a coarse level.

An alternative and a relatively new framework of learning that tackles the inherent ambiguity better than

supervised learning, is the Multiple Instance Learning (MIL) paradigm [16].

A. Multiple Instance Learning

Unlike standard supervised learning, in MIL, an object is not represented by a simple data point, but

rather by a collection of instances, called a bag. Each bag can contain a different number of instances.

A bag is labeled negative if all of its instances are negative, and positive if at least one of its instances

is positive1. Positive bags can encode ambiguity since the instances themselves are not labeled. Given a

training set of labeled bags, the goal of MIL is to learn a concept that predicts the labels of training data

at the instance level and generalizes to predict the labels of testing bags and their instances [17]. We refer

to this deﬁnition as the standard MIL assumption. Multiple MIL paradigms have been proposed [18], but

for simplicity we focus our formulation on the standard MIL assumption.

The MIL is a well known problem that has been studied for the last 20 years, it was ﬁrst formalized by

Dietterich et al. [19] providing a solution to drug activity prediction. Ever since, it has increasingly been

applied to a wide variety of tasks including content-based information retrieval [20], drug discovery [21],

pattern recognition [22], image classiﬁcation [23], region-based image categorization [24], image annotation

[25], object tracking [26] and time series prediction [16]. In general, MIL can be applied in two contexts

of ambiguity: “polymorphism ambiguity” and “part-whole ambiguity” [27]. In polymorphism ambiguity,

an object can have multiple forms of expression in the input space and it is not known which form is

responsible for the object label. Whereas, in part-whole ambiguity, an object can be broken into several

parts represented by different feature vectors in the input space. However, only few parts are responsible for

the object label [28]. Polymorphism Ambiguity arise more often in applications related to chemistry and

bioscience. The original MIL application of drug discovery [16], [17] is a case of polymorphism ambiguity.

Part-whole Ambiguity is more common in pattern recognition problems. For example, in image annotation

features are usually extracted locally (from patches) while the labels, or tags, are only available gloablly at

the image level. Another closely related application is object detection. In this application, objects of interest

may cover only a limited region of the image, the rest could be other objects or background. Traditional

supervised learning requires identifying image patches containing the object of interest only and labeling

them. As indicated by Viola et al. [29], placing bounding boxes around objects is an inherently ambiguous

task. Thus, to avoid the tedious task of object segmentation and annotation, the problem of object detection

can be addressed using an MIL paradigm. To illustrate the need for MIL further, in the following we analyze

how a multiple instance (MI) representation can be applied to image classiﬁcation. More details about MIL

taxonomy have been reported by Amores [30].

Consider the simple example of classifying images that contain “sky”. Using an MIL approach, each

training image is represented by a bag of instances where each instance corresponds to features extracted

from a region of interest. These regions could be obtained by segmenting the image or simply by dividing

it into patches. A multiple instance representation is well suited for this purpose because only few regions

may contain the object of interest (sky), that is the positive class. Other patches will be from background

or other classes. This representation is illustrated in Figure 1. Traditional single instance learning are based

on instance level (patch-level) labels and would require each image region to be correctly segmented and

labeled prior to learning.

B. Fuzzy Inference Systems

A Fuzzy Inference System (FIS) is a paradigm in soft computing which provides a means of approximate

reasoning [31]. A FIS is capable of handling computing with knowledge uncertainty and measurements

imprecision effectively [1]. It performs a non-linear mapping from an input space to an output space by

1Note that positive bags may also contain negative instances.

3

Fig. 1. Example of an image represented as a bag of 12 instances. Each instance correspond to a feature vector (e.g., color, texture) extracted

from one patch. The bag is labeled “sky” because at least one of its instances is sky. However, many other instances are not “sky”. Labels at

the instance level are not available.

deriving conclusions from a set of fuzzy if-then rules and known facts [32]. Fuzzy rules are condition/action

(if-then) rules composed of a set of linguistic variables (e.g. image patch). Each variable is assigned a

linguistic term (e.g. red, green, blue). For instance, the following rules could be used to identify patches

from the image in Figure 1:

•If patch is blue and texture is smooth then region is sky.

•If patch is blue and patch position is upper half then region is sky.

Typically, a FIS is composed of 5 components: (1) a Fuzziﬁcation unit that assigns a membership degree

to each crisp input dimension in the input fuzzy sets; (2) a Knowledge Base characterized by fuzzy sets of

linguistic terms; (3) a Rule Base containing a set of fuzzy if-then rules; (4) an Inference unit that performs

fuzzy reasoning; and (5) a Deffuziﬁcation unit that generates crisp output values. FIS has proven to be very

effective in various applications [2]–[4], [33]–[40]. However, it is not applicable to cases where objects are

represented by multiple instances.

C. Motivations For Multiple Instance Fuzzy Inference

There are two major limitations that prevent using standard FIS methods with multiple instance data.

First, due to the absence of labels at the instance level, we cannot use standard FIS learning methods to

construct the knowledge base. Second, we need an effective mechanism to aggregate instances’ conﬁdences

and infer at the bag level. The above limitations are due mainly to the inherent architecture of fuzzy inference

systems. The standard inference systems reason with individual instances. First, the system’s input is an

individual instance. Second, the rules describe fuzzy regions within the instances space. Third, the output

of the system corresponds to the fuzzy inference using a single instance. Fourth, labels of the individual

instances are required when using learning techniques to identify the parameters of the system. In summary,

traditional fuzzy inference systems cannot be used effectively within the MIL framework.

To address the above limitations, we introduce two FIS designed to handle reasoning with bags of

instances and capable of learning form ambiguously labeled data. The ﬁrst one, called Multiple Instance-

Sugeno (MI-Sugeno) extends the standard Sugeno system [41]. The second one, called Multiple Instance-

ANFIS (MI-ANFIS) extends the standard ANFIS [15] system and uses MI-Sugeno rules. We report results

on various experiments and discuss the advantages of using our proposed methods over closely related MIL

4

algorithms such as Multiple Instance Neural Networks [42] (MI-NN) and Multiple Instance RBF Neural

Networks [43] (RBF-MIP).

II. MU LTIPLE INSTANC E FUZ ZY IN FERENCE

In the following, let Bpbe a bag of Mpinstances with the jth instance denoted as xpj ∈RDwith

elements x(p,j,k)corresponding to features, i.e.,

Bp=

xp1

xp2

.

.

.

xpMp

=

x(p,1,1) x(p,1,2) . . . x(p,1,D)

x(p,2,1) x(p,2,2) . . . x(p,2,D)

.

.

..

.

.....

.

.

x(p,Mp,1) x(p,Mp,2) . . . x(p,Mp,D)

.(1)

Note that the number of instances can vary between bags (Mpdepends on Bp). A bag is labeled positive

if at least one of its instances is positive, and negative if all of its instances are negative.

A. Multiple Instance Sugeno Style Fuzzy Inference

To adapt Sugeno inference to problems where objects are described by multiple instances, we propose

a multiple instance Sugeno inference (MI-Sugeno) system that uses multiple instance fuzzy if-then rules.

Recall that a fuzzy if-then rule is expressed as

if x is A then y is C (2)

where Aand Care fuzzy sets on universes of discourse Xand Y, respectively. The rule in (2) combines

the fuzzy propositions (x is A,y is C) into a logical implication abbreviated as A→Cwith membership

function µA→C(x, y). The rule is deﬁned using a premise part that is a single instance fuzzy proposition.

To generalize the rule in (2) to MI data, we deﬁne a multiple instance fuzzy rule as:

if Biis A then y is C ⇐⇒ if

Mi

_

j=1

(xij is A)then y is C (3)

where as in (2), Aand Care fuzzy sets on the universes of discourse Xand Y, respectively. In (3), Biis a

bag of instances xij as deﬁned in (1), and Miis the number of instances in Bi. The premise part of a multiple

instance fuzzy rule (i.e., WMi

j=1(xij is A)) is a multiple instance proposition, whereas the consequent part is

a traditional proposition. In (3), Wis a joint operator that can be any T-conorm (maximum, algebraic sum,

bounded sum, etc.). The reason behind using a T-conorm for combining individual instances’ responses,

goes back to the standard MIL assumption [16], [17] which states that a bag is positive if and only if

one or more of its instances are positive. Thus, the bag-level class label is determined by the disjunction

of the instance-level class labels. We note that the T-conorm can be designed to handle a broader set of

non-standrad MIL problems, for example to allow the inference process to assign a higher degree of belief

to bags with more than one positive instance.

The proposed MI-Sugeno uses multiple instance fuzzy rules with a consequent part that is described by

means of a function Cthat maps a bag of instances to a crisp numerical value. Speciﬁcally, we deﬁne a

multiple instance sugeno rule as:

Ri(Bp) :

Mp

_

j=1

(If x(p,j,1) is Ai

1and x(p,j,2) is Ai

2. . . and x(p,j,D)is Ai

D),then

oi=C(xp1·bi,xp2·bi,...,xpMp·bi)

(4)

In (4), bi=bi

0, ..., bi

Dis a set of polynomial coefﬁcients. When the polynomial coefﬁcients biare ﬁrst

order, the MI-Sugeno fuzzy model is called ﬁrst order, and zero order when the polynomial coefﬁcients

5

Fig. 2. Illustration of the proposed multiple instance Sugeno fuzzy inference system with 2 rules.

are zero order.

Figure 2 illustrates the proposed MI-Sugeno system and its fuzzy inference mechanism to derive the output,

o, in response to a bag of Minstances for the simple case of two rules. The premise part of the rules

evaluates all the bag’s instances simultaneously. The inference starts by the fuzziﬁcation of instances xpm of

input bag Bp. Fuzziﬁcation assigns a membership degree to each input instance dimension in the rules input

fuzzy sets. In Figure 2, instance xpm activates the ith input fuzzy set of the jth rule by a degree of truth

w(m,i,j). Next, an implication process is executed to combine the activations of the instances within the bag

resulting in the activation of the rules’ output with different degrees. In this example, we use a simple min

operator, and the output of rule Rjwill be partially activated by a degree wmj =mink=1,...,Dw(m,k,j). The

wmj (truth instances) are combined in the premise part using the max T-conorm, resulting in the activation

of rule Rjby a degree wj=maxm=1,...,M {wmj}. To evaluate the consequent part, ﬁrst the linear response of

each instance is computed, i.e., xpj ·bi. Then, a function Cis used to compute the ﬁnal output by combining

the instances’ responses. Many functions could be used and the choice should be domain-specﬁc. The output

of each rule, o1and o2, are crisp values. As in the traditional Sugeno fuzzy inference system, the overall

output of the system is obtained by taking the weighted average of the rules’ outputs.

The consequent part of the proposed MI-Sugeno style inference system is inspired by the work of Ray

and Page on multiple instance regression [44]. In their work, the authors proposed a regression framework

for predicting bags’ labels. This formulation allows the linear coefﬁcients biand the parameters of the

combining function Cto be learned using optimazation techniques, as we will show in section II-B.

Similar to traditional fuzzy inference, the premise part of a multiple instance rule deﬁnes a local fuzzy

region within the instance space, and the consequent part describes the characteristics of the system’s

output within each region. More speciﬁcally, in problems, a local region describes a positive concept (also

called target concept), and the output of a rule represents the degree of “positivity” of the instances in that

target concept. A target concept is a region in the instances’ feature space that includes as many instances

from positive bags as possible and as few instances from negative bags as possible.

The Sugeno fuzzy model [41] was the ﬁrst attempt at learning fuzzy rules from training data. It has

6

Fig. 3. Architecture of the proposed Multiple Instance Adaptive Neuro-Fuzzy Inference System

been used to develop the standard ANIFS which combines the representation power of fuzzy inference and

learning capability of neural networks to learn the rules. In the next section, we will use our MI-Sugeno

to develop a multiple instance extension of ANFIS (MI-ANFIS).

B. MI-ANFIS: A Multiple Instance Adaptive Neuro-Fuzzy Inference System

Let Bibe a bag of Miinstances as deﬁned in (1). For simplicity, we introduce our MI-ANFIS for the

case of two rules. The generalization to an arbitrary number of rules is trivial. The MI-ANFIS with two

Sugeno rules can be described as:

R1(Bp) :

Mp

_

j=1

(If x(p,j,1) is A(1,1) and x(p,j,2) is A(1,2),...,and x(p,j,D)is A(1,D)),

then f1=C(xp1·b1,xp2·b1,...,xpMp·b1)

R2(Bp) :

Mp

_

j=1

(If x(p,j,1) is A(2,1) and x(p,j,2) is A(2,2),...,and x(p,j,D)is A(2,D)),

then f2=C(xp1·b2,xp2·b2,...,xpMp·b2)

(5)

Figure 3 illustrates the proposed MI-ANFIS architecture. As in the traditional ANFIS, nodes at the same

layer have similar functions. We denote the output of the ith node in layer las O(l,i)

Layer 1 is an adaptive layer, it calculates the degree to which an input instance satisﬁes a quantiﬁer

A. Every node evaluates the membership degree µA(k,j)of an input instance, x, in the fuzzy set

A(k,j). Generally, µA(k,j)is a parameterized membership function (MF), for example a Gaussian

MF with

µA(k,j)(x) = exp(−(x−ckj )2

2σkj 2).(6)

In (6), ckj and σkj are the mean and variance of the gaussian function, and are referred to as the

premise parameters.

7

Layer 2 is a ﬁxed layer where every node computes the product of all incoming inputs. It evaluates the

degree of truth of proposition instances, or simply, “truth instances”. The output of this layer is

computed using:

O(2,i)=ri/Mp,i[Mp]=

D

Y

j=1

µAi/Mp,j(x(p,i[Mp],j)).(7)

where is a ceiling operator, and i[Mp]is 1 + ((i−1) mod Mp). As in the traditional ANFIS,

any T-norm can replace the product as the node function in this layer.

Layer 3 is a new addition when compared to the traditional ANFIS. Every node in this layer aggregates

the truth instances (within each bag) of the previous layer by means of a smooth T-conorm. In

this paper, we use a “softmax” function (Sα):

softmaxα(x1, x2, . . . , xn) = Sα(x1, x2, . . . , xn) =

n

X

i=1

xi·eαxi

Pn

j=1 eαxj.(8)

In (8), αdetermines the behavior of softmax. As αapproaches ∞, softmax approaches the max

operator. When α= 0, it calculates the mean. As αapproaches −∞, softmax approaches the

min operator. The outputs of this layer are the ﬁring strength of each input bag in each multiple

instance fuzzy rule. i.e.,

O(3,i)=wi=Sα({r(i,j)}Mp

j=1).(9)

Layer 3is also a ﬁxed layer.

Layer 4 is a ﬁxed layer. Every node in this layer calculates the normalized ﬁring strength of each

rule, i.e.,

O(4,i)=wi=wi

P|O3|

j=1 wj

.(10)

where |O3|is the number of rules.

Layer 5 is an adaptive layer. Every node iin this layer computes the output of the ith multiple instance

rule using

O(5,i)=wiC(xp1·bi,xp2·bi,...,xpMp·bi).(11)

The parameters {bi}|O3|

i=1 are referred to as the consequent parameters. The only constraint on

Cis that it has to be a smooth function to allow for optimization techniques to be applied. We

use the “softmax” as the combining function for this layer. In this case, (11) is equivalent to:

O(5,i)=wiSα(xp1·bi,xp2·bi,...,xpMp·bi).(12)

note that the constant αhere is not necessary the same as in Layer 3.

Layer 6 is a ﬁxed layer with a single node labeled Σ. It computes the overall output of the system

using

O(6,1) =

|O3|

X

i=1

O5,i =

|O3|

X

i=1

wiSα(xp1·bi,xp2·bi,...,xpMp·bi).(13)

To learn the parameters of the proposed MI-ANFIS network, we propose a generalization to the basic

learning algorithm presented by Jang [45]. Our variation is different from the ANFIS standard backpropa-

gation learning rule due to the additional layers (Layers 3 and 5) and the use of “softmax” function (in (9)

and (11)). Thus, all update equations need to be rederived.

BackPropagation Learning Rule: we assume that we have Ntraining bags, B={Bp|p= 1, . . . , N},

and their corresponding labels T={tp|p= 1, . . . , N}. After presenting the pth training bag, we compute

its squared error measure:

Ep= (tp−Op)2.(14)

8

In (14), tpis the desired bag output, and Opis the computed output of the network when presented with

training bag Bp. Recall that labels at the instances level are not available and errors can be computed only

at the bag level.

The overall error measure of the network after presenting all Nbags is

E=

N

X

p=1

Ep.(15)

To develop the gradient descent optimization on E, we compute the error rate for the pth training bag at

each output node O(l,i). This error rate ε(l,i)(where 1≤l≤6indicates the MI-ANFIS layer) is deﬁned as

εl,i =∂Ep

∂O(l,i)

.(16)

At the output node, we have

ε(6,1) =∂Ep

∂O(6,1)

=∂Ep

∂Op

=−2(tp−Op).(17)

For non-output nodes (i.e. internal nodes, l < 6), we derive the error rate using the chain rule

ε(l,i)=∂Ep

∂O(l,i)

=

Card(l+1)

X

h=1

∂Ep

∂O(l+1,h)

∂O(l+1,h )

∂O(l,i)

.(18)

where Card(l+ 1) refers the number of nodes at layer l+ 1.

Next, we seek to minimize the network error with respect to the premise parameters {ckj, σkj |1≤k≤

|O3|,1≤j≤D}, and with respect to the consequent parameters {bi}|O3|

i=1 .

The error rate with respect to a generic parameter θcan be computed using

∂Ep

∂θ =X

O∗∈G

∂Ep

∂O∗

∂O∗

∂θ .(19)

where Gis the set of nodes whose outputs depend on θ.

Using(15), the total error rate is given by

∂E

∂θ =

N

X

p=1

∂Ep

∂θ .(20)

Update Rule For Premise Parameters: First we compute the error rate for the premise parameters ckj

and σkj using

∂Ep

∂ckj

=

Mp

X

i=1

∂Ep

∂O(1,i+[(k−1)D+(j−1)]Mp)

∂O(1,i+[(k−1)D+(j−1)]Mp)

∂ckj

.(21)

and,

∂Ep

∂σkj

=

Mp

X

i=1

∂Ep

∂O(1,i+[(k−1)D+(j−1)]Mp)

∂O(1,i+[(k−1)D+(j−1)]Mp)

∂σkj

.(22)

9

Using the chain rule deﬁned in (18), it can be shown that (see derivation in Appendix A)

∂Ep

∂ckj

=−2(tp−Op)× Sα(xp1·bk,xp2·bk,...,xpMp·bk)×P|O3|

l=1 wl−wk

P|O3|

l=1 wl2

×

Mp

X

i=1 eαr(k,(i+(k−1)Mp))

PMp

m=1 eαr(k,m)h1 + αr(k,(i+(k−1)Mp)) − Sα({r(k,m)}Mp

m=1)i

×

D

Y

d=1,d6=j

µA(i+(k−1)Mp)/Mp,dx(p,(i+(k−1)Mp)[Mp],d)

×(x(p,(i+(k−1)Mp)[Mp],j)−ckj )

σ2

kj

×exp(−(x(p,(i+(k−1)Mp)[Mp],j)−ckj )2

2σ2

kj

)!.

(23)

The center parameters ckj are then updated using

4ckj =−η∂E

∂ckj

.(24)

where ηis the learning rate.

The update formula for σkj can be derived in a similar manner. It can be shown that

∂Ep

∂σkj

=−2(tp−Op)× Sα(xp1·bk,xp2·bk,...,xpMp·bk)×P|O3|

l=1 wl−wk

P|O3|

l=1 wl2

×

Mp

X

i=1 eαr(k,(i+(k−1)Mp))

PMp

m=1 eαr(k,m)h1 + αr(k,(i+(k−1)Mp)) − Sα({r(k,m)}Mp

m=1)i

×

D

Y

d=1,d6=j

µA(i+(k−1)Mp)/Mp,dx(p,(i+(k−1)Mp)[Mp],d)

×(x(p,(i+(k−1)Mp)[Mp],j)−ckj )2

σ3

kj

×exp(−(x(p,(i+(k−1)Mp)[Mp],j)−ckj )2

2σ2

kj

)!.

(25)

The MF’s width, σkj, are then updated using

4σkj =−η∂E

∂σkj

.(26)

Update Rule For Consequent Parameters: The error rate for the consequent parameters {bi={bi

0, ..., bi

D}, i =

1. . . |O3|} is deﬁned as

∂Ep

∂bi=∂Ep

∂bi

0

,∂Ep

∂bi

1

,...,∂Ep

∂bi

D.(27)

where, ∂Ep

∂bi

j

=∂Ep

∂O(5,i)

∂O(5,i)

∂bi

j

, f or j = 1, . . . , D. (28)

10

Using (18), it can be shown that (see Appendix B)

∂E

∂bi

j

=

N

X

p=1

∂Ep

∂bi

j

=

N

X

p=1

wi

Mp

X

m=1 1

PMp

h=1 exp(α(xph ·bi−xpm ·bi))2

×hx(p,m,j)

Mp

X

h=1

exp(α(xph ·bi−xpm ·bi)−xpm ·bi

Mp

X

h=1

exp(α(xph ·bi−xpm ·bi)α(x(p,h,j)−x(p,m,j))i!.

(29)

The consequent parameters are then updated using

4bi

j=−η∂E

∂bi

j

.(30)

Equations (24), (26) and (30) can be used to update ckj ,σkj and bi

jparameters either on-line, bag by bag (

we want to emphasis here that the on-line learning is not achieved instance by instance, but rather bag by

bag), or off-line in batch mode after presentation of the entire data.

The proposed MI-ANFIS learning algorithm is summarized in Algorithm 1.

Algorithm 1 MI-ANFIS Basic Learning Algorithm

Inputs:B: the set of training bags.

T: labels of the training bags.

M: the number of instances in each bag.

α: the constant used in the “softmax” function.

η: the learning rate.

Emax: number of epochs.

: minimum parameters change value.

Outputs:bi: the sets of consequent parameters.

ci: the set of membership functions’ centers.

σi: the set of membership functions’ widths.

Initialize bi,ci, and σi.

repeat

Update biusing (30) and bi(new)=bi(old)+4bi.

Update ciusing (24) and ci(new)=ci(old)+4ci.

Update σiusing (26) and σi(new)=σi(old)+4σi.

until max(kbi(new)−bi(old)k,kci(new)−ci(old)k,kσi(new)−σi(old)k)< or Number of epochs > Emax

return bi,ci,σi

III. PREVENTING OVE RFI TT IN G: RU LE DROPOUT

Neural networks with large number of parameters are susceptible to overﬁtting. MI-ANFIS is no excep-

tion, particularly when using large number of multiple instance fuzzy rules and relatively small training

datasets. In such scenario, some rules could co-adapt to the training data and degrade the network ability to

generalize to unseen examples. In this section, we present a technique, known as Dropout, used to prevent

overﬁtting and rules’ co-adaptation.

Dropout is a regularization method that was introduced by Hinton et al. [46] to alleviate the serious problem

of overﬁtting in deep neural networks. Over the years, many methods have been developed to reduce

overﬁtting, including using a validation dataset to stop the training as soon as the performance gets worse,

adding weight penalties using L1 and L2 regularization, or artiﬁcially augmenting the training dataset using

11

label-preserving transformations. However, as noted by Hinton [46], the best way to regularize a ﬁxed-size

model is to average the predictions of all possible settings of the parameters weighted by its posterior

probability given the training data. This can be achieved by combining the predictions of an exponential

number of models. Combining several models with different architectures have the advantage of better

generalization and per consequence better testing performance. While generating an ensemble of models is

trivial, training them all is prohibitively expensive.

Generally, Dropout works by setting to 0the output of each node in a given layer with probability 1−p

(ptypically equals 0.5), during training. Nodes that are dropped out do not contribute to the parameters

updates. During testing, all nodes are used but the outputs are weighted by the probability p. Following

this strategy, every time a new training example is presented, the network samples and trains a different

architecture. In other words, Dropout trains an ensemble of networks (2Nnetworks, Nbeing the number of

nodes) simultaneously leading to an important speedup in training time as compared to traditional ensemble

methods. Figure 4 and Figure 5 illustrate the Dropout model.

Fig. 4. Dropout neural network model. (a) is a standard neural network. (b) is the same network after applying dropout. Doted lines indicate

a node that has been dropped. (source [46])

Fig. 5. Illustration of Dropout application. (a) a node is dropped with probability 1−pat training time. (b) at test time the node is always

present and its outputs are weighted by p. (source [46])

In this paper, we propose to adopt the Dropout strategy to regularize MI-ANFIS networks. Typically,

overﬁtting occurs in MI-ANFIS networks when a set of multiple instance rules co-adapt to the provided

data early during the training process and prevent the remaining rules from learning. Thus, degrading the

12

network’s generalization capability. While the Dropout technique could be applied to MI-ANFIS as is (given

the inherited neural network nature of the architecture), care should be exercised when selecting nodes to

include in the list of the randomly dropped out nodes. MI-ANFIS nodes are different from that of standard

neural networks as they are grouped into rules to model and express linguistic terms. Simply dropping

few nodes from a given rule can change its role and could severely handicap the fuzzy inference process.

Hence, Dropout should be executed differently. In deep neural nets, Dropout is applied to selected layers

(vertically), for MI-ANFIS, we propose to apply Dropout on a rule by rule basis (i.e., horizontally). Either

the whole rule is included, or the whole rule is dropped. This can be achieved by applying Dropout to

Layer 5(see Figure 6), i.e., setting to zero the output of the “to be dropped out” rules. We will refer to

this derived technique as “Rule Dropout”. Using a Rule Dropout strategy to train MI-ANFIS networks is

approximatively equivalent to sampling and training 2R(Ris the number of rules) ensemble of networks.

Let pbe the probability with which a rule is present, formally, Rule Dropout is applied to Layer 5during

training as follows

O5,i =hiwiSα(xp1·bi,xp2·bi,...,xpMp·bi),(31)

where

hi∼Bernoulli(p)(32)

is a Bernoulli random variable with probability pof being 1. During testing, Layer 5output is scaled by

p, i.e., O3,i =pwiSα(xp1·bi,xp2·bi,...,xpMp·bi). Figure 6 illustrates our MI-ANFIS network with 3

multiple instance fuzzy rules where, at a given iteration, rule 2 has been dropped out..

Deriving the new update equations for MI-ANFIS parameters requires taking into consideration the added

Bernoulli random variable, hi. It is straightforward to show that the new gradients with respect to premise

and consequent parameters are given by

∂Ep

∂ckj

=−2(tp−Op)

×hk× Sα(xp1·bk,xp2·bk,...,xpMp·bk)×P|O3|

l=1 wl−wk

P|O3|

l=1 wl2

×

Mp

X

i=1 eαrk,(i+(k−1)Mp)

PMp

m=1 eαrk,m h1 + αrk,(i+(k−1)Mp)− Sα({rk,m }Mp

m=1)i

×

D

Y

d=1,d6=j

µA(i+(k−1)Mp)/Mp,dxp,(i+(k−1)Mp)[Mp],d

×(x(p,(i+(k−1)Mp)[Mp],j)−ckj )

σ2

kj

×exp(−(x(p,(i+(k−1)Mp)[Mp],j)−ckj )2

2σ2

kj

)!.(33)

13

Fig. 6. Illustration of Rule Dropout application. Doted lines indicate a rule that has been dropped.

and,

∂Ep

∂σkj

=−2(tp−Op)

×hk× Sα(xp1·bk,xp2·bk,...,xpMp·bk)×P|O3|

l=1 wl−wk

P|O3|

l=1 wl2

×

Mp

X

i=1 eαrk,(i+(k−1)Mp)

PMp

m=1 eαrk,m h1 + αrk,(i+(k−1)Mp)− Sα({rk,m }Mp

m=1)i

×

D

Y

d=1,d6=j

µA(i+(k−1)Mp)/Mp,dxp,(i+(k−1)Mp)[Mp],d

×(x(p,(i+(k−1)Mp)[Mp],j)−ckj )2

σ3

kj

×exp(−(x(p,(i+(k−1)Mp)[Mp],j)−ckj )2

2σ2

kj

)!.(34)

14

In a similar manner,

∂E

∂bi

j

=

N

X

p=1

−2(tp−Op)

×hiwi

Mp

X

m=1

1

PMp

h=1 exp(α(xph ·bi−xpm ·bi))2

×hxpmj

Mp

X

h=1

exp(α(xph ·bi−xpm ·bi)

−xpm ·bi

Mp

X

h=1

exp(α(xph ·bi−xpm ·bi)α(xphj −xpmj )i.

(35)

As it can be seen, equations (33), (34), and (35) will get zeroed when the rule is dropped out (i.e., hk= 0

and hi= 0). Thus, its premise and consequent parameters are not updated.

IV. EXPERIMENTAL RESULTS

A. Synthetic Data

To illustrate the proposed multiple instance fuzzy inference and its ability to learn from data without

instance-level labels, ﬁrst, we use a simple 2-Dim synthetic dataset. This data were generated from a

distribution of two positive contexts with centers at (0.5,0.5) and (1.5,1.5), and with a ﬁxed standard

deviation. These centers are marked with squares in Figure 7, and the circles around the centers indicates

regions within 1 standard deviation. These regions are considered the two target concepts. From each positive

concept we generated 50 bags. Each bag has a random number, between 2 and 10, of instances. Each bag

from concept 1 (or 2) will have at least one instance close to target concept 1 (or 2). We also generated 50

negative bags randomly from non concept regions. Negative bags will have all of their instances outside

both target concepts. In Figure 7, instances from negative bags are shown as “.”, and instances from positive

bags are shown as “+” or “M” depending on the underlying concept. In Figure 7, we highlight one bag

from Concept 1 by circling all of its instances. As it can be seen, one of its instances is within one standard

deviation region of target concept 1 while the other instances are scattered around. We should emphasize

here that the centers of the target concepts in Figure 7 are unknown and not used by the learning algorithm.

They are shown here for illustration and validation purposes only.

1) MI-ANFIS Rules Learning: In the following, we show that the MI-ANFIS Learning Algorithm (Al-

gorithm 1) is capable of identifying positive concepts as well as their corresponding multiple instance

fuzzy rules. To initialize the premise parameters, we partition the instances’ space into 6 partition generated

randomly 2. We use the partitions’ centers as initial centers for the Gaussian MFs, and we initialize all

standard deviation parameters to a default value of 0.5.

The initial fuzzy sets (MFs) of the rules, before training, are displayed in Figure 8 in dashed lines. As

it can be seen, the initial 6 partitions simply cover random quadrants of the 2D instance space (if no label

information is used, as in this case, data would appear to have uniform distribution (refer to Figure 7)).

The learned fuzzy sets after convergence are shown in Figure 8 in bold lines. As it can be seen, the system

has correctly identiﬁed the positive concepts, and at the same time identiﬁed irrelevant rules (MI-Rule 1,

MI-Rule 3 and MI-Rule 5) and assigned low output values to each, −0.3,−0.06 and −0.12 respectively.

B. Benchmark Datasets

To provide a quantitative evaluation of the proposed MI-ANFIS, we apply it to ﬁve benchmark data

sets commonly used to evaluate MIL methods: The MUSK1, MUSK2 [19], and Fox, Tiger, and Elephant

2A grid or manual partitioning could also be used

15

Fig. 7. Instances from positive and negative bags drawn from data that have 2 concepts. The center of each target concept is indicated by

a square and the circles indicate the region within 1 - standard deviation from the mean. Instances from negative bags are shown as “.”, and

instances from positive bags are shown as “+” or “M”. Instances from one sample positive bag are circled.

from the COREL data set [47]. MUSK1 and MUSK2 data sets consist of descriptions of molecules and

the objective is to classify whether a molecule smells musky [48]. In these data sets, each bag represents a

molecule and instances within each bag represent the different low-energy conformations of the molecule.

Each instance is characterized by 166 features. MUSK1 has 92 bags, of which 47 are positive, and MUSK2

has 102 bags, of which 39 are positive. The other data sets (Fox, Tiger, and Elephant), classify whether an

image contains the corresponding animal. Each data set consists of 200 images (bags): 100 positive images

containing the target animal and 100 negative images containing other animals. Each image is represented

as a set of patches (instances) and each patch is in turn represented by a 230 dimensional feature vector

describing its color, texture and shape information. We note that the three data sets are independent and used

as binary classiﬁcation problem (positive v.s. negative). Table I summarizes the characteristics of the ﬁve

data sets. It is to be noted that for each benchmark data set, PCA was applied to reduce the dimensionality

of the features in order to speedup MI-ANFIS training and increase the interpretability of the generated

multiple instance fuzzy rules.

For each experiment, we construct a zero-order MI-ANFIS with a given number of multiple instance rules.

For MI-ANFIS the number of rules is not critical. It should be large enough to cover the diverse regions of

the input space and the multiple concepts. If the speciﬁed number of rules is too large, some will vanish

as illustrated in Figure 8 for the example with the synthetic data. Also, larger number of rules leads to

slower training. We use Gaussian MFs to describe the input fuzzy sets. For initialization, we use the FCM

algorithm to cluster the instances of the positive bags into a number of clusters equal to the number of fuzzy

16

Fig. 8. Learned MFs after convergence of MI-ANFIS training algorithm. Initial MFs before training are marked with dashed lines. Learned

MFs are shown in sold bold lines.

TABLE I. BE NCH MAR K DATA SET S

Data set dim.(PCA) No. Bags Positive Negative No.Instances

MUSK1 166(25) 92 47 45 2→40

MUSK2 166(25) 102 39 63 1→1044

Fox 230(10) 200 100 100 2→13

Tiger 230(10) 200 100 100 1→13

Elephant 230(10) 200 100 100 2→13

rules, and we initialize MFs’ centers as the clusters centers. MI-ANFIS was trained and tested using ten

fold cross validation. Table II summarizes all parameters used in training the MI-ANFIS (parameters were

manually selected using trial and error). We note that the reason behind using larger standard deviations

for MUSK1 and MUSK2 datasets is the higher dimensionality of these data sets. We expect the sparsity

to increase with the dimensions of the feature space, so we set the standard deviations to larger values to

allow the initial rules to cover the entirety of the input space.

First, to illustrates the advantage of using MI-ANFIS over the traditional ANFIS we compare these

two algorithms on the two MUSK data sets. Since ANFIS cannot learn from ambiguously labeled data,

for the sake of comparison, we consider the naive MIL assumption where all instances from positive

bags are considered positive and all instances from negative bags are considered negative. We refer to this

implementation as Naive-ANFIS. The results are summarized in Table III where the performance is reported

in terms of prediction accuracy averaged over all 10 cross validation sets (% of correct ±standard deviation).

17

TABLE II. MI-A NFI S TRAINING PAR AME TE RS

Parameter MUSK1 MUSK2 FOX Tiger Elephant

No. of MI Rules 6 3 2 4 3

No. of Inputs 25 25 10 10 10

MF’s σ100 100 10 10 10

Output parameters 1s 1s 1s 1s 1s

Softmax’s α1 1 1 1 1

Learning rate 0.1 0.1 0.1 0.1 0.1

TABLE III. COMPARISON OF MI-ANFIS PREDICTION ACCURACY (IN PERCENT)TO NAI VE -ANFIS ON TH E BEN CHMAR K DATA SETS .

Algorithms MUSK1 MUSK2 Fox Tiger Elephant

MI-ANFIS 93.49 90.58 66.4 84.5 86.97

±0.76 ±1.31 ±2.77 ±0.61 ±1.10

Naive-ANFIS 67.82 79.43 58.70 77.70 82.2

±4.04 ±5.04 ±1.35 ±0.83 ±0.83

As it can be seen, MI-ANFIS outperforms Naive-ANFIS signiﬁcantly. This is because inaccurately labeled

instances within the positive bags were used for training the Naive-ANFIS. The difference in performance

between MI-ANFIS and Naive-ANFIS is greater for MUSK1 and MUSK2 because of the greater number

of instances per bag (more ambiguousity).

TABLE IV. COMPARISON OF MI-ANFIS PREDICTION ACCURACY (IN PERCENT)TO OTHE R METHO DS ON T HE BE NC HMA RK DATA

SE TS. R ESU LTS FO R 3TOP P ERFOR MIN G MET HO DS AR E SHO WN IN B OLD F ON T. WE US E REP ORTE D RESULTS , N/A IND ICATED TH AT A

GI VEN A LGO RI THM WAS N OT AP PLI ED TO T HAT DATA SET.

Algorithms MUSK1 MUSK2 Fox Tiger Elephant

MI-ANFIS 93.49 90.58 66.4 84.5 86.97

±0.76 ±1.31 ±2.77 ±0.61 ±1.10

MILES [49] 86.3 87.7 N/A N/A N/A

APR [19] 92.4 89.2 N/A N/A N/A

DD [21] 88.9 82.5 N/A N/A N/A

DD-SVM [50] 85.8 91.3 N/A N/A N/A

EM-DD [51] 84.8 84.9 56.1 72.1 78.3

Citation-KNN [52] 92.4 86.3 N/A N/A N/A

MI-SVM [47] 77.9 84.3 57.8 84.0 81.4

mi-SVM [47] 87.4 83.6 58.2 78.4 82.2

MI-NN [53] 88.0 82.0 N/A N/A N/A

Bagging-APR [54] 92.8 93.1 N/A N/A N/A

RBF-MIP [43] 91.3 90.1 N/A N/A N/A

±1.6±1.7

BP-MIP [42] 83.7 80.4 N/A N/A N/A

RBF-Bag-Unit [55] 90.3 86.6 N/A N/A N/A

MI-kernel [56] 88.0 89.3 60.3 84.2 84.3

PPPM-kernel [57] 95.6 81.2 60.3 80.2 82.4

MIGraph [56] 90.0 90.0 61.2 81.9 85.1

miGraph [56] 88.9 90.3 61.6 86.0 86.8

ALP-SVM [58] 86.3 86.2 66.0 86.0 83.5

MIForest [59] 85.0 82.0 64.0 82.0 84.0

Table IV compares the performance of the proposed algorithm to state of the art MIL algorithms on the

benchmark data sets.

Overall, MI-ANFIS is comparable to other MIL algorithms. In fact, on all tested data sets, MI-ANFIS

ranked consistently among the top three. For MUSK1, PPPM-kernel [57] performed the best (95.6%), but

this algorithm did not perform as well for the other sets. For MUSK2 Bagging-APR [54] achieved the best

18

TABLE V. BENC HMARK DATA SE TS

Data set dim.(PCA) No. Bags Positive Negative No.Instances

Elephant 230(10) 200 100 100 2→13

accuracy, as reported by [49]. MI-ANFIS achieved the best average performance for the Fox and Elephant

data sets, and second best performance after the miGraph [56] and ALP-SVM [58] methods for the Tiger

data set.

In order to demonstrate the gain in generalization acquired by MI-ANFIS when utilizing Rule Dropout,

we train an MI-ANFIS architecture for binary classiﬁcation with and without Rule Dropout on a multiple

instance dataset sampled from COREL [47]. The dataset classify whether an image contains an elephant or

not, and consists of 200 images (bags): 100 positive images containing the target animal and 100 negative

images containing other animals. Each image is represented as a set of patches (instances) and each patch

is in turn represented by 230 features describing color, texture and shape information. Before training, we

applied PCA to reduce the dimensionality of the features to 10 dimensions to speedup MI-ANFIS. Table

V summarizes the dataset characteristics. Two MI-ANFIS networks composed of 15 rules each, with one

network employing Rule Dropout (with p= 0.7, this hyper-parameter was selected based on trial and error),

were trained on 90% of the data, and the remaining 10% was used for testing (split was done randomly).

Figure 9 shows the training and testing errors for both networks during 100 epochs. As it can be seen,

without Rule Dropout, starting at epoch 20, testing performance begins to degrade while training error

continues to decrease. In other words, overﬁtting begins to occur. Typically, using a cross validation data

set, this point can be detected and training would be stopped. However, this assumes that a cross validation

data is available (or training data is large enough to be split into training and testing) and more important

that the cross validation data is representative of the testing data. On the other hand, when using Rule

Dropout, overﬁtting is signiﬁcantly reduced and MI-ANFIS achieved better testing performance at the end

of the training phase. Even though, when using Rule Dropout the training and testing error rates oscillate

(due to the randomness of the dropout process), overall MI-ANFIS achieved 0.1123 testing SSE with Rule

Dropout compared to 0.1451 testing SSE without Rule Dropout.

C. Application To Landmine Detection

In this section, we report the results of applying the proposed Multiple Instance Inference to fuse the

output of multiple discrimination algorithms for the purpose of landmine detection using Ground Penetrating

Radar (GPR). GPR data collected at different locations and different dates were used to train and test the

proposed MI-ANFIS. The alarm collection covers 319 encounters of various anti-tank mines with high

metal content (ATHM) and 422 encounters of various anti-tank mines with low metal content (ATLM).

The vehicle-mounted GPR sensor collects 3-dimensional data as the vehicle moves (Figure 10). The 3-

dimensions correspond to the spatial location on the ground (down-track, cross-track, and depth) and is

shown in Figure 11. Figure 11(b) shows a 2-D views of (depth, down-track) and (depth, cross-track) slices

of GPR data. As it can be seen, the target signature does not extend over all depth values. Thus, one global

feature vector may not discriminate between mines and clutter effectively. To overcome this limitation,

most classiﬁers developed for this application extract multiple features from small overlapping windows at

multiple depths. In the following, we assume that each training alarm (3-D data cube) has been divided

into 15 overlapping (depth wise) patches. Each patch is processed by 2 discrimination algorithms. These

algorithms are based on the Edge Histogram Descriptor (EHD) [60]. The ﬁrst one, called EHDDT, extracts

features from each 2-D (down-track, depth) patch. The second discrimination algorithm, called EHDCT,

extracts information for the 2-D (Cross-Track, depth) patch. In addition, auxiliary features are synthesized

from each patch. In particular, “SignatureWidth” in the Down-track direction and “SignatureWidth” in the

Cross-Track direction are used to capture the effective width of the hyperbolic shape within each patch.

These auxiliary features are intended to provide contextual information that can support the relevance of

the EHDDT and/or EHDCT. As a result, each alarm is represented by a Bag of 15 instances and each

19

Fig. 9. Training and testing errors for two MI-ANFIS networks with and without Rule Dropout.

Fig. 10. Vehicle mounted GPR system for detecting buried Landmines.

instance is a 4-dimensional feature vector. Each bag is labeled as positive (has a target) or negative (non

target), but labels at the instance level are not available. The X-Y Ground truth information about the target

is available (using GPS and known target position on calibration lanes). However, the depth position cannot

be easily identiﬁed as it depends on target size, burial depth, soil type, and other environmental conditions.

Manually extracting the depth location can be very tedious. Similarly, during testing, it is not trivial how

to combine partial conﬁdence values from the multiple windows. Therefore, the MIL paradigm is suitable

to solve this problem.

20

(a) 3D GPR Raw data. (b) 2-D views of the raw GPR data.

Fig. 11. 3-dimensional and 2-dimensional raw GPR data.

We construct a zero-order MI-ANFIS (constant consequent parameters) having 5 multiple instance rules,

and employing Gaussian MFs to describe the input fuzzy sets. To initialize the system’s parameters, ﬁrst,

we use the FCM algorithm to cluster the instances that belong to positive bags into 5 clusters, and we

initialize the MFs’ centers as the clusters’ centers. Then, we initialize the standard deviations of the input

MFs and the output parameters to 1.

After initialization, we run MI-ANFIS basic learning algorithm (Algorithm 1) to jointly learn a fuzzy

description of the positive concepts as well as optimal rules’ output. Figure 12 is a graphical representation

of the 5 multiple instance rules prior to running the optimization process (dotted line curves) and the learned

rules after training (continuous curves). The fuzzy sets of the rules’ antecedents describe the location and

the extent of the positive concepts in the 4-D instance feature space. The rules’ consequent values can

be interpreted as an assessment of the “positivity” of each learned concept. For instance, the MI-ANFIS

learned the following two positive concepts to describe targets:

R1:If EH DDT is M edium and EHDC T is Medium and W idthDT is High and W idthC T is H igh then o1= 1.15.

R2:If EH DDT is M edium and EHDC T is Low and W idthDT is High and W idthC T is H igh then o2= 0.94.

1) Results: The proposed fusion method was trained and tested using 10-folds cross validation. Figure

14 displays the ROCs (averaged over the 10 folds). To provide a quantitative evaluation of the proposed

multiple instance fuzzy inference fusion method, we compare its performance to a fusion method based

on the standard Mamdani [12] and standard ANFIS [61]. Since the standard Mamdani and AFNIS cannot

learn from partially labeled data, an expert is used to label all instances of all bags within the training data.

We also compare MI-ANFIS performances to a naive MIL implementation of Mamdani (NaiveMamdani)

and ANFIS (NaiveANFIS) where all instances from positive bags are considered positive and all instances

from negative bags are considered negative.

As it can be seen in Figure 14, MI-ANFIS performed better than the standard ANFIS on the large dataset,

and as expected NaiveANFIS performed worse. The standard ANFIS performed better at low FAR (False

Alarms Rate), the reason is that strong Mines are easy to identify manually and in this case, the ground

truth helps. However, weaker mine signatures are not as easy to localize, so the truth may not be as accurate

and can degrade the performance. Overall, MI-ANFIS outperformed all presented fusion approaches and

21

Fig. 12. MI-ANIFS fusion rules before (dotted lines) and after training (solid lines).

Fig. 13. A plot of MI-ANFIS RMSE during 150 training epochs.

22

the individual discriminators (EHDDT and EHDCT). This is due to the ability of MI-ANFIS to overcome

labeling ambiguity by learning meaningful concepts.

Fig. 14. Comparison of the individual discriminators, the proposed MI-ANFIS, Mamdani, ANFIS, NaiveMamdani, and NaiveANFIS fusion

methods. Note that the Mamdani and ANFIS systems have the advantage of instance-level labels on training data.

As in standard ANFIS, we cannot prove convergence of the algorithms. However, in all conducted

experiments MI-ANFIS converged in less than 150 epochs. Figure 13 plots the root mean squared error

(RMSE) vs. the training epoch number.

V. REL ATED WORK

MI-ANFIS deals with ambiguity by introducing the novel concept of truth instances: when carrying

reasoning using a bag of instances at Layer 2 (Figure 3), a proposition will not only have one degree of

truth, it will have multiple degrees of truth (rij), we call truth instances. Thus, effectively encoding the

third vagueness component of ambiguity and increasing the expressive power of traditional fuzzy logic. In

addition to effectively model ambiguity, MI-ANIFS has the inherited capability of assessing the prediction

quality by outputting soft values. For example, depending on the αparameter of Softmax in Layer 3, MI-

ANFIS can assign higher outputs to bags with more than one positive instance. Thus, giving the end user

a way to assess the positiveness of a given bag.

Learning positive target concepts from ambiguously labeled data has been the core task of various MIL

algorithms (e.g. Diverse Density [16]). MI-ANFIS has proven that it can learn positive concepts effectively

while jointly providing a fuzzy representation of such regions. The fuzzy representation is combined into

meaningful and simple multiple instance rules that can be easily visualized and interpreted.

Compared to previously proposed multiple instance neural networks, such as Multiple Instance Neural Net-

works [42] (MI-NN) and Multiple Instance RBF Neural Networks [43] (RBF-MIP), MI-ANFIS advantage

is the use of multiple instance fuzzy logic to learn a fuzzy representation of true positive concepts. MI-NN

only learns standard neural network weights that do not carry any information regarding target concepts.

On the other hand, while standard RBF neural networks have been shown to be equivalent to zero order

23

traditional Sugeno systems under certain constraints [62], thus, capable of learning a fuzzy representation

of the inputs, RBF-MIP networks have different architecture and they do not employ adaptive radial basis

functions in the ﬁrst layer. Instead, they represent the inputs by computing their distances to clusters of

training bags. This latter method is computationally expensive and its success depends greatly on the quality

of the training data as it takes into consideration all the training examples which may include wrongly

(noisily) labeled bags. RBF-MIP learns only discriminative regions of the bags space and does not learn

true positive concepts. Moreover, MI-ANFIS learning algorithms can be updated to support a wide range

of loss functions (criterions) such as cross entropy [63], maximum margin [64], etc. MI-NN is designed

to use a handcrafted loss function which is largely responsible for the multiple instance behavior of the

system and cannot be changed without substantially changing the architecture of MI-NN. This could be

disadvantageous if MI-NN is to be used to solve multiple instance - multiple class classiﬁcation problems.

VI. CONCLUSIONS

In this paper, we have introduced a new framework to accomplish fuzzy inference with multiple instance

data. Our work generalizes Sugeno fuzzy inference style to reason with multiple instances, the new inference

style is called MI-Sugeno. We then used MI-Sugeno to develop MI-ANFIS, a novel neuro-fuzzy architecture

that extends the standard Adaptive Neuro-Fuzzy Inference System (ANFIS) to reason with bags of instances

in order to solve multiple instance learning problems. We developed a BackPropagation learning algorithm

and showed that the proposed system is capable of learning meaningful concepts from ambiguously labeled

data.

MI-ANFIS deals with ambiguity by introducing the novel concept of truth instances: when carrying rea-

soning using a bag of instances at Layer 2 (Figure 3), a proposition will not only have one degree of truth,

it will have multiple degrees of truth (rij), we call truth instances. Thus, effectively encoding the third

vagueness component of ambiguity and increasing the expressive power of traditional fuzzy logic.

Learning positive concepts from ambiguously labeled data has been the core task of various MIL algorithms.

MI-ANFIS has proven that it can learn positive concepts effectively while jointly providing a fuzzy

representation of such regions. The fuzzy representation is combined into meaningful and simple multiple

instance rules that can be easily visualized and interpreted.

Using synthetic and benchmark data sets we showed that the proposed Multiple Instance Fuzzy Inference

is comparable to state of the art MI machine learning algorithms. We also used our framework for a

real application and applied it to fuse the output of multiple discrimination algorithms for the purpose of

landmine detection using Ground Penetrating Radar.

In situations where overﬁtting is imminent, for example when using relatively smaller datasets to learn very

large MI-ANFIS networks, we proposed a regularization technique, we called Rule Dropout, and showed

that it could be used to train MI-ANFIS systems with better generalization.

In future work, we intend to develop a multiple class version of MI-ANFIS to be used to solve multiple

class classiﬁcation problems. In addition, we will conduct a detailed analysis of MI-ANFIS convergence.

APPENDIX A

DE RI VATION OF PREMISE PAR AM ET ER S UP DATE RU LE S

From equations (21) and (22) the error rate for the premise parameters ckj and σkj are deﬁned as following

∂Ep

∂ckj

=

Mp

X

i=1

∂Ep

∂O(1,i+[(k−1)D+(j−1)]Mp)

∂O(1,i+[(k−1)D+(j−1)]Mp)

∂ckj

.

and,

∂Ep

∂σkj

=

Mp

X

i=1

∂Ep

∂O(1,i+[(k−1)D+(j−1)]Mp)

∂O(1,i+[(k−1)D+(j−1)]Mp)

∂σkj

.

24

Using the chain rule deﬁned in (18), we have

∂Ep

∂O(1,i+[(k−1)D+(j−1)]Mp)

=∂Ep

∂O(6,1)

×∂O(6,1)

∂O(5,k)

×∂O(5,k)

∂O(4,k)

×∂O(4,k)

∂O(3,k)

×∂O(3,k)

∂O(2,i+(k−1)Mp)

×∂O(2,i+(k−1)Mp)

∂O(1,i+[(k−1)D+(j−1)]Mp)

.

(36)

Hence, (21) is equivalent to

∂Ep

∂ckj

=∂Ep

∂O(6,1)

×∂O(6,1)

∂O(5,k)

×∂O(5,k)

∂O(4,k)

×∂O(4,k)

∂O(3,k)

×

Mp

X

i=1 "∂O(3,k)

∂O(2,i+(k−1)Mp)

×∂O(2,i+(k−1)Mp)

∂O(1,i+[(k−1)D+(j−1)]Mp)

×∂O(1,i+[(k−1)D+(j−1)]Mp)

∂ckj #.(37)

From (17), we have ∂Ep

∂O(6,1)

=−2(tp−Op).(38)

It is also straightforward to show that,

∂O(6,1)

∂O(5,k)

=∂(P|O3|

i=1 O(5,i))

∂O(5,k)

= 1.(39)

and,

∂O(5,k)

∂O(4,k)

=∂(wkSα(xp1·bk,xp2·bk,...,xpMp·bk))

∂(wk)=Sα(xp1·bk,xp2·bk,...,xpMp·bk).(40)

Continuing with the derivation, we have

∂O(4,k)

∂O(3,k)

=∂wk

∂wk

=

∂wi

P|O3|

l=1 wl

∂wk

=P|O3|

l=1 wl−wk

P|O3|

l=1 wl2.(41)

and,

∂O(3,k)

∂O(2,i+(k−1)Mp)

=∂Sα({r(k,j)}Mp

j=1)

∂r(k,(i+(k−1)Mp))

=eαr(k,(i+(k−1)Mp))

PMp

m=1 eαr(k,m)h1 + αr(k,(i+(k−1)Mp)) − Sα({r(k,m)}Mp

m=1)i.(42)

The details of the derivation of the “softmax” function details can be found at [21].

Next, we need to compute ∂O(2,i+(k−1)Mp)

∂O(1,i+[(k−1)D+(j−1)]Mp). We have,

O(2,i+(k−1)Mp)=

D

Y

d=1

µA(i+(k−1)Mp)/Mp,dx(p,(i+(k−1)Mp)[Mp],d).(43)

and,

O(1,i+[(k−1)D+(j−1)]Mp)=µA(k,j)(x(p,(i+(k−1)Mp)[Mp],d)).(44)

Thus,

∂O(2,i+(k−1)Mp)

∂O(1,i+[(k−1)D+(j−1)]Mp)

=

∂ QD

d=1 µA(i+(k−1)Mp)/Mp,dx(p,(i+(k−1)Mp)[Mp],d)!

∂µA(k,j)(x(p,(i+(k−1)Mp)[Mp],d))=

D

Y

d=1,d6=j

µA(i+(k−1)Mp)/Mp,dx(p,(i+(k−1)Mp)[Mp],d).

(45)

25

Finally, we have

∂O(1,i+[(k−1)D+(j−1)]Mp)

∂ckj

=∂µA(k ,j)(x(p,(i+(k−1)Mp)[Mp],j))

∂ckj

=

(x(p,(i+(k−1)Mp)[Mp],j)−ckj )

σ2

kj

×exp(−(x(p,(i+(k−1)Mp)[Mp],j)−ckj )2

2σ2

kj

).(46)

Substituting the derivatives in (37), we obtain (25).

The update formula for σkj can be derived in a similar manner. It can be shown that

∂Ep

∂σkj

=−2(tp−Op)× Sα(xp1·bk,xp2·bk,...,xpMp·bk)×P|O3|

l=1 wl−wk

P|O3|

l=1 wl2

×

Mp

X

i=1 eαr(k,(i+(k−1)Mp))

PMp

m=1 eαr(k,m)h1 + αr(k,(i+(k−1)Mp))− Sα({r(k,m)}Mp

m=1)i

×

D

Y

d=1,d6=j

µA(i+(k−1)Mp)/Mp,dx(p,(i+(k−1)Mp)[Mp],d)

×(x(p,(i+(k−1)Mp)[Mp],j)−ckj )2

σ3

kj

×exp(−(x(p,(i+(k−1)Mp)[Mp],j)−ckj )2

2σ2

kj

)!.

(47)

APPENDIX B

DE RI VATION OF CONSEQUENT PARA ME TE RS U PDATE RULE

The error rate for the consequent parameters is deﬁned in equations (27) and (28). Next, we compute

∂Ep

∂O(5,i)using the previously deﬁned chain rule in (18), and obtain

∂Ep

∂O(5,i)

=∂Ep

∂O(6,1)

×∂O(6,1)

∂O(5,i)

.(48)

From (17), we have ∂Ep

∂O(6,1)

=−2(tp−Op).(49)

And from (50), we have ∂O(6,1)

∂O(5,i)

= 1.(50)

Continuing with the derivation, we have

∂O(5,i)

∂bi

j

=∂(wiSα(xp1·bi,xp2·bi,...,xpMp·bi))

∂bi

j

=∂

∂bi

jwi

Mp

X

m=1

xpm ·biexp(αxpm ·bi)

PMp

h=1 exp(αxph ·bi)

=wi

Mp

X

m=1

∂

∂bi

jxpm ·biexp(αxpm ·bi)

PMp

h=1 exp(αxph ·bi)=wi

Mp

X

m=1

1

PMp

h=1 exp(α(xph ·bi−xpm ·bi))2

×hx(p,m,j)

Mp

X

h=1

exp(α(xph ·bi−xpm ·bi)−xpm ·bi

Mp

X

h=1

exp(α(xph ·bi−xpm ·bi)α(x(p,h,j)−x(p,m,j))i.

(51)

Thus, the overall error rate with respect to the consequent parameter bi

jis given according to (20) in equation

(29).

26

ACKNOWLEDGMENT

This work was supported in part by U.S. Army Research Ofﬁce Grants Number W911NF-13-1-0066

and W911NF-14-1-0589. The views and conclusions contained in this document are those of the authors

and should not be interpreted as representing the ofﬁcial policies, either expressed or implied, of the Army

Research Ofﬁce, or the U.S. Government.

REFERENCES

[1] L. A. Zadeh, A theory of approximate reasoning (AR). Berkeley: Electronics Research Laboratory, College of Engineering, University

of California, 1977.

[2] Zadeh, “Outline of a new approach to the analysis of complex systems and decision processes,” Systems, Man and Cybernetics, IEEE

Transactions on, no. 1, pp. 28–44, 1973.

[3] C.-W. Xu and Y.-Z. Lu, “Fuzzy model identiﬁcation and self-learning for dynamic systems,” Systems, Man and Cybernetics, IEEE

Transactions on, vol. 17, no. 4, pp. 683–689, July 1987.

[4] R. Jager, H. Verbruggen, and P. Bruijin, “Fuzzy inference in rule-based control systems,” in Intelligent Systems Engineering, 1992., First

International Conference on (Conf. Publ. No. 360), Aug 1992, pp. 232–237.

[5] C.-C. Lee, “Fuzzy logic in control systems: fuzzy logic controller. ii,” Systems, Man and Cybernetics, IEEE Transactions on, vol. 20,

no. 2, pp. 419–435, Mar 1990.

[6] J. Casillas, Interpretability issues in fuzzy modeling. Springer, 2003, vol. 128.

[7] C.-Y. Chiu, H.-C. Lin, and S.-N. Yang, “A fuzzy logic cbir system,” in Fuzzy Systems, 2003. FUZZ ’03. The 12th IEEE International

Conference on, vol. 2, May 2003, pp. 1171–1176 vol.2.

[8] A. Othman, H. Tizhoosh, and F. Khalvati, “Eﬁs 2014;evolving fuzzy image segmentation,” Fuzzy Systems, IEEE Transactions on, vol. 22,

no. 1, pp. 72–82, Feb 2014.

[9] S. Hajimirza and E. Izquierdo, “Gaze movement inference for implicit image annotation,” in Image Analysis for Multimedia Interactive

Services (WIAMIS), 2010 11th International Workshop on, April 2010, pp. 1–4.

[10] H. Kwan and L. Y. Cai, “Supervised fuzzy inference network for invariant pattern recognition,” in Circuits and Systems, 2000. Proceedings

of the 43rd IEEE Midwest Symposium on, vol. 2, 2000, pp. 850–854 vol.2.

[11] M. Adnan, M. Chowdury, I. Taz, T. Ahmed, and R. Rahman, “Content based news recommendation system based on fuzzy logic,” in

Informatics, Electronics Vision (ICIEV), 2014 International Conference on, May 2014, pp. 1–6.

[12] A. Ben Khalifa and H. Frigui, “Fusion of multiple algorithms for detecting buried objects using fuzzy inference,” Proc. SPIE, vol.

9072, pp. 90 720V–90720V–10, 2014. [Online]. Available: http://dx.doi.org/10.1117/12.2051217

[13] M.-Y. Chen and D. Linkens, “Rule-base self-generation and simpliﬁcation for data-driven fuzzy models,” in Fuzzy Systems, 2001. The

10th IEEE International Conference on, vol. 1, 2001, pp. 424–427.

[14] D. Saletic, “On data-driven procedure for determining the number of rules in a takagi-sugeno fuzzy model,” in Computer as a Tool,

2005. EUROCON 2005.The International Conference on, vol. 2, Nov 2005, pp. 1132–1135.

[15] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, “Neuro-fuzzy and soft computing-a computational approach to learning and machine intelligence

[book review],” Automatic Control, IEEE Transactions on, vol. 42, no. 10, pp. 1482–1484, 1997.

[16] O. Maron, “Learning from ambiguity,” Ph.D. dissertation, Massachusetts Institute of Technology, 1998.

[17] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Prez, “Solving the multiple instance problem with axis-parallel rectangles,” Artiﬁcial

Intelligence, vol. 89, no. 12, pp. 31 – 71, 1997. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0004370296000343

[18] E. Alpaydın, V. Cheplygina, M. Loog, and D. M. Tax, “Single-vs. multiple-instance classiﬁcation,” Pattern Recognition, vol. 48, no. 9,

pp. 2831–2838, 2015.

[19] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P´

erez, “Solving the multiple instance problem with axis-parallel rectangles,” Artiﬁcial

intelligence, vol. 89, no. 1, pp. 31–71, 1997.

[20] C. Zhang, X. Chen, and W.-B. Chen, “An online multiple instance learning system for semantic image retrieval,” in Multimedia Workshops,

2007. ISMW ’07. Ninth IEEE International Symposium on, Dec 2007, pp. 83–84.

[21] O. Maron and T. Lozano-P´

erez, “A framework for multiple-instance learning,” in Proceedings of the 1997 Conference on Advances in

Neural Information Processing Systems 10, ser. NIPS ’97. Cambridge, MA, USA: MIT Press, 1998, pp. 570–576. [Online]. Available:

http://dl.acm.org/citation.cfm?id=302528.302753

[22] A. Karem and H. Frigui, “A multiple instance learning approach for landmine detection using ground penetrating radar,” in Geoscience

and Remote Sensing Symposium (IGARSS), 2011 IEEE International. IEEE, 2011, pp. 878–881.

[23] R. Rahmani and S. A. Goldman, “Missl: Multiple-instance semi-supervised learning,” in Proceedings of the 23rd international conference

on Machine learning. ACM, 2006, pp. 705–712.

[24] Y. Chen, J. Bi, and J. Wang, “Miles: Multiple-instance learning via embedded instance selection,” Pattern Analysis and Machine

Intelligence, IEEE Transactions on, vol. 28, no. 12, pp. 1931–1947, Dec 2006.

[25] C. Yang, M. Dong, and F. Fotouhi, “Region based image annotation through multiple-instance learning,” in Proceedings of the 13th

annual ACM international conference on Multimedia. ACM, 2005, pp. 435–438.

27

[26] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,” Pattern Analysis and Machine

Intelligence, IEEE Transactions on, vol. 33, no. 8, pp. 1619–1632, 2011.

[27] S. H. T. Andrews, “Multiple-instance learning via disjunctive programming boosting,” Advances in neural information processing systems.,

no. 16, pp. 65–72, 2004.

[28] B. Babenko, “Multiple instance learning: algorithms and applications,” View Article PubMed/NCBI Google Scholar, 2008.

[29] C. Zhang, J. C. Platt, and P. A. Viola, “Multiple instance boosting for object detection,” in Advances in neural information processing

systems, 2005, pp. 1417–1424.

[30] J. Amores, “Multiple instance classiﬁcation: Review, taxonomy and comparative study,” Artiﬁcial Intelligence, vol. 201, pp. 81–105,

2013.

[31] P. L. Lanzi, W. Stolzmann, and S. W. Wilson, Learning classiﬁer systems: from foundations to applications. Springer, 2000, no. 1813.

[32] O. Cordn, “A historical review of evolutionary learning methods for mamdani-type fuzzy rule-based systems: Designing interpretable

genetic fuzzy systems,” International Journal of Approximate Reasoning, vol. 52, no. 6, pp. 894 – 913, 2011.

[33] E. Mamdani, “Application of fuzzy algorithms for control of simple dynamic plant,” Electrical Engineers, Proceedings of the Institution

of, vol. 121, no. 12, pp. 1585–1588, December 1974.

[34] R. Babuka and H. Verbruggen, “An overview of fuzzy modeling for control,” Control Engineering Practice, vol. 4, no. 11, pp. 1593 –

1606, 1996.

[35] M. Mizumoto, “Fuzzy controls under various fuzzy reasoning methods,” Information Sciences, vol. 45, no. 2, pp. 129 – 151, 1988.

[Online]. Available: http://www.sciencedirect.com/science/article/pii/0020025588900370

[36] C.-C. Lee, “Fuzzy logic in control systems: fuzzy logic controller. i,” Systems, Man and Cybernetics, IEEE Transactions on, vol. 20,

no. 2, pp. 404–418, Mar 1990.

[37] M. Sugeno and T. Yasukawa, “A fuzzy-logic-based approach to qualitative modeling,” Fuzzy Systems, IEEE Transactions on, vol. 1,

no. 1, pp. 7–, Feb 1993.

[38] R. Yager and D. Filev, “Uniﬁed structure and parameter identiﬁcation of fuzzy models,” Systems, Man and Cybernetics, IEEE Transactions

on, vol. 23, no. 4, pp. 1198–1205, Jul 1993.

[39] E. Tacker, “Modeling stabilization policies in ﬁnancial systems,” in Decision and Control including the 16th Symposium on Adaptive

Processes and A Special Symposium on Fuzzy Set Theory and Applications, 1977 IEEE Conference on, Dec 1977, pp. 194–194.

[40] P. Singh, S. Bhanot, and H. Mohanta, “Optimized adaptive neuro-fuzzy inference system for ph control,” in Advanced Electronic Systems

(ICAES), 2013 International Conference on, Sept 2013, pp. 1–5.

[41] T. Takagi and M. Sugeno, “Fuzzy identiﬁcation of systems and its applications to modeling and control,” Systems, Man and Cybernetics,

IEEE Transactions on, vol. SMC-15, no. 1, pp. 116–132, 1985.

[42] Z.-H. Zhou and M.-L. Zhang, “Neural networks for multi-instance learning,” Proceedings of the International Conference on Intelligent

Information Technology, Beijing, China, pp. 455–459, 2002.

[43] M.-L. Zhang and Z.-H. Zhou, “Adapting rbf neural networks to multi-instance learning,” Neural Processing Letters, vol. 23, no. 1, pp.

1–26, 2006.

[44] S. Ray and D. Page, “Multiple instance regression,” in ICML, vol. 1, 2001, pp. 425–432.

[45] J.-S. Jang, “Anﬁs: adaptive-network-based fuzzy inference system,” Systems, Man and Cybernetics, IEEE Transactions on, vol. 23, no. 3,

pp. 665–685, 1993.

[46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from

overﬁtting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[47] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vector machines for multiple-instance learning,” in Advances in neural

information processing systems, 2002, pp. 561–568.

[48] Y. Li, D. M. Tax, R. P. Duin, and M. Loog, “Multiple-instance learning as a classiﬁer combining problem,” Pattern Recognition, vol. 46,

no. 3, pp. 865–874, 2013.

[49] Y. Chen, J. Bi, and J. Z. Wang, “Miles: Multiple-instance learning via embedded instance selection,” Pattern Analysis and Machine

Intelligence, IEEE Transactions on, vol. 28, no. 12, pp. 1931–1947, 2006.

[50] Y. Chen and J. Z. Wang, “Image categorization by learning and reasoning with regions,” The Journal of Machine Learning Research,

vol. 5, pp. 913–939, 2004.

[51] Q. Zhang and S. A. Goldman, “Em-dd: An improved multiple-instance learning technique,” in Advances in neural information processing

systems, 2001, pp. 1073–1080.

[52] J. Wang, “Solving the multiple-instance problem: A lazy learning approach,” in In Proc. 17th International Conf. on Machine Learning,

2000.

[53] J. Ramon and L. De Raedt, “Multi instance neural networks,” in Proceedings of the ICML-2000 workshop on attribute-value and

relational learning, 2000, pp. 53–60.

[54] Z.-H. Zhou and M.-L. Zhang, “Ensembles of multi-instance learners,” in Machine Learning: ECML 2003. Springer, 2003, pp. 492–502.

[55] A. Bouchachia, “Multiple instance learning with radial basis function neural networks,” in Neural Information Processing. Springer,

2002, Conference Proceedings, pp. 440–445.

[56] Z.-H. Zhou, Y.-Y. Sun, and Y.-F. Li, “Multi-instance learning by treating instances as non-iid samples,” in Proceedings of the 26th annual

international conference on machine learning. ACM, 2009, pp. 1249–1256.

28

[57] H.-Y. Wang, Q. Yang, and H. Zha, “Adaptive p-posterior mixture-model kernels for multiple instance learning,” in Proceedings of the

25th international conference on Machine learning. ACM, 2008, pp. 1136–1143.

[58] P. V. Gehler and O. Chapelle, “Deterministic annealing for multiple-instance learning,” in International conference on artiﬁcial intelligence

and statistics, 2007, pp. 123–130.

[59] C. Leistner, A. Saffari, and H. Bischof, “Miforests: multiple-instance learning with randomized trees,” in Computer Vision–ECCV 2010.

Springer, 2010, pp. 29–42.

[60] H. Frigui and P. Gader, “Detection and discrimination of land mines in ground-penetrating radar based on edge histogram descriptors

and a possibilistic k-nearest neighbor classiﬁer,” Trans. Fuz Sys., vol. 17, no. 1, pp. 185–199, Feb. 2009.

[61] A. B. Khalifa and H. Frigui, “Fusion of multiple landmine detection algorithms using an adaptive neuro fuzzy inference system,” in

Geoscience and Remote Sensing Symposium (IGARSS), 2014 IEEE International. IEEE, 2014, pp. 3148–3151.

[62] J.-S. Jang and C.-T. Sun, “Functional equivalence between radial basis function networks and fuzzy inference systems,” Neural Networks,

IEEE Transactions on, vol. 4, no. 1, pp. 156–159, Jan 1993.

[63] M. Yi-de, L. Qing, and Q. Zhi-Bai, “Automated image segmentation using improved pcnn model based on cross-entropy,” in Intelligent

Multimedia, Video and Speech Processing, 2004. Proceedings of 2004 International Symposium on. IEEE, 2004, pp. 743–746.

[64] H. Li, T. Jiang, and K. Zhang, “Efﬁcient and robust feature extraction by maximum margin criterion,” Neural Networks, IEEE Transactions

on, vol. 17, no. 1, pp. 157–165, 2006.