Content uploaded by Misha Pavel

Author content

All content in this area was uploaded by Misha Pavel on Sep 29, 2017

Content may be subject to copyright.

Beyond Novelty Detection: Incongruent Events, when

General and Speciﬁc Classiﬁers Disagree

Abstract

Unexpected stimuli are a challenge to any machine learning algorithm. Here we

identify distinct types of unexpected events, focusing on ’incongruent events’ -

when ’general level’ and ’speciﬁc level’ classiﬁers give conﬂicting predictions.

We deﬁne a formal framework for the representation and processing of incongru-

ent events: starting from the notion of label hierarchy,we show how partial order

on labels can be deduced from such hierarchies. For each event, we compute its

probability in different ways, based on adjacent levels (according to the partial

order) in the label hierarchy. An incongruent event is an event where the proba-

bility computed based on some more speciﬁc level (in accordance with the partial

order) is much smaller than the probability computed based on some more general

level, leading to conﬂicting predictions. We derive algorithms to detect incongru-

ent events from different types of hierarchies, corresponding to class membership

or part membership. Respectively, we show promising results with real data on

two speciﬁc problems: Out Of Vocabulary words in speech recognition, and the

identiﬁcation of a new sub-class (e.g., the face of a new individual) in audio-visual

facial object recognition.

1 Introduction

Machine learning builds models of the world using training data from the application domain and

prior knowledge about the problem. The models are later applied to future data in order to estimate

the current state of the world. An implied assumption is that the future is stochastically similar to

the past. The approach fails when the system encounters situations that are not anticipated from the

past experience. In contrast, successful natural organisms identify new unanticipated stimuli and

situations and frequently generate appropriate responses.

By deﬁnition, an unexpected event is one whose probability to confront the system is low, based

on the data that has been observed previously. In line with this observation, much of the computa-

tional work on novelty detection focused on the probabilistic modeling of known classes, identifying

outliers of these distributions as novel events (see e.g. [1, 2] for recent reviews). More recently, one-

class classiﬁers have been proposed and used for novelty detection without the direct modeling of

data distribution [3, 4]. There are many studies on novelty detection in biological systems [5], often

focusing on regions of the hippocampus [6].

To advance beyond the detection of outliers, we observe that there are many different reasons why

some stimuli could appear novel. Our work, presented in Section 2, focuses on unexpected events

which are indicated by the incongruence between prediction induced by prior experience (training)

and the evidence provided by the sensory data. To identify an item as incongruent, we use two

parallel classiﬁers. One of them is strongly constrained by speciﬁc knowledge (both prior and data-

derived), the other classiﬁer is more general and less constrained. Both classiﬁers are assumed

to yield class-posterior probability in response to a particular input signal. A sufﬁciently large

discrepancy between posterior probabilities induced by input data in the two classiﬁers is taken as

indication that an item is incongruent.

Thus, in comparison with most existing work on novelty detection, one new and important charac-

teristic of our approach is that we look for a level of description where the novel event is highly

probable. Rather than simply respond to an event which is rejected by all classiﬁers, which more

often than not requires no special attention(as in pure noise),we construct and exploit a hierarchy of

1

representations. We attend to those events which are recognized(or accepted) at some more abstract

levels of description in the hierarchy, while being rejected by the more concrete classiﬁers.

There are various ways to incorporate prior hierarchical knowledge and constraints within different

classiﬁer levels, as discussed in Section 3. One approach, used to detect images of unexpected in-

congruous objects, is to train the more general, less constrained classiﬁer using a larger more diverse

set of stimuli, e.g., the facial images of many individuals. The second classiﬁer is trained using a

more speciﬁc (i.e. smaller) set of speciﬁc objects (e.g., the set of Einstein’s facial images). An

incongruous item (e.g., a new individual) could then be identiﬁed by a smaller posterior probability

estimated by the more speciﬁc classiﬁers relative to the probability from the more general classiﬁer.

A different approach is used to identify unexpected (out-of-vocabulary) lexical items. The more

general classiﬁer is trained to classify sequentially speech sounds (phonemes) from a relatively short

segments of the input speech signal (thus yielding an unconstrained sequence of phoneme labels);

the more constrained classiﬁer is trained to classify a particular set of words (highly constrained

sequences of phoneme labels) from the information available in the whole speech sentence. A word

that did not belong to the expected vocabulary of the more constrained recognizer could then be

identiﬁed by discrepancy in posterior probabilities of phonemes derived from both classiﬁers.

Our second contribution in Section 2 is the presentation of a unifying theoretical framework for

these two approaches. Speciﬁcally, we consider two kinds of hierarchies: Part membership as in

biological taxonomy or speech, and Class membership, as in human categorization (or levels of

categorization). We deﬁne a notion of partial order on such hierarchies, and identify those events

whose probability as computed using different levels of the hierarchy does not agree. In particular,

we are interested in those events that receive high probability at more general levels (for example,

the system is certain that the new example is a dog), but low probability at more speciﬁc levels (in the

same example, the system is certain that the new example is not any known dog breed). Such events

correspond to many interesting situations that are worthy of special attention, including incongruous

scenes and new sub-classes, as shown in Section 3.

2 Incongruent Events - uniﬁed approach

2.1 Introducing label hierarchy

The set of labels represents the knowledge base about stimuli, which is either given (by a teacher in

supervised learning settings) or learned (in unsupervised or semi-supervised settings). In cognitive

systems such knowledge is hardly ever a set; often, in fact, labels are given (or can be thought of) as

a hierarchy. In general, hierarchies can be represented as directed graphs. The nodes of the graphs

may be divided into distinct subsets that correspond to different entities (e.g., all objects that are

animals); we call these subsets “levels”. We identify two types of hierarchies:

Part membership, as in biological taxonomy or speech. For example, eyes, ears, and nose combine

to form a head; head, legs and tail combine to form a dog.

Class membership, as in human categorization – where objects can be classiﬁed at different levels

of generality, from sub-ordinate categories (most speciﬁc level), to basic level (intermediate level),

to super-ordinate categories (most general level). For example, a Beagle (sub-ordinate category) is

also a dog (basic level category), and it is also an animal (super-ordinate category).

The two hierarchies deﬁned above induce constraints on the observed features in different ways. In

the class-membership hierarchy, a parent class admits higher number of combinations of features

than any of its children, i.e., the parent category is less constrained than its children classes. In

contrast, a parent node in the part-membership hierarchy imposes stricter constraints on the observed

features than a child node. This distinction is illustrated by the simple ”toy” example shown in Fig. 1.

Roughly speaking, in the class-membership hierarchy (right panel), the parent node is the disjunction

of the child categories. In the part-membership hierarchy (left panel), the parent category represents

aconjunction of the children categories. This difference in the effect of constraints between the two

representations is, of course, reﬂected in the dependency of the posterior probability on the class,

conditioned on the observations.

2

Figure 1: Examples. Left: part-membership hierarchy, the concept of a dog requires a conjunction of parts -

a head, legs and tail. Right: class-membership hierarchy, the concept of a dog is deﬁned as the disjunction of

more speciﬁc concepts - Afghan, Beagle and Collie.

In order to treat different hierarchical representations uniformly we invoke the notion of partial

order. Intuitively speaking, different levels in each hierarchyare related by a partial order: the more

speciﬁc concept, which corresponds to a smaller set of events or objects in the world, is always

smaller than the more general concept, which corresponds to a larger set of events or objects.

To illustrate this point, consider Fig. 1 again. For the part-membership hierarchy example (left

panel), the concept of ’dog’ requires a conjunction of parts as in DOG =LEGS ∩HEAD ∩TAIL,

and therefore, for example, DOG ⊂LEGS ⇒DOG LEGS . Thus

DOG LEGS ,DOG HEAD ,DOG TAIL

In contrast, for the class-membership hierarchy (right panel),the class of dogs requires the conjunc-

tion of the individual members as in DOG =AFGHAN ∪BEAGEL ∪COLLIE , and therefore,

for example, DOG ⊃AFGHAN ⇒DOG AFGHAN . Thus

DOG AFGHAN ,DOG BEAGEL,DOG COLLIE

2.2 Deﬁnition of Incongruent Events

Notations

We assume that the data is represented as a Graph {G, E }of Partial Orders (GP O). Each node in

Gis a random variable which corresponds to a class or concept (or event). Each directed link in E

corresponds to partial order relationship as deﬁned above, where there is a link from node ato node

biff ab.

For each node (concept) a, deﬁne As={b∈G, b a}- the set of all nodes (concepts) bmore

speciﬁc (smaller) than ain accordance with the given partial order; similarly, deﬁne Ag={b∈

G, a b}- the set of all nodes (concepts) bmore general (larger) than ain accordance with the

given partial order.

For each concept aand training data T, we train up to 3 probabilistic models which are derived from

Tin different ways, in order to determine whether the conceptais present in a new data point X:

•Qa(X): a probabilistic model of class a, derived from training data Twithout using the

partial order relations in the GP O.

•If |As|>1

Qs

a(X): a probabilistic model of class awhich is based on the probability of concepts in

As, assuming their independence of each other. Typically, the model incorporates some

relatively simple conjunctive and/or disjunctive relations among concepts in As.

•If |Ag|>1

Qg

a(X): a probabilistic model of class awhich is based on the probability of concepts in

Ag, assuming their independence of each other. Here too, the model typically incorporates

some relatively simple conjunctive and/or disjunctive relations among concepts in Ag.

3

Examples

To illustrate, we use the simple examples shown in Fig. 1, where our concept of interest ais the

concept ‘dog’:

In the part-membership hierarchy (left panel), |Ag|= 3 (head, legs, tail). We can therefore learn 2

models for the class ‘dog’ (Qs

dog is not deﬁned):

1. Qdog - obtained using training pictures of ’dogs’ and ’not dogs’ without body part labels.

2. Qg

dog - obtained using the outcome of models for head, legs and tail, which were trained on

the same training set Twith body part labels. For example, if we assume that concept ais

the conjunction of its part member concepts as deﬁned above, and assuming that these part

concepts are independent of each other, we get

Qg

dog =Y

b∈Ag

Qb=QHead ·QLegs ·QTail (1)

In the class-membership hierarchy (right panel), |As|= 3 (Afghan, Beagle, Collie). If we further

assume that a class-membership hierarchy is always a tree, then |Ag|= 1. We can therefore learn 2

models for the class ‘dog’ (Qg

dog is not deﬁned):

1. Qdog - obtained using training pictures of ’dogs’ and ’not dogs’ without breed labels.

2. Qs

dog - obtained using the outcome of models for Afghan, Beagle and Collie, which were

trained on the same training set Twith only speciﬁc dog type labels. For example, if we

assume that concept ais the disjunction of its sub-class concepts as deﬁned above, and

assuming that these sub-class concepts are independent of each other, we get

Qs

dog =X

b∈As

Qb=QAf ghan +QBeagle +QCollie

Incongruent events

In general, we expect the different models to provideroughly the same probability for the presence

of concept ain data X. A mismatch between the predictions of the different modelsshould raise the

red ﬂag, possibly indicating that something new and interesting had been observed. In particular, we

are interested in the following discrepancy:

Deﬁnition:Observation Xis incongruent if there exists a concept 0a0such that

Qg

a(X)Qa(X) or Qa(X)Qs

a(X).(2)

Alternatively, observation Xis incongruent if a discrepancy exists between the inference of the two

classiﬁers: either the classiﬁer based on the more general descriptions from level gaccepts the X

while the direct classier rejects it, or the direct classiﬁer accepts Xwhile the classiﬁer based on the

more speciﬁc descriptions from level srejects it. In either case, the concept receives high probability

at the more general level (according to the GP O), but much lower probability when relying only on

the more speciﬁc level.

Let us discuss again the examples we have seen before, to illustrate why this deﬁnition indeed

captures interesting “surprises”:

•In the part-membership hierarchy (left panel of Fig. 1), we have

Qg

dog =QHead ·QLegs ·QTail Qdog

In other words, while the probability of each part is high (since the multiplication of those

probabilities is high), the ’dog’ classiﬁer is rather uncertain about the existence of a dog in

this data.

How can this happen? Maybe the parts are conﬁgured in an unusual arrangement for a dog

(as in a 3-legged cat), or maybe we encounter a donkey with a cat’s tail (as in Shrek 3).

Those are two examples of the kind of unexpectedevents we are interested in.

4

•In the class-membership hierarchy (right panel of Fig. 1), we have

Qs

dog =QAf ghan +QBeagle +QCollie Qdog

In other words, while the probability of each sub-class is low (since the sum of these prob-

abilities is low), the ’dog’ classiﬁer is certain about the existenceof a dog in this data.

How may such a discrepancy arise? Maybe we are seeing a new type of dog that we haven’t

seen before - a Pointer. The dog model, if correctly capturing the notion of ’dogness’,

should be able to identify this new object, while models of previously seen dog breeds

(Afghan, Beagle and Collie) correctly fail to recognize the new object.

3 Incongruent events: algorithms

Our deﬁnition for incongruent events in the previous section is indeed uniﬁed, but as a result quite

abstract. In this section we discuss two different algorithmic implementations, one generative and

one discriminative, which were developed for the part membership and class membership hierar-

chies respectively (see deﬁnition in Section 1). In both cases, we use the notation Q(x)for the class

probability as deﬁned above, and p(x)for the estimated probability.

3.1 Part membership - a generative algorithm

Consider the left panel of Fig. 1. The event in the top node is incongruent if its probability is low,

while the probability of all its descendants is high.

In many applications, such as speech recognition, one computes the probability of events (sentences)

based on a generative model (corresponding to a speciﬁc language) which includes a dictionary of

parts (words). At the top level the event probabilityis computed conditional onthe model; in which

case typically the parts are assumed to be independent, and the event probability is computed as

the multiplication of the parts probabilities conditioned on the model. For example, in speech pro-

cessing and assuming a speciﬁc language (e.g., English), the probability of the sentence is typically

computed by multiplying the probability of each word using an HMM model trained on sentences

from a speciﬁc language. At the bottom level, the probability of each part is computed independently

of the generative model.

More formally, Consider an event ucomposed of parts wk. Using the generative model of events

and assuming the conditional independence of the parts given this model, the prior probability of the

event is givenby the product of prior probabilities of the parts,

p(u|L) = Y

k

p(wk|L)(3)

where Ldenotes the generative model (e.g.,the language).

For measurement X, we compute Q(X)as follows

Q(X) = p(X|L) = X

u

p(X|u, L)p(u|L)≈p(X|¯u, L)p(¯u|L) = p(X|¯u)Y

k

p(wk|L)(4)

using p(X|u, L) = p(X|u)and (3), and where ¯u= arg max

up(u|L)is the most likely interpreta-

tion. At the risk of notation abuse, {wk}now denote the parts which compose the most likely event

¯u. We assume that the ﬁrst sum is dominated by the maximal term.

Given a part-membership hierarchy, we can use (1) to compute the probability Qg(X)directly,

without using the generative model L.

Qg(X) = p(X) = X

u

p(X|u)p(u)≥p(X|¯u)p(¯u) = p(X|¯u)Y

k

p(wk)(5)

It follows from (4) and (5) that

Q(X)

Qg(X)≤Y

k

p(wk|L)

p(wk)(6)

5

We can now conclude that Xis an incongruent event according to our deﬁnition if there exists at

least one part kin the ﬁnal event ¯u, such that p(wk)p(wk|L)(assuming all other parts have

roughly the same conditional and unconditional probabilities). In speech processing, a sentence is

incongruent if it includes an incongruent word - a word whose probability based on the generative

language model is low, but whose direct probability (not constrained by the language model) is high.

Example: Out Of Vocabulary (OOV) words

For the detection of OOV words, we performed experiments using a Large Vocabulary Continuous

Speech Recognition (LVCSR) system on the Wall Street Journal Corpus (WSJ). The evaluation

set consists of 2.5 hours. To introduce OOV words, the vocabulary was restricted to the 4968 most

frequent words from the language training texts, leaving the remainingwords unknown to the model.

A more detailed description is given in [7].

In this task, we have shown that the comparison between two parallel classiﬁers, based on strong

and weak posterior streams, is effective for the detection of OOV words, and also for the detection

of recognition errors. Speciﬁcally, we use the derivation above to detect out of vocabulary words,

by comparing their probability when computed based on the language model, and when computed

based on mere acoustic modeling. The best performance was obtained by the system when a Neural

Network (NN) classiﬁer was used for the direct estimation of frame-based OOV scores. The network

was directly fed by posteriors from the strong andthe weak systems. For the WSJ task, we achieved

performance of around 11% Equal-Error-Rate (EER) (Miss/False Alarm probability), see Fig. 2.

Figure 2: Several techniques used to detect OOV: (i) Cmax: Conﬁdence measure computed ONLY from

strongly constrained Large Vocabulary Continuous Speech Recognizer (LVCSR), with frame-based posteriors.

(ii) LVCSR+weak features: Strongly and weakly constrained recognizers, compared via the KL-divergence

metric. (iii) LVCSR+NN posteriors: Combination of strong and weak phoneme posteriors using NN classiﬁer.

(iv) all features: fusion of (ii) and (iii) together.

3.2 Class membership - a discriminative algorithm

Consider the right panel of Fig. 1. The general class in the top node is incongruent if its probability

is high, while the probability of all its sub-classes is low. In other words, the classiﬁer of the

parent object accepts the new observation, but all the children object classiﬁers reject it. Brute

force computation of this deﬁnition may follow the path taken by traditional approaches to novelty

detection, e.g., looking for rejection by all one class classiﬁers corresponding to sub-class objects.

The result we have obtained by this method were mediocre, probably because generative models are

not well suited for the task. Instead, it seems like discriminative classiﬁers, trained to discriminate

6

between objects at the sub-class level, could be more successful. We note that unlike traditional

approaches to novelty detection, which must use generative models or one-class classiﬁers in the

absence of appropriate discriminative data, our dependence on object hierarchy provides discrimi-

native data as a by-product. In other words, after the recognition by a parent-node classiﬁer, we may

use classiﬁers trained to discriminate between its children to implement a discriminative novelty

detection algorithm.

Speciﬁcally, we used the approach described in [8] to build a uniﬁed representation for all objects

in the sub-class level, which is the representation computed for the parent object whose classiﬁer

had accepted (positively recognized) the object. In this feature space, we build a classiﬁer for each

sub-class based on the majority vote between pairwise discriminative classiﬁers. Based on these

classiﬁers, each example (accepted by the parent classiﬁer) is assigned to one of the sub-classes, and

the average margin over classiﬁers which agree with the ﬁnal assignment is calculated. The ﬁnal

classiﬁer then uses a threshold on this average margin to identify each object as known sub-class or

new sub-class. Previous research in the area of face identiﬁcation can be viewed as an implicit use

of this propsed framework, see e.g. [9].

Example: new face recognition from audio-visual data

We tested our algorithm on audio-visual speaker veriﬁcation. In this setup, the general parent cate-

gory level is the ‘speech’ (audio) and ‘face’ (visual), and the different individuals are the offspring

(sub-class) levels. The task is to identify an individual as belonging to the trusted group of individ-

uals vs. being unknown, i.e. known sub-class vs. new sub-class in a class membership hierarchy.

The uniﬁed representation of the visual cues was built using the approach described in [8]. All

objects in the sub-class level (differentindividuals) were representedusing the representation learnt

for the parent level (’face’). For the audio cues we used the Perceptual linear predictive (PLP)

Cepstral features [10] as theuniﬁed representation. We used SVM classiﬁers with RBF kernel as the

pairwise discriminative classiﬁers for each of the different audio/visual representations separately.

Data was collected for our experiments using a wearable device, which included stereo panoramic

vision sensors and microphone arrays. In the recorded scenario, individuals walked towards the

device and then read aloud an identical text; we acquired 30 sequences with 17 speakers (see Fig. 3

for an example). We tested our method by choosing six speakers as members of the trusted group,

while the rest were assumed unknown.

The method was applied separately using each one of the different modalities, and also in an in-

tegrated manner using both modalities. For this fusion the audio signal and visual signal were

synchronized, and the winning classiﬁcation margins of both signals were normalized to the same

scale and averaged to obtain a single margin for the combined method.

Since the goal is to identify novel incongruent events, true positive and false positive rates were

calculated by considering all frames from the unknown test sequences as positive events and the

known individual test sequences as negative events. We compared our method to novelty detection

based on one-class SVM [3] extended to our multi-class case. Decision was obtained by comparing

the maximal margin over all one-class classiﬁers to a varying threshold. As can be seen in Fig. 3,

our method performs substantially better in both modalities as compared to the “standard” one class

approach for novelty detection. Performance is further improved by fusing both modalities.

4 Summary

Unexpected events are typically identiﬁed by their low posterior probability. In this paper we em-

ployed label hierarchy to obtain a few probability values for each event, which allowed us to tease

apart different types of unexpected events. In general there are 4 possibilities, based on the classi-

ﬁers’ response at two adjacent levels:

Speciﬁc level General level possible reason

1 reject reject noisy measurements, or a totally new concept

2 reject accept incongruent concept

3 accept reject inconsistent with partial order, models are wrong

4 accept accept known concept

7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

True positive rate

audio

visual

audio−visual

audio (OC−SVM)

visual (OC−SVM)

Figure 3: Left: Example: one frame used for the visual veriﬁcation task. Right: True Positive vs. False

Positive rates when detecting unknown vs. trusted individuals. The unknown are regarded as positive events.

Results are shown for the proposed method using both modalities separately and the combined method (solid

lines). For comparison, we show results with a more traditional novelty detection method using One Class

SVM (dashed lines).

We focused above on the second type of events - incongruent concepts, which have not been studied

previously in isolation. Such events are characterizedby some discrepancy between the responseof

two classiﬁers, which can occur for a number different reasons: Context: in a given context such as

the English language, a sentence containing a Czech word is assigned low probability. In the visual

domain, in a given context such as a street scene, otherwise high probability events such as “car”

and “elephant” are not likely to appear together. New sub-class: a new object has been encountered,

of some known generic type but unknown speciﬁcs.

We described how our approach can be used to design new algorithms to address these problems,

showing promising results on real speech and audio-visual facial datasets.

References

[1] Markou, M., Singh, S.: Novelty detection: a review-part 1: statistical approaches. Signal Processing 83

(2003) 2499 – 2521

[2] Markou, M., Singh, S.: Novelty detection: a review-part 2: neural network based approaches. Signal

Processing 83 (2003) 2481–2497

[3] Scholkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.: Support vector method for

novelty detection. In: Proc. NIPS. Volume 12. (2000) 582–588

[4] Lanckrietand, G.R.G., Ghaoui, L.E., Jordan, M.I.: Robust novelty detection with single-class mpm. In:

Proc. NIPS. Volume 15. (2003) 929–936

[5] Berns, G.S., Cohen, J.D., Mintun, M.A.: Brain regions responsive to novelty in the absence of awareness.

Science 276 (1997) 1272 – 1275

[6] Rokers, B., Mercado, E., Allen, M.T., Myers, C.E., Gluck, M.A.: A connectionist model of septohip-

pocampal dynamics during conditioning: Closing the loop. Behavioral Neuroscience 116 (2002) 48–62

[7] Burget, L., Schwarz, P., Matejka, P., Hannemann, M., Rastrow, A., White, C., Khudanpur, S., Hermansky,

H., Cernocky, J.: Combination of strongly and weakly constrained recognizers for reliable detection of

oovs. In: Proceedings of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP). (2008)

[8] Bar-Hillel, A., Weinshall, D.: Subordinate class recognition using relational object models. Proc. NIPS

19 (2006)

[9] Lanitis, A., Taylor, C.J., Cootes, T.F.: A uniﬁed approach to coding and interpreting face images. In:

Proc. ICCV. (1995) 368–373

[10] Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical

Society of America 87 (1990) 1738

8