ArticlePDF Available

Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree

Authors:

Abstract and Figures

Unexpected stimuli are a challenge to any machine learning algorithm. Here we identify distinct types of unexpected events, focusing on ’incongruent events’ - when ’general level’ and ’specific level’ classifiers give conflicting predictions. We define a formal framework for the representation and processing of incongruent events: starting from the notion of label hierarchy, we show how partial order on labels can be deduced from such hierarchies. For each event, we compute its probability in different ways, based on adjacent levels (according to the partial order) in the label hierarchy. An incongruent event is an event where the probability computed based on some more specific level (in accordance with the partial order) is much smaller than the probability computed based on some more general level, leading to conflicting predictions. We derive algorithms to detect incongruent events from different types of hierarchies, corresponding to class membership or part membership. Respectively, we show promising results with real data on two specific problems: Out Of Vocabulary words in speech recognition, and the identification of a new sub-class (e.g., the face of a new individual) in audio-visual facial object recognition.
Content may be subject to copyright.
Beyond Novelty Detection: Incongruent Events, when
General and Specific Classifiers Disagree
Abstract
Unexpected stimuli are a challenge to any machine learning algorithm. Here we
identify distinct types of unexpected events, focusing on ’incongruent events’ -
when ’general level’ and specific level’ classifiers give conflicting predictions.
We define a formal framework for the representation and processing of incongru-
ent events: starting from the notion of label hierarchy,we show how partial order
on labels can be deduced from such hierarchies. For each event, we compute its
probability in different ways, based on adjacent levels (according to the partial
order) in the label hierarchy. An incongruent event is an event where the proba-
bility computed based on some more specific level (in accordance with the partial
order) is much smaller than the probability computed based on some more general
level, leading to conflicting predictions. We derive algorithms to detect incongru-
ent events from different types of hierarchies, corresponding to class membership
or part membership. Respectively, we show promising results with real data on
two specific problems: Out Of Vocabulary words in speech recognition, and the
identification of a new sub-class (e.g., the face of a new individual) in audio-visual
facial object recognition.
1 Introduction
Machine learning builds models of the world using training data from the application domain and
prior knowledge about the problem. The models are later applied to future data in order to estimate
the current state of the world. An implied assumption is that the future is stochastically similar to
the past. The approach fails when the system encounters situations that are not anticipated from the
past experience. In contrast, successful natural organisms identify new unanticipated stimuli and
situations and frequently generate appropriate responses.
By definition, an unexpected event is one whose probability to confront the system is low, based
on the data that has been observed previously. In line with this observation, much of the computa-
tional work on novelty detection focused on the probabilistic modeling of known classes, identifying
outliers of these distributions as novel events (see e.g. [1, 2] for recent reviews). More recently, one-
class classifiers have been proposed and used for novelty detection without the direct modeling of
data distribution [3, 4]. There are many studies on novelty detection in biological systems [5], often
focusing on regions of the hippocampus [6].
To advance beyond the detection of outliers, we observe that there are many different reasons why
some stimuli could appear novel. Our work, presented in Section 2, focuses on unexpected events
which are indicated by the incongruence between prediction induced by prior experience (training)
and the evidence provided by the sensory data. To identify an item as incongruent, we use two
parallel classifiers. One of them is strongly constrained by specific knowledge (both prior and data-
derived), the other classifier is more general and less constrained. Both classifiers are assumed
to yield class-posterior probability in response to a particular input signal. A sufficiently large
discrepancy between posterior probabilities induced by input data in the two classifiers is taken as
indication that an item is incongruent.
Thus, in comparison with most existing work on novelty detection, one new and important charac-
teristic of our approach is that we look for a level of description where the novel event is highly
probable. Rather than simply respond to an event which is rejected by all classifiers, which more
often than not requires no special attention(as in pure noise),we construct and exploit a hierarchy of
1
representations. We attend to those events which are recognized(or accepted) at some more abstract
levels of description in the hierarchy, while being rejected by the more concrete classifiers.
There are various ways to incorporate prior hierarchical knowledge and constraints within different
classifier levels, as discussed in Section 3. One approach, used to detect images of unexpected in-
congruous objects, is to train the more general, less constrained classifier using a larger more diverse
set of stimuli, e.g., the facial images of many individuals. The second classifier is trained using a
more specific (i.e. smaller) set of specific objects (e.g., the set of Einstein’s facial images). An
incongruous item (e.g., a new individual) could then be identified by a smaller posterior probability
estimated by the more specific classifiers relative to the probability from the more general classifier.
A different approach is used to identify unexpected (out-of-vocabulary) lexical items. The more
general classifier is trained to classify sequentially speech sounds (phonemes) from a relatively short
segments of the input speech signal (thus yielding an unconstrained sequence of phoneme labels);
the more constrained classifier is trained to classify a particular set of words (highly constrained
sequences of phoneme labels) from the information available in the whole speech sentence. A word
that did not belong to the expected vocabulary of the more constrained recognizer could then be
identified by discrepancy in posterior probabilities of phonemes derived from both classifiers.
Our second contribution in Section 2 is the presentation of a unifying theoretical framework for
these two approaches. Specifically, we consider two kinds of hierarchies: Part membership as in
biological taxonomy or speech, and Class membership, as in human categorization (or levels of
categorization). We define a notion of partial order on such hierarchies, and identify those events
whose probability as computed using different levels of the hierarchy does not agree. In particular,
we are interested in those events that receive high probability at more general levels (for example,
the system is certain that the new example is a dog), but low probability at more specific levels (in the
same example, the system is certain that the new example is not any known dog breed). Such events
correspond to many interesting situations that are worthy of special attention, including incongruous
scenes and new sub-classes, as shown in Section 3.
2 Incongruent Events - unified approach
2.1 Introducing label hierarchy
The set of labels represents the knowledge base about stimuli, which is either given (by a teacher in
supervised learning settings) or learned (in unsupervised or semi-supervised settings). In cognitive
systems such knowledge is hardly ever a set; often, in fact, labels are given (or can be thought of) as
a hierarchy. In general, hierarchies can be represented as directed graphs. The nodes of the graphs
may be divided into distinct subsets that correspond to different entities (e.g., all objects that are
animals); we call these subsets “levels”. We identify two types of hierarchies:
Part membership, as in biological taxonomy or speech. For example, eyes, ears, and nose combine
to form a head; head, legs and tail combine to form a dog.
Class membership, as in human categorization where objects can be classified at different levels
of generality, from sub-ordinate categories (most specific level), to basic level (intermediate level),
to super-ordinate categories (most general level). For example, a Beagle (sub-ordinate category) is
also a dog (basic level category), and it is also an animal (super-ordinate category).
The two hierarchies defined above induce constraints on the observed features in different ways. In
the class-membership hierarchy, a parent class admits higher number of combinations of features
than any of its children, i.e., the parent category is less constrained than its children classes. In
contrast, a parent node in the part-membership hierarchy imposes stricter constraints on the observed
features than a child node. This distinction is illustrated by the simple ”toy” example shown in Fig. 1.
Roughly speaking, in the class-membership hierarchy (right panel), the parent node is the disjunction
of the child categories. In the part-membership hierarchy (left panel), the parent category represents
aconjunction of the children categories. This difference in the effect of constraints between the two
representations is, of course, reflected in the dependency of the posterior probability on the class,
conditioned on the observations.
2
 
   
Figure 1: Examples. Left: part-membership hierarchy, the concept of a dog requires a conjunction of parts -
a head, legs and tail. Right: class-membership hierarchy, the concept of a dog is defined as the disjunction of
more specific concepts - Afghan, Beagle and Collie.
In order to treat different hierarchical representations uniformly we invoke the notion of partial
order. Intuitively speaking, different levels in each hierarchyare related by a partial order: the more
specific concept, which corresponds to a smaller set of events or objects in the world, is always
smaller than the more general concept, which corresponds to a larger set of events or objects.
To illustrate this point, consider Fig. 1 again. For the part-membership hierarchy example (left
panel), the concept of ’dog’ requires a conjunction of parts as in DOG =LEGS HEAD TAIL,
and therefore, for example, DOG LEGS DOG LEGS . Thus
DOG LEGS ,DOG HEAD ,DOG TAIL
In contrast, for the class-membership hierarchy (right panel),the class of dogs requires the conjunc-
tion of the individual members as in DOG =AFGHAN BEAGEL COLLIE , and therefore,
for example, DOG AFGHAN DOG AFGHAN . Thus
DOG AFGHAN ,DOG BEAGEL,DOG COLLIE
2.2 Definition of Incongruent Events
Notations
We assume that the data is represented as a Graph {G, E }of Partial Orders (GP O). Each node in
Gis a random variable which corresponds to a class or concept (or event). Each directed link in E
corresponds to partial order relationship as defined above, where there is a link from node ato node
biff ab.
For each node (concept) a, define As={bG, b a}- the set of all nodes (concepts) bmore
specific (smaller) than ain accordance with the given partial order; similarly, define Ag={b
G, a b}- the set of all nodes (concepts) bmore general (larger) than ain accordance with the
given partial order.
For each concept aand training data T, we train up to 3 probabilistic models which are derived from
Tin different ways, in order to determine whether the conceptais present in a new data point X:
Qa(X): a probabilistic model of class a, derived from training data Twithout using the
partial order relations in the GP O.
If |As|>1
Qs
a(X): a probabilistic model of class awhich is based on the probability of concepts in
As, assuming their independence of each other. Typically, the model incorporates some
relatively simple conjunctive and/or disjunctive relations among concepts in As.
If |Ag|>1
Qg
a(X): a probabilistic model of class awhich is based on the probability of concepts in
Ag, assuming their independence of each other. Here too, the model typically incorporates
some relatively simple conjunctive and/or disjunctive relations among concepts in Ag.
3
Examples
To illustrate, we use the simple examples shown in Fig. 1, where our concept of interest ais the
concept ‘dog’:
In the part-membership hierarchy (left panel), |Ag|= 3 (head, legs, tail). We can therefore learn 2
models for the class ‘dog’ (Qs
dog is not defined):
1. Qdog - obtained using training pictures of ’dogs’ and ’not dogs’ without body part labels.
2. Qg
dog - obtained using the outcome of models for head, legs and tail, which were trained on
the same training set Twith body part labels. For example, if we assume that concept ais
the conjunction of its part member concepts as defined above, and assuming that these part
concepts are independent of each other, we get
Qg
dog =Y
bAg
Qb=QHead ·QLegs ·QTail (1)
In the class-membership hierarchy (right panel), |As|= 3 (Afghan, Beagle, Collie). If we further
assume that a class-membership hierarchy is always a tree, then |Ag|= 1. We can therefore learn 2
models for the class ‘dog’ (Qg
dog is not defined):
1. Qdog - obtained using training pictures of ’dogs’ and ’not dogs’ without breed labels.
2. Qs
dog - obtained using the outcome of models for Afghan, Beagle and Collie, which were
trained on the same training set Twith only specific dog type labels. For example, if we
assume that concept ais the disjunction of its sub-class concepts as defined above, and
assuming that these sub-class concepts are independent of each other, we get
Qs
dog =X
bAs
Qb=QAf ghan +QBeagle +QCollie
Incongruent events
In general, we expect the different models to provideroughly the same probability for the presence
of concept ain data X. A mismatch between the predictions of the different modelsshould raise the
red flag, possibly indicating that something new and interesting had been observed. In particular, we
are interested in the following discrepancy:
Definition:Observation Xis incongruent if there exists a concept 0a0such that
Qg
a(X)Qa(X) or Qa(X)Qs
a(X).(2)
Alternatively, observation Xis incongruent if a discrepancy exists between the inference of the two
classifiers: either the classifier based on the more general descriptions from level gaccepts the X
while the direct classier rejects it, or the direct classifier accepts Xwhile the classifier based on the
more specific descriptions from level srejects it. In either case, the concept receives high probability
at the more general level (according to the GP O), but much lower probability when relying only on
the more specific level.
Let us discuss again the examples we have seen before, to illustrate why this definition indeed
captures interesting “surprises”:
In the part-membership hierarchy (left panel of Fig. 1), we have
Qg
dog =QHead ·QLegs ·QTail Qdog
In other words, while the probability of each part is high (since the multiplication of those
probabilities is high), the ’dog’ classifier is rather uncertain about the existence of a dog in
this data.
How can this happen? Maybe the parts are configured in an unusual arrangement for a dog
(as in a 3-legged cat), or maybe we encounter a donkey with a cat’s tail (as in Shrek 3).
Those are two examples of the kind of unexpectedevents we are interested in.
4
In the class-membership hierarchy (right panel of Fig. 1), we have
Qs
dog =QAf ghan +QBeagle +QCollie Qdog
In other words, while the probability of each sub-class is low (since the sum of these prob-
abilities is low), the dog’ classifier is certain about the existenceof a dog in this data.
How may such a discrepancy arise? Maybe we are seeing a new type of dog that we haven’t
seen before - a Pointer. The dog model, if correctly capturing the notion of ’dogness’,
should be able to identify this new object, while models of previously seen dog breeds
(Afghan, Beagle and Collie) correctly fail to recognize the new object.
3 Incongruent events: algorithms
Our definition for incongruent events in the previous section is indeed unified, but as a result quite
abstract. In this section we discuss two different algorithmic implementations, one generative and
one discriminative, which were developed for the part membership and class membership hierar-
chies respectively (see definition in Section 1). In both cases, we use the notation Q(x)for the class
probability as defined above, and p(x)for the estimated probability.
3.1 Part membership - a generative algorithm
Consider the left panel of Fig. 1. The event in the top node is incongruent if its probability is low,
while the probability of all its descendants is high.
In many applications, such as speech recognition, one computes the probability of events (sentences)
based on a generative model (corresponding to a specific language) which includes a dictionary of
parts (words). At the top level the event probabilityis computed conditional onthe model; in which
case typically the parts are assumed to be independent, and the event probability is computed as
the multiplication of the parts probabilities conditioned on the model. For example, in speech pro-
cessing and assuming a specific language (e.g., English), the probability of the sentence is typically
computed by multiplying the probability of each word using an HMM model trained on sentences
from a specific language. At the bottom level, the probability of each part is computed independently
of the generative model.
More formally, Consider an event ucomposed of parts wk. Using the generative model of events
and assuming the conditional independence of the parts given this model, the prior probability of the
event is givenby the product of prior probabilities of the parts,
p(u|L) = Y
k
p(wk|L)(3)
where Ldenotes the generative model (e.g.,the language).
For measurement X, we compute Q(X)as follows
Q(X) = p(X|L) = X
u
p(X|u, L)p(u|L)p(X|¯u, L)pu|L) = p(X|¯u)Y
k
p(wk|L)(4)
using p(X|u, L) = p(X|u)and (3), and where ¯u= arg max
up(u|L)is the most likely interpreta-
tion. At the risk of notation abuse, {wk}now denote the parts which compose the most likely event
¯u. We assume that the first sum is dominated by the maximal term.
Given a part-membership hierarchy, we can use (1) to compute the probability Qg(X)directly,
without using the generative model L.
Qg(X) = p(X) = X
u
p(X|u)p(u)p(X|¯u)pu) = p(X|¯u)Y
k
p(wk)(5)
It follows from (4) and (5) that
Q(X)
Qg(X)Y
k
p(wk|L)
p(wk)(6)
5
We can now conclude that Xis an incongruent event according to our definition if there exists at
least one part kin the final event ¯u, such that p(wk)p(wk|L)(assuming all other parts have
roughly the same conditional and unconditional probabilities). In speech processing, a sentence is
incongruent if it includes an incongruent word - a word whose probability based on the generative
language model is low, but whose direct probability (not constrained by the language model) is high.
Example: Out Of Vocabulary (OOV) words
For the detection of OOV words, we performed experiments using a Large Vocabulary Continuous
Speech Recognition (LVCSR) system on the Wall Street Journal Corpus (WSJ). The evaluation
set consists of 2.5 hours. To introduce OOV words, the vocabulary was restricted to the 4968 most
frequent words from the language training texts, leaving the remainingwords unknown to the model.
A more detailed description is given in [7].
In this task, we have shown that the comparison between two parallel classifiers, based on strong
and weak posterior streams, is effective for the detection of OOV words, and also for the detection
of recognition errors. Specifically, we use the derivation above to detect out of vocabulary words,
by comparing their probability when computed based on the language model, and when computed
based on mere acoustic modeling. The best performance was obtained by the system when a Neural
Network (NN) classifier was used for the direct estimation of frame-based OOV scores. The network
was directly fed by posteriors from the strong andthe weak systems. For the WSJ task, we achieved
performance of around 11% Equal-Error-Rate (EER) (Miss/False Alarm probability), see Fig. 2.
Figure 2: Several techniques used to detect OOV: (i) Cmax: Confidence measure computed ONLY from
strongly constrained Large Vocabulary Continuous Speech Recognizer (LVCSR), with frame-based posteriors.
(ii) LVCSR+weak features: Strongly and weakly constrained recognizers, compared via the KL-divergence
metric. (iii) LVCSR+NN posteriors: Combination of strong and weak phoneme posteriors using NN classifier.
(iv) all features: fusion of (ii) and (iii) together.
3.2 Class membership - a discriminative algorithm
Consider the right panel of Fig. 1. The general class in the top node is incongruent if its probability
is high, while the probability of all its sub-classes is low. In other words, the classifier of the
parent object accepts the new observation, but all the children object classifiers reject it. Brute
force computation of this definition may follow the path taken by traditional approaches to novelty
detection, e.g., looking for rejection by all one class classifiers corresponding to sub-class objects.
The result we have obtained by this method were mediocre, probably because generative models are
not well suited for the task. Instead, it seems like discriminative classifiers, trained to discriminate
6
between objects at the sub-class level, could be more successful. We note that unlike traditional
approaches to novelty detection, which must use generative models or one-class classifiers in the
absence of appropriate discriminative data, our dependence on object hierarchy provides discrimi-
native data as a by-product. In other words, after the recognition by a parent-node classifier, we may
use classifiers trained to discriminate between its children to implement a discriminative novelty
detection algorithm.
Specifically, we used the approach described in [8] to build a unified representation for all objects
in the sub-class level, which is the representation computed for the parent object whose classifier
had accepted (positively recognized) the object. In this feature space, we build a classifier for each
sub-class based on the majority vote between pairwise discriminative classifiers. Based on these
classifiers, each example (accepted by the parent classifier) is assigned to one of the sub-classes, and
the average margin over classifiers which agree with the final assignment is calculated. The final
classifier then uses a threshold on this average margin to identify each object as known sub-class or
new sub-class. Previous research in the area of face identification can be viewed as an implicit use
of this propsed framework, see e.g. [9].
Example: new face recognition from audio-visual data
We tested our algorithm on audio-visual speaker verification. In this setup, the general parent cate-
gory level is the ‘speech’ (audio) and ‘face’ (visual), and the different individuals are the offspring
(sub-class) levels. The task is to identify an individual as belonging to the trusted group of individ-
uals vs. being unknown, i.e. known sub-class vs. new sub-class in a class membership hierarchy.
The unified representation of the visual cues was built using the approach described in [8]. All
objects in the sub-class level (differentindividuals) were representedusing the representation learnt
for the parent level (’face’). For the audio cues we used the Perceptual linear predictive (PLP)
Cepstral features [10] as theunified representation. We used SVM classifiers with RBF kernel as the
pairwise discriminative classifiers for each of the different audio/visual representations separately.
Data was collected for our experiments using a wearable device, which included stereo panoramic
vision sensors and microphone arrays. In the recorded scenario, individuals walked towards the
device and then read aloud an identical text; we acquired 30 sequences with 17 speakers (see Fig. 3
for an example). We tested our method by choosing six speakers as members of the trusted group,
while the rest were assumed unknown.
The method was applied separately using each one of the different modalities, and also in an in-
tegrated manner using both modalities. For this fusion the audio signal and visual signal were
synchronized, and the winning classification margins of both signals were normalized to the same
scale and averaged to obtain a single margin for the combined method.
Since the goal is to identify novel incongruent events, true positive and false positive rates were
calculated by considering all frames from the unknown test sequences as positive events and the
known individual test sequences as negative events. We compared our method to novelty detection
based on one-class SVM [3] extended to our multi-class case. Decision was obtained by comparing
the maximal margin over all one-class classifiers to a varying threshold. As can be seen in Fig. 3,
our method performs substantially better in both modalities as compared to the “standard” one class
approach for novelty detection. Performance is further improved by fusing both modalities.
4 Summary
Unexpected events are typically identified by their low posterior probability. In this paper we em-
ployed label hierarchy to obtain a few probability values for each event, which allowed us to tease
apart different types of unexpected events. In general there are 4 possibilities, based on the classi-
fiers’ response at two adjacent levels:
Specific level General level possible reason
1 reject reject noisy measurements, or a totally new concept
2 reject accept incongruent concept
3 accept reject inconsistent with partial order, models are wrong
4 accept accept known concept
7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False positive rate
True positive rate
audio
visual
audio−visual
audio (OC−SVM)
visual (OC−SVM)
Figure 3: Left: Example: one frame used for the visual verification task. Right: True Positive vs. False
Positive rates when detecting unknown vs. trusted individuals. The unknown are regarded as positive events.
Results are shown for the proposed method using both modalities separately and the combined method (solid
lines). For comparison, we show results with a more traditional novelty detection method using One Class
SVM (dashed lines).
We focused above on the second type of events - incongruent concepts, which have not been studied
previously in isolation. Such events are characterizedby some discrepancy between the responseof
two classifiers, which can occur for a number different reasons: Context: in a given context such as
the English language, a sentence containing a Czech word is assigned low probability. In the visual
domain, in a given context such as a street scene, otherwise high probability events such as “car”
and “elephant” are not likely to appear together. New sub-class: a new object has been encountered,
of some known generic type but unknown specifics.
We described how our approach can be used to design new algorithms to address these problems,
showing promising results on real speech and audio-visual facial datasets.
References
[1] Markou, M., Singh, S.: Novelty detection: a review-part 1: statistical approaches. Signal Processing 83
(2003) 2499 2521
[2] Markou, M., Singh, S.: Novelty detection: a review-part 2: neural network based approaches. Signal
Processing 83 (2003) 2481–2497
[3] Scholkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.: Support vector method for
novelty detection. In: Proc. NIPS. Volume 12. (2000) 582–588
[4] Lanckrietand, G.R.G., Ghaoui, L.E., Jordan, M.I.: Robust novelty detection with single-class mpm. In:
Proc. NIPS. Volume 15. (2003) 929–936
[5] Berns, G.S., Cohen, J.D., Mintun, M.A.: Brain regions responsive to novelty in the absence of awareness.
Science 276 (1997) 1272 1275
[6] Rokers, B., Mercado, E., Allen, M.T., Myers, C.E., Gluck, M.A.: A connectionist model of septohip-
pocampal dynamics during conditioning: Closing the loop. Behavioral Neuroscience 116 (2002) 48–62
[7] Burget, L., Schwarz, P., Matejka, P., Hannemann, M., Rastrow, A., White, C., Khudanpur, S., Hermansky,
H., Cernocky, J.: Combination of strongly and weakly constrained recognizers for reliable detection of
oovs. In: Proceedings of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP). (2008)
[8] Bar-Hillel, A., Weinshall, D.: Subordinate class recognition using relational object models. Proc. NIPS
19 (2006)
[9] Lanitis, A., Taylor, C.J., Cootes, T.F.: A unified approach to coding and interpreting face images. In:
Proc. ICCV. (1995) 368–373
[10] Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical
Society of America 87 (1990) 1738
8
... In spite of its importance, the topic is little researched. Recently, a new theoretical framework has emerged [7], that defines rareness as an incongruence compared to the prior knowledge of the system. The model has shown to work on several applications, from audio-visual persons identification [7] to detection of incongruent human actions [5]. ...
... Recently, a new theoretical framework has emerged [7], that defines rareness as an incongruence compared to the prior knowledge of the system. The model has shown to work on several applications, from audio-visual persons identification [7] to detection of incongruent human actions [5]. ...
... For the experiments reported in Section 4.1 and 4.3 we used subsets of the Caltech-256 database [4] together with the features described in [3], available on the authors' website 1 . For the experiments reported in Section 4.2 we used the audio-visual database and features described in [7] using only the face images. All the experiments are defined as "object vs background" where the background corresponds respectively to the Caltech-256 clutter class and to a synthetically defined non-face, obtained scrumbling the face feature vector elements. ...
Article
Full-text available
Within the context of detection of incongruent events, an often overlooked aspect is how a system should react to the detection. The set of all the possible actions is certainly conditioned by the task at hand, and by the embodiment of the artificial cognitive system under consideration. Still, we argue that a desirable action that does not depend from these factors is to update the internal model and learn the new detected event. This paper proposes a recent transfer learning algorithm as the way to address this issue. A notable feature of the proposed model is its capability to learn from small samples, even a single one. This is very desirable in this context, as we cannot expect to have too many samples to learn from, given the very nature of incongruent events. We also show that one of the internal parameters of the algorithm makes it possible to quantitatively measure incongruence of detected events. Experiments on two different datasets support our claim.
... Uncertainty based stopping criteria based on Renyi entropy H α (p) family with α ≥ 0 includes, as special cases, Shannon entropy (in the limit as α → 1), and confidence thresholding (in the limit as α → ∞) methods. In contrast to these uncertainty based stopping criteria, With a diminishing-returns perspective, Weinshall [14] and Geisser [8] apply a threshold to the Kullback-Leibler (KL) divergence between two consecutive posterior distributions to terminate evidence collection: [15] proposes a chained-KL divergence to monitor posterior progression in the probability simplex. Banerjee [16] proposes using Bregman divergences in the context of clustering. ...
Preprint
Systems that are based on recursive Bayesian updates for classification limit the cost of evidence collection through certain stopping/termination criteria and accordingly enforce decision making. Conventionally, two termination criteria based on pre-defined thresholds over (i) the maximum of the state posterior distribution; and (ii) the state posterior uncertainty are commonly used. In this paper, we propose a geometric interpretation over the state posterior progression and accordingly we provide a point-by-point analysis over the disadvantages of using such conventional termination criteria. For example, through the proposed geometric interpretation we show that confidence thresholds defined over maximum of the state posteriors suffer from stiffness that results in unnecessary evidence collection whereas uncertainty based thresholding methods are fragile to number of categories and terminate prematurely if some state candidates are already discovered to be unfavorable. Moreover, both types of termination methods neglect the evolution of posterior updates. We then propose a new stopping/termination criterion with a geometrical insight to overcome the limitations of these conventional methods and provide a comparison in terms of decision accuracy and speed. We validate our claims using simulations and using real experimental data obtained through a brain computer interfaced typing system.
... Possible solutions for this problem are currently investigated in the European research project Detection and Identification Of Rare Audio-visual Cues (DIRAC). The main idea of the DIRAC approach for event detection is -in contrast to existing holistic models -to make use of the discrepancy between more general and more specific information available about reality [2]. This approach heavily reduces the amount of training data necessary to model rare and incongruent events and can be generalized in such a way that information from different modalities become applicable as well. ...
Conference Paper
Full-text available
A catalog of basic audio-visual recordings containing rare and incongruent events for security and in-home-care scenarios for Euro-pean research project Detection and Identification of Rare Audio-visual Cues is presented in this paper. The purpose of this catalog is to provide a basic and atomistic testbed to the scientific community in order to validate methods for rare and incongruent event detection. The record-ing equipment and setup is defined to minimize the influence of error that might affect the overall quality of the recordings in a negative way. Additional metadata, such as a defined format for scene descriptions, comments, labels and physical parameters of the recording setup is pre-sented as a basis for evaluation of the utilized multimodal detectors, classifiers and combined methods for rare and incongruent event detec-tion. The recordings presented in this work are available online on the DIRAC preoject website [1].
... Unlike the traditional approach to novelty detection (see [4,5] for recent reviews), we would like to utilize the natural hierarchy of objects, and develop a more selective constructive approach to novel class identification. Our proposed algorithm uses a recently proposed approach to novelty detection, based on the detection of incongruencies [6]. The basic observation is that, while a new object should be correctly rejected by all existing models, it can be recognized at some more abstract level of description. ...
Article
For novel class identification we propose to rely on the natural hierar-chy of object classes, using a new approach to detect incongruent events. Here detection is based on the discrepancy between the responses of two different classifiers trained at different levels of generality: novelty is detected when the general level classifier accepts, and the specific level classifier rejects. Thus our approach is arguably more robust than traditional approaches to novelty detec-tion, and more amendable to effective information transfer between known and new classes. We present an algorithmic implementation of this approach, show experimental results of its performance, analyze the effect of the underlying hi-erarchy on the task and show the benefit of using discriminative information for the training of the specific level classifier.
... In this paper we only focus on abnormality predictions. Abnormality prediction has been extensively explored in temporal domains [35,29,14,33]. Our focus in this paper is on abnormalities of objects in images. ...
Conference Paper
Full-text available
When describing images, humans tend not to talk about the obvious, but rather mention what they find interesting. We argue that abnormalities and deviations from typicalities are among the most important components that form what is worth mentioning. In this paper we introduce the abnormality detection as a recognition problem and show how to model typicalities and, consequently, meaningful deviations from prototypical properties of categories. Our model can recognize abnormalities and report the main reasons of any recognized abnormality. We also show that abnormality predictions can help image categorization. We introduce the abnormality detection dataset and show interesting results on how to reason about abnormalities.
... Our chosen framework for anomaly detection is that advocated in [8,13] which distinguishes outliers from anomalies via the disparity between a generalised context classifier (when giving a low confidence output) and a combination of 'specific-level' classifiers (generating a high confidence output). The classifier disparity leading to the anomaly detection can equally be characterised as being between strongly constrained (contextual) and weakly constrained (non-contextual) classifiers [2]. ...
Chapter
Full-text available
A key question in machine perception is how to adaptively build upon existing capabilities so as to permit novel functionalities. Implicit in this are the notions of anomaly detection and learning transfer. A perceptual system must firstly determine at what point the existing learned model ceases to apply, and secondly, what aspects of the existing model can be brought to bear on the newlydefined learning domain. Anomalies must thus be distinguished from mere outliers, i.e. cases in which the learned model has failed to produce a clear response; it is also necessary to distinguish novel (but meaningful) input from misclassification error within the existing models. We thus apply a methodology of anomaly detection based on comparing the outputs of strong and weak classifiers [8] to the problem of detecting the rule-incongruence involved in the transition from singles to doubles tennis videos. We then demonstrate how the detected anomalies can be used to transfer learning from one (initially known) rule-governed structure to another. Our ultimate aim, building on existing annotation technology, is to construct an adaptive system for court-based sport video annotation. 1
... We now show how the hierarchical model paves the way for a more sophisticated and efficient analysis. Since the hierarchy consists of a set of more general and more specific models, we can apply the anomaly reasoning as proposed in [19]. To this end, we first need to determine if an observation is well described by a certain node in the hierarchy. ...
Article
Full-text available
Figure 1: In videos, each frame strongly correlates with its neighbors. Our approach exploits this fact and enables the segmentation of the video and the interpretation of unseen sequences. Abstract Temporal consistency is a strong cue in continuous data streams and especially in videos. We exploit this concept and encode temporal relations between consecutive frames using discriminative slow feature analysis. Activities are automatically segmented and represented in a hierarchical coarse to fine structure. Simultaneously, they are mod-eled in a generative manner, in order to analyze unseen data. This analysis supports the detection of previously learned activities and of abnormal, novel patterns. Our technique is purely data-driven and feature-independent. Experiments validate the approach in sev-eral contexts, such as traffic flow analysis and the monitoring of human behavior. The results are competitive with the state-of-the-art in all cases.
Article
A common trend in machine learning and pattern classification research is the exploitation of massive amounts of information in order to achieve an increase in performance. In particular, learning from huge collections of data obtained from the web, and using multiple features generated from different sources, have led to significantly boost of performance on problems that have been considered very hard for several years. In this thesis, we present two ways of using these information to build learning systems with robust performance and some degrees of autonomy. These ways are Cue Integration and Cue Exploitation, and constitute the two building blocks of this thesis. In the first block, we introduce several algorithms to answer the research question on how to integrate optimally multiple features. We first present a simple online learning framework which is a wrapper algorithm based on the high-level integration approach in the cue integration literature. It can be implemented with existing online learning algorithms, and preserves the theoretical properties of the algorithms being used. We then extend the Multiple Kernel Learning (MKL) framework, where each feature is converted into a kernel and the system learns the cue integration classifier by solving a joint optimization problem. To make the problem practical, We have designed two new regularization functions making it possible to optimize the problem efficiently. This results in the first online method for MKL. We also show two algorithms to solve the batch problem of MKL. Both of them have a guaranteed convergence rate. These approaches achieve state-of-the-art performance on several standard benchmark datasets, and are order of magnitude faster than other MKL solvers. In the second block, We present two examples on how to exploit information between different sources, in order to reduce the effort of labeling a large amount of training data. The first example is an algorithm to learn from partially annotated data, where each data point is tagged with a few possible labels. We show that it is possible to train a face classification system from data gathered from Internet, without any human labeling, but generating in an automatic way possible lists of labels from the captions of the images. Another example is under the transfer learning setting. The system uses existing models from potentially correlated tasks as experts, and transfers their outputs over the new incoming samples, of a new learning task where very few labeled data are available, to boost the performance.
Article
Full-text available
Learning from sensory patterns associated with different kinds of sensors is paramount for biological systems, as it permits them to cope with complex environments where events rarely appear twice in the same way. In this paper we want to investigate how perceptual categories formed in one modality can be transferred to another modality in biological and artificial systems. We first present a study on Mongolian gerbils that show clear evidence of transfer of knowledge for a perceptual category from the auditory modality to the visual modality. We then introduce an algorithm that mimics the behavior of the rodents within the online learning framework. Experiments on simulated data produced promising results, showing the pertinence of our approach.
Article
The healthcare system is in crisis due to challenges including escalating costs, the inconsistent provision of care, an aging population, and high burden of chronic disease related to health behaviors. Mitigating this crisis will require a major transformation of healthcare to be proactive, preventive, patient-centered, and evidence-based with a focus on improving quality-of-life. Information technology, networking, and biomedical engineering are likely to be essential in making this transformation possible with the help of advances, such as sensor technology, mobile computing, machine learning, etc. This paper has three themes: 1) motivation for a transformation of healthcare; 2) description of how information technology and engineering can support this transformation with the help of computational models; and 3) a technical overview of several research areas that illustrate the need for mathematical modeling approaches, ranging from sparse sampling to behavioral phenotyping and early detection. A key tenet of this paper concerns complementing prior work on patient-specific modeling and simulation by modeling neuropsychological, behavioral, and social phenomena. The resulting models, in combination with frequent or continuous measurements, are likely to be key components of health interventions to enhance health and wellbeing and the provision of healthcare.
Conference Paper
Full-text available
The minimax probability machine (MPM) considers a binary clas- siflcation problem, where mean and covariance matrix of each class are assumed to be known. Without making any further distribu- tional assumptions, the MPM minimizes the worst-case probability 1¡ fi of misclassiflcation of future data points. However, the va- lidity of the upper bound 1¡ fi depends on the accuracy of the es- timates of the real but unknown means and covariances. First, we show how to make this minimax approach robust against certain estimation errors: for unknown but bounded means and covari- ance matrices, we guarantee a robust upper bound. Secondly, the robust minimax approach for supervised learning is extended in a very natural way to the unsupervised learning problem of quantile estimation { computing a minimal region in input space where at least a fraction fi of the total probability mass lives. Mercer kernels can be exploited in this setting to obtain nonlinear regions. Posi- tive empirical results are obtained when comparing this approach to single class SVM and a 2-class SVM approach.
Conference Paper
Full-text available
We address the problem of sub-ordinate class recognition, like the distinction be- tween different types of motorcycles. Our approach is motivated by observations from cognitive psychology, which identify parts as the defin ing component of basic level categories (like motorcycles), while sub-ordi nate categories are more often defined by part properties (like 'jagged wheels'). Acc ordingly, we suggest a two-stage algorithm: First, a relational part based objec t model is learnt using unsegmented object images from the inclusive class (e.g., motorcycles in general). The model is then used to build a class-specific vector repres entation for images, where each entry corresponds to a model's part. In the second stage we train a standard discriminative classifier to classify subclass in stances (e.g., cross motor- cycles) based on the class-specific vector representation. We describe extensive experimental results with several subclasses. The proposed algorithm typically gives better results than a competing one-step algorithm, or a two stage algorithm where classification is based on a model of the sub-ordinate c lass.
Conference Paper
Full-text available
Suppose you are given some dataset drawn from an underlying probability distribution and you want to estimate a “simple” subset of input space such that the probability that a test point drawn from lies outside of equals some a priori specified between and. We propose a method to approach this problem by trying to estimate a function which is positive on and negative on the complement. The functional form of is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. We provide a theoretical analysis of the statistical performance of our algorithm. The algorithm is a natural extension of the support vector algorithm to the case of unlabelled data.
Conference Paper
Full-text available
This paper addresses the detection of OOV segments in the output of a large vocabulary continuous speech recognition (LVCSR) system. First, standard confidence measures from frame-based word- and phone-posteriors are investigated. Substantial improvement is obtained when posteriors from two systems - strongly constrained (LVCSR) and weakly constrained (phone posterior estimator) are combined. We show that this approach is also suitable for detection of general recognition errors. All results are presented on WSJ task with reduced recognition vocabulary.
Article
Full-text available
Septohippocampal interactions determine how stimuli are encoded during conditioning. This study extends a previous neurocomputational model of corticohippocampal processing to incorporate hippocamposeptal feedback and examines how the presence or absence of such feedback affects learning in the model. The effects of septal modulation in conditioning were simulated by dynamically adjusting the hippocampal learning rate on the basis of how well the hippocampal system encoded stimuli. The model successfully accounts for changes in behavior and septohippocampal activity observed in studies of the acquisition, retention, and generalization of conditioned responses and accounts for the effects of septal disruption on conditioning. The model provides a computational, neurally based synthesis of prior learning theories that predicts changes in medial septal activity based on the novelty of stimulus events.
Conference Paper
Full-text available
Face images are difficult to interpret because they are highly variable. Sources of variability include individual appearance, 3D pose, facial expression and lighting. We describe a compact parametrised model of facial appearance which takes into account all these sources of variability. The model represents both shape and grey-level appearance and is created by performing a statistical analysis over a training set of face images. A robust multi-resolution search algorithm is used to fit the model to faces in new images. This allows the main facial features to be located and a set of shape and grey-level appearance parameters to be recovered. A good approximation to a given face can be reconstructed using less than 100 of these parameters. This representation can be used for tasks such as image coding, person identification, pose recovery, gender recognition and expression recognition. The system performs well on all the tasks listed above
Article
Novelty detection is the identification of new or unknown data or signal that a machine learning system is not aware of during training. Novelty detection is one of the fundamental requirements of a good classification or identification system since sometimes the test data contains information about objects that were not known at the time of training the model. In this paper we provide state-of-the-art review in the area of novelty detection based on statistical approaches. The second part paper details novelty detection using neural networks. As discussed, there are a multitude of applications where novelty detection is extremely important including signal processing, computer vision, pattern recognition, data mining, and robotics.
Article
Novelty detection is the identification of new or unknown data or signal that a machine learning system is not aware of during training. In this paper we focus on neural network-based approaches for novelty detection. Statistical approaches are covered in Part 1 paper.
Article
A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, is presented and examined. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) the critical-band spectral resolution, (2) the equal-loudness curve, and (3) the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. A 5th-order all-pole model is effective in suppressing speaker-dependent details of the auditory spectrum. In comparison with conventional linear predictive (LP) analysis, PLP analysis is more consistent with human hearing. The effective second formant F2' and the 3.5-Bark spectral-peak integration theories of vowel perception are well accounted for. PLP analysis is computationally efficient and yields a low-dimensional representation of speech. These properties are found to be useful in speaker-independent automatic-speech recognition.
Article
Brain regions responsive to novelty, without awareness, were mapped in humans by positron emission tomography. Participants performed a simple reaction-time task in which all stimuli were equally likely but, unknown to them, followed a complex sequence. Measures of behavioral performance indicated that participants learned the sequences even though they were unaware of the existence of any order. Once the participants were trained, a subtle and unperceived change in the nature of the sequence resulted in increased blood flow in a network comprising the left premotor area, left anterior cingulate, and right ventral striatum. Blood flow decreases were observed in the right dorsolateral prefrontal and parietal areas. The time course of these changes suggests that the ventral striatum is responsive to novel information, and the right prefrontal area is associated with the maintenance of contextual information, and both processes can occur without awareness.