Semantic Understanding of General Linguistic Items by Means of Fuzzy Set Theory
ABSTRACT Modern statistical techniques used in the field of natural language processing are limited in their applications by the fact they suffer from the loss of most of the semantic information contained in text documents. Fuzzy techniques have been proposed as a way to correct this problem through the modelling of the relationships between words while accommodating the ambiguities of natural languages. However, these techniques are currently either restricted to modelling the effects of simple words or are specialized in a single domain. In this paper, we propose a novel statistical-fuzzy methodology to represent the actions described in a variety of text documents by modelling the relationships between subject-verb-object triplets. The research will focus in the first place on the technique used to accurately extract the triplets from the text, on the necessary equations to compute the statistics of the subject-verb and verb-object pairs, and on the formulas needed to interpolate the fuzzy membership functions from these statistics and on those needed to de fuzzify the membership value of unseen triplets. Taken together, these sets of equations constitute a comprehensive system that allows the quantification and evaluation of the meaning of text documents, while being general enough to be applied to any domain. In the second phase, this paper will proceed to experimentally demonstrate the validity of our new methodology by applying it to the implementation of a fuzzy classifier conceived especially for this research. This classifier is trained using a section of the Brown Corpus, and its efficiency is tested with a corpus of 20 unseen documents drawn from three different domains. The positive results obtained from these experimental tests confirm the soundness of our new approach and show that it is a promising avenue of research.
[show abstract] [hide abstract]
ABSTRACT: Traditional information retrieval techniques typically return excessive output when directed at large bibliographic databases. Natural Language Processing applications strive to extract salient content from the excessive data. Semantic MEDLINE, a National Library of Medicine (NLM) natural language processing application, highlights relevant information in PubMed data. However, Semantic MEDLINE implements manually coded schemas, accommodating few information needs. Currently, there are only five such schemas, while many more would be needed to realistically accommodate all potential users. The aim of this project was to develop and evaluate a statistical algorithm that automatically identifies relevant bibliographic data; the new algorithm could be incorporated into a dynamic schema to accommodate various information needs in Semantic MEDLINE, and eliminate the need for multiple schemas. We developed a flexible algorithm named Combo that combines three statistical metrics, the Kullback-Leibler Divergence (KLD), Riloff's RlogF metric (RlogF), and a new metric called PredScal, to automatically identify salient data in bibliographic text. We downloaded citations from a PubMed search query addressing the genetic etiology of bladder cancer. The citations were processed with SemRep, an NLM rule-based application that produces semantic predications. SemRep output was processed by Combo, in addition to the standard Semantic MEDLINE genetics schema and independently by the two individual KLD and RlogF metrics. We evaluated each summarization method using an existing reference standard within the task-based context of genetic database curation. Combo asserted 74 genetic entities implicated in bladder cancer development, whereas the traditional schema asserted 10 genetic entities; the KLD and RlogF metrics individually asserted 77 and 69 genetic entities, respectively. Combo achieved 61% recall and 81% precision, with an F-score of 0.69. The traditional schema achieved 23% recall and 100% precision, with an F-score of 0.37. The KLD metric achieved 61% recall, 70% precision, with an F-score of 0.65. The RlogF metric achieved 61% recall, 72% precision, with an F-score of 0.66. Semantic MEDLINE summarization using the new Combo algorithm outperformed a conventional summarization schema in a genetic database curation task. It potentially could streamline information acquisition for other needs without having to hand-build multiple saliency schemas.BMC Medical Informatics and Decision Making 01/2011; 11:6. · 1.48 Impact Factor
IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 15, NO. 5, OCTOBER 2007757
Semantic Understanding of General Linguistic Items
by Means of Fuzzy Set Theory
Richard Khoury, Fakhri Karray, Senior Member, IEEE, Yu Sun, Mohamed Kamel, Fellow, IEEE, and
Otman Basir, Member, IEEE
Abstract—Modern statistical techniques used in the field of nat-
ural language processing are limited in their applications by the
fact they suffer from the loss of most of the semantic information
as a way to correct this problem through the modelling of the rela-
tionships between words while accommodating the ambiguities of
natural languages. However, these techniques are currently either
restricted to modelling the effects of simple words or are special-
ized in a single domain. In this paper, we propose a novel statis-
tical-fuzzy methodology to represent the actions described in a va-
ject-verb-object triplets. The research will focus in the first place
on the technique used to accurately extract the triplets from the
text, on the necessary equations to compute the statistics of the
subject-verb and verb-object pairs, and on the formulas needed
to interpolate the fuzzy membership functions from these statistics
and on those needed to defuzzify the membership value of unseen
triplets. Taken together, these sets of equations constitute a com-
prehensive system that allows the quantification and evaluation of
the meaning of text documents, while being general enough to be
by applying it to the implementation of a fuzzy classifier conceived
especially for this research. This classifier is trained using a sec-
tion of the Brown Corpus, and its efficiency is tested with a corpus
of 20 unseen documents drawn from three different domains. The
positive results obtained from these experimental tests confirm the
soundness of our new approach and show that it is a promising av-
enue of research.
work, which is discussed in . Our fundamental intuition is
that most of the semantic information of a sentence lies in the
action described by that sentence. For example, the meaning
of the sentence “the excited man drives the red car” is best
captured by “man drives car” rather than “excited man” or
“red car.” Consequently, we focus on modelling relationships
between words where those relationships represent actions. In
more precise terms, our focus will be on modelling the rela-
tionships between the subject, verb, and object of the sentence.
Our methodology assumes no prior knowledge of the text to be
N THIS paper, we present a new approach to model the
relationship between words. It is related to our previous
Manuscript received July 22, 2005; revised November 15, 2005.
The authors are with the Pattern Analysis and Machine Intelligence Labora-
tory, Department of Electrical and Computer Engineering, University of Wa-
terloo, Waterloo, ON N2L 3G1, Canada (e-mail: firstname.lastname@example.org;
email@example.com; firstname.lastname@example.org; email@example.com-
Digital Object Identifier 10.1109/TFUZZ.2006.889817
analyzed, save that most of it follows the subject-verb-object
structure, which is a fundamental sentence structure in English
discourse. Consequently, our approach remains general, unlike
the many domain-specific solutions proposed in the natural
language processing (NLP) literature reviewed in Section II.
To develop our model, we propose to use a hybrid statistical-
fuzzy methodology in order to represent the subject-verb-object
triplets. As we will show, this methodology computes the mem-
mains. The membership of a given triplet can then be inferred
based on its two composing pairs, thus allowing the system to
know not only what domain the triplet belongs to but also how
confident it is of that knowledge. The statistical-fuzzy represen-
tation of the noun-verb pairs is computed in various domains
using information extracted from domain-specific training cor-
pora. This allows our methodology either to be specialized to a
the use of a general corpus such as the Brown corpus.
Moreover, this paper seeks to demonstrate the usefulness
of our statistical-fuzzy methodology by applying it to create
a cross-domain text classifier. The remainder of this paper is
organized as follows. The next section gives an overview of
statistical and fuzzy NLP techniques. Section III then presents
the details of our statistical-fuzzy technique as well as those
of the classifier. In Section IV, we give some classification
results that are then thoroughly studied in Section V and which
confirm the classifier’s usefulness. The concluding remarks are
summarized in Section VI.
II. OVERVIEW AND BACKGROUND
Modern studies in the field of NLP often rely on a statistical
. The assumption underlying this approach as well as other
selected by a certain metric . These approaches have been la-
belled “bag of word” techniques because, in a sense, they treat
the document like a bag of words, by making the assumption
that the order of the words is irrelevant and their frequency is
all that matters. Despite their successes, these approaches have
prominent of which is that they discard all syntactic and se-
this information, word synonymy and polysemy in the English
language become difficult problems to handle , . For ex-
not tell us if the topic is finance or rivers, and knowing that the
1063-6706/$25.00 © 2007 IEEE
758IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 15, NO. 5, OCTOBER 2007
word “duck” occurs is meaningless if we do not know if it de-
scribes the animal or the action.
In order to bypass the limits of the bag-of-word approach,
some researchers have enhanced it by disambiguating the se-
mantic meaning of words using a variety of machine-readable
dictionaries, such as WordNet –, the Oxford Advanced
lish , or the Funk and Wagnalls Dictionary . The first,
became known as the Lesk algorithm, after its creator . Lesk
proposed to compare the dictionary definitions of the words in
the text. In the case of words with several definitions, the cor-
with the definitions of other words. In essence, Lesk pushed the
bag-of-words approach down one level: instead of treating the
document as a bag of words, he replaced the words of the doc-
ument with their dictionary definitions and treated those defini-
tions as bags of words. This simple setup can correctly recog-
nize on average 50% of the ambiguous words it encounters, but
when enhanced with collocates or specialized dictionaries such
as those mentioned above, its success rate can reach 70%.
knowledge into their systems by using predefined syntactic pat-
terns –. The basic methodology consists in matching a
sentence to a pattern, then using words from that sentence to fill
slots in the pattern . A set of conditions is defined for each
slot, and must be met by the word considered for that slot. More
advanced techniques have been used to evolve the patterns and
the conditions, by taking one sentence as an initial strict pattern
and generalizing it to encompass similar sentences in a training
corpus , . However, the use of predefined patterns im-
plies a very precise a priori knowledge of the document being
analyzed, and as such is inherently limited to a domain-specific
In order to represent syntactic and semantic information
without being limited by domain-specific assumptions, some
studies have explored the possibility of modelling the relation-
ship between individual words –. Indeed, modelling
the relationships between all parts of a sentence may require
specialized knowledge,buttherelationships betweenindividual
nouns, adjectives, and adverbs are mostly independent of con-
text. However, because of the vagueness of natural languages,
as in the case of English, for example, those relationships are
rarely defined in a clear fashion, and cannot be easily repre-
sented using statistical tools. For this reason, researchers have
turned their attention to fuzzy logic.
In  paper, on the foundations of fuzzy logic, Zadeh states
facet is the set-theoretic. In this perspective, fuzzy logic is used
to define fuzzy sets, or sets with noncrisp boundaries. As a
matter of fact, this aspect of fuzzy logic is the first one that re-
searchers focused on when they initially explored this field, and
it is the one most current applications of fuzzy logic rely on.
However, fuzzy logic can also be seen, from a second angle,
as a multiple-valued logical system with its own set of infer-
ence rules. These rules allow it to represent and manipulate in-
formation that is uncertain or partially true. A third aspect of
fuzzy logic is its epistemic facet, which is related to the log-
ical-system perception. This facet, however, focuses rather on
knowledge applications, such as knowledge representation and
information systems, where the knowledge is incomplete or un-
certain. Finally, there is a relational aspect to fuzzy logic, which
is mainly concerned with the representation and manipulation
of fuzzy relations. It is this last facet that deals with the impre-
cise or uncertain relationships that exist between crisp or fuzzy
elements. In practice, these fuzzy relations are most commonly
represented as fuzzy if-then rules. This ability to deal with im-
precision and approximate reasoning that is intrinsic to fuzzy
logic, and more importantly the usefulness of this characteristic
in handling uncertain relationships, makes fuzzy logic an ideal
tool to model and handle the vagueness and uncertainty of nat-
ural languages .
researchers have focused on modelling the impact of adverb the
meaning of associated words in a sentence, using fuzzy sets.
A good deal of this work was pioneered by Zadeh. His initial
research showed that adverbs could be modelled as operators
acting on fuzzy sets . He went on to define a series of fuzzy
theoretical groundwork needed to match specific adverbs to the
appropriate operations. These accomplishments were the first
steps on the road towards the fuzzy representation and evalua-
ever, limited to representing adverbs. Representing the meaning
of nouns is a far more arduous task.In fact, numerouspsycholo-
meaning without coming to a single definitive definition of this
concept and what it means to humans . For his part, Zadeh
observed that most words, such as “green,” or sentences, such
as “large integers,” do not have a clear definition but rather have
an imprecise and subjective meaning. As such, he proposed to
quantify the meaning of these words as fuzzy subsets of a uni-
verse of discourse and developed the basics of a mathematical
framework to represent and evaluate the value of these subsets
He later expanded on this concept by creating a new fuzzy
method for representing the meaning of words in a natural
(PRUF) , . According to Zadeh, PRUF is to fuzzy
logic what predicate calculus is to binary logic. That is to say,
it is a methodology to translate natural-language premises into
PRUF expressions, which can be manipulated and transformed
using fuzzy rules of inference in order to yield new PRUF
expressions, which can then be translated back to natural
language as conclusions inferred from the original premises.
PRUF not only differs from other methods of representing
knowledge because it uses fuzzy logic but also improves
on other fuzzy logic–based approaches by making the basic
assumption that the vagueness intrinsic to natural languages is
possibilistic in nature, as opposed to the probabilistic nature
that is assumed by other meaning representation techniques.
By taking possibility distributions as its starting point, the
PRUF methodology allows for a uniform treatment of the
true meaning, probable meaning, and possible meaning of
propositions and makes it possible to manipulate them all in
a manner similar to predicate calculus .
KHOURY et al.: SEMANTIC UNDERSTANDING OF GENERAL LINGUISTIC ITEMS BY MEANS OF FUZZY SET THEORY 759
Zadeh’s next breakthrough followed from his theory of fuzzy
information granulation (TFIG) . Startingfrom theidea that
logic, and by so doing laid the foundation for TFIG. It is clear
that this theory can be easily applied to NLP, by defining words
as fuzzy granules. Then, by combining TFIG with his previous
work on generalized constraints , Zadeh was able to create
the concept of precisiated natural language (PNL) . A state-
and the relationships between them are governed by the rules of
generalized constraints. Furthermore, it is possible to manipu-
from natural languages, withthe helpof operatorsdefinedin the
implement inference rules, which can be applied to extract new
information from a database of PNL statements and to answer
such as “very,” “somewhat,” “red, ” and “tall,” and simple sen-
tences such as “most Swedes are honest” or “getting there takes
dling all but the simplest of natural sentences.
More recently, researchers have considered the possibility of
representing semantic information in the form of fuzzy rela-
tions. This differs from Zadeh’s approach in that it does not at-
tempt to represent the meaning or the effect of a word directly
with fuzzy sets but rather models the word’s meaning in terms
of relationships with other words, and uses fuzzy logic to quan-
tify the degree of those relationships. The resulting database of
words and relationships is usually called a fuzzy thesaurus, be-
cause the act of looking up a word in this database will return
a list of words with a related meaning. In , researchers built
a fuzzy thesaurus of soccer-related words using the following
1) Equivalence: This relationship means that two words are
2) Inclusion: The asymmetric relationship that represents the
case where one word is a generalization of another more
3) Association: This relationship means that two words are
focusing on affect-related information, such as emotions, feel-
ings, attitudes, temperaments, humors, frame of minds, moods,
spirits, morale, and dispositions. Their intuition was that, since
affect-related information is pervasive in all human documents
and is critical in human thinking, then modelling this infor-
mation will help extract important information from the texts,
which can then be usefully integrated to all types of NLP appli-
cations. They went on to define 83 basic affect categories and to
create a fuzzy thesaurus of affect-related words using these two
1) Centrality: This denotes how much a word belongs to a
2) Intensity: This denotes the strength of the word in the cat-
However, fuzzy thesauri such as those presented above re-
main a means of representing words within a specific domain,
and as such are of limited usefulness in most general, context-
III. PROPOSED APPROACH
As explained above, the focus of this paper is twofold. First,
it introduces our original hybrid statistical-fuzzy representation
of actions based on subject-verb-object triplets. Secondly, it
it to implement a general fuzzy text classifier. The classifier is
designed to recognize specific text domains that are present in
a training corpus. The training corpus is simply composed of
unannotated texts and of their matching domain. Once trained,
the system will accurately classify unseen texts in one of the
domains it was trained for and will also give a measure of the
certainty of its classification.
The training data used in this paper come from the Brown
Corpus . This data source is a corpus of American English
written texts compiled in 1961. It is composed of 500 sample
documents, selected to reflect the spread of domains Americans
were reading at about that time. Documents in the corpus thus
cover a wide range of topics, from news coverage to religious
texts, from industrial reports to detective fiction. To set off the
research on our classifier, we have decided to limit our scope
initially to three domains of the Brown Corpus:
1) business domain: samples A26-A28 in the Brown Corpus;
2) medical domain: samples J13-J17;
3) science-fiction domain: samples M01-M06.
It should be noted, however, that the approach presented in
this paper can be expanded to cover all the domains indexed in
the Brown Corpus. In fact, one of the reasons motivating the
choice of this corpus is the fact that it is a general corpus that
sifier is expanded to the entire corpus, it will be able to handle
texts from most domains.
A. Fuzzy Text Classification
Although the statistical-fuzzy methodology we develop
in this research is general enough to be applied to any NLP
problem, we have decided to apply it in this paper only to
the implementation of a text classifier. Text classification is a
problem that has been studied extensively over the past few
years, as reported in –, and several approaches have
been adopted to tackle this complex issue.
Most of the techniques of text classification proposed so far
in the literature rely on a document vector to represent the texts
to classify. Such a vector may simply contain the frequency of
ticated way, it may quantify relevant text features . The doc-
ument vector is then compared with vectors representing arche-
typal documents of each domain. Several methods designed to
760IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 15, NO. 5, OCTOBER 2007
tire vector is compared to all vectors in the training corpus and
the k most similar vectors retained. The domains represented
by those k vectors are then evaluated by performing a weighted
sum of their corresponding vectors, and the domain with the
highest score is assumed to be the correct one. The main draw-
backs of this approach arise from the fact that the comparison
of the entire vector to the entire training corpus simultaneously
imposes inherently high requirements in terms of memory and
computation. Other comparison techniques have been proposed
to circumvent this problem. For example, using a decision tree
elements of the vector one by one in a specified order and to ter-
minate the search once it encounters a combination of features
that only occurs in a particular domain. This technique, how-
ever, is subject to its own drawbacks, namely, extracting the rel-
the problem of overfitting. This problem occurs when a deci-
sion tree has been built using overly specific, irrelevant, or co-
incidental features of the domains found in the training corpus.
In these cases, it generates a very accurate classification on the
training data but has a very poor performance on unseen test
Given the noncrisp and unstructured nature of natural lan-
guages, as is the case of English, for example, it seems only
natural to use fuzzy logic tools to deal with text classification.
Some researchers have explored this possibility, and in so doing
they have come up with some interesting classifiers. One such
system worth noting is based on a fuzzy similarity metric .
At its core, this system is an implementation of the Rocchio al-
gorithm that clusters the vectors of each domain in the training
corpus and replaces them by a single vector representing the
center of each cluster. The correct domain of a test vector is
then computed as the nearest cluster center using an Euclidian
metric such as the cosine coefficient . Widyantoro and Yen
improved on this algorithm by replacing the crisp values of the
cluster center vectors by fuzzy sets representing the member-
ship of each term to the domain. They also computed the sim-
ilarity between a test vector and the cluster centers using ordi-
nary fuzzy operators . Their results show that using fuzzy
the performance of the classification done with crisp and fuzzy
clustering methods . They also found that there is a notable
improvement in the results when fuzzy clustering is used.
The text classifier we develop in this paper is radically dif-
not rely on a document vector at all. As we will show in the
following sections, instead of classifying texts using the fre-
quencies of words or features, our text classifier bases its clas-
sification on the actions described in the text. These actions
are modeled during the training stage of our classifier with our
statistical-fuzzy methodology, which uses subject-verb-object
triplets. In the testing stage, the membership of each triplet of
an unseen test document will be computed in each domain as a
function of the membership of the subject-verb and verb-object
pairs, and a text’s membership in a given domain will be de-
fuzzified from its triplets’ membership in that domain. The cor-
rect domain will be the one with the highest defuzzified value,
and the difference between the values of the correct and incor-
rect domains will allow us to model the certainty of the classi-
fication. This certainty measure will give the system a tool to
distinguish between its correct and incorrect classifications.
B. Triplet Generation and Noun Categories
The first step for both the training and testing stages of
our system is to extract the subject-verb-object triplets from
the training and from the testing documents, respectively.
This task is akin to semantic tagging, which is a problem that
has been extensively studied in the literature, and for which
several interesting solutions have been suggested. For example,
the FrameNet Project  has been proposed to construct a
database that indexes the descriptions of lexical items. These
descriptions include the context in which each item is used, as
well as the semantic role of co-occurring lexical items. This
database would be a valuable tool for several NLP applications
like word sense disambiguation , text understanding ,
and even machine translation provided the database is available
in all the languages involved . However, the construction
of such a database is a time-consuming endeavor that requires
considerable effort on the part of human annotators and re-
viewers . Furthermore, the FrameNet database goes into far
more detail than what is required for our needs. For instance, in
thecase of a sentence such as “John purchased a car,” FrameNet
would annotate “buyer purchase good” instead of the simple
“subject purchase object” that we seek. These inherent aspects
of FrameNet disqualify it as a research tool for our purposes.
Another project that matches our needs more closely is that of
PropBank , which is an evolution of the Penn TreeBank
, . The objective of PropBank is to add a layer of
semantic annotation to the Penn TreeBank in order to extract
relational data from the text. The semantic information added,
which is in the form of verb-argument relations, is similar to the
information we seek to extract from our documents. However,
PropBank, like FrameNet, relies on a manual annotation of the
data, which is not a procedure suitable for our usage. It also
makes use of verb classes , which is something we wish to
avoid for reasons that will be explained later.
Numerous other small-scale semantic annotation projects
also exist in the research framework of the Semantic Web .
But, given that the manual annotation of the Internet is an
unthinkable task, these projects invariably focus on the automa-
tion of the annotation process. To accomplish this automation,
the projects then rely on various preexisting ontologies, such
as the ontology of concepts used by , the ontology of an-
notated poems proposed in , the sharable domain ontology
encouraged by , or the bridge-ontology designed to link
multiple ontologies and presented in . However, since our
system does not rely on a predefined ontology, such approaches
cannot be used.
In view of the limits of existing semantic annotation systems
and the specific needs of our research, we have developed an
appropriate system specifically designed to identify the sub-
jects, verbs, and objects in each sentence. To be sure, the most
accurate results can be obtained only if the triplet extraction
is done manually. However, an automated system can provide
KHOURY et al.: SEMANTIC UNDERSTANDING OF GENERAL LINGUISTIC ITEMS BY MEANS OF FUZZY SET THEORY 761
fairly accurate results nonetheless. In the system developed for
this paper, the triplets are extracted from the text automatically,
with a simple setup in which we use a part-of-speech tagger to
identify the verbs and nouns of each sentence. We rely on an
implementation of the Brill tagger  for this task. The Brill
tagger is a simple yet efficient tagger that operates by assigning
each word its most commonly used part-of-speech tag and then
uses nine transformation rules to correct common tagging mis-
takes. Brill reports that this system achieves a tagging accuracy
of 95.6% . Once the tagging is known, we can extract the
triplets. We do so by assuming that, for each verb in the sen-
all the nouns following it. Some simple syntactic heuristics are
then used to remove obviously incorrect triplets, such as those
involving the verb “prepared” in the sentence, “The company is
prepared to buy the stock.” In that example, the triplet extracted
would be “company buy stock.” It is also worth noting that the
verb “is” in the example will not be marked as a regular verb
by the Brill tagger but as an auxiliary verb, and that we ignore
auxiliary verbs when extracting the triplets.
In this paper, we make the assumption that the most infor-
mative part of an action is the act itself, rather than the ob-
jects performing the act or being acted upon, and rather than the
time (past, present, or future) in which the act takes place. Con-
sequently, we give more importance to the verb of the triplet,
rather than the subject or the object, and ignore the verb tense.
For all practical purposes, we eliminate the verb tense by sub-
stituting all verbs with their infinitive form, and we use these in-
finitive verbs in the triplets directly. In the same vein, nouns are
replaced with noun categories. It is also because of the impor-
tance we place on verbs that we choose not to make use of verb
classes, such as those of  and . Indeed, while, for ex-
ample, Levin considers that verbs such as consult, debate, meet,
and visit all belong to a single class , we believe the differ-
ences between them carry valuable information that should not
Another reason for not keeping nouns in the triplets is that,
given the rich English lexicon, retaining them would cause a
data sparseness problem. For this reason, it is necessary to in-
stead represent nouns with more general categories in which
they belong. However, grouping nouns in categories has an im-
portant drawback in that some contextual information will be
lost. Indeed, a balance must be struck between overly general
categories that would cause the loss of important contextual in-
formation and overly specific categories that would reintroduce
the data sparseness problem. Moreover, as the categories be-
come more specific, some polysemic nouns may turn out to be-
long to multiple categories. No formal framework exists to deal
with either of these problems.
In the system developed in this paper, each noun is repre-
sented by one of 45 categories. Although the system treats all
categories in the same manner, we can nevertheless differen-
such as finance, justice, or emotion. We list those categories in
more general categories designed to give a coarse classification
for nouns that do not fit in any category of the first type. These
categories are listed in Table II. Although there are fewer cate-
goriesof the second typethan of the first type, each second-type
category encompasses more nouns.
Thecategoriespresentedin TablesIand II are thosethat were
found to strike the best balance between generality and speci-
ficity given the training data used in this paper. They do not by
anymeans represent theonlypossible clusteringof thesenouns.
On the other hand, the problem of polysemy was solved simply
by placing nouns in the category representing their most fre-
quently used meaning.
As mentioned above, our system considers the verbs to be
the most significant part of the triplet. However, not all verbs
are significant. Some very common verbs, such as be and have,
carry little significance, yettheywill have a major impact onthe
results by virtue of the fact that they are used very frequently in
all domains. It is thus necessary to detect and filter out these
verbs from the triplets. This detection system is based on the
concept that the common verbs we are looking for will occur
considers the frequency of each verb
computes the relative difference
and most common categories as such
in each domain
between the least common
Verbs for which the computed difference is less than 33%
are considered too common and are consequently filtered out.
By using this approach and the 33% threshold, our system has
detected and eliminated the verbs be, become, consist, do, give,
have, maintain, make, mean, report, see, serve, and start.
In the remaining parts of the system, we assume that the
meaning of the subject-verb pair of a triplet is independent from
the object-verb pair, and vice versa. Consequently, they will
be treated as two pairs in the subsequent stages of the system.
To simplify the notation, we will refer to these pairs as the
noun-verb pairs in the remainder of this paper, with the under-
standing that we distinguishbetween a noun used as subject and
a noun used as object.
C. Noun-Verb Statistics
The training process of the system is based on the triplets
representing the training corpus. These triplets are grouped ac-
cording to the domain they came from, and are separated in
noun-verb pairs. The first step of the training process is to rep-
resent the meaning of each noun-verb pair. This is done using a
technique similar to theone proposed by Rieger, , .
The basis of Rieger’s work is twofold. To start with, he draws
on the notion of situation semantics , which holds that the
meaning of an expression is based on two situations.
1) Discourse Situation: This notion of situation is what al-
lows the expression’s meaning to be interpreted.
2) Described Situation: This notion of situation allows the
expression’s truth-value to be evaluated.
Following situation semantics, it appears that the meaning
of an expression can be discovered by recognizing similarities
762IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 15, NO. 5, OCTOBER 2007
and invariants between situations in which the expression ap-
pears. In other words, the important information we must ex-
second place, Rieger linked the notion of situation to the idea of
language games . To be sure, Wittgenstein had previously
introduced language games, or the contextual-usage-meaning
view, as an explanation for the way children begin to under-
stand and use words. In his theory, children learn not by assim-
ilating the definitions of words but by discovering patterns in
their usage. However, Wittgenstein’s work on language games
was done at the level of philosophical discourse. By linking it to
the formally defined notion of situation semantics, Rieger laid
the theoretical basis for an empirical approach to learn word
meaning using its observed usage in a normal text.
Starting from the assumption that the analysis of a number of
texts in order to isolate regularities in a word’s usage can reveal
developed an empirical model to discover these regularities. In
fact, since Rieger wanted his model to rely only on observable
regularities in the text, he was in essence trying to isolate lex-
ical items that are regularly associated with the target word and
may give insight into its meaning. The core of his methodology,
which we have adapted for our purposes, was a two-level ab-
straction process that produced a set of usage regularities and a
set of meaning points.
To extract the usage regularities, Rieger used a modified cor-
relation coefficient. This coefficient computes the interdepen-
dence between any two lexical items on the basis of their fre-
quencies in the texts of the training corpus. Rieger then used
target word whosemeaning he wanted todiscoverand anyother
relevant word. For the purpose of this paper, however, the cor-
relation analysis must be more specific. Indeed, since we want
to find the meaning of noun-verb pairs, we will focus on finding
we use to compute each noun-verb pair’s correlation coefficient
is given by
posed of noun
domains forming the training corpus, and
the number of occurrences of noun
, respectively. The mean
is the total number of occurrences of the word
the training corpus;
is the length of domain , or, said dif-
ferently, the number of triplets representing domain ; and
is the length of the training corpus. Mathematically, noun-verb
pairs that have similar occurrences in the training corpus, in
the sense that they are both present in or absent from the same
texts, have a positive correlation coefficient. These are called
affined pairs. On the other hand, noun-verb pairs with different
occurrences, in the sense that one word often appears in texts
is the correlation coefficient of the pair com-
and verb,denotes the total number of
and verbin domain
is defined as,
35 SPECIFIC CATEGORIES USED IN OUR SYSTEM
without the other, will have a negative correlation coefficient.
Those noun-verb pairs are called repugnant.
Once the usage regularities are known, the second level
of Rieger’s model uses them to extract the meaning points.
Rieger defined these meaning points as the difference between
KHOURY et al.: SEMANTIC UNDERSTANDING OF GENERAL LINGUISTIC ITEMS BY MEANS OF FUZZY SET THEORY 763
TEN GENERAL CATEGORIES USED IN OUR SYSTEM
the usage regularity of two lexical items. The smaller this
difference, the more similar the usage of the two lexical items
is, and therefore the closer their meaning must be. This distance
can be measured using an Euclidian metric that will compute
the difference of usage regularities of a lexical item against all
other lexical items. In our methodology, we have computed the
semantic similarity Sim
the following equation:
of each noun-verb pair using
zero and 2
.We shouldnote, however,that, since this
semantic similarity is in fact a measure of difference, noun-verb
pairs that are very similar will have a lower value of Sim
than pairs that are very different.
Apart from the two above concepts that we have adapted
noun-verb pair in each domain. Indeed, while the semantic sim-
pair and the domains of the training corpus. Both informations
will be crucial in the subsequent stages of the system, where we
The probability of observing each noun-verb pair in each do-
main is computed by simply calculating the frequency of occur-
and are the number of different verbs and of dif-
rences of each noun-verb pair in each domain , and dividing it
by the total of frequencies of subject-verb pairs or object-verb
pairs in that domain, as follows:
D. Fuzzy Membership Functions
If the training process ends at this point, or possibly uses a
smoothing algorithm to smooth the statistics computed in (4),
the system would be a regular statistical classifier. To become
a hybrid statistical-fuzzy classifier, the system needs to build
this end, the system generates a two-dimensional (2-D) graph
for each noun-domain combination, where the
graphisthenoun-verbsimilarityand the -axis is thenoun-verb
probability in each domain. The system can then position each
verbinevery graphwithitsappropriatecoordinatesand isready
to compute an equivalent fuzzy membership function to repre-
sent this data.
Studies have shown that the shape and tuning of the member-
controller . Regarding the shape, we have opted for a trape-
zoid membership function, based on the assumption that verbs
at the extremities of the 2-D graph should carry less weight in
our decision than those in the center. The tuning of the mem-
bership function’s parameters was done based on each graph’s
similarity and probability values, as we will show in the fol-
The top plateau of the trapezoid function is centered on the
average similarity of the noun-verb pairs ASim
computed as follows:
-axis of the
, which is
as indicated in the following equation:
The height of the plateau will be the normalized average
of all noun-verb pairs in that region. It
is necessary to normalizethat measure because thehigh number
of noun-verb pairs whose probability is zero would otherwise
drag the height of all the membership functions near zero.
These noun-verb pairs come from two sources: domain-depen-
dent verbs of one domain that do not appear in other domains
and rare verbs that occur once in the training corpus, thus
generating a probability of 1 in one domain and 0 in all others.
764IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 15, NO. 5, OCTOBER 2007
Obviously, the latter source will skew the results, but it must
be dealt with in a way that does not affect the former source,
which contains some very important information. The solution
is to give each verb
a weight depending on its frequency in
the training corpus. This verb weight VW
two steps. In the first step, we compute the relative frequency
RFof the verb by dividing its frequency in the corpus by the
total frequency of all verbs
is computed in
Since there are several thousand verbs in total, that relative
frequency will be quite close to zero. To avoid this result, the
relative frequency is normalized by dividing it by the most fre-
quent verb’s relative frequency
Thus, the most frequent verbs will have a VW close to 1,
while verbs that seldom appear in the corpus will have a VW
around 0. In the second step, we compute the normalized AP.
We define the normalized AP as the sum of the probabilities of
the noun-verb pairs in the region of the plateau divided by the
total VW of those verbs. This is given mathematically as
The ramp sections of the trapezoid extend to three standard
deviations on each side of the mean. In (9),
number of elements contained within
sentation of this setup is shown in Fig. 1, while examples of
actual 2-D graphs and their membership functions are shown in
Fig. 2, with the noun lifeform acting as subject in all three do-
mains the system is trained for.
. A graphical repre-
E. Text Classification
Once the system is fully trained, it can be used to classify an
unseen test document. This classification can be done by com-
puting the membership of each of triplets of the document in
every domain and summing all these membership values. The
correct class will be the one with the highest membership de-
gree. Furthermore, as we will show in Section IV, we can define
the certainty of the classification as the difference between the
highest and second highest membership values.
Fig. 1. Graphical representation of a membership function.
of a text in domain is given by
is the number of triplets in the test document and
is the value of the verb of triplet
the number of occurrences of that verb in the domain
divided by the total number of occurrences of all verbs in that
. Hence, we can write
(i.e., verb), defined as
It should be noted that
values of the subject-verb and object-verb pairs, respectively.
We will show how to compute them in the next section. Finally,
is the probability of the domain, defined as the number of
triplets belonging to that domain in the training corpus
divided by the total number of triplets in the training corpus (or
, the length of the training corpus)
and in (11) are the defuzzyfied
Although a triplet’s meaning is dependent on its verb, subject
and object, we have assumed in Section III-B that the meanings
other. For that reason we can compute the values of
independently, then combine them together in (11) to get the
value of the corresponding triplet.
The values of
appropriate membership functions in order to be used in (11).
Numerous defuzzification techniques have been proposed and
tested in the literature , and though the centroid method is
generallyconsideredmore effectivethantheothermethods, itis
different problems and different function sets call for different
defuzzificationmethods. More specifically, thecentroid method
is not appropriate for our problem, and we will develop the the-
oretical justifications of this fact and demonstrate it experimen-
andneed to be defuzzified from the
KHOURY et al.: SEMANTIC UNDERSTANDING OF GENERAL LINGUISTIC ITEMS BY MEANS OF FUZZY SET THEORY 765
Fig. 2. Two-dimensional graphs and membership functions for the noun lifeform as subject in the domain (top) business, (middle) medicine, and (bottom) science-
fiction. Each dot represents a verb.
problem with the centroid defuzzification technique is that it re-
point, and then uses only the similarity coordinate of that point
andvalues. In so doing, it loses too much informa-
tion to be able to give an accurate measure. Since our method-
such information loss, we have decided to defuzzify the mem-
bership functions by computing their area, using a process that
we outline here and present graphically in Fig. 3.
by performing the
-cut of the noun’s membership function at
the height corresponding to the noun-verb similarity abscissa
coordinate [Fig. 3(a)]. The cropped membership functions of
thesubject-verbpairand object-verbpairofeachtripletare then
merged together using an AND function [Fig. 3(b)], and the
height of the resulting triplet membership function is multiplied
and divided by, in the same way as in (11) [Fig. 3(c)].
The triplet’s membership function is defuzzified by computing
its area in order to obtain the crisp membership value of the
triplet [Fig. 3(d)], and the values of all the triplets are summed
up to obtain the membership of the document in a domain. This
procedure is then repeated for all domains independently.
IV. EXPERIMENTAL RESULTS
As mentioned in the previous section, the system has been
ognize documents from three different domains. Following this
Fig. 3. Illustration of the defuzzification technique.
longing to one or another of these three domains provided they
are not part of the Brown Corpus. The testing corpus we created
766IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 15, NO. 5, OCTOBER 2007
COMPOSITION OF THE TESTING CORPUS
is largely varied, in terms of both the source of documents and
the length of these documents. It is detailed in Table III.
three supplementary techniquesto computethevalues of
in addition to the one presented in Section III-F. In the first
technique, the defuzzified values are the height of the
from Section III-F. In other words, the membership degree of a
each other and by
. This first alternative thus replaces the
merging of both membership functions with a scalar multiplica-
tion and does not take into account the area of the functions at
all. It will allow us to verify the usefulness of those extra steps.
at the beginning of Section III-F. In this case, the membership
degree of a triplet in a domain will be computed as the centroid
of the two merged membership functions. This will allow us to
prove the point we made in Section III-F, to the effect that the
centroid technique is not appropriate for our methodology. As
for the third and last technique to compute
membership functions were entirely ignored. In this technique,
the values of
and are simply the noun-verb probabilities.
This last technique will permit us to see the impact of using
membership functions, as opposed to using simple probabili-
In order to analyze the three classification techniques dis-
cussed above and the one presented in Section III-F, we
and, the fuzzy
EXPERIMENTAL RESULTS (SEE TEXT FOR DETAILS)
have generated five interesting statistics that are illustrated in
Table IV. The first statistic, presented in column A of the table,
refers to the number of documents classified correctly by the
system. The second and third statistics are the average relative
difference (ARD) between the membership value of the correct
domain and the membership value of the highest-ranking
incorrect domain, with the standard deviation (STD) of that
difference, computed for the correctly classified documents.
The ARD value appears in column B, and the STD is listed in
column C. The fourth and fifth statistics found in columns D
and E, respectively, are identical to the second and third statis-
tics but are computed for the incorrectly classified documents.
These last four values reflect our confidence in the classification
of the documents. Indeed, in the case of correctly classified
documents, the system has to maximize the difference in order
to indicate a high confidence in the classification, while in the
case of incorrectly classified documents, it has to minimize
it. In both cases, the standard deviation must be as small as
possible for it to indicate that the values of the ARD do not
fluctuate wildly. Also, a low standard deviation coupled with
high and low confidence values would indicate that the groups
of correctly classified and incorrectly classified documents do
Ideally, the classifier will exhibit a high confidence for the
documents it has classified correctly and a low confidence for
the documents it has classified incorrectly, and there will be no
overlap between these confidence values. This will allow the
uments classified with a low confidence.
niques used in this paper to compute
that table are expressed in percentages. This particular transfor-
mation is necessary because the membership values, and there-
fore the ARD and STD values, vary greatly from one tech-
nique to the other. Indeed, the area of the membership func-
tions is measured on a scale that can reach in the tens of thou-
sands, whereas the height of the membership functions and the
noun-verb probabilities are measured on a much lower scale
of less than 1. With such differences in the order of magni-
tudes, it becomes quite difficult to make comparisons between
the first technique and the other three. It should be noted, how-
ever,thatthemembership valuesare notimportant; onlythedif-
ference between them is. For example, a document ranked at
0.08 should be quite different from another document ranked at
0.01, whereas a document ranked at 105 should be almost iden-
tical to another one ranked at 104, even though the difference
and. The values in
KHOURY et al.: SEMANTIC UNDERSTANDING OF GENERAL LINGUISTIC ITEMS BY MEANS OF FUZZY SET THEORY767
between their two rankings is 1, as opposed to a difference of
only 0.07 in the rankings of the first two documents. The con-
cept of proportional difference in ranking can give us an insight
of the first pair of documents reaches 88%, indicating a dissim-
ilarity between them greater than that between the documents
of the second pair, where the proportional difference in ranking
stands at only 1%. For this reason, we will henceforth analyze
the classification statistics in terms of proportional difference.
V. COMMENTS AND DISCUSSION
When examining the number of documents correctly classi-
fied, three of the four techniques to compute
similar results. This similarity is, however, superficial since all
three techniques fail to correctly classify different documents.
Referring to Table III, the first technique fails to correctly clas-
sify documents 5, 7, 10, 14, and 20; the second technique fails
to classify documents 1, 3, 5, 10, and 11; and the fourth tech-
nique fails to classify documents 10, 14, 18, and 20. There is no
common pattern or characteristic to the documents that are mis-
classified. Rather, each technique has its own particular weak-
nesses and a special set of documents for which it fails. Docu-
ments for which several techniques fail, such as document 10,
are only coincidental.
In comparison, using the centroid technique to perform
the defuzzification gives a correct classification for only five
documents. The problem with the centroid technique is that it
minimizes the importance of the pair-domain probabilities (the
-axis of the trapezoid plots) and focuses on the noun-verb
-axis of the trapezoid plots). But since our
basic membership function shape is a trapezoid, the merged
membership functions will have nearly symmetric shapes, and
to their height. However, as we can see in Fig. 2, the similarity
of a noun-verb pair is the same in all domains, and its height
is what distinguishes one domain from another. Using the
centroids eliminates almost completely this distinction, leaving
verylittle informationto differentiateone domainfrom another.
The effect of this loss of information becomes apparent when
we consider the low ARD values obtained with the centroid de-
documents. That is why we declared in Section III-F that this
defuzzification technique is inadequate for our methodology.
By contrast, if the defuzzification is performed by computing
the area of the membership function, both the probability
and the similarity of the membership functions are taken into
account. However, in this case, the defuzzification is based on
the range of similarities covered by the trapezoid rather than
its actual similarity value. Yet, by combining both sources of
information, the area defuzzification gives considerably better
results than the centroid defuzzification.
The results obtained from the area defuzzification and those
obtained by using the probabilities directly show some signifi-
using the probabilities is significantly higher than the ARD ob-
tained by using the defuzzification, but so is the ARD of the
documents classified incorrectly. Furthermore, the STD of the
documents classifiedincorrectly is much largerwhen the proba-
bilitiesareused.Thesetwo findingsnegatethegains ofhavinga
higher correct ARD. Indeed, the overlap between the regions at
one STD of the ARD value of the correctly and incorrectly clas-
sified documents is around 1.3% with the defuzzification tech-
nique, which is much lower than the 20.4% obtained with the
probabilities, despite the lower ARD values of the former.
The above discussion indicates that it is preferable to use
high STD of the incorrectly classified documents. This problem
is commonly encountered in statistical NLP and is due to the
sparseness of the training data. Indeed, some rare noun-verb
pairs occur only once or twice in the training data, which gener-
ates 100% spikes of probabilities in one domain, while a large
number of rare noun-verb pairs will not be encountered at all,
which generates a large number of 0% probability points in all
domains. These 100% and 0% probabilities are obviously false,
since no noun-verb combination in English is used only in one
domain, or is never used at all. They only occur because exam-
Instinctively, the solution to this problem seems to consist in
increasing the size of the training corpus in order to include a
representative sample of the rare pairs. However, researchers
in statistical NLP have shown that creating a training corpus
with a representative sample of all possible pairs is impossible,
short of using the whole universe of written texts as a corpus
, . This observation has led to the development of a
number of algorithms to smooth the co-occurrence probabili-
ties, such as Laplace’s law , Lidstone’s law of successions
, the expected likelihood estimation , the held out esti-
algorithms operate by subtracting a portion of the probability
it to absent pairs that would otherwise have had a zero prob-
ability. In our system, the construction of the fuzzy member-
ship functions serves as a smoothing algorithm for the probabil-
their similarity, decreasing the 100% spikes and increasing the
zero-probability pairs. This smoothing significantly improves
the results of our methodology.
The results obtained when the defuzzification is performed
using only the heights of the membership functions are very
poor compared to those of the area defuzzification. Indeed, al-
though the results of the area defuzzification technique show a
small overlap between the correctly and incorrectly classified
documents, those of the height defuzzification show that the re-
gion of incorrectly classified documents is entirely contained
within the correctly classified documents. This is due to the fact
that the ARDs of the correctly and incorrectly classified docu-
ments are nearly identical for the height defuzzification. In this
case, there is no distinction between the correctly and incor-
rectly classified documents.
The problem with the height defuzzification is similar to that
of the centroid defuzzification, in that it discards a lot of impor-
tant information from the membership functions. In this case,
however, it is the range of similarities covered by the member-
ship functions that is lost, in favor of putting all the emphasis on
768IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 15, NO. 5, OCTOBER 2007
COMPARISON OF CLASSIFICATION RESULTS WITH VARIOUS TECHNIQUES
their height. However, two membership functions of the same
height may not be equal. One of them can cover verbs spread
across a large range of similarities, and be accordingly large,
consequently a very narrow shape. This distinction is lost when
only the height of the membership functions is taken into con-
sideration, which leads directly to a deterioration in the results.
It is interesting to consider how our domain classifier per-
forms relative to other text classifiers. To this end, we can com-
pare the percentage of test documents correctly classified using
the area defuzzification with some of the results reported in
the literature. We have selected as benchmarks the improved
Bayes algorithm proposed by , the sentence-space model
classification introduced in , the combined layered clus-
tering and K-means algorithm put forward in , the Bayesian
classification approach integrating compound words that was
the focus of , and a classifier designed using the feature se-
sifier and of each of these six classifiers is presented in Table V.
mance ranks well compared to others. Its success rate is close
to that of most classifiers in the literature, and even surpasses
some of them. This is a very positive result, considering that
our classifier is still in its prototype stage. As our methodology
becomes more refined and our implementation more sophisti-
cated, we can expect our results to improve further and equal,
maybe even surpass, the current top classifiers.
Finally, it is worth noting that the positive results of the area
defuzzification were obtained despite the fact that a very simple
triplet-extraction setup, described in Section III-B, was used.
been adopted to filter out some of the erroneous triplets, and
could lead to the presence of a number of these triplets in the
training and testing data. However, the erroneous triplets do not
our technique is tolerant to a certain level of imprecision in the
triplets. This resilience may be due to three main factors. First,
the membership function has a normalizing effect on the prob-
ability values, smoothing out the erratic values that these erro-
neous triplets can take. Secondly, the fact that the triplets are
decomposed into two noun-verb pairs helps dampen the impact
oftheerrors.Tobe sure,inthetrainingdata,anerroneous triplet
may contain a valid pair, whose statistics will be reinforced.
On the other hand, in the testing data, an incorrect triplet may
be made up of an incorrect pair, whose low membership value
will decrease the overall value of the triplet. Lastly, the fact the
system uses only a small set of noun categories limits the im-
pact that the erroneous triplets can have. Indeed, if there are a
small number of very frequently observed noun-verb pairs, then
each erroneous triplet will only cause a small increase in the
corresponding pairs’ statistics. In other words, if the number of
erroneous triplets is small relative to that of the correct triplets,
their impact on the statistics will be negligible.
VI. CONCLUSION AND FUTURE WORK
The purpose of this paper is to develop and test a new
methodology for the statistical-fuzzy representation of actions
using subject-verb-object triplets. In line with this objective,
the research has focused on the necessary equations to compute
the statistics of the subject-verb and verb-object pairs, as well
as on the formulas needed to interpolate the fuzzy membership
functions from these statistics and on those needed to defuzzify
the membership value of unseen triplets. Taken together, these
three sets of equations constitute a comprehensive system that
allows the quantification and evaluation of the meaning of text
documents by using an original approach based on the theory
of fuzzy sets. This system of equations draws in part on the
formulas developed by Rieger , ,  to compute the
semantic similarity between words. We have, however, adapted
Rieger’s formulas to fit this research by specializing them to
noun-verb pairs and have added many new features to our
system of equations, namely, those regarding the construction
and handling of the fuzzy membership functions, as well as
those needed to extract the triplets from the text documents.
In the second phase, this paper proceeded to experimentally
demonstrate the validity of our new methodology by applying it
to the implementation of a fuzzy classifier conceived especially
. This application is at the core of our work, and its effi-
ciency was tested with a corpus of 20 unseen documents drawn
from three different domains. The experimental tests show that
15 documents have been correctly classified with an average
42% degree of confidence. Even the remaining five incorrectly
classified documents exhibit a weak degree of confidence of
tests confirm the soundness of our new approach and show that
it is a promising avenue of research. It should be noted, how-
ever, that this research is still in its early stages, and more work
remains to be done before a better application is developed.
On the basis of the encouraging results obtained from this
new technique, it is possible to reflect upon areas of future re-
search. To be sure, this future research should address both the
limits of the new methodology we have proposed and the pos-
sible weaknesses of the classifier we have implemented.
The first methodological point that will require future
scrutiny relates to the fuzzy membership functions generated
in Section III-D. Some studies  have shown that the shape
and tuning of the membership functions play an important role
in the behavior of the fuzzy controller. Yet, our decision to use
a trapezoid membership function was reached intuitively and is
only justified by the fact that it gives good experimental results.
Clearly, this aspect of our methodology should be analyzed in
more depth, and other membership function shapes should be
KHOURY et al.: SEMANTIC UNDERSTANDING OF GENERAL LINGUISTIC ITEMS BY MEANS OF FUZZY SET THEORY 769
examined in order to determine which shape is most appro-
priate for our approach. Moreover, as can be seen in Fig. 2, the
membership functions’ maximum value is not always 1, as it
should theoretically be. Some kind of normalization step may
be needed to fix this problem.
The second methodological point that should capture
our attention is the triplet generation stage highlighted in
Section III-B. As mentioned before, the technique adopted in
this paper allows a number of incorrect triplets to be included
in the training and testing data. For example, a sentence such
as “the house was purchased by the man” will yield the triplet
“house purchase man,” which is syntactically accurate. How-
ever, from a semantic point of view, the action described in
that sentence would be better represented by the triplet “man
purchase house.” Unless a human reader manually extracts
the triplets from the corpus, we can expect that such incorrect
triplets, brought about by these or by other unexpected sentence
structures, will always be part of the data. Consequently, future
studies can investigate the impact of these incorrect triplets
on our proposed methodology. In this regard, it would be
interesting to know what percentage of incorrect triplets our
methodology can tolerate, and what impact these incorrect
triplets can have on the accuracy of the results. This particular
and resources into a more accurate triplet extraction system.
From an implementation point of view, the triplet extraction
system we introduced in Section III-B presents a number of
clear weaknesses that could be improved upon. First and fore-
most is the fact that there is no formal way to cluster the nouns
in categories. The categories we selected were simply those that
seemed most appropriate given the training corpus. This is a
in order to formalize the choice of noun categories. Moreover, it
is necessary to find a way to handle polysemy, or nouns that can
be mapped to several different categories at once. One possible
approach has been alluded to at the end of Section V. There, it
would be possible to consider a number of different triplets, one
for each possible meaning of a noun. This would guarantee that
a triplet that uses the correct sense of the noun will be present.
On the other hand, this solution will also cause a major increase
resistant to a small number of false triplets; however, the impact
of a number of false triplets of the magnitude we just proposed
remains to be studied.
Another practical improvement that future work should seek
to achieve consists in reaching the ultimate goal mentioned ear-
lier in this paper for the classifier, which is to expand it so that
it can cover the entire Brown Corpus, and thus be able to clas-
sify texts covering all domains. Although the system’s frame-
work is complete, a lot of work remains to be done in order to
integrate in it the entire corpus. To our knowledge, a fuzzy clas-
sifier on such a scale has never been devised. Alternatively, an-
other corpus could be envisioned as training data, in lieu of the
Brown Corpus. Indeed, the Brown Corpus is already 40 years
old, and even though variationsin English syntax and text struc-
ture are eliminated when the text is converted into triplets, new
and potentially crucial verbs and nouns that have been added
to the English lexicon these past decades are missing from it.
Moreover, the Brown Corpus consists of only 1 million words;
become somewhat limited by today’s standards, particularly in
to extract a word’s meaning from regularities in its usage, on
whichour methodologyis based, requirestheanalysisof a great
number of texts in order to obtain the best results . For this
reason, it is imperative to examine the possibility of replacing
the Brown Corpus with a larger and more up-to-date training
In addition to expanding the system horizontally by adding
new domains, we should also examine the possibility of ex-
panding it vertically, by adding subdomains to the domains al-
ready implemented. This would transform our classifier into a
and refining it subsequently. With this improvement, the clas-
sifier could, for example, begin by analyzing a test document
and classify it as a belonging to the business domain. Then it
would subclassify the document as a text dealing with a corpo-
rate takeover, and further subclassify it as describing either a
friendly or a hostile takeover. The implementation of this hier-
archical structure would be quite different from the single-level
system architecture presented in this paper and would raise a
number of new concerns. To illustrate, one such concern relates
to the statistics of the new children categories, which could be
computed either in relation to each other or in relation to their
parent category only. Each scheme has its own advantages and
drawbacks, and further efforts are needed to figure out the most
with how far down in the children categories the system should
try to classify a test document, and what its cutoff conditions to
end the classification should be. Most importantly, the level of
precise the children categories can get, also becomes a concern.
Finally, it is likely that this new technique has the potential
to be applied positively in all fields of NLP, from text classifi-
cation to semantic search engines to text labelling and summa-
rization. Although the basic principles presented in this paper
do bear scrutiny, the work has so far focused on one specific ap-
dent and prove its generality.
 R. E. Madsen, S. Sigurdsson, L. K. Hansen, and J. Larsen, “Pruning
the vocabulary for better context recognition,” in Proc. 2004 IEEE Int.
Joint Conf. Neural Netw., Jul. 25–29, 2004, vol. 2, pp. 1439–1444.
 T. Yoshioka, Y. Takata, M. Ito, and S. Ishii, “A neural visualization
method for WWW document clusters,” in Proc. Int. Joint Conf. Neural
Netw. (IJCNN’01), Jul. 15–19, 2001, vol. 3, pp. 2270–2275.
 K. Aas and L. Eikvil, Text categorisation: A survey Tech. Rep. 941,
 R. Menon, S. S. Keerthi, H. T. Loh, and A. C. Brombacher, “On the
effectiveness of latent semantic analysis for the categorization of call
centre records,” in Proc. 2004 IEEE Int. Eng. Manage. Conf., Oct.
18–21, 2004, vol. 2, pp. 546–550.
 H. Uejima, T. Miura, and I. Shioya, “Improving text categorization by
resolving semantic ambiguity,” in Proc. 2003 IEEE Pacific Rim Conf.
Commun.,Comput. SignalProcess. (PACRIM),Aug.28–30, 2003, vol.
2, pp. 796–799.
770IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 15, NO. 5, OCTOBER 2007
 S. Chua and N. Kulathuramaiyer, “Semantic feature selection using
WordNet,” in Proc. IEEE/WIC/ACM Int. Conf. Web Intelligence (WI
2004), Sep. 20–24, 2004, pp. 166–172.
 C. Hung;, S. Wermter, and P. Smith, “Hybrid neural document clus-
tering using guided self-organization and WordNet,” in IEEE Intell.
Syst., 2004, vol. 19, no. 2, pp. 68–77.
 M. Lesk, “Automatic sense disambiguation using machine readable
dictionaries: How to tell a pine cone from an ice cream cone,” in Proc.
SIGDOC, 1986, pp. 24–26.
 S. Banerjee and T. Pedersen, “An adapted lesk algorithm for word
Process. Comp. Linguistics (CICLING-02), Feb. 2002, pp. 17–22.
 R. Krovetz and W. B. Croft, “Lexical ambiguity and information re-
trieval,” ACM Trans. Inf. Syst., vol. 10, no. 2, pp. 115–141, 1992.
 G. J. Wilms, “Using an on-line dictionary to extract a list of sense-dis-
ambiguated synonyms,” in Proc. 30th Annu. Southeast Regional Conf.,
1992, pp. 15–22.
 W. Lehnert, “Symbolic/subsymbolic sentence analysis: Exploiting the
best of two worlds,” in Advances in Connectionist and Neural Compu-
tation Theory, J. Barnden and J. Pollack, Eds.
1990, pp. 135–164.
 S. Soderland, D. Fisher, J. Aseltine, and W. Lehnert, “CRYSTAL: In-
ducing a conceptual dictionary,” in Proceedings of the 14th Interna-
Francisco, CA: Morgan Kaufmann, 1995, pp. 1314–1319.
 S. G. Soderland, Learning text analysis rules for domain-specific nat-
ural language process Tech. Rep. UM-CS-1996- 087, 1996.
 L. A. Zadeh, “Toward a restructuring of the foundations of fuzzy logic
(FL),” in Proc. 1998 IEEE Int. Conf. Fuzzy Syst. , May 4–9, 1998, vol.
2, pp. 1676–1677.
 L. A. Zadeh, “The roles of soft computing and fuzzy logic in the con-
ception, design and deployment of intelligent system,” in Proc. IEEE
Asia Pacific Conf. Circuits Syst., Nov. 18–21, 1996, pp. 3–4.
J. Cybern., vol. 2, pp. 4–34, 1972.
 L. A. Zadeh, “Quantitative fuzzy semantics,” Inf. Sci., vol. 3, pp.
 L. A. Zadeh, “Pruf and its application to inference from fuzzy propo-
sitions,” in Proc. IEEE Conf. Decision Contr., 1977, pp. 1359–1360.
 L. A. Zadeh, “Pruf-a meaning representation language for natural lan-
guage,” in Fuzzy Reason. Applicat., 1981, pp. 1–66.
 G. Akrivasand G. Stamou, “Fuzzy semantic association of audiovisual
document descriptions,” in Proc. Int.Workshop Very LowBitrate Video
Coding (VLBV), Oct. 2001.
 P. Subasic and A. Huettner, “Affect analysis of text using fuzzy se-
mantic typing,” IEEE Trans. Fuzzy Syst., Aug. 2001.
 W. N. Francis and H. Kucera, “Manual of information to accompany a
standard corpus of present-day edited American English, for use with
digital computers,” , Dept. of Linguistics, Brown Univ., Providence,
 B. B. Rieger, “Semiotic cognitive information processing: Learning
to understand discourse. A systemic model of meaning constitution,”
in Perspectives on Adaptivity and Learning, R. Kühn, R. Menzel, W.
Menzel, U. Ratsch, M. M. Richter, and I. O. Stamatescu, Eds.
delberg, Germany: Springer, 2003, pp. 347–403.
parative study on different learning methods for fuzzy inference sys-
tems,” in Proc. 3rd Conf. IEEE Conf. Fuzzy Syst., June 1994, vol. 3,
 S. Zhou, T. W. Ling, J. Guan, J. Hu, and A. Zhou, “Fast text
classification: A training-corpus pruning based approach,” in Proc.
8th Int. Conf. Database Syst. Adv. Applicat., Mar. 26–28, 2003,
 S. Hong-bo, S. Zhi-Hai, H. Hou-Kuanm, and J. Li-Ping, “Text clas-
sification based on the TAN model,” in Proc. IEEE Region 10 Conf.
Comput., Commun., Contr. Power Eng., Oct. 28–31, 2002, vol. 1, pp.
 C. Apte, P. Damerau, and S. Weiss, “Text mining with decision trees
and decision rules,” in Proc. Conf. Autom. Learning Discovery, Jun.
 D. H.WidyantoroandJ.Yen,“Afuzzysimilarity approachin text clas-
sification task,” in Proc. 9th IEEE Int. Conf. Fuzzy Syst., May 7–10,
2000, vol. 2, pp. 653–658.
 G. Keswani and L. O. Hall, “Text classification with enhanced semi-
May 12–17, 2002, vol. 1, pp. 621–626.
 P. S. Laplace, Philosophical Essay on Probabilities.
Norwood, NJ: Ablex,
 G. J. Lidstone, “Note on the general case of the Bayes-Laplace for-
mulae for inductive or a posteriori probabilities,” Trans. Faculty Actu-
aries, vol. 8, pp. 182–192, 1920.
 G. E. P. Box and G. C. Tiao, Bayesian Inference in Statistical Anal-
ysis. Norwood, MA: Addison-Wesley, 1973.
 F. Jelinek and R. Mercer, “Probability distribution estimation from
sparse data,” in IBM Tech. Disclosure Bull., 1985, vol. 28, pp.
 I. J. Good, “The population frequencies of species and the estimation
of population parameters,” in Biometrika, 1953, vol. 40, pp. 237–264.
 J. Rocchio, “Relevance feedback information retrieval,” in The Smart
Retrieval System—Experiments in Automated Document Processing,
G. Salton, Ed. Englewood Cliffs, NJ: Prentice-Hall, 1971, pp.
 E. Brill, “Transformation-based error-driven learning and natural lan-
guage processing: A case study in part-of-speech tagging,” in Comp.
Linguistics, 1995, vol. 21, pp. 543–565.
 E. Brill, “Unsupervised learning of disambiguation rules for part of
speech tagging,” in Proc. 3rd Workshop Very Large Corpora (WVLC
3), Jun. 30, 1995, pp. 1–13.
 B. B. Rieger, “On understanding understanding. Perception-based pro-
cessing of NL texts in SCIP systems, or meaning constitution as visu-
alized learning,” IEEE Trans. Syst., Man, Cybern. Part C, vol. 34, no.
4, pp. 425–438, 2004.
 B. B. Rieger, “Distributed semantic representation of word meanings,”
in Parallelism, Learning, Evolution, J. D. Becker, I. Eisele, and F.
W. Mundemann, Eds. London, U.K.: Springer-Verlag, 1991, pp.
 J. Barwise and J. Perry, Situations and Attitudes.
MIT Press, 1983.
 L. Wittgenstein, The Blue and Brown Books, R. Rhees, Ed.
U.K.: , 1958.
 J. J. Saade and H. B. Diab, “Defuzzification techniques for fuzzy
controllers,” IEEE Trans. Syst., Man, Cybern. B, vol. 30, no. 1, pp.
 G. K. Zipf, Human Behaviour and the Principle of Least Effort.
Reading, MA: Addison-Wesley, 1949.
vol. 10, pp. 1–27, 1954.
 C. F. Baker, C. J. Fillmore, and J. B. Lowe, “The berkeley FrameNet
project,” in Proc. COLING-ACL, Montreal, PQ, Canada, 1998.
 C. J. Fillmore and C. F. Baker, “Frame Semantics for Text Un-
derstanding,” in Proc. WordNet Other Lexical Resources Workshop
(NAACL), Pittsburgh, Jun. 2001.
 H. C. Boas, “BilingualFrameNet dictionaries for machine translation,”
in Proc. 3rd Int. Conf. Language Resources Eval. (LREC-2002), M.
Gonzalez Rodriguez and C. Paz Suarez Araujo, Eds., Las Palmas,
Spain, 2002, vol. IV, pp. 1364–1371.
 C. J. Fillmore, C. F. Baker, and H. Sato, “Seeing Arguments through
Transparent Structures,” in Proc. 3rd Int. Conf. Language Resources
Eval. (LREC-2002), Las Palmas, Spain, 2002, pp. 787–791.
 P. Kingsbury and M. Palmer, “From Treebank to PropBank,” in Proc.
3rd Int. Conf. Language Resources Eval. (LREC-2002), Las Palmas,
 M. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large
annotated corpus of English: The Penn TreeBank,” in Comp. Linguis-
tics, 1993, vol. 19, no. 2, pp. 313–330.
predicate-argument structure,” in Proc. ARPA Human Language Tech-
nology Workshop, Princeton, NJ, 1994.
 H. T. Dang, K. Kipper, M. Palmer, and J. Rosenzweig, “Investigating
regular sense extensions based on intersective Levin classes,” in Proc.
Coling-ACL98, Montreal, PQ, Canada, 1998.
 B. Levin, English Verb Classes and Alternations A Preliminary Inves-
tigation.Chicago, IL: Univ. of Chicago Press, 1993.
 I. Navas-Delgado, N. Moreno-Vergara, A. C. Gomez-Lora, M. del Mar
Roldan-Garcia, I. Ruiz-Mostazo, and J. F. Aldana-Montes, “Embed-
ding semantic annotations into dynamic Web contents,” in Proc. 15th
Int. Workshop Database Expert Syst. Applicat., 30 Aug.–3 Sept. 2004,
 P. Wang, B.-W. Xu, J.-J. Lu, D.-Z. Kang, and Y.-H. Li, “A novel ap-
proach to semantic annotation based on multi-ontologies,” in Proc.
2004 Int. Conf. Machine Learn. Cybern., Aug. 26–29, 2004, vol. 3,
 V.-W. Soo, C.-Y. Lee, C.-C. Li, S. L. Chen, and C.-C. Chen, “Auto-
mated semantic annotation and retrieval based on sharable ontology
and case-based learning techniques,” in Proc. 2003 Joint Conf. Digital
Libraries, May 27–31, 2003, pp. 61–72.
KHOURY et al.: SEMANTIC UNDERSTANDING OF GENERAL LINGUISTIC ITEMS BY MEANS OF FUZZY SET THEORY 771
 V.-W. Soo, S.-Y. Yang, S.-L. Chen, and Y.-T. Fu, “Ontology acquisi-
Proc. 2004 Joint ACM/IEEE Conf. Digital Libraries, Jun. 7–11, 2004,
 T. Berners-Lee, J. Hendler, and O. Lassila, “The semantic web,” Sci.
Amer., vol. 284, pp. 34–43, 2001.
 L. A. Zadeh, “Toward a theory of fuzzy information granulation and its
centrality in human reasoning and fuzzy logic,” Fuzzy Sets Syst., vol.
90, no. 2, pp. 111–127, 1997.
 L. A. Zadeh, “Outline of a computational approach to meaning and
knowledge representation based on a concept of a generalized assign-
ment statement,”in Proceedings of the International Seminar onArtifi-
cial Intelligence and Man-Machine Systems, M. Thoma and A. Wyner,
Eds. Heidelberg, Germany: Springer, 1986, pp. 198–211.
ment of the role of natural languages in scientific theories,” in Proc.
2004 IEEE Int. Conf. Fuzzy Syst., Jul. 25–29, 2004, vol. 1, pp. 1–2.
 Y. Sun, R. Khoury, F. Karray, and O. Basir, “Semantic context classi-
fication by means of fuzzy set theory,” in Proc. 2005 IEEE Int. Conf.
Natural Language Process. Knowledge Eng. (IEEE NLP-KE 2005),
Wuhan, China, 30 Oct.–1 Nov. 2005, pp. 250–255.
 F. Yuan, L. Yang, and G. Yu, “Improving the k-NN and applying it to
Chinese text classification,” in Proc. 2005 Int. Conf. Machine Learn.
Cybern., Aug. 18–21, 2005, vol. 3, pp. 1547–1553.
 B. Wang and S. Zhang, “A novel text classification algorithm based
on Naïve bayes and KL-divergence,” in Proc. 6th Int. Conf. Parallel
 T.-D.Zhu, X.-X.Zhao,andY.-S. Liu,“A newtextclassification model
based on the sentence space,” in Proc. 2005 Int. Conf. Machine Learn.
Cybern., Aug. 18–21, 2005, vol. 3, pp. 1774–1777.
 L.-X. Wang, J.-M. Han, Z. Wei, and G.-C. Zhou, “Application of lay-
ered clustering and plane partition in web page classification,” in Proc.
2005 Int. Conf. Machine Learn. Cybern., Aug. 18–21, 2005, vol. 4, pp.
text classification,” in Proc. 2005 IEEE/WIC/ACM Int. Conf. Web In-
tell., Sept. 19–22, 2005, pp. 598–601.
 Y. Wang and X.-J. Wang, “A new approach to feature selection in text
classification,” in Proc. 2005 Int. Conf. Machine Learn. Cybern., Aug.
18–21, 2005, vol. 6, pp. 3814–3819.
Richard Khoury received the B.Ing. degree in com-
puter engineering and the M.Sc. degree in electrical
engineering from Laval University, Quebec City, PQ,
Canada, in 2002 and 2004, respectively. He is cur-
rently pursuing the Ph.D. degree in computer engi-
neering at the University of Waterloo, Waterloo, ON,
He is currently a member of the Pattern Analysis
and Machine Intelligence Laboratory, University of
Waterloo. His publications include “A Methodology
for Extracting and Representing Actions in Texts”
presented at the 2006 IEEE International Conference on Fuzzy Systems and
“A Fuzzy Classifier for Natural Language Text using Automatically Learned
Fuzzy Rules” presented at the second International Conference on Artificial
and Computational Intelligence for Decision, Control and Automation—Inter-
national Conference on Machine Intelligence. His research interests include
machine intelligence and natural language processing.
Fakhri Karray (M’98–SM’01) received the Ing. de-
gree from the University of Tunis, Tunisia, in 1984
and the Ph.D. degree from the University of Illinois,
Urbana-Champaign, in 1989.
He is a Professor of electrical and computer engi-
neering at the University of Waterloo, Waterloo, ON,
Canada, and Associate Director of the Pattern Anal-
ysis and Machine Intelligence Laboratory. His cur-
rent research interests are in the areas of autonomous
systems and intelligent man–machine interfacing de-
sign. He is the coauthor of Soft Computing and Intel-
ligent Systems Design (Reading, MA: Addison-Wesley, 2004). He is Associate
Editor of the International Journal of Robotics and Automation and the Journal
of Control and Intelligent Systems. He is a member of the Editorial Advisory
Board of Inhenierya.
Prof. Karray is Associate Editor of the IEEE TRANSACTIONS ON SYSTEMS,
MAN, AND CYBERNETICS—PART B: CYBERNETICS. He was Program Chair
of the 2002 IEEE International Symposium on Intelligent Control. He is KW
Chapter Chair of the IEEE Control Systems Society.
Yu Sun received the B.Sc. degree in computer sci-
ence from Harbin Institute of Technology, China, in
1995 and the M.A.Sc. and Ph.D. degrees in systems
design engineering from the University of Waterloo,
Waterloo, ON, Canada, in 2001 and 2005, respec-
His research interests include artificial intelli-
gence and natural language study through fuzzy
received the B.Sc. (Hons.) degree in electrical
engineering from Alexandria University, Egypt,
the M.A.Sc. degree from McMaster University,
Hamilton, ON, Canada, and the Ph.D. degree from
the University of Toronto, Toronto, ON, Canada.
He is a Professor and Director of the Pattern
Analysis and Machine Intelligence Laboratory,
the Department of Electrical and Computer Engi-
neering, University of Waterloo. He holds a Canada
Research Chair in Cooperative Intelligent Systems.
His research interests are in machine intelligence, neural networks, and pattern
recognition with applications in robotics and manufacturing. He has authored
or coauthored more than 300 papers in journals and conference proceedings
and numerous technical and industrial project reports. He has received two
patents Under his supervision, 60 Ph.D. and M.A.Sc. students have completed
their degrees. He is Editor-in-Chief of the International Journal of Robotics
and Automation and Associate Editor of Cognitive Neurodynamics. He is
a member of the Editorial Advisory Board of the International Journal of
Image and Graphics and Intelligent Automation and Soft Computing. He was
a Consultant for General Motors, NCR, IBM, Northern Telecom, and Spar
Aerospace. He is a member of the Board of Directors and Cofounder of Virtek
Vision International, Waterloo.
Dr. Kamel is a member of ACM, AAAI, and APEO. He is an Associate Ed-
itor of IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A:
PATTERN RECOGNITION LETTERS.
Otman Basir (M’89) received the B.Sc. degree in
computer engineering from Al-Fateh University,
Libya, in 1984, the M.Sc. degree in electrical en-
gineering from Queens University, Kingston, ON,
Canada, in 1989, and the Ph.D. degree in systems
design engineering from the University of Waterloo,
Waterloo, ON, Canada, in 1993.
He is currently an Associate Professor with the
Department of Electrical and Computer Engineering,
University of Waterloo, and an Associate Director
of the Pattern Analysis and Machine Intelligence
Laboratory and the Waterloo Institute for Health Informatics Research. His
research interests include embedded real-time systems, sensor design, fusion
of sensory and decision information, biologically inspired intelligence, and
intelligent transportation systems.