Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Historically, information retrieval (IR) has followed two principally different paths that we call syntactic IR and semantic IR. In syntactic IR, terms are represented as arbitrary sequences of characters and IR is performed through the computation of string similarity. In semantic IR, instead, terms are represented as concepts and IR is performed through the computation of semantic relatedness between concepts. Semantic IR, in general, demonstrates lower recall and higher precision than syntactic IR. However, so far the latter has definitely been the winner in practical applications. In this paper we present a novel approach which allows it to extend syntactic IR with semantics, thus leverage the advantages of both syntactic and semantic IR. First experimental results, reported in the paper, show that the combined approach performs at least as good as syntactic IR, often improving results where semantics can be exploited.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... An apparently more successful approach was presented by Fausto Giunchiglia et al. (cf. [13]); their prototype system also uses WordNet data in a GATE architecture but rather creates a " conceptual index " from natural language phrases linked to Wordnet concepts. The evaluation of their system showed a better precision/recall-performance (on a 29506-document corpus) than an Apache Lucene based search engine 7 . ...
... Unfortunately, current search engine technology is mostly based on (syntactic) open web 9 search (cf. [13], [16], etc.), which in turn is based on common information retrieval techniques. These provide only basic tools, which are not very effective in a highly socialized and informationwise fine grained environment. ...
Article
Full-text available
In this paper, we propose semantic enterprise search as promising technical methodology for improving on accessibility to institutional knowledge. We briefly discuss the nature of knowledge and ignorance in respect to web–based information retrieval before introducing our particular view on semantic search as tight fusion of search engine and semantic web technologies, based on semantic annotations and the concept of intra–institutionwise distributed extensibility while still maintaining free keyword search functionality. Consequently , our architecture implementation makes strong use of the Aperture and Lucene software frameworks but introduces the novel concept of "RDF documents". Because our prototype system is not yet complete, we are not able to provide performance statistics but instead we present a concise example scenario.
... Pre-coordination clearly leads to an exponential explosion in the number of subjects, while in the faceted approach they are instead created by composing the atomic concepts from the facets. A faceted representation scheme corresponds to what in our previous work we call the background knowledge [4,5], i.e. the a-priori knowledge which must exist to make semantics effective. Each facet corresponds to what in logics is called logical theory [23,24] and to what in computer science is called ontology, or more precisely lightweight ontology [6]. ...
... From their integration we created GeoWordNet, one of the biggest multi-lingual geo-spatial ontologies currently available and therefore particularly suitable to provide semantic support for spatial applications. A large subset of GeoWordNet is available as open source 5 in plain CSV and RDF formats. This mapping allowed, among other things, identifying the main subtrees of WordNet containing synsets representing geographical classes. ...
Conference Paper
Full-text available
Space, together with time, is one of the two fundamental dimensions of the universe of knowledge. Geo-spatial ontologies are essential for our shared understanding of the physical universe and to achieve semantic interoperability between people and between software agents. In this paper we propose a methodology and a minimal set of guiding principles, mainly inspired by the faceted approach, to produce high quality ontologies in terms of robustness, extensibility, reusability, compactness and flexibility. We demonstrate - with step by steps examples - that by applying the methodology and those principles we can model the space domain and produce a high quality facet-based large scale geo-spatial ontology comprising entities, entity classes, spatial relations and attributes.
... The context always makes clear what we mean. Lightweight ontologies allow for automated document classification[1,16], query answering[1,21]and also for solving the semantic heterogeneity problem among multiple ontologies[15,18,19,20]. They are definitely a very powerful tool which can be exploited towards the automation of reasoning in data and knowledge management. ...
... These two steps construct a faceted representation scheme and correspond to what in our previous work we call the definition and construction of the so called background knowledge[17,21], namely the a-priori knowledge which must exist in order to make semantics effective. Notice that the grouping of terms of step 2 have real world semantics, namely, they are descriptive ontologies which are formed using part-of, is-a and instance-of. ...
Article
Full-text available
We concentrate on the use of ontologies for the categorization of objects, e.g., photos, books, web pages. Lightweight ontologies are ontologies with a tree structure where each node is associated a natural language label. Faceted lightweight ontologies are lightweight ontologies where the labels of nodes are organized according to certain predefined patterns which capture different aspects of meaning, i.e., facets. We introduce facets based on the Analytico-Synthetic approach, a well established methodology from Library Science which has been successfully used for decades for the classification of books. Faceted lightweight ontologies have a well defined structure and, as such, they are easier to create, to share among users, and they also provide more organized input to semantics based applications, such as semantic search and navigation. To appear in: "Conceptual Modeling: Foundations and Applications", Alex Borgida, Vinay Chaudhri, Paolo Giorgini, Eric Yu (Eds.) LNCS 5600 Springer. - The original publication will be available at www.springerlink.com
... According to the foregoing, there are few of the ontological search engines that support Arabic language and QA. The fundamental reasons could be traced back to the natural language processing (NLP) (Giunchiglia et al., 2008) and the challenges to address the syntactic search and produce synonym meaning of words. Another point is the particularities like short vowels, absence of capital letters, grammar and the morphological complexity such as the diacritics. ...
... Ces avancées sont illustrées par Boughanem et Savoy [BS08] qui proposent un état des lieux et des perspectives de la RI. Nous adoptons finalement la définition suivante de Giunchiglia et al. [GKZ08] : « le but d'un système de recherche d'information est de faire correspondre une requête en langage naturel q, qui spécifie des besoins en information d'un utilisateur, à un ensemble de documents dans une collection de documents D, qui répondent à ces besoins, et (de manière facultative) de classer ces documents selon leur degré de pertinence 3 ». ...
Article
Information systems face a relevance problem in retrieval due to the huge increase of available data. Moreover, the number of networking devices grows up and jeopardizes the client/server architecture model. A new architecture is then emerging: peer-to-peer networks (P2P). But they are greedy in network resources (queries flood the network) and offer limited functionalities (key word search). In both fields, IR and P2P systems, research are going deeper on the use of semantics. In computer science, semantics based approaches generally relies on the definition of ontologies. Huge and distributed development of ontologies leads to a semantic heterogeneity. A classical solution relies on the use of mappings between parts of two ontologies. But this solution is difficult to obtain and not always complete. Unshared parts of two ontologies are often not managed, which leads to a loss of information. Our solution, EXSI2D, uses a special query expansion, called structuring expansion, on query initiator's side. Then she can specify the dimensions of her query without any modification of the query itself. Information provider is also allowed to interpret the structuring expansion within her own ontologies. Thus each participant of a semantic heterogeneous information system is able to use all her ontology, including the unshared parts. We also present a solution to the use of EXSI2D in a P2P system, thanks to SPARTANBFS, a “frugal”protocol for unstructured P2P systems.
... The problem of sense enumerations in compound nouns is that they are a source of noise rather than a source of knowledge when using WordNet as a source for natural language processing (NLP) and knowledge-based applications, especially Information Retrievel (IR) [8] and semantic search [9]. Although specific instances of the compound noun polysemy have been addressed when solving the problem of specialization polysemy [ [13], no or little research has been made towards solving the problem of compound noun polysemy as a problem of sense enumeration in WordNet. ...
Conference Paper
Full-text available
Sense enumeration in WordNet is one of the main reasons behind the problem of the high polysemous nature of WordNet. The sense enumeration refers to misconstruction that results in wrong assigning of a synset to a term. In this paper, we propose semi-automatic process to discover and solve the problem of sense enumerations in compound noun polysemy in WordNet. The proposed solution reduces the number of sense enumerations in WordNet and thus its high polysemous nature without affecting its efficiency as a lexical resource for natural language processing.
... In[7], we showed how searching for complex concepts can be implemented by indexing documents directly by these concepts. The problem of this method is that the size of the inverted index vocabulary, in the worst case, is exponential with respect to size of terminology T . ...
Conference Paper
Full-text available
In this paper we present a novel approach, called Concept Search, which extends syntactic search, i.e., search based on the computation of string similarity between words, with semantic search, i.e., search based on the computation of semantic relations between concepts. The key idea of Concept Search is to operate on complex concepts and to maximally exploit the semantic information available, reducing to syntactic search only when necessary, i.e., when no semantic information is available. The experimental results show that Concept Search performs at least as well as syntactic search, improving the quality of results as a function of the amount of available semantics.
... Moreover, [21] does not support advanced search operations related to ontology semantics. Additionally, an interesting but less relative approach [22, 23], analyzes the meaning of words and phrases, to define semantic relations between lexicalized concepts. In that case, syntactic search is extended with semantics, by converting words into concepts and exploiting the arisen semantics. ...
Conference Paper
Full-text available
This paper describes GoNTogle, a framework for document annotation and retrieval, built on top of Semantic Web and IR technologies. GoNTogle supports ontology-based annotation for documents of several formats, in a fully collaborative environment. It provides both manual and automatic annotation mechanisms. Automatic annotation is based on a learning method that exploits user annotation history and textual information to automatically suggest annotations for new documents. GoNTogle also provides search facilities beyond the traditional keyword-based search. A flexible combination of keyword-based and semantic-based search over documents is proposed in conjunction with advanced ontology-based search operations. The proposed methods are implemented in a fully functional tool and their effectiveness is experimentally validated.
... In this section, we describe how the document representations (seeFigure 3) can be indexed and retrieved by using a (record level) inverted index (as it was proposed in [13]). In the inverted index, as used in syntactic search, there are two parts: the dictionary, i.e., a set of terms (t) used for indexing; and a set of posting lists P(t). ...
Article
Full-text available
The goal of information retrieval (IR) is to map a natural language query, which specifies the user information needs, to a set of objects in a given collection, which meet these needs. Historically, there have been two major approaches to IR that we call syntactic IR and semantic IR. In syntactic IR, search engines use words or multi-word phrases that occur in document and query representations. The search procedure, used by these search engines, is principally based on the syntactic matching of document and query representations. The precision and recall achieved by these search engines might be negatively affected by the problems of (i) polysemy, (ii) synonymy, (iii) complex concepts, and (iv) related concepts. Semantic IR is based on fetching document and query representations through a semantic analysis of their contents using natural language processing techniques and then retrieving documents by matching these semantic representations. Semantic IR approaches are developed to improve the quality of syntactic approaches but, in practice, results of semantic IR are often inferior to that of syntactic one. In this thesis, we propose a novel approach to IR which extends syntactic IR with semantics, thus addressing the problem of low precision and low recall of syntactic IR. The main idea is to keep the same machinery which has made syntactic IR so successful, but to modify it so that, whenever possible (and useful), syntactic IR is substituted by semantic IR, thus improving the system performance. As instances of the general approach, we describe the semantics enabled approaches to: (i) document retrieval, (ii) document classification, and (iii) peer-to-peer search.
Chapter
The amount of unstructured data has grown exponentially during the past two decades and continues to grow at even faster rates. As a consequence, the efficient management of this kind of data came to play an important role in almost all organizations. Up to now, approaches from many different research fields, like information search and retrieval, text mining or query expansion and reformulation, have enabled us to extract and learn patterns in order to improve the management, retrieval and recommendation of documents. However, there are still many open questions, limitations and vulnerabilities that need to be addressed. This paper aims at identifying the current major challenges and research gaps in the field of “document enrichment, retrieval and recommendation”, introduces innovative ideas towards overcoming these limitations and weaknesses, and shows the benefits of adopting these ideas into real enterprise content management systems.
Article
Full-text available
(Full text available here: http://eprints.biblio.unitn.it/2104/) The availability of a priori knowledge, also called background knowledge, is fundamental for the functioning of semantics based systems. In this paper we introduce a faceted knowledge organization framework called DERA (for Domain, Entity, Relation, Attribute) and describe its implementation inside a system, called UK (for Universal Knowledge) which is extensible and scalable and which allows for fully automated reasoning via a direct encoding into Description Logics (DL). Extendibility and scalability is obtained by allowing the definition of any number of domains, where a domain is taken to be ― an area of knowledge or field of study that we are interested in or that we are communicating about. In turn, a domain is organized into a number of facets where a facet is taken to be ― a hierarchy of homogeneous terms describing an aspect of the knowledge being codified, where each term denotes a primitive atomic concept. Domains, facets, terms can be added at any time, and the different applications can use any subset of them. The direct encoding of DERA into DL is obtained by allowing only three types of facets (i.e., Entity, Relation, Attribute) which can be directly translated into DL concepts, roles, attributes, or into instances whose properties are encoded using the terms occurring in the facets themselves. The current implementation of UK contains around 377 Domains, out of which 115 are in priority for development, more than 150,000 terms (encoding concepts, relations and attributes), around 10,000,000 instances and more than 93,000,000 axioms codified using the terms codified in the DERA facets.
Conference Paper
In this paper, I would like to present a brief view of my research in Vietnamese concept-based information retrieval model and a result of the research in concept identification method for Vietnamese text. This method, which is the key of the model, has to solve the problems of identification of phrases, synonyms and homonyms in a sentence. The problems of identification of synonyms and homonyms are focused and solved according to the concept of semantic memory. The experiment result of the method is also presented.
Conference Paper
In this paper, the ontology-based annotation method for Vietnamese text document and a method of indexing annotated document are presented. The annotation method aims at annotating a text document in order to decrease the amount of comparing operations and increase the accuracy in the retrieval tasks. To accomplish this purpose, the kernel phrase concept is proposed to transform the groups of synonymy phrases into an only phrase which has the same meaning to the synonymy phrases. The transforming from a phrase to its kernel phrase is based on an ontology of Vietnamese words according to a different language model. This paper is focused on the annotation method using ontology to identify kernel phrases in a document. Some problems of Vietnamese language related to the annotation method are also presented.
Conference Paper
Full-text available
Multi-domain search answers to queries spanning multiple entities, like "Find an affordable house in a city with low criminality index, good schools and medical services", by producing ranked sets of entity combinations that maximize relevance, measured by a function expressing the user's preferences. Due to the combinatorial nature of results, good entity instances (e.g., inexpensive houses) tend to appear repeatedly in top-ranked combinations. To improve the quality of the result set, it is important to balance relevance (i.e., high values of the ranking function) with diversity, which promotes different, yet almost equally relevant, entities in the top-k combinations. This paper explores two different notions of diversity for multi-domain result sets, compares experimentally alternative algorithms for the trade-off between relevance and diversity, and performs a user study for evaluating the utility of diversification in multi-domain queries.
Article
Full-text available
In questo articolo ci concentriamo sull’uso delle ontologie per l’organizzazione di oggetti, quali ad esempio foto, libri e pagine Web. Le ontologie leggere sono ontologie con una struttura gerarchica ad albero dove a ciascun nodo è associata un’etichetta in linguaggio naturale. Nelle ontologie leggere a faccette le etichette sono organizzate secondo modelli ben definiti, i quali catturano specifici aspetti della conoscenza, ovvero le faccette. A tal fine, ci basiamo sull’approccio Analitico-Sintetico, una ben radicata metodologia usata con successo per decenni in biblioteconomia, soprattutto in India, per la classificazione di libri. Le ontologie leggere a faccette hanno una struttura ben definita ed, in quanto tali, risultano più facili da creare, condividere tra gli utenti, e più appropriate in applicazioni semantiche, dove cioè viene automaticamente analizzato e sfruttato il significato ontologico dei termini.
Conference Paper
Full-text available
We think of Match as an operator which takes two graph-like structures (e.g., conceptual hierarchies or ontologies) and produces a mapping between those nodes of the two graphs that correspond semantically to each other. Semantic matching is a novel approach where semantic correspondences are discovered by computing, and returning as a result, the semantic information implicitly or explicitly codified in the labels of nodes and arcs. In this paper we present an algorithm implementing semantic matching, and we discuss its implementation within the S-Match system. We also test S-Match against three state of the art matching systems. The results, though preliminary, look promising, in particular for what concerns precision and recall.