Article

Dipe-D: A Tool for Knowledge-Based Query Formulation in Information Retrieval

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The paper reports the development of Dipe-D, a knowledge-based procedure for the formulation of Boolean queries in information retrieval. Dipe-D creates a query in two steps: (1) the user's information need is developed interactively, while identifying the concepts of the information need, and subsequently (2) the collection of concepts identified is automatically transformed into a Boolean query. In the first step, the subject area—as represented in a knowledge base—is explored by the user. He does this by means of specifying the (concepts that meet his) information need in an artificial language and looking through the solution as provided by the computer. The specification language allows one to specify concepts by their features, both in precise terms as well as vaguely. By repeating the process of specifying the information need and exploring the resulting concepts, the user may precisely single out the concepts that describe his information need. In the second step, the program provides the designations (and variants) for the concepts identified, and connects them by appropriate operators. Dipe-D is meant to improve on existing procedures that identify the concepts less systematically, create a query manually, and then sometimes expand that query. Experiments are reported on each of the two steps; they indicate that the first step identifies only but not all the relevant concepts, and the second step performs (at least) as good as human beings do.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Therefore, query formulation needs to be done according to both the user´s information need and the type of systems where the information are stored in. Van Der Pol (2003) suggests also that formulating a query requires the knowledge of the subject area (or domain knowledge). In fact, the domain knowledge is important because the information need "comprises concepts having several features" and also "several concepts having identical features" (Van Der Pol, 2003). ...
... Van Der Pol (2003) suggests also that formulating a query requires the knowledge of the subject area (or domain knowledge). In fact, the domain knowledge is important because the information need "comprises concepts having several features" and also "several concepts having identical features" (Van Der Pol, 2003). Thus, by having the domain knowledge, one can either enlarge the domain of search (this process is referred to as query expansion) ...
Thesis
Full-text available
Nowadays, more and more companies use Enterprise Models to integrate and coordinate their business processes with the aim of remaining competitive on the market. Consequently, Enterprise Models play a critical role in this integration enabling to improve the objectives of the enterprise, and ways to reach them in a given period of time. Through Enterprise Models, companies are able to improve the management of their operations, actors, processes and also to improve communication within the organisation. This thesis describes another use of Enterprise Models. In this work, we intend to apply Enterprise Models as interface for information searching. The underlying needs for this project lay in the fact that we would like to show that Enterprise Models can be more than just models but can be used in a more dynamic way which is through a software program for information searching. The software program aims first at extracting the information contained in the Enterprise Models (which are stored into a XML file on the system). Once the information is extracted, it is used to express a query which will be sent into a search engine to retrieve some relevant documents for the query and return them to the user. The thesis was carried out over an entire academic semester. The results of this work are a report which summarizes all the knowledge gained into the field of the study. A software has been built to serve as a proof of testing the theories.
... Other work has been developed in this context, we can cite: -R. W. Van Der Pol, (Van Der Pol, 2003) proposed a system to reformulation pre-interrogation based on the representation of a medical field. This field is organized into concepts linked by a certain number of binary relations (i.e. ...
Article
Full-text available
Research of pertinent information on the web is a recent concern of Information Society. Processing based on statistics is no longer enough to handle (i.e. to search, translate, summarize...) relevant information from texts. The problem is how to extract knowledge taking into account document contents as well as the context of the query. Requests looking for events taking place during a certain time period (i.e. between 1990 and 2001) cannot provide yet the expected results. We propose a method to transform the query in order to "understand" its context and its temporal framework. Our method is validated by the SORTWEB System.
... The terms were then aggregated into " sub-requests " and combined into a Boolean expression in disjunctive normal form (DNF). Other algorithms that have been proposed to translate a query to DNF are based on classification [8], decision-trees [3], and thesauri [24]. Hearst [10] Keyphrases within the same category are associated with a facet of the concept; therefore, the keyphrases within the category are translated into DNF, which each keyphrase is OR-ed together. ...
Conference Paper
Full-text available
In pseudo-relevance feedback, the two key factors affecting the retrieval performance most are the source from which expansion terms are generated and the method of ranking those expansion terms. In this paper, we present a novel unsupervised query expansion technique that utilizes keyphrases and POS phrase categorization. The keyphrases are extracted from the retrieved documents and weighted with an algorithm based on information gain and co-occurrence of phrases. The selected keyphrases are translated into Disjunctive Normal Form (DNF) based on the POS phrase categorization technique for better query refomulation. Furthermore, we study whether ontologies such as WordNet and MeSH improve the retrieval performance in conjunction with the keyphrases. We test our techniques on TREC 5, 6, and 7 as well as a MEDLINE collection. The experimental results show that the use of keyphrases with POS phrase categorization produces the best average precision.
... The term "knowledge-based queries" may also be used to mean a repository of meaningful terms or keywords that help the user to construct more meaningful queries regarding the data (Catarci et al. 2004;Van Der Pol 2003; KB_SQL™, developed by the Knowledge Based System Company, http://www.kbsystems.com/ products_kbsql.html). ...
Article
Time-oriented domains with large volumes of time-stamped information, such as medicine, security information and finance, require useful, intuitive intelligent tools to process large amounts of time-oriented multiple-subject data from multiple sources. We designed and developed a new architecture, the VISualizatIon of Time-Oriented RecordS (VISITORS) system, which combines intelligent temporal analysis and information visualization techniques. The VISITORS system includes tools for intelligent selection, visualization, exploration, and analysis of raw time-oriented data and of derived (abstracted) concepts for multiple subject records. To derive meaningful interpretations from raw time-oriented data (known as temporal abstractions), we use the knowledge-based temporal-abstraction method. A major task in the VISITORS system is the selection of the appropriate subset of the subject population on which to focus during the analysis. Underlying the VISITORS population-selection module is our ontology-based temporal-aggregation (OBTAIN) expression-specification language which we introduce in this study. The OBTAIN language was implemented by a graphical expression-specification module integrated within the VISITORS system. The module enables construction of three types of expressions supported by the language: Select Subjects, Select Time Intervals, and Get Subjects Data. These expressions retrieve a list of subjects, a list of relevant time intervals, and a list of time-oriented subjects’ data sets, respectively. In particular, the OBTAIN language enables population-specification, through the Select Subjects expression, by using an expressive set of time and value constraints. We describe the syntax and semantics of the OBTAIN language and of the expression-specification module. The OBTAIN expressions constructed by the expression-specification module, are computed by a temporal abstraction mediation framework that we have previously developed. To evaluate the expression-specification module, five clinicians and five medical informaticians defined ten expressions, using the expression-specification module, on a database of more than 1,000 oncology patients. After a brief training session, both user groups were able in a short time (mean = 3.3 ± 0.53 min) to construct ten complex expressions using the expression-specification module, with high accuracy (mean = 95.3 ± 4.5 on a predefined scale of 0 to 100). When grouped by time and value constraint subtypes, five groups of expressions emerged. Only one of the five groups (expressions using time-range constraints), led to a significantly lower accuracy of constructed expressions. The five groups of expressions could be clustered into four homogenous groups, ordered by increasing construction time of the expressions. A system usability scale questionnaire filled by the users demonstrated the expression-specification module to be usable (mean score for the overall group = 68), but the clinicians’ usability assessment (60.0) was significantly lower than that of the medical informaticians (76.1).
Chapter
Keyphrase Extraction-Based Pseudo-Relevance Feedback Query Expansion with WordNet Experiments on Medline Data Sets Conclusions References
Article
The paper reports the design of Dipe-R, a knowledge representation language. It meets two requirements: (1) Dipe-R should enable the appropriate expression of mental states and processes, viz. in line with basic insights on the nature of (human) communication and knowledge, and (2) Dipe-R should offer the possibility of identifying concepts by describing them by their features. The first requirement means that Dipe-R is capable of representing knowledge adequately. The second reflects a common way of exploring knowledge; this seems useful in various applications, e.g., in Information Retrieval.For meeting these requirements, Dipe-R has as main characteristics: (a) two types of expressions for representing thoughts (using binary relations), (b) for each expression Dipe-R creates at least one sentence in natural language, thus distinguishing language from ‘meaning’, and (c) each expression in Dipe-R is provided with information on its origin. Some relation types are pre-represented, e.g., for creating type hierarchies. Experiments (in Information Retrieval) indicate that Dipe-R indeed meets its requirements.
Conference Paper
Full-text available
Enterprise knowledge modelling in general offers methods, tools and approaches for capturing knowledge about processes and products in formalized models in order to support organizational challenges. From the perspective of information systems development, the use of enterprise models traditionally is very strong in the earlier development phases, whereas approaches for applying enterprise models in later development phases exist, but are not as intensely researched and elaborated. We argue that enterprise models should be considered as carrier of enterprise knowledge, which can be used to a larger extent in creating actable solutions for enterprise problems. With competence supply as an example, the focus of the paper is on using enterprise models as interface for information searching. The main contributions of the paper are (1) to show that it is feasible and produces pertinent results to use enterprise models for expressing information demand, (2) architecture and implementation of an IT solution for this purpose and (3) lessons learned from validating this approach.
Conference Paper
In the domain of bioinformatics, extracting a relation such as protein-protein interations from a large database of text documents is a challenging task. One major issue with biomedical information extraction is how to efficiently digest the sheer size of unstructured biomedical data corpus. Often, among these huge biomedical data, only a small fraction of the documents contain information that is relevant to the extraction task. We propose a novel query expansion algorithm to automatically discover the characteristics of documents that are useful for extraction of a target relation. Our technique introduces a hybrid query re-weighting algorithm combining the modified Robertson Sparck-Jones query ranking algorithm with a keyphrase extraction algorithm. Our technique also adopts a novel query translation technique that incorporates POS categories to query translation. We conduct a series of experiments and report the experimental results. The results show that our technique is able to retrieve more documents that contain protein-protein pairs from MEDLINE as iteration increases. Our technique is also compared with SLIPPER, a supervised rule-based query expansion technique. The results show that our technique outperforms SLIPPER from 17.90% to 29.98 better in four iterations.
Article
This article investigates a new, effective browsing approach called metabrowsing. It is an alternative for current information retrieval systems, which still face six prominent difficulties. We identify and classify the difficulties and show that the metabrowsing approach alleviates the difficulties associated with query formulation and missing domain knowledge. Metabrowsing is a high-level way of browsing through information: instead of browsing through document contents or document surrogates, the user browses through a graphical representation of the documents and their relations to the domain. The approach requires other cognitive skills from the user than what is currently required. Yet, a user evaluation in which the metabrowsing system was compared with an ordinary query-oriented system showed only some small indicatory differences in effectiveness, efficiency, and user satisfaction. We expect that more experience with metabrowsing will result in a significantly better performance difference. Hence, our conclusion is that the development of new cognitive skills requires some time before the technologies are ready to be used.
Article
Full-text available
Graphical Markov models are a powerful tool for the description of complex interactions between the variables of a domain. They provide a succinct description of the joint distribution of the variables. This feature has led to the most successful application of graphical Markov models, that is as the core component of probabilistic expert systems. The fascinating theory behind this type of models arises from three different disciplines, viz., Statistics, Graph Theory and Artificial Intelligence. This interdisciplinary origin has given rich insight from different perspectives. There are two main ways to find the qualitative structure of graphical Markov models. Either the structure is specified by a domain expert or ``structural learning'' is applied, i.e., the structure is automatically recovered from data. For structural learning, one has to compare how well different models describe the data. This is easy for, e.g., acyclic digraph Markov models. However, structural learning is still a hard problem because the number of possible models grows exponentially with the number of variables. The main contributions of this thesis are as follows. Firstly, a new class of graphical Markov models, called TCI models, is introduced. These models can be represented by labeled trees and form the intersection of two previously well-known classes. Secondly, the inclusion order of graphical Markov models is studied. From this study, two new learning algorithms are derived. One for heuristic search and the other for the Markov Chain Monte Carlo Method. Both algorithms improve the results of previous approaches without compromising the computational cost of the learning process. Finally, new diagnostics for convergence assessment of the Markov Chain Monte Carlo Method in structural learning are introduced. The results of this thesis are illustrated using both synthetic and real world datasets.
Article
Due to technological developments we foresee future systems where groups of actors coordinate their actions in a dynamic manner to reach their goals. Our aim is to develop a reasoning model for artificial actors in such systems. Starting point is the relation between autonomy of individuals and coordination of group behavior. We adopt the agent paradigm as basis for the actors. Although autonomy is a key concept in agent research, there is no common definition of agent autonomy and adjustable autonomy. Our agents should be autonomous such that they can be held responsible for their actions. Furthermore, they should be able to work together following different types of coordination in order to achieve dynamic coordination. In this research, we define being autonomous as having control over external influences on the decision-making process. Adjustable autonomy in our context means dynamically dealing with external influences on the decision-making process based on internal motivations. An agent controls to what extend it is being influenced by the environment and by other agents. Agents achieve coordination by allowing influences of others on their decision-making process. We present an abstract reasoning model that facilitates agent autonomy. The two main components of the reasoning model are influence control and decision-making. In the component for influence control the agent deals with new observations and messages. We propose to use reasoning rules to process those external events. The reasoning rules specify the effects of external events on the beliefs and goals of the agent. The beliefs and goals, then, are used for the decision-making process. With this reasoning model we do a simulation experiment of a firefighter organization extinguishing several fires. The organizational goal can be reached via different types of coordination ranging from emergent to centralized coordination. The experiment shows that the perspective on autonomy as influence control provides a promising way to facilitate dynamic coordination. Our reasoning model separates event-processing and decision-making. This allows the development of domain-independent heuristics for event-processing. In this thesis we mention three heuristics: information relevance, organizational knowledge and trust. The first two are discussed in further detail. We propose an algorithm for relevance determination. We argue that potential benefits of using information relevance for influence control can be found in controling the decision quality and preventing information overload. Furthermore, we work out the use of organizational knowledge to process external events. We show how organizational norms can be translated into event-processing rules. We describe how our approach facilitates organizational dynamics. The modularity in our reasoning model ensures that the event-processing rules are explicitly defined. This allows for metareasoning about the event-processing rules. We present a metareasoning model, with which the agent can select and take up the desired attitude with respect to the environment and to other agents. This process gives the agent control over external influences, and therewith it meets the requirements for autonomy and adjustable autonomy. In a simulation experiment we implement a firefighter organization that exhibits dynamic coordination via self-adjustable autonomy of the firefighters.
Article
Decision-support systems are used in a large variety of domains. In the medical domain, such systems can be equipped with a patient-specific facility that indicates which diagnostic test or combination of tests should be performed. Current systems, however, do not take into account that tests might be ordered in packages rather than one by one. Current systems furthermore do not take into account that physicians would prefer to gather information for a specific purpose, such as the general condition of the patient, prior to gathering information about an other topic, such as for instance the presence or absence of metastases. In this thesis, we describe a automatic test-selection facility we designed that fits better with the daily medical practice than current facilities do. Our facility is capable of selecting diagnostic tests while taking into account the specific characteristics of the patients illness, the characteristics of the test, such as its sensitivity and specificity or its predictive value, and the order in which various different subgoals are investigated by the physicians. It is furthermore able to select multiple tests rather than just one test. By extending a decision-support system with a test-selection facility, the system become more and more interesting to use in the daily medical practice. A decision-support system with a good test-selection facility could lead to ordering fewer tests. This might imply lower financial costs, shorter periods of waiting prior be able to visit the physician or receive therapy. Futhermore, it might lead to better patient care and a higher quality of care in general.
Article
Rheumatic and, in particular, connective tissue diseases involve many organ systems in addition to the joints. These include the skin and underlying tissues, muscles, salivary glands, nerves, kidneys and blood vessels. Biopsy specimens of these tissues can be helpful in establishing a diagnosis, assessing the extent and severity of organ involvement and monitoring therapy. This article describes the practical information on a clinical diagnosis that might indicate a need for biopsy. It outlines how to prepare the patient, how to perform the biopsy, what results are to be expected and which sequelae are possible.
Article
Full-text available
A research prototype software system for conceptual information retrieval has been developed. The goal of the system, called RUBRIC, is to provide more automated and relevant access to unformatted textual databases. The approach is to use production rules from artificial intelligence to define a hierarchy of retrieval subtopics, with fuzzy context expressions and specific word phrases at the bottom. RUBRIC allows the definition of detailed queries starting at a conceptual level, partial matching of a query and a document, selection of only the highest ranked documents for presentation to the user, and detailed explanation of how and why a particular document was selected. Initial experiments indicate that a RUBRIC rule set better matches human retrieval judgment than a standard Boolean keyword expression, given equal amounts of effort in defining each. The techniques presented may be useful in stand-alone retrieval systems, front-ends to existing information retrieval systems, or real-time document filtering and routing.
Article
Full-text available
The amount of information available to information workers recently has becomeoverwhelming. This confronts information workers with two majorproblems: finding the information needed, and accessing it; they arecalled the search problem and the access problem, respectively. Asthe main result of our research an architecture is specified of anautomated tool that provides integrated support for searching andaccessing multimedia documents that may be located at arbitraryplaces. The architecture contains a database with information aboutthe documents and with thesaurus-like information. The architecturealso contains a browse mechanism and a query mechanism for inspectingthe database. In the design process of the architecture, severalfundamental questions arose, like What is a document?and What is a medium kind?. The developed answers tosome of these questions are considered to have a general characterand thus to be useful also outside the scope of the research at hand.The paper concludes with an overview of the current status of theproject and a discussion of future work.
Conference Paper
Full-text available
Applications such as office automation, news filtering, help facilities in complex systems, and the like require the ability to retrieve documents from full-text databases where vocabulary problems can be particularly severe. Experiments performed on small collections with single-domain thesauri suggest that expanding query vectors with words that are lexically related to the original query words can ameliorate some of the problems of mismatched vocabularies. This paper examines the utility of lexical query expansion in the large, diverse TREC collection. Concepts are represented by WordNet synonym sets and are expanded by following the typed links included in WordNet. Experimental results show this query expansion technique makes little difference in retrieval effectiveness if the original queries are relatively complete descriptions of the information being sought even when the concepts to be expanded are selected by hand. Less well developed queries can be significantly improved by expansion of hand-chosen concepts. However, an automatic procedure that can approximate the set of hand picked synonym sets has yet to be devised, and expanding by the synonym sets that are automatically generated can degrade retrieval performance.
Conference Paper
Full-text available
We present a deductive data model for concept-based query expansion. It is based on three abstraction levels: the conceptual, the expression and the occurrence level. Concepts and their relationships are represented at the conceptual level. The expression level represents natural language expressions for concepts. Each expression has one or more matching models at the occurrence level. Each model specifies the matching of the expression in database indices built in varying ways. The data model supports a concept-based query expantion and formulation tool, the ExpansionTool, for environments providing heterogeneous IR systems. Expansion is controlled by adjustable matching reliability.
Conference Paper
Full-text available
Most casual users of IR systems type short queries. Recent research has shown that adding new words to these queries via odhoc feedback improves the re- trieval effectiveness of such queries. We investigate ways to improve this query expansion process by refining the set of documents used in feedback. We start by using manually formulated Boolean filters along with proxim- ity constraints. Our approach is similar to the one pro- posed by Hearst(l2). Next, we investigate a completely automatic method that makes use of term cooccurrence information to estimate word correlation. Experimental results show that refining the set of documents used in query expansion often prevents the query drift caused by blind expansion and yields substantial improvements in retrieval effectiveness, both in terms of average preci- sion and precision in the top twenty documents. More importantly, the fully automatic approach developed in this study performs competitively with the best manual approach and requires little computational overhead.
Conference Paper
Full-text available
In Practical retrieval environments the assumption is normally made that the terms assigned to the documents of a collection occur independently of each other. The term independence assumption is unrealistic in many cases, but its use leads to a simple retrieval algorithm. More realistic retrieval systems take into account dependencies between certain term pairs and possibly between term triples. In this study, methods are outlined for generating dependency factors for term pairs and term triples and for using them in retrieval. Evaluation output is included to demonstrate the effectiveness of the suggested methodologies.
Conference Paper
Full-text available
The paper describes how 3D?based visualization and interaction techniques can be used in information retrieval user interfaces. It demonstrates that information retrieval systems based on state of the art retrieval models can be supplied with an intuitive interface functionality by applying the cone?tree metaphor for the visualization of content spaces. The natural ability of humans for spatial perception, orientation and spatial memories is outlined as an advantage in the process of perceiving information spaces by means of spatial metaphors. It is shown how the models, concepts and mechanisms of the retrieval system underlying its user interface can become more transparent and perceptible for the user at the interface level and how some of the cognitive costs of navigations in information spaces can be reduced. This goal is achieved by transformi ng document?term networks with two levels of abstraction into hierarchical and directed cone trees that use the spatial depth of 3D to achieve an easy perception of their topological structure. Finally the paper presents a 3D?based information system user interface which gives the user intuitive control over content?oriented search paths resulting in a query generation for the underlying automatic retrieval mechanisms. In order to improve the presentation of the query results as well, it is shown that 3D?based graphical metaphors can provide very intuitive ways of perceiving the relevance of the result in accordance to the query.
Article
Full-text available
We analyzed transaction logs of a set of 51,473 queries posed by 18,113 users of Excite, a major Internet search service. We provide data on: (i) queries --- the number of search terms, and the use of logic and modifiers, (ii) sessions --- changes in queries during a session, number of pages viewed, and use of relevance feedback, and (iii) terms --- their rank/frequency distribution and the most highly used search terms. Common mistakes are also observed. Implications are discussed.
Article
Full-text available
The results are presented of an experimental investigation into the effects that some forms of query expansion by term addition or term deletion, have on the retrieval effectiveness of a document retrieval system. The overall search strategy used by a user is an iterative process whereby a set of user-judged relevant documents at any point in the search is used to refine and improve on the remainder of the user's search. At some point during the search, the set of relevant documents found so far can be used to modify the original query, either by the addition or deletion of search terms. This process is called 'query modification' or 'query expansion'. A number of different types of query modification strategies are tried and the results obtained are presented and analyzed.
Article
An evaluation of a large, operational full-text document-retrieval system (containing roughly 350,000 pages of text) shows the system to be retrieving less than 20 percent of the documents relevant to a particular search. The findings are discussed in terms of the theory and practice of full-text document retrieval.
Article
As information is both the source and the product of all legal work, legal practitioners are crucially dependent on good access to legal information. The legal databases that are currently available to store and retrieve statutes, judicial decisions, and legal literature, however, often pose problems to the users with the effect that they cannot find the information they need. To a large extent, these problems can be attributed to the limitations of the traditional Boolean query mechanism used in text databases which is difficult for users to operate. As a possible solution for these problems, in this book an architecture is proposed for an intelligent interface to legal databases. An intelligent interface uses knowledge of the task domain of legal practitioners to operate as an intelligent intermediary between the user and the database. Interfacing between Lawyers and Computers addresses the most pressing issues that need to be resolved to allow for the development of advanced user-friendly legal databases. It involves a study into the nature of legal information handling and the representation of legal knowledge. In addition to the theoretical study, it is demonstrated, with the development of a prototype, how the architecture can be used for the development of practical legal information retrieval systems. The results of these studies are interpreted to discuss a number of possible applications for the intelligent interface architecture, including the publication of government information and the organisation of legal information on the Internet. Luuk Matthijssen, who studied business informatics, worked at the law faculty of Tilburg University and the computer science department of Eindhoven University of Technology. The author considers the problems of legal information retrieval from the perspective of an information analyst.
Article
We investigated the sources and effectiveness of search terms used during mediated on-line searching under real-life (as opposed to laboratory) circumstances. A stratified model of information retrieval (IR) interaction served as a framework for the analysis. For the analysis, we used the on-line transaction logs, videotapes, and transcribed dialogue of the presearch and on-line interaction between 40 users and 4 professional intermediaries. Each user provided one question and interacted with one of the four intermediaries. Searching was done using DIALOG. Five sources of search terms were identified: (1) the users' written question statements, (2) terms derived from users' domain knowledge during the interaction, (3) terms extracted from retrieved items as relevance feedback, (4) database thesaurus, and (5) terms derived by intermediaries during the interaction. Distribution, retrieval effectiveness, transition sequences, and correlation of search terms from different sources were investigated. Search terms from users' written question statements and term relevance feedback were the most productive sources of terms contributing to the retrieval of items judged relevant by users. Implications of the findings are discussed. © 1997 John Wiley & Sons, Inc.
Article
The most effective method of improving the retrieval performance of a document retrieval system is to acquire a detailed specification of the user's information need. The system described in this article, I3R, provides a number of facilities and search strategies based on this approach. The system uses a novel architecture to allow more than one system facility to be used at a given stage of a search session. Users influence the system actions by stating goals they wish to achieve, by evaluating system output, and by choosing particular facilities directly. The other main features of I3R are an emphasis on domain knowledge used for refining the model of the information need, and the provision of a browsing mechanism that allows the user to navigate through the knowledge base. © 1987 John Wiley & Sons, Inc.
Article
Automatic feedback methods may be used in online information retrieval to generate improved query statements based on information contained in previously retrieved documents. In this study automatic relevance feedback techniques are applied to Boolean query statements. The feedback operations are carried out using both the conventional Boolean logic, as well as an extended logic producing improved retrieval effectiveness. Experimental output is included to evaluate the automatic feedback operations.
Chapter
This paper presents an integrated Four Tools approach to providing an ER-based Information Retrieval Interface for mixed database environments, including relational, text, and picture databases. The Four Tools supports several types of interaction, formalised querying, exploratory ad-hoc browsing, schema navigation, and entity inspection. A combination of ER-models and hypermedia is applied in order to support exploratory searches following hyperlinks, yet maintaining orientation in the information space using the ER-model. The paper shows how previous research on ER-based Query Interfaces can be extended to the field of multimedia databases.
Article
Authors and searchers usually express the same things in many different ways, which causes problems in free text searching of text databases. Thus, a switching tool connecting the different names of one concept is needed. This study tests the effectiveness of a thesaurus as a search-aid in free text searching of a full text database. A set of queries was searched against a large full text database of newspaper articles. The search-aid thesaurus constructed for the test contains the usual relationships of a thesaurus, namely equivalence, hierarchical, and associative relationships. Each query was searched in five distinct modes: basic search, synonym search, narrower term search, related term search, and union of all previous searches. The basic searches contained only terms included in the original query statements. In the synonym searches, the terms of the basic search were extended by disjunction of the synonyms given by the search-aid thesaurus without modifying the overall logic of the basic search. Likewise, the basic search was extended in turn with the narrower terms and with the related terms given by the search-aid thesaurus. The last search mode included the basic terms and all the terms used in the previous searches. The searches were analyzed in terms of relative recall and precision; relative recall was estimated by setting the recall of the union search to 100%. On the average the value of relative recall was 47.2% in the basic search, compared with 100% in the union search; the average value of precision decreased only from 62.5% in the basic search to 51.2% in the union search.
Article
We present in this paper a navigation approach using a combination of functionalities encountered in classification processes, Hypertext Systems and Information Retrieval Systems. Its originality lies in the cooperation of these mechanisms to restrict the consultation universe, to locate faster the searched information, and to tackle the problem of disorientation when consulting the restricted Hypergraph of retrieved information. A first version of the SYRIUS system has been developed integrating both Hypertext and Information Retrieval functionalities that we have called Hypertext Information Retrieval System (H.I.R.S.). This version has been extended using classification mechanisms. The graphic interface of this new system version is presented here. Querying the system is done through common visual representation of the database Hypergraph. The visualization of the Hypergraph can be parameterized focusing on several levels (classes, links,...).
Article
Information retrieval systems often rely on thesauri or semantic nets in indexing documents and in helping users search for documents. Reasoning with these thesauri resembles traversing a graph. Several algorithms for matching documents to queries based on the distances between nodes on the graph (terms in the thesaurus) are compared to the evaluations of people. The “broader-than” relationships of both a medical and a computer science thesaurus when coupled with a simple average path length algorithm are able to simulate the decisions of people regarding the conceptual similarity of documents and queries. A graphical presentation of a thesaurus is connected to a multi-window document retrieval system and its ease of use is compared to a more traditional thesaurus-based information retrieval system. While substantial evidence exists that the graphics and multiple windows can be useful, our experiments have shown, as have many other human-computer interface experiments, that a multitude of factors come into play in determining the value of a particular interface.
Conference Paper
The effects of query structures and query expansion (QE) on retrieval performance were tested with a best match retrieval system (INQUERY ). Query structure means the use of operators to express the relations between search keys. Eight different structures were tested, representing weak structures (averages and weighted averages of the weights of the keys) and strong structures (e.g., queries with more elaborated search key relations). QE was based on concepts, which were first selected from a conceptual model, and then expanded by semantic relationships given in the model. The expansion levels were (a) no expansion, (b) a synonym expansion, (c) a narrower concept expansion, (d) an associative concept expansion, and (e) a cumulative expansion of all other expansions. With weak structures and Boolean structured queries, QE was not very effective. The best performance was achieved with one of the strong structures at the largest expansion level.
Conference Paper
To improve information retrieval effectiveness, research in both the algorithmic and human approach to query expansion is required. This paper uses the human approach to examine the selection and effectiveness of search terms sources for query expansion. The results show that the most effective sources were the users written question statement, user terms derived during the interaction and terms selected from particular database fields. These findings indicate the need for the design and testing of automatic relevance feedback techniques that place greater emphasis on these sources.
Article
Neural networks have recently been proposed for the construction of navigation interfaces for Information Retrieval systems. In this paper, we give an overview of some current research in this area. Most of the cited approaches use (variants) of the well-known Kohonen network. The Kohonen network implements a topology-preserving dimensionality-reducing mapping, which can be applied for information visualization. We identify a number of problems in the application of Kohonen networks for Information Retrieval, most notably scalability, reliability and retrieval effectiveness. To solve these problems we propose to use the Growing Cell Structures network, a variant of the Kohonen network which shows a more flexible adaptation to the domain structure. This network was tested on two standard test-collections, using a combined recall and precision measure, and compared to traditional IR methods such as the Vector Space Model and various clustering algorithms. The network performs at a competitive level of effectiveness, and is suitable for visualization purposes. However, the incremental training procedures for the networks result in a reliability problem, and the approach is computationally intensive. Also, the utility of the resulting maps for navigation will need further improvement.
Article
We investigated the sources and effectiveness of search terms used during mediated on-line searching under real-life (as opposed to laboratory) circumstances. A stratified model of information retrieval (IR) interaction served as a framework for the analysis. For the analysis, we used the on-line transaction logs, videotapes, and transcribed dialogue of the presearch and on-line interaction between 40 users and 4 professional intermediaries. Each user provided one question and interacted with one of the four intermediaries. Searching was done using DIALOG. Five sources of search terms were identified: (1) the users' written question statements, (2) terms derived from users' domain knowledge during the interaction, (3) terms extracted from retrieved items as relevance feedback, (4) database thesaurus, and (5) terms derived by intermediaries during the interaction. Distribution, retrieval effectiveness, transition sequences, and correlation of search terms from different sources were investigated. Search terms from users' written question statements and term relevance feedback were the most productive sources of terms contributing to the retrieval of items judged relevant by users. Implications of the findings are discussed. © 1997 John Wiley & Sons, Inc.
Article
Relevance feedback is an automatic process, introduced over 20 years ago, designed to produce improved query formulations following an initial retrieval operation. The principal relevance feedback methods described over the years are examined briefly, and evaluation data are included to demonstrate the effectiveness of the various methods. Prescriptions are given for conducting text retrieval operations iteratively using relevance feedback. © 1990 John Wiley & Sons, Inc.
Article
We discuss whether it is feasible to build intelligence rule- or weight-based algorithms into general-purpose software for interactive thesaurus navigation. We survey some approaches to the problem reported in the literature, particularly those involving the assignment of “link weights” in a thesaurus network, and point out some problems of both principle and practice. We then describe investigations which entailed logging the behavior of thesaurus users and testing the effect of thesaurus-based query enhancement in an IR system using term weighting, in an attempt to identify successful strategies to incorporate into automatic procedures. The results cause us to question many of the assumptions made by previous researchers in this area. © 1995 John Wiley & Sons, Inc.
Article
Natural language processing will grow into a vital industrial technology in the next five to 10 years. But this growth depends on the development of large linguistic databases that capture natural language phenomena [1, 2]. Another important theme for future work is development of large knowledge bases that are shared widely by different groups. One promising approach to such knowledge bases draws on natural language processing and linguistic knowledge. This article describes the EDR Electronic Dictionary [3], which seeks to provide a foundation for linguistic databases, and explains the relation of electronic dictionaries to very large knowledge bases.
Article
Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet ¹ provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].
Article
ueries using all search facets identified from requests, low complexity was achieved by formulating queries with major facets only. Query expansion was based on a thesaurus, from which the expansion keys were elicited for queries. There were five expansion types: (1) the first query version was an unexpanded, original query with one search key for each search concept (original search concepts) elicited from the test thesaurus; (2) the synonyms of the original search keys were added to the original query; (3) search keys representing the narrower concepts of the original search concepts were added to the original query; (4) search keys representing the associative concepts of the original search concepts were added to the original query; (5) all previous expansion keys were cumulatively added to the original query. Query structure refers to the syntactic structure of a query expression, marked with query operators and parentheses. The structure of queries was either weak (queries with n
Article
Dictionary methods for cross-language information retrieval have shown performance below that for monolingual retrieval. Failure to translate multi-term phrases has been shown to be one of the factors responsible for the errors associated with dictionary methods. First, we study the importance of phrasal translation for this approach. Second, we explore the role of phrases in query expansion via local context analysis and local feedback and show how they can be used to significantly reduce the error associated with automatic dictionary translation. Introduction The development of IR systems for languages other than English has focused on building mono-lingual systems. Increased availability of on-line text in languages other than English and increased multi-national collaboration have motivated research in cross-language information retrieval (CLIR) - the development of systems to perform retrieval across languages. There have been three main approaches to CLIR: text translation via m...
Article
Visualization is a promising technique for both enhancing users' perception of structure in the Internet and providing navigation facilities for its large information spaces. This paper describes an application of the Document Explorer to the visualization of WWW content structure. The system provides visualization, browsing, and query formulation mechanisms based on documents' semantic content. These mechanisms complement text and link based search by supplying a visual search and query formulation environment using semantic associations among documents. The user can view and interact with visual representations of WWW document relations to traverse this derived document space. The relationships among individual keywords in the documents are also represented visually to support query formulation by direct manipulation of content words in the document set. A suite of navigation and orientation tools is provided which focuses on orientation and navigation using the visual representation...
Article
As larger and more heterogeneous text databases become available, information retrieval research will depend on the development of powerful, efficient and flexible retrieval engines. In this paper, we describe a retrieval system (INQUERY) that is based on a probabilistic retrieval model and provides support for sophisticated indexing and complex query formulation. INQUERY has been used successfully with databases containing nearly 400,000 documents. 1 Introduction The increasing interest in sophisticated information retrieval (IR) techniques has led to a number of large text databases becoming available for research. The size of these databases, both in terms of the number of documents in them, and the length of the documents that are typically full text, has presented significant challenges to IR researchers who are used to experimenting with two or three thousand document abstracts. In order to carry out research with different types of text representations, retrieval mo...
Article
The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. We continue our work in TREC 3, performing runs in the routing, ad-hoc, and foreign language environments. Our major focus is massive query expansion: adding from 300 to 530 terms to each query. These terms come from known relevant documents in the case of routing, and from just the top retrieved documents in the case of ad-hoc and Spanish. This approach improves effectiveness from 7% to 25% in the various experiments. Other ad-hoc work extends our investigations into combining global similarities, giving an overall indication of how a document matches a query, with local similarities identifying a smaller part of the document which matches the query. Using an overlapping text window definition of "local", we achieve a 16% improvement. Introduction For over 30 years, the Smart project at Cornell University has been interested in the analy...
Article
Bilingual transfer dictionaries are an important resource for query translation in cross-language text retrieval. However, term translation is not an isomorphic process, so dictionary-based systems must address the problem of ambiguity in language translation. In this paper, we claim that boolean conjunction (the AND operator) provides simple and automatic disambiguation in the target language. We derive a new weighted boolean model based on a probabilistic formulation and apply it to the crosslanguage text retrieval problem. The results suggest that the weighted boolean model is highly effective for general text retrieval, but more experimental evidence is need to conclude that it is particularly advantageous for cross-language application. Nonetheless, the preliminary results are quite promising. 1 Introduction With the ongoing development of multilingual information retrieval systems, researchers are becoming increasing interested in the problem of cross-language information retrie...
Term relevance and query expansion: Relation to design In: Croft WB and van Rijsbergen CJ
  • A Spink
Introduction to Modern Information Retrieval The retrieval effects of query expansion on a feedback document retrieval system
  • Salton
  • Mj
Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY. Smeaton AF and van Rijsbergen CJ (1983) The retrieval effects of query expansion on a feedback document retrieval system. Computer Journal, 26:239–246.
Topic: A concept-base document retrieval system
  • A Chong
Dipe-R: A knowledge representation language for formulating queries in Information Retrieval
  • Rw Van Der Pol
Uniform access to multiple sources of information
  • F Wiesman
  • A Hasman
  • M Van Der Meulen