Compression algorithms are important for data oriented tasks, especially in
the era of Big Data. Modern processors equipped with powerful SIMD instruction
sets, provide us an opportunity for achieving better compression performance.
Previous research has shown that SIMD-based optimizations can multiply decoding
speeds. Following these pioneering studies, we propose a general approach to
accelerate compression algorithms. By instantiating the approach, we have
developed several novel integer compression algorithms, called Group-Simple,
Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding
vectorized versions. We evaluate the proposed algorithms on two public TREC
datasets, a Wikipedia dataset and a Twitter dataset. With competitive
compression ratios and encoding speeds, our SIMD-based algorithms outperform
state-of-the-art non-vectorized algorithms with respect to decoding speeds.
Recall, the proportion of relevant documents retrieved, is an important
measure of effectiveness in information retrieval, particularly in the legal,
patent, and medical domains. Where document sets are too large for exhaustive
relevance assessment, recall can be estimated by assessing a random sample of
documents; but an indication of the reliability of this estimate is also
required. In this article, we examine several methods for estimating two-tailed
recall confidence intervals. We find that the normal approximation in current
use provides poor coverage in many circumstances, even when adjusted to correct
its inappropriate symmetry. Analytic and Bayesian methods based on the ratio of
binomials are generally more accurate, but are inaccurate on small populations.
The method we recommend derives beta-binomial posteriors on retrieved and
unretrieved yield, with fixed hyperparameters, and a Monte Carlo estimate of
the posterior distribution of recall. We demonstrate that this method gives
mean coverage at or near the nominal level, across several scenarios, while
being balanced and stable. We offer advice on sampling design, including the
allocation of assessments to the retrieved and unretrieved segments, and
compare the proposed beta-binomial with the officially reported normal
intervals for recent TREC Legal Track iterations.
In spite of its tremendous value, metadata is generally sparse and incomplete, thereby hampering the effectiveness of digital information services. Many of the existing mechanisms for the automated creation of metadata rely primarily on content analysis which can be costly and inefficient. The automatic metadata generation system proposed in this article leverages resource relationships generated from existing metadata as a medium for propagation from metadata-rich to metadata-poor resources. Because of its independence from content analysis, it can be applied to a wide variety of resource media types and is shown to be computationally inexpensive. The proposed method operates through two distinct phases. Occurrence and co-occurrence algorithms first generate an associative network of repository resources leveraging existing repository metadata. Second, using the associative network as a substrate, metadata associated with metadata-rich resources is propagated to metadata-poor resources by means of a discrete-form spreading activation algorithm. This article discusses the general framework for building associative networks, an algorithm for disseminating metadata through such networks, and the results of an experiment and validation of the proposed method using a standard bibliographic dataset.
We consider information retrieval when the data, for instance multimedia, is coputationally expensive to fetch. Our approach uses "information filters" to considerably narrow the universe of possiblities before retrieval. We are especially interested in redundant information filters that save time over more general but more costly filters. Efficient retrieval requires that decision must be made about the necessity, order, and concurrent processing of proposed filters (an "execution plan"). We develop simple polynomial-time local criteria for optimal execution plans, and show that most forms of concurrency are suboptimal with information filters. Although the general problem of finding an optimal execution plan is likely exponential in the number of filters, we show experimentally that our local optimality criteria, used in a polynomial-time algorithm, nearly always find the global optimum with 15 filters or less, a sufficient number of filters for most applications. Our methods do not require special hardware and avoid the high processor idleness that is characteristic of massive parallelism solutions to this problem. We apply our ideas to an important application, information retrieval of cpationed data using natural-language understanding, a problem for which the natural-language processing can be the bottleneck if not implemented well.
The multinomial language model has been one of the most effective models of retrieval for more than a decade. However, the multinomial distribution does not model one important linguistic phenomenon relating to term dependency—that is, the tendency of a term to repeat itself within a document (i.e., word burstiness). In this article, we model document generation as a random process with reinforcement (a multivariate Pólya process) and develop a Dirichlet compound multinomial language model that captures word burstiness directly.
We show that the new reinforced language model can be computed as efficiently as current retrieval models, and with experiments on an extensive set of TREC collections, we show that it significantly outperforms the state-of-the-art language model for a number of standard effectiveness metrics. Experiments also show that the tuning parameter in the proposed model is more robust than that in the multinomial language model. Furthermore, we develop a constraint for the verbosity hypothesis and show that the proposed model adheres to the constraint. Finally, we show that the new language model essentially introduces a measure closely related to idf, which gives theoretical justification for combining the term and document event spaces in tf-idf type schemes.
The standard approach for term frequency normalization is based only on the
document length. However, it does not distinguish the verbosity from the scope,
these being the two main factors determining the document length. Because the
verbosity and scope have largely different effects on the increase in term
frequency, the standard approach can easily suffer from insufficient or
excessive penalization depending on the specific type of long document. To
overcome these problems, this paper proposes two-stage normalization by
performing verbosity and scope normalization separately, and by employing
different penalization functions. In verbosity normalization, each document is
pre-normalized by dividing the term frequency by the verbosity of the document.
In scope normalization, an existing retrieval model is applied in a
straightforward manner to the pre-normalized document, finally leading us to
formulate our proposed verbosity normalized (VN) retrieval model. Experimental
results carried out on standard TREC collections demonstrate that the VN model
leads to marginal but statistically significant improvements over standard
In this report we present the language T ROLL . It is a language particularly suited to be used in the early stages of information system design where the problem domain or Universe of Discourse must be described. In T ROLL , the description of the static and dynamic aspects of entities is integrated in object descriptions. T ROLL is based on sublanguages for data terms, for first-order and temporal assertions and for processes. These sublanguages are the tools to describe static properties, behavior and evolution over time of objects. Object descriptions can be related in many ways. The abstraction mechanisms provided by T ROLL are classification, specialization, generalization, roles, and aggregation. Abstraction mechanisms together with the basic structuring in objects help in organizing the system design. In order to support the composition of system descriptions from component description T ROLL provides language features to state interactions and dependencies between components ...
ing methods; fractals; information visualization; program display; UI theory 1. INTRODUCTION As computer systems evolve, the capability of restoring and managing information increases more and more. At the same time, computer users must view increasing amounts of information through video displays which are physically limited in size. Displaying information 1 effectively is a main concern in many software applications. For example, in visual programming systems[Shu 1988], graphic representations become very complex if the number of visual elements increases. In hypertext 1 The word "information" is used as a structured set of primitive elements which is specific to each application. Author's address: 481 Minor Hall, School of Optometry, University of California, Berkeley, CA 94720-2020. email: firstname.lastname@example.org; (permanent address: Graduate School of Information Systems, University of Electro-Communications, 1--5--1, Chofugaoka, Chofu, Tokyo 182, Japan. email: email@example.com...
"Fish tank virtual reality" refers to the use of a standard graphics workstation to achieve real-time display of three-dimensional scenes using stereopsis and dynamic head-coupled perspective. Fish tank VR has a number of advantages over head-mounted immersion VR which make it more practical for many applications. After discussing the characteristics of fish tank VR, we describe a set of three experiments conducted to study the benefits of fish tank VR over a traditional workstation graphics display. These experiments tested user performance under two conditions: (a) whether or not stereoscopic display was used and (b) whether or not the perspective display was coupled dynamically to the positions of a user's eyes. Subjects using a comparison protocol consistently preferred headcoupling without stereo over stereo without head-coupling. Error rates in a tree tracing task similar to one used by Sollenberger and Milgram showed an order of magnitude improvement for headcoupled stereo over ...
Digital libraries (DLs) are complex information systems and therefore demand formal foundations lest development e#orts diverge and interoperability su#ers. In this paper, we propose the fundamental abstractions of Streams, Structures, Spaces, Scenarios, and Societies (5S), which contribute to define digital libraries rigorously and usefully. Streams are sequences of abstract items used to describe static and dynamic content. Structures can be defined as labeled directed graphs, which impose organization. Spaces are sets of abstract items and operations on those sets that obey certain rules. Scenarios consist of sequences of events or actions that modify states of a computation in order to accomplish a functional requirement. Societies comprehend entities and the relationships between and among them. Together these abstractions relate and unify concepts, among others, of digital objects, metadata, collections, and services required to formalize and elucidate "digital libraries". The applicability, versatility and unifying power of the theory is demonstrated through its use in three distinct applications: building and interpretation of a DL taxonomy, analysis of case studies of digital libraries, and utilization as a formal basis for a DL description language. Keywords: digital libraries, theory, foundations, definitions, applications 1 1 Motivation Digital libraries are extremely complex information systems. The proper concept of a digital library seems hard to completely understand and evades definitional consensus. Di#erent views (e.g., historical, technological) and perspectives (e.g., from the library and information science, information retrieval, or human-computer interaction communities) have led to a myriad of di#ering definitions. Licklider, in his seminal ...
this article, we propose to use inductive evaluation for functional benchmarking of IR models as a complement of the traditional experiment-based performance benchmarking. We define a functional benchmark suite in two stages: the evaluation criteria based on the notion of "aboutness," and the formal evaluation methodology using the criteria. The proposed benchmark has been successfully applied to evaluate various well-known classical and logic-based IR models. The functional benchmarking results allow us to compare and analyze the functionality of the different IR models
ion, Inductive Learning and Probabilistic Assumptions Norbert Fuhr, Ulrich Pfeifer University of Dortmund, Dortmund, Germany Categories and Subject Descriptors G.1.2 [Numerical Analysis] Approximation -- nonlinear approximation, least squares approximation H.3.1 [Information Storage and Retrieval] Content Analysis and Indexing -- indexing methods H.3.3 [Information Storage and Retrieval] Information Search and Retrieval -- retrieval models I.2.6 [Artificial Intelligence] Learning -- Parameter Learning General Terms: Experimentation, Theory Additional Keywords and Phrases: logistic regression, probabilistic indexing, probabilistic retrieval, controlled vocabulary Abstract We show that former approaches in probabilistic information retrieval are based on one or two of the three concepts abstraction, inductive learning and probabilistic assumptions, and we propose a new approach which combines all three concepts. This approach is illustrated for the case of indexing with a controlled ...
This paper presents a customizable architecture for software agents that capture and access information in large, heterogeneous, distributed electronic repositories. The key idea is to exploit underlying structure at various levels of granularity to build high-level indices with task-specific interpretations. Information agents construct such indices and are configured as a network of reusable modules called structure detectors and segmenters. We illustrate our architecture with the design and implementation of smart information filters in two contexts: retrieving stock market data from Internet newsgroups, and retrieving technical reports from Internet ftp sites. Computing Reviews Categories: software agents, information retrieval General Terms: software agents, information retrieval, information gathering Keywords: software agents, information retrieval, information gathering, table recognition Submitted: December 1994 Revised: August 1995 Accepted: February 1996 1 Introduction...
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing , which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections.
Information. 20 . Embedded Video ACM Transactions on Information Systems characteristics of the target workstation much in the same way that multiple projections could be generated from a single encoding of text. 4.2 Adaptive Information Object Model One of the principal control operations in extended environment is a freedom to select individual data representations of an adaptive information object (AIO ,) at runtime. Our abstraction for the AIO shown in Fig. 12. In this illustration, three alternative representations of a video clip are shown: the video clip itself (R 0 ), an audio description with a set of still images (R 1 ) or a text block with captioned stills (R 2 ). Selection of a particular data set is supported through a set of control interfaces. These interfaces include: . representation control: The choice of a particular AIO data projection is governed by the constraints of the application and the nature of the environment. In general, support is provided for ...
In response to a query a search engine returns a ranked list of documents. If the query is on a popular topic (i.e., it matches many documents) then the returned list is usually too long to view fully. Studies show that users usually look at only the top 10 to 20 results. However, the best targets for popular topics are usually linked to by enthusiasts in the same domain which can be exploited. In this paper, we propose a novel ranking scheme for popular topics that places the most authoritative pages on the query topic at the top of the ranking. Our algorithm operates on a special index of "expert documents." These are a subset of the pages on the WWW identified as directories of links to non-affiliated sources on specific topics. Results are ranked based on the match between the query and relevant descriptive text for hyperlinks on expert pages pointing to a given result page. We present a prototype search engine that implements our ranking scheme and discuss its performance. With a relatively small (2.5 million page) expert index, our algorithm was able to perform comparably on popular queries with the best of the mainstream search engines. 1
The use of markup languages like SGML, HTML, or XML for encoding the structure of documents or linguistic data has lead to many databases where entries are adequately described as trees. In this context querying formalisms are interesting that offer the possibility to refer both to textual content and logical structure. We consider models where the structure specified in a query is not only used as a filter, but also for selecting and presenting different parts of the data. If answers are formalized as mappings from query nodes to the database, a simple enumeration of all mappings in the answer set will often suffer from the effect that many answers have common subparts. From a theoretical point of view this may lead to an exponential time complexity of the computation and presentation of all answers. Concentrating on the language of so-called tree queries - a variant and extension of Kilpeläinen's Tree Matching formalism - we introduce the notion of a "complete answer aggregate" for a given query. This new data structure offers a compact view of the set of all answers and supports active exploration of the answer space. Since complete answer aggregates use a powerful structure-sharing mechanism their maximal size is of order script O sign(d · h · q) where d and q respectively denote the size of the database and the query, and h is the maximal depth of a path of the database. An algorithm is given that computes a complete answer aggregate for a given tree query in time script O sign(d · log(d) · h · q). For the sublanguage of so-called rigid tree queries, as well as for so-called "nonrecursive" databases, an improved bound of script O sign(d · log(d) · q) is obtained. The algorithm is based on a specific index structure that supports practical efficiency.
Similarity-based retrieval of images is an important task in many image database applications. A major class of users' requests requires retrieving those images in the database that are spatially similar to the query image. We propose an algorithm for computing the spatial similarity between two symbolic images. A symbolic image is a logical representation of the original image where the image objects are uniquely labeled with symbolic names. Spatial relationships in a symbolic image are represented as edges in a weighted graph referred to as spatial-orientation graph. Spatial similarity is then quantified in terms of the number of, as well as the extent to which, the edges of the spatial-orientation graph of the database image conform to the corresponding edges of the spatial-orientation graph of the query image.
The proposed algorithm is robust in the sense that it can deal with translation, scale, and rotational variances in images. The algorithm has quadratic time complexity in terms of the total number of objects in both the database and query images. We also introduce the idea of quantifying a system's retrieval quality by having an expert specify the expected rank ordering with respect to each query for a set of test queries. This enables us to assess the quality of algorithms comprehensively for retrieval in image databases. The characteristics of the proposed algorithm are compared with those of the previously available algorithms using a testbed of images. The comparison demonstrated that our algorithm is not only more efficient but also provides a rank ordering of images that consistently matches with the expert's expected rank ordering.
A new text compression scheme is presented in this paper. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box, and any such program can be used. Instead, the pattern is modified; only the outcome of the matching of the modified pattern against the compressed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and processing time than a search in the original file. For typical text files, we achieve about 30% reduction of space and slightly less of search time. A 70% compression is not competitive with good text compression schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are searched often, such as catalogs, bibliographic files, and address books. Such files are typically not ...
Lexical ambiguity is a pervasive problem in natural language processing. However, little quantitative information is available about the extent of the problem, or about the impact that it has on information retrieval systems. We report on an analysis of lexical ambiguity in information retrieval test collections, and on experiments to determine the utility of word meanings for separating relevant from non-relevant documents. The experiments show that there is considerable ambiguity even in a specialized database.
A growing concern for organizations and groups has been to augment their knowledge and expertise. One such augmentation is to provide an organizational memory, some record of the organization's knowledge. However, relatively little is known about how computer systems might enhance organizational, group, or community memory. This paper presents findings from a field study of one such organizational memory system, the Answer Garden. The paper discusses the usage data and qualitative evaluations from the field study, and then draws a set of lessons for next-generation organizational memory systems.
The wealth of information on the web makes it an attractive resource for seeking quick answers to simple, factual questions such as "who was the first American in space?" or "what is the second tallest mountain in the world?" Yet today's most advanced web search services (e.g., Google and AskJeeves) make it surprisingly tedious to locate answers to such questions, In this paper, we extend question-answering techniques, first studied in the information retrieval literature, to the web and experimentally evaluate their performance, First we introduce MULDER, which we believe to be the first general-purpose, fully-automated question-answering system available on the web. Second, we describe MULDER's architecture, which relies on multiple search-engine queries, natural-language parsing, and a novel voting procedure to yield reliable answers coupled with high recall. Finally, we compare MULDER's performance to that of Google and AskJeeves on questions drawn from the TREC-8 question answering track. We find that MULDER's recall is more than a factor of three higher than that of AskJeeves, In addition, we find that Google requires 6.6 times as much user effort to achieve the same level of recall as MULDER.
This paper proposes a framework for the handling of spatio-temporal queries with inexact matches, using the concept of relation similarity. We initially describe a binary string encoding for 1D relations that permits the automatic derivation of similarity measures. We then extend this model to various granularity levels and many dimensions, and show that reasoning on spatio-temporal structure is significantly facilitated in the new framework. Finally, we provide algorithms and optimization methods for four types of queries: (i) object retrieval based on some spatio-temporal relations with respect to a reference object, (ii) spatial joins, i.e., retrieval of object pairs that satisfy some input relation, (iii) structural queries, which retrieve configurations matching a particular spatio-temporal structure, and (iv) special cases of motion queries. Considering the current large availability of multidimensional data and the increasing need for flexible query-answering mechanisms, our techniques can be used as the core of spatio-temporal query processors
ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or firstname.lastname@example.org. 2 Delta B. Cahoon, K.S. McKinley, and Z. Lu General Terms: Experimentation,Performance Additional Key Words and Phrases: Distributed information retrieval architectures 1. INTRODUCTION The increasing number of large, unstructured text collections require full-text information retrieval (IR) systems in order for users to access them effectively. Current systems typically only allow users to connect to a single database either locally or perhaps on another machine. A distributed IR system should be able to provide multiple users with concurrent, efficient access to multiple text collections located on disparate sites. Since the d...
We describe PicASHOW, a fully automated WWW image retrieval system that is based on several link-structure analyzing algorithms. Our basic premise is that a page # displays (or links to) an image when the author of # considers the image to be of value to the viewers of the page. Wethus extend some well known link-based WWW #### ######### schemes to the context of image retrieval. PicASHOW's analysis of the link structure enables it to retrieve relevant images even when those are stored in les with meaningless names. The same analysis also allows it to identify ##### ########## and ##### ####. We dene these as Web pages that are rich in relevant images, or from which many images are readily accessible. PicASHOW requires no image analysis whatsoever and no creation of taxonomies for pre-classication of the Web's images. It can be implemented by standard WWW search engines with reasonable overhead, in terms of both computations and storage, and with no change to user query formats. It can thus be used to easily add image retrieving capabilities to standard search engines. Our results demonstrate that PicASHOW, while relying almost exclusively on link analysis, compares well with dedicated WWW image retrieval systems. We conclude that link analysis, a bona-de eective technique for Web page search, can improve the performance of Web image retrieval, as well as extend its denition to include the retrieval of image hubs and containers. Keywords Image Retrieval; Link Structure Analysis; Hubs and Authorities; Image Hubs. 1.
this article, we present an authorization model that can be used to express a number of discretionary access control policies for relational data management systems. The model permits both positive and negative authorizations and supports exceptions at the same time. The model is flexible in that the users can specify, for each authorization they grant, whether the authorization can allow for exceptions or whether it must be strongly obeyed. It provides authorization management for groups with exceptions at any level of the group hierarchy, and temporary suspension of authorizations. The model supports ownership together with decentralized administration of authorizations. Administrative privileges can also be restricted so that owners retain control over their tables.
We describe the results of extensive experiments on large document collections using optimized rule-based induction methods. The goal of these methods is to automatically discover classification patterns that can be used for general document categorization or personalized filtering of free text. Previous reports indicate that human-engineered rule-based systems, requiring manymanyears of developmental efforts, have been successfully built to "read" documents and assign topics to them. In this paper, weshowthatmachine generated decision rules appear comparable to human performance, while using the identical rule-based representation. In comparison with other machine learning techniques, results on a key benchmark from the Reuters collection show a large gain in performance, from a previously reported 65% recall/precision breakeven point to 80.5%. In the context of a very high dimensional feature space, several methodological alternatives are examined, including universal versu...
Searching online text collections can be both rewarding and frustrating. While valuable information can be found, typically many irrelevant documents are also retrieved and many relevant ones are missed. Terminology mismatches between the user's query and document contents are a main cause of retrieval failures. Expanding a user's query with related words can improve search performance, but finding and using related words is an open problem. This research uses corpus analysis techniques to automatically discover similar words directly from the contents of the databases which are not tagged with part-ofspeech labels. Using these similarities, user queries are automatically expanded, resulting in conceptual retrieval rather than requiring exact word matches between queries and documents. We are able to achieve a 7.6% improvement for TREC5 queries and up to a 28.5% improvement on the narrow-domain Cystic Fibrosis collection. This work has been extended to multi-database collections where ...
From an analysis of actual cases, three categories of bias in computer systems have been developed: preexisting, technical, and emergent. Preexisting bias has its roots in social institutions, practices, and attitudes. Technical bias arises from technical constraints or considerations. Emergent bias arises in a context of use. Although others have pointed to bias in particular computer systems and have noted the general problem, we know of no comparable work that examines this phenomenon comprehensively and which offers a framework for understanding and remedying it. We conclude by suggesting that freedom from bias should be counted among the select set of criteria - including reliability, accuracy, and efficiency -according to which the quality of systems in use in society should be judged.
With the profusion of text databases on the Internet, it is becoming increasingly hard to find the most useful databases for a given query. To attack this problem, several existing and proposed systems employ brokers to direct user queries, using a local database of summary information about the available databases. This summary information must effectively distinguish relevant databases, and must be compact while allowing efficient access. We offer evidence that one broker, GlOSS, can be effective at locating databases of interest even in a system of hundreds of databases, and examine the performance of accessing the GlOSS summaries for two promising storage methods: the grid file and partitioned hashing. We show that both methods can be tuned to provide good performance for a particular workload (within a broad range of workloads), and discuss the tradeoffs between the two data structures. As a side effect of our work, we show that grid files are more broadly applicable than previous...
We present a formal definition of the Trellis model of hypertext and describe an authoring and browsing prototype called ffTrellis that is based on the model. The Trellis model not only represents the relationships that tie individual pieces of information together into a document (i.e., the adjacencies) but specifies the browsing semantics to be associated with the hypertext as well (i.e., the manner in which the information is to be visited and presented). The model is based on Petri nets, and is a generalization of existing directed graph based forms of hypertext. The Petri net basis permits more powerful specification of what is to be displayed when a hypertext is browsed and permits application of previously-developed Petri net analysis techniques to verify properties of the hypertext. A number of useful hypertext constructs, easily described in the Trellis model, are presented. These include the synchronization of simultaneous traversals of separate paths through a hyper...
We identify crucial design issues in building a distributed inverted index for a large collection of Web pages. We introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. We also propose a storage scheme for creating and managing inverted files using an embedded database system. We suggest and compare different strategies for collecting global statistics from distributed inverted indexes. Finally, we present performance results from experiments on a testbed distributed Web indexing system that we have implemented.
Two recently implemented machine-learning algorithms, RIPPER and sleeping-experts for phrases, are evaluated on a number of large text categorization problems. These algorithms both construct classifiers that allow the "context" of a word w to affect how (or even whether) the presence or absence of w will contribute to a classification. However, RIPPER and sleeping-experts differ radically in many other respects: differences include different notions as to what constitutes a context, different ways of combining contexts to construct a classifier, different methods to search for a combination of contexts, and different criteria as to what contexts should be included in such a combination. In spite of these differences, both RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods. We view this result as a confirmation of the usefulness of classifiers that represent contextual information.
: The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this database is 700 MB long, more than a CD-ROM can hold. But in addition the dictionary and concordance needed to access this data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed and the compression of the text is related to the problem of data encryption: specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishible from a representation of coin flips. Categories and Subject Descriptors: E.3 E.4 H.3.2 J.5 General terms: ...
text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes. Categories and Subject Descriptors: E.4 [Coding and Information Theory]: Data Compaction This work has been partially supported by SIAM Project, grant MCT/FINEP/PRONEX 76.97.1016.00, AMYRI/CYTED Project, CAPES scholarship (E. S. de Moura), Fondecyt grant 1990627 (G. Navarro and R. Baeza-Yates) and CNPq grant 520916/94-8 (N. Ziviani). Authors' addresses: E. S. de Moura and N. Ziviani, Dept. of Computer Science, Univ. Federal de Minas Gerais, Av. Antonio Carlos 6627, Belo Horizonte, Brazil; G. Navarro and R. Baeza-Yates, Dept. of Computer Science, Univ. de Chile, Av. Blanco Encalada 2120, Santiago, Chile. Permission to make digital or h
Text compression is of considerable theoretical and practical interest. It is, for example, becoming increasingly important for satisfying the requirements of fitting a large database onto a single CDROM. Many of the compression techniques discussed in the literature are model based. We here propose the notion of a formal grammar as a flexible model of text generation that encompasses most of the models offered before as well as, in principle, extending the possibility of compression to a much more general class of languages. Assuming a general model of text generation, a derivation is given of the well known Shannon entropy formula, making possible a theory of information based upon text representation rather than on communication. The ideas are shown to apply to a number of commonly used text models. Finally, we focus on a Markov model of text generation, suggest an information theoretic measure of similarity between two probability distributions, and develop a clustering algorithm based on this measure. This algorithm allows us to cluster Markov states, and thereby base our compression algorithm on a smaller number of probability distributions than would otherwise have been required. A number of theoretical consequences of this approach to compression are explored, and a detailed example is given. 1.
Most computing serves as a resource or tool to support other work: performing complex analyses for engineering projects, preparing documents, or sending electronic mail using office automation equipment, etc. To improve the character, quality, and ease of computing work, we must understand how automated systems actually are integrated into the work they support. How do people actually adapt to computing as a resource? How do they deal with the unreliability in hardware, software, or operations; data inaccuracy; system changes; poor documentation; inappropriate designs; etc.; which are present in almost every computing milieu, even where computing is widely used and considered highly successful? This paper presents some results of a detailed empirical study of routine computer use in several organizations. We present a theoretical account of computing work and use it to explain a number of observed phenomena, such as:
How people knowingly use “false” data to obtain desired analytical results by tricking their systems.
How organizations come to rely upon complex, critical computer systems despite significant, recurrent, known errors and inaccurate data.
How people work around inadequate computing systems by using manual or duplicate systems, rather than changing their systems via maintenance or enhancement.
In addition, the framework for analyzing computing and routine work presented here proves useful for representing and reasoning about activity in multiactor systems in general, and in understanding how better to integrate organizations of people and computers in which work is coordinated.
Specifying exact query concepts has become increasingly challenging to end-users. This is because many query concepts #e.g., those for looking up a multimedia object# can be hard to articulate, and articulation can be subjective. In this study,we propose a query-concept learner that learns query criteria through an intelligent sampling process. Our concept learner aims to ful#ll two primary design objectives: 1# it has to be expressive in order to model most practical query concepts, and 2# it must learn a concept quickly and with a small number of labeled data since online users tend to be too impatient to provide much feedback. To ful#ll the #rst goal, we model query concepts in k-CNF, which can express almost all practical query concepts. To ful#ll the second design goal, we propose our maximizing expected generalization algorithm #MEGA#, which converges to target concepts quickly by its two complementary steps: sample selection and concept re#nement. We also propose a divide-and-conquer method that divides the concept-learning task into G subtasks to achieve speedup. We notice that a task must be divided carefully, or search accuracy may su#er. Wethus employ a genetic-based mining algorithm to discover good feature groupings. Through analysis and mining results, we observe that organizing image features in a multi-resolution manner, and minimizing intragroup feature correlation, can speed up query-concept learning substantially while maintaining high search accuracy. Through examples, analysis, experiments, and an prototype implementation, we show that MEGA converges to query concepts signi#cantly faster than traditional methods. Keywords: query concept, relevance feedback, active learning, data mining. 1
ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or email@example.com. 2 Delta C. H. Goh et al. of the Internet as well as advances in telecommunications technologies. Nonetheless, this increased physical connectivity (the ability to exchange bits and bytes) does not necessarily lead to logical connectivity (the ability to exchange information meaningfully). This problem is sometimes referred to as the need for semantic interoperability [Sheth and Larson 1990] among autonomous and heterogeneous systems. The Context Interchange strategy [Siegel and Madnick 1991; Sciore et al. 1994] is a mediator-based approach [Wiederhold 1992] for achieving semantic interoperability among heterogeneous sources and ...
The Context Interchange strategy presents a novel perspective for mediated data access in which semantic conflicts among heterogeneous systems are not identified a priori, but are detected and reconciled by a context mediator through comparison of contexts axioms corresponding to the systems engaged in data exchange. In this article, we show that queries formulated on shared views, export schema, and shared "ontologies" can be mediated in the same way using the Context Interchange framework. The proposed framework provides a logic-based object-oriented formalism for representing and reasoning about data semantics in disparate systems, and has been validated in a prototype implementation providing mediated data access to both traditional and web-based information sources.
Research has reported that about 10% of Web searchers utilize advanced query operators, with the other 90% using extremely simple queries. It is often assumed that the use of query operators, such as Boolean operators and phrase searching, improves the effectiveness of Web searching. We test this assumption by examining the effects of query operators on the performance of three major Web search engines. We selected one hundred queries from the transaction log of a Web search service. Each of these original queries contained query operators such as AND, OR, MUST APPEAR (+), or PHRASE (" "). We then removed the operators from these one hundred advanced queries. We submitted both the original and modified queries to three major Web search engines; a total of 600 queries were submitted and 5,748 documents evaluated. We compared the results from the original queries with the operators to the results from the modified queries without the operators. We examined the results for changes in coverage, relative precision, and ranking of relevant documents. The use of most query operators had no significant effect on coverage, relative precision, or ranking, although the effect varied depending on the search engine. We discuss implications for the effectiveness of searching techniques as currently taught, for future information retrieval system design, and for future research.
In networked IR, a client submits a query to a broker, which is in contact with a large number of databases. In order to yield a maximum number of documents at minimum cost, the broker has to make estimates about the retrieval cost of each database, and then decide for each database whether or not to use it for the current query, and if, how many documents to retrieve from it. For this purpose, we develop a general decision-theoretic model and discuss different cost structures. Besides cost for retrieving relevant versus nonrelevant documents, we consider the following parameters for each database: expected retrieval quality, expected number of relevant documents in the database, and cost factors for query processing and document delivery. For computing the overall optimum, a divide-and-conquer algorithm is given. If there are several brokers knowing different databases, a preselection of brokers can only be performed heuristically, but the computation of the optimum can be done similarly to the single-broker case. In addition, we derive a formula which estimates the number of relevant documents in a database based on dictionary information.
Queries to text collections are resolved by ranking the documents in the collection and returning the highest-scoring documents to the user. An alternative retrieval method is to rank passages, that is, short fragments of documents, a strategy that can improve effectiveness and identify relevant material in documents that are too large for users to consider as a whole. However, ranking of passages can considerably increase retrieval costs. In this paper we explore alternative query evaluation techniques, and develop new techniques for evaluating queries on passages. We show experimentally that, appropriately implemented, effective passage retrieval is practical in limited memory on a desktop machine. Compared to passage ranking with adaptations of current document ranking algorithms, our new "DO-TOS" passage ranking algorithm requires only a fraction of the resources, at the cost of a small loss of effectiveness.
this article we propose replacing the SVD with the semidiscrete decomposition (SDD). We will describe the SDD approximation, show how to compute it, and compare the SDD-based LSI method to the SVD-based LSI method. We will show that SDD-based LSI does as well as SVD-based LSI in terms of document retrieval while requiring only one-twentieth the storage and one-half the time to compute each query. We will also show how to update the SDD approximation when documents are added or deleted from the document collection.
Spatial relations often are desired answers that a geographic information system (GIS) should generate in response to a users query. Current GISs provide only rudimentary support for processing and interpreting natural-language-like spatial relations, because their models and representations are primarily quantitative, while natural-language spatial relations are usually dominated by qualitative properties. Studies of the use of spatial relations in natural language showed that topology accounts for a significant portion of the geometric properties. This paper develops a formal model that captures metricdetails for the description of natural-language spatial relations. The metric details are expressed as refinements of the categories identified by the 9intersection, a model for topological spatial relations, and provide a more precise measure than does topology alone as to whether a geometric configuration matches with a spatial term or not. Similarly, these measures help in identifying the spatial term that describes a particular configuration. Two groups of metric details are derived: splitting ratios as the normalized values of lengths and areas of intersections; and closeness measures as the normalized distances between disjoint object parts. The resulting model of topological and metric properties was calibrated for sixty-four spatial terms in English, providing values for the best fit as well as value ranges for the significant parameters of each term. Three examples demonstrate how the framework and its calibrated values are used to determine the best spatial term for a relationship between two geometric objects.
We present an analysis of word senses that provides a fresh insight into the impact of word ambiguity on retrieval effectiveness with potential broader implications for other processes of information retrieval. Using a methodology of forming artificially ambiguous, words known as pseudo-words, and through reference to other researchers' work, the analysis illustrates that the distribution of the frequency of occurrence of the senses of a word plays a strong role in ambiguity's impact on effectiveness. Further investigation shows that this analysis may also be applicable to other processes of retrieval, such as Cross Language Information Retrieval, query expansion, retrieval of OCR'ed texts, and stemming. The analysis appears to provide a means of explaining, at least in part, reasons for the processes' impact (or lack of it) on effectiveness. Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing - Linguistic processing; I.2.7 [Arti...
We describe a method for probabilistic document indexing using relevance feedback data that has been collected from a set of queries. Our approach is based on three new concepts: (1) Abstraction from specific terms and documents, which overcomes the restriction of limited relevance information for parameter estimation. (2) Flexibility of the representation, which allows the integration of new text analysis and knowledge-based methods in our approach as well as the consideration of document structures or different types of terms. (3) Probabilistic learning or classification methods for the estimation of the indexing weights making better use of the available relevance information. Our approach can be applied under restrictions that hold for real applications. We give experimental results for five test collections which show improvements over other indexing methods.
A signature file organization, called the weight-partitioned signature file, for supporting document ranking is proposed. It employs multiple signature files, each of which corresponds to one term frequency, to represent terms with different term frequencies. Words with the same term frequency in a document are grouped together and hashed into the signature file corresponding to that term frequency. This eliminates the need to explicitly record the term frequency for each word. We investigate the effect of false drops on retrieval effectiveness if they are not eliminated in the search process. We have shown that false drops introduce insignificant degradation on precision and recall when the false drop probability is below a certain threshold. This is an important result since false drop elimination could become the bottleneck in systems using fast signature file search techniques. We perform an analytical study on the performance of the weight-partitioned signature file un...
Document properties are a compelling infrastructure on which to develop document management applications. A property-based approach avoids many of the problems of traditional hierarchical storage mechanisms, reflects document organizations meaningful to user tasks, provides a means to integrate the perspectives of multiple individuals and groups, and does this all within a uniform interaction framework. Document properties can reflect not only categorizations of documents and document use, but also expressions of desired system activity, such as sharing criteria, replication management and versioning. Augmenting property-based document management systems with active properties that carry executable code enables the provision of document-based services on a property infrastructure. The combination of document properties as a uniform mechanism for document management, and active properties as a way of delivering document services, represents a new paradigm for document management infras...
This article describes and evaluates SavvySearch, a metasearch engine designed to intelligently select and interface with multiple remote search engines. The primary metasearch issue examined is the importance of carefully selecting and ranking remote search engines for user queries. We studied the efficacy of SavvySearch's incrementally acquired metaindex approach to selecting search engines by analyzing the effect of time and experience on performance. We also compared the metaindex approach to the simpler categorical approach and showed how much experience is required to surpass the simple scheme.