Automatic review assignment can significantly improve the productivity of many people such as conference organizers, journal editors and grant administrators. A general setup of the review assignment problem involves assigning a set of reviewers on a committee to a set of documents to be reviewed under the constraint of review quota so that the reviewers assigned to a document can collectively cover multiple topic aspects of the document. No previous work has addressed such a setup of committee review assignments while also considering matching multiple aspects of topics and expertise. In this paper, we tackle the problem of committee review assignment with multi-aspect expertise matching by casting it as an integer linear programming problem. The proposed algorithm can naturally accommodate any probabilistic or deterministic method for modeling multiple aspects to automate committee review assignments. Evaluation using a multi-aspect review assignment test set constructed using ACM SIGIR publications shows that the proposed algorithm is effective and efficient for committee review assignments based on multi-aspect expertise matching.
We present an image retrieval framework based on automatic query expansion in a concept feature space by generalizing the vector space model of information retrieval. In this framework, images are represented by vectors of weighted concepts similar to the keyword-based representation used in text retrieval. To generate the concept vocabularies, a statistical model is built by utilizing Support Vector Machine (SVM)-based classification techniques. The images are represented as "bag of concepts" that comprise perceptually and/or semantically distinguishable color and texture patches from local image regions in a multi-dimensional feature space. To explore the correlation between the concepts and overcome the assumption of feature independence in this model, we propose query expansion techniques in the image domain from a new perspective based on both local and global analysis. For the local analysis, the correlations between the concepts based on the co-occurrence pattern, and the metrical constraints based on the neighborhood proximity between the concepts in encoded images, are analyzed by considering local feedback information. We also analyze the concept similarities in the collection as a whole in the form of a similarity thesaurus and propose an efficient query expansion based on the global analysis. The experimental results on a photographic collection of natural scenes and a biomedical database of different imaging modalities demonstrate the effectiveness of the proposed framework in terms of precision and recall.
The research looks at a Web page as a graph structure or a Web graph and tries to classify different Web graphs in the new coordinate space: Out-Degree, In-Degree. The Out-degree coordinate is defined as the number of outgoing Web pages from a given Web page. The In-degree coordinate is the number of Web pages that point to a given Web page. J. Kleinberg's (1988) Web algorithm on discovering “hub Web pages” and “authorities Web pages” is applied in this new coordinate space. Some very uncommon phenomena have been discovered and new interesting results interpreted. The author believes that understanding the underlying Web page as a graph will help design better Web algorithms, enhance retrieval and Web performance, and recommends using graphs as part of a visual aid for search engine designers
This paper proposes knowledge map creation and maintenance approaches by utilizing information retrieval and data mining techniques to facilitate knowledge management in virtual communities of practice. Besides evaluating their performance using synthesized data, the generated knowledge maps for documents collected from the teachers' cyber community, SCTNet, and the master thesis repository at Taiwan's National Central Library, are evaluated by domain experts. Domain experts are asked to revise the obtained knowledge maps, and the proportion of modification is small and acceptable. Therefore, the developed approaches are suitable for support knowledge management of professional communities on the Internet.
A perspective is presented to bring into focus the state-of-the-art and development of methods and criteria which relate to the problem of design and performance evaluation of information systems. Two major aspects of design and evaluation are considered. These are the initiation, planning, development and testing of new information systems, to include modification of existing structures; and the appraisals and measurement of operational systems and their components. A taxonomy of information systems is presented in order to provide for a basis of organized evaluation of system performance.
The purpose of the present study is to analyse and map the trends in research on prion diseases by applying bibliometric tools to the scientific literature published between 1973 and 2002. The data for the study were obtained from the Medline database. The aim is to determine the volume of scientific output in the above period, the countries involved and the trends in the subject matters addressed. Significant growth is observed in scientific production since 1991 and particularly in the period 1996–2001. The countries found to have the highest output are the United States, the United Kingdom, Japan, France and Germany. The collaboration networks established by scientists are also analysed in this study, as well as the evolution in the subject matters addressed in the papers they published, that are observed to remain essentially constant in the three sub-periods into which the study is divided.
Multimedia is proliferating on Web sites, as the Web continues to enhance the integration of multimedia and textual information. In this paper we examine trends in multimedia Web searching by Excite users from 1997 to 2001. Results from an analysis of 1,025,910 Excite queries from 2001 are compared to similar Excite datasets from 1997 to 1999. Findings include: (1) queries per multimedia session have decreased since 1997 as a proportion of general queries due to the introduction of multimedia buttons near the query box, (2) multimedia queries identified are longer than non-multimedia queries, and (3) audio queries are more prevalent than image or video queries in identified multimedia queries. Overall, we see multimedia Web searching undergoing major changes as Web content and searching evolves.
Search engines play an essential role in the usability of Internet-based information systems and without them the Web would be much less accessible, and at the very least would develop at a much slower rate. Given that non-English users now tend to make up the majority in this environment, our main objective is to analyze and evaluate the retrieval effectiveness of various indexing and search strategies based on test-collections written in four different languages: English, French, German, and Italian. Our second objective is to describe and evaluate various approaches that might be implemented in order to effectively access document collections written in another language. As a third objective, we will explore the underlying problems involved in searching document collections written in the four different languages, and we will suggest and evaluate different database merging strategies capable of providing the user with a single unique result list.
The Web, and consequently the information contained in it, is growing rapidly. Every day a huge amount of newly created information is electronically published in Digital Libraries, whose aim is to satisfy users' information needs.In this paper, we envisage a Digital Library not only as an information resource where users may submit queries to satisfy their daily information need, but also as a collaborative working and meeting space of people sharing common interests. Indeed, we will present a personalized collaborative Digital Library environment, where users may organize the information space according to their own subjective view, may build communities, may become aware of each other, may exchange information and knowledge with other users, and may get recommendations based on preference patterns of users.
Citation analysis is performed in order to evaluate authors and scientific collections, such as journals and conference proceedings. Currently, two major systems exist that perform citation analysis: Science Citation Index (SCI) by the Institute for Scientific Information (ISI) and CiteSeer by the NEC Research Institute. The SCI, mostly a manual system up until recently, is based on the notion of the ISI Impact Factor, which has been used extensively for citation analysis purposes. On the other hand the CiteSeer system is an automatically built digital library using agents technology, also based on the notion of ISI Impact Factor. In this paper, we investigate new alternative notions besides the ISI impact factor, in order to provide a novel approach aiming at ranking scientific collections. Furthermore, we present a web-based system that has been built by extracting data from the Databases and Logic Programming (DBLP) website of the University of Trier. Our system, by using the new citation metrics, emerges as a useful tool for ranking scientific collections. In this respect, some first remarks are presented, e.g. on ranking conferences related to databases.
The experimental evidence accumulated over the past 20 years indicates that text indexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective termweighting systems. This article summarizes the insights gained in automatic term weighting, and provides baseline single-term-indexing models with which other more elaborate content analysis procedures can be compared.
This paper deals with information needs, seeking, searching, and uses within scholarly communities by introducing theory from the field of science and technology studies. In particular it contributes to the domain-analytic approach in information science by showing that Whitley’s theory of ‘mutual dependence’ and ‘task uncertainty’ can be used as an explanatory framework in understanding similarity and difference in information practices across intellectual fields. Based on qualitative case studies of three specialist scholarly communities across the physical sciences, applied sciences, social sciences and arts and humanities, this paper extends Whitley’s theory into the realm of information communication technologies. The paper adopts a holistic approach to information practices by recognising the interrelationship between the traditions of informal and formal scientific communication and how it shapes digital outcomes across intellectual fields. The findings show that communities inhabiting fields with a high degree of ‘mutual dependence’ coupled with a low degree of ‘task uncertainty’ are adept at coordinating and controlling channels of communication and will readily co-produce field-based digital information resources, whereas communities that inhabit fields characterised by the opposite cultural configuration, a low degree of ‘mutual dependence’ coupled with a high degree of ‘task uncertainty’, are less successful in commanding control over channels of communication and are less concerned with co-producing field-based digital resources and integrating them into their epistemic and social structures. These findings have implications for the culturally sensitive development and provision of academic digital resources such as digital libraries and web-based subject portals.
In earlier papers the authors focused on differences in the ageing of journal literature in science and the social sciences. It was shown that for several fields and topics bibliometric standard indicators based on journal articles need to be modified in order to provide valid results. In fields where monographs, books or reports are important means of scientific information, standard models of scientific communication are not reflected by journal literature alone. To identify fields where the role of non-serial literature is considerable or critical in terms of bibliometric standard methods, the totality of the bibliographic citations indexed in the 1993 annual cumulation of the SCI and SSCI databases, have been processed. The analysis is based on three indicators, thepercentage of references to serials, the mean references age, and themean reference rate. Applications of these measures at different levels of aggregation (i.e., to journals in selected science and social science fields) lead to the following conclusions. 1. The percentage of references to serials proved to be a sensitive measure to characterise typical differences in the communication behaviour between the sciences and the social sciences. 2. However, there is an overlap zone which includes fields like mathematics, technology oriented science, and some social science areas. 3. In certain social sciences part of the information seems even to be originated in non-scientific sources: references to non-serials do not always represent monographs, pre-prints or reports. Consequently, the model of information transfer from scientific literature to scientific (journal) literature assumed by standard bibliometrics requires substantial revision before valid results can be expected through its application to social science areas.
Analyzing actions to be supported by information and information retrieval (IR) systems is vital for understanding the needs of different types of information, search strategies and relevance assessments, in short, understanding IR. A necessary condition for this understanding is to link results from information seeking studies to the body of knowledge by IR studies. The actions to be focused on in this paper are tasks from the angle of problem solving. I will analyze certain features of work tasks and relate these features to types of information people are looking for and using in their tasks, patterning of search strategies for obtaining information and relevance assessments in choosing retrieved documents. The major claim is that these information activities are systematically connected to task complexity and structure of the problem at hand. The argumentation is based on both theoretical and empirical results from studies on information retrieval and seeking.
Test collections have traditionally been used by information retrieval researchers to improve their retrieval strategies. To be viable as a laboratory tool, a collection must reliably rank different retrieval variants according to their true effectiveness. In particular, the relative effectiveness of two retrieval strategies should be insensitive to modest changes in the relevant document set since individual relevance assessments are known to vary widely.The test collections developed in the TREC workshops have become the collections of choice in the retrieval research community. To verify their reliability, NIST investigated the effect changes in the relevance assessments have on the evaluation of retrieval results. Very high correlations were found among the rankings of systems produced using different relevance judgment sets. The high correlations indicate that the comparative evaluation of retrieval performance is stable despite substantial differences in relevance judgments, and thus reaffirm the use of the TREC collections as laboratory tools.
Cross-language information retrieval (CLIR) systems allow users to find documents written in different languages from that of their query. Simple knowledge structures such as bilingual term lists have proven to be a remarkably useful basis for bridging that language gap. A broad array of dictionary-based techniques have demonstrated utility, but comparison across techniques has been difficult because evaluation results often span only a limited range of conditions. This article identifies the key issues in dictionary-based CLIR, develops unified frameworks for term selection and term translation that help to explain the relationships among existing techniques, and illustrates the effect of those techniques using four contrasting languages for systematic experiments with a uniform query translation architecture. Key results include identification of a previously unseen dependence of pre- and post-translation expansion on orthographic cognates and development of a query-specific measure for translation fanout that helps to explain the utility of structured query methods.
Environmental scanning is the acquisition and use of information about events and trends in an organization's external environment, the knowledge of which would assist management in planning the organization's future courses of action. This paper reports a study of how 13 chief executives in the Canadian publishing and telecommunications industries scan their environments and use the information in decision making. Each respondent was asked to relate two critical incidents of information use. The incidents were analyzed according to their environmental sectors, the information sources, and their use in decision making. The interview data suggest that the chief executives concentrate their scanning on the competition, customer, regulatory, and technological sectors of the environment. In the majority of cases, the chief executives used environmental information in the Entrepreneur decisional role, initiating new products, projects, or policies. The chief executives acquire or receive environmental information from multiple, complementary sources. Personal sources are important for information on customers and competitors, whereas printed or formal sources are also important for information on technological and regulatory matters.
This paper describes algorithms and data structures for applying a parallel computer to information retrieval. Previous work has described an implementation based on overlap encoded signatures. That system was limited by (a) the necessity of keeping the signatures in primary memory and (b) the difficulties involved in implementing document-term weighting. Overcoming these limitations required adapting the inverted index techniques used on serial machines. The most obvious adaptation, also previously described, suffers from the fact that data must be sent between processors at query time. Since interprocessor communication is generally slower than local computation, this suggests that an algorithm which does not perform such communication might be faster. This paper presents a data structure, called a partitioned posting file, in which the interprocessor communication takes place at database-construction time, so that no data movement is needed at query-time. Performance characteristics and storage overhead are established by benchmarking against a synthetic database. Based on these figures, it appears that currently available hardware can deliver interactive document ranking on databases containing between 1 and 8192 Gigabytes of text.
While research into theoretical models of information retrieval (IR), IR in libraries, and testing of search algorithms have been cornerstones of IR research for decades, there has been comparatively little research into the problems of IR in business. Because of the growing magnitude and urgency of these problems, it is important to assess them more completely. This paper is an essay that draws on interviews with over 40 people experiencing real-life retrieval problems to characterize better these problems in the context of the work their organizations perform.
This paper presents a Foreign-Language Search Assistant that uses noun phrases as fundamental units for document translation and query formulation, translation and refinement. The system (a) supports the foreign-language document selection task providing a cross-language indicative summary based on noun phrase translations, and (b) supports query formulation and refinement using the information displayed in the cross-language document summaries. Our results challenge two implicit assumptions in most of cross-language Information Retrieval research: first, that once documents in the target language are found, Machine Translation is the optimal way of informing the user about their contents; and second, that in an interactive setting the optimal way of formulating and refining the query is helping the user to choose appropriate translations for the query terms.
Technical terms and proper names constitute a major problem in dictionary-based cross-language information retrieval (CLIR). However, technical terms and proper names in different languages often share the same Latin or Greek origin, being thus spelling variants of each other. In this paper we present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first step, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated automatically using translation dictionaries as source data. In the second step, the intermediate forms obtained in the first step are translated into a target language using fuzzy matching. The effectiveness of the technique was evaluated empirically using five source languages and English as a target language. The two-step technique performed better, in some cases considerably better, than fuzzy matching alone. Even using the first step as such showed promising results.
Networked information retrieval aims at the interoperability of heterogeneous information retrieval (IR) systems. In this paper, we show how differences concerning search operators and database schemas can be handled by applying data abstraction concepts in combination with uncertain inference. Different data types with vague predicates are required to allow for queries referring to arbitrary attributes of documents. Physical data independence separates search operators from access paths, thus solving text search problems related to noun phrases, compound words and proper nouns. Projection and inheritance on attributes support the creation of unified views on a set of IR databases. Uncertain inference allows for query processing even on incompatible database schemas.
We present a new paradigm for the automatic creation of document headlines that is based on direct transformation of relevant textual information into well-formed textual output. Starting from an input document, we automatically create compact representations of weighted finite sets of strings, called WIDL-expressions, which encode the most important topics in the document. A generic natural language generation engine performs the headline generation task, driven by both statistical knowledge encapsulated in WIDL-expressions (representing topic biases induced by the input document) and statistical knowledge encapsulated in language models (representing biases induced by the target language). Our evaluation shows similar performance in quality with a state-of-the-art, extractive approach to headline generation, and significant improvements in quality over previously proposed solutions to abstractive headline generation.
Studies on contrastive genre analysis have become a current issue in research on languages for specific purposes (LSP) and are intended to economize specialist communication. The present article compares formal schemata and linguistic devices of German abstracts and their English equivalents, written by German medical scholars to English native speaker (NS) abstracts. The source material is a corpus of 20 abstracts taken from German medical journals representing different degrees of specialism/professionalism. The method of linguistic analysis includes 1.(1) the overall length of articles/abstracts,2.(2) the representation/arrangement of “moves”,3.(3) the linguistic means (complexity of sentences, finite verb forms, active and passive voice, tenses, linking words, and lexical hedging).Results show no correlation between the length of articles and the length of abstracts. In contrast to NS author abstracts, the move “Background information” predominated in the structure of the studied German non-native speaker (GNNS) abstracts, whereas “Purpose of study” and “Conclusions” were not clearly stated. In linguistic terms, the German abstracts frequently contained lexical hedges, complex and enumerating sentence structures, passive voice and past tense as well as linkers of adversative, concessive and consecutive character. The GNNS English equivalent abstracts were author translations and contained structural and linguistic inadequacies which may hamper the general readability for the scientific community. Therefore abstracting should be systematically incorporated into language courses for the medical profession and for technical translators.
Free-text retrieval is less effective than it might be because of its dependence on notions that evolved with controlled vocabulary representation and searching. The structure and nature of the discourse level features of natural language text types are not incorporated. In an attempt to address this problem, an exploratory study was conducted for the purpose of determining whether information abstracts reporting on empirical work do possess a predictable discourse-level structure and whether there are lexical clues that reveal this structure. A three phase study was conducted, with Phase I making use of four tasks to delineate the structure of empirical abstracts based on the internalized notions of 12 expert abstractors. Phase II consisted of a linguistic analysis of 276 empirical abstracts that suggested a linguistic model of an empirical abstract, which was tested in Phase III with a two stage validation procedure using 68 abstracts and four abstractors. Results indicate that expert abstractors do possess an internalized structure of empirical abstracts, whose components and relations were confirmed repeatedly over the four tasks. Substantively the same structure revealed by the experts was manifested in the sample of abstracts, with a relatively small set of recurring lexical clues revealing the presence and nature of the text components. Abstractors validated the linguistic model at an average level of 86%. Results strongly support the presence of a detectable structure in the text-type of empirical abstracts. Such a structure may be of use in a variety of text-based information processing systems. The techniques developed for analyzing natural language texts for the purpose of providing more useful representations of their semantic content offer potential for application of other types of natural language texts.