Conference Paper

A Relevance Feedback Model for Fractal Summarization

Authors:
  • Hong Kong Metropolitan University
If you want to read the PDF, try requesting it from the authors.

Abstract

As a result of the recent information explosion, there is an increasing demand for automatic summarization, and human abstractors often synthesize summaries that are based on sentences that have been extracted by machine. However, the quality of machine-generated summaries is not high. As a special application of information retrieval systems, the precision of automatic summarization can be improved by user relevance feedback, in which the human abstractor can direct the sentence extraction process and useful information can be retrieved efficiently. Automatic summarization with relevance feedback is a helpful tool to assist professional abstractors in generating summaries, and in this work we propose a relevance feedback model for fractal summarization. The results of the experiment show that relevance feedback effectively improves the performance of automatic fractal summarization.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Many methods have been used in automatic extraction but most of them do not take into account the hierarchical structure of the documents. A novel method using the structure of the document was introduced by Yang and Wang [15]. It is based in a fractal view method for controlling the information. ...
... For that, we need a distance or similarity measure between two portions of a document. Then we present a novel method to summarize documents using the fractal dimension and eliminating the drawbacks of the method introduced by Yang and Wang [15]. This paper begins with the concept of fractal dimension and the techniques used to calculate it. ...
... We adapt one of them for calculating the fractal dimension to the case of text documents, and we present some results for web pages. We follow talking about the drawbacks of the fractal summarization model [14,15], and we propose some changes in the model using the fractal dimension of a text document giving a better working of fractal summarization. In the last section we talk about the conclusions obtained. ...
Article
The calculation of dimensions is a useful tool to quantify structural information of artificial and natural objects. There are some types of dimension (9): the Euclidean one, the Hausdorff-Besicovitch dimension, and so on. We are going to work with the fractal dimension in the special case of text documents. There are many objects which fractal dimension cannot be determined analytically, but there exits estimators for those cases. We review some of them and choose the best for our purpose: the calculation of fractal dimension of text documents. Every day we search new information in the web, and we found a lot of documents which contain pages with a great amount of information. There is a big demand for automatic summarization in a rapid and precise way. Many methods have been used in automatic extraction but most of them do not take into account the hierarchical structure of the documents. A novel method using the structure of the document was introduced by Yang and Wang (15). It is based in a fractal view method for controlling the information. It has some drawbacks and we solve them doing a new adaptation of the fractal view method. We also use the new concept of fractal dimension of a text document to achieve a better diversification of the extracted sentences.
Conference Paper
Every day we search new information in the web, and we found a lot of documents which contain pages with a great amount of information. There is a big demand for automatic summarization in a rapid and precise way. Many methods have been used in automatic extraction but most of them do not take into account the hierarchical structure of the documents. A novel method using the structure of the document was introduced by Yang and Wang in 2004. It is based in a fractal view method for controlling the information displayed. We explain its drawbacks and we solve them using the new concept of fractal dimension of a text document to achieve a better diversification of the extracted sentences improving the performance of the method.
Conference Paper
Full-text available
Query-expansion is an effective Relevance Feedback technique for improving performance in Information Retrieval. In general query-expansion methods select terms from the complete contents of relevant documents. One problem with this approach is that expansion terms unrelated to document relevance can be introduced into the modified query due to their presence in the relevant documents and distribution in the document collection. Motivated by the hypothesis that query-expansion terms should only be sought from the most relevant areas of a document, this investigation explores the use of document summaries in query-expansion. The investigation explores the use of both context-independent standard summaries and query-biased summaries. Experimental results using the Okapi BM25 probabilistic retrieval model with the TREC-8 ad hoc retrieval task show that query-expansion using document summaries can be considerably more effective than using full-document expansion. The paper also presents a novel approach to term-selection that separates the choice of relevant documents from the selection of a pool of potential expansion terms. Again, this technique is shown to be more effective that standard methods.
Conference Paper
Full-text available
In this paper, we introduce the fractal summarization model based on the fractal theory. In fractal summarization, the important information is captured from the source text by exploring the hierarchical structure and salient features of the document. A condensed version of the document that is informatively close to the original is produced iteratively using the contractive transformation in the fractal theory. User evaluation has shown that fractal summarization outperforms traditional summarization.
Conference Paper
Full-text available
Wireless access with mobile (or handheld) devices is a promising addition to the WWW and traditional electronic business. Mobile devices provide convenience and portable access to the huge information space on the Internet without requiring users to be stationary with network connection. However, the limited screen size, narrow network bandwidth, small memory capacity and low computing power are the shortcomings of handheld devices. Loading and visualizing large documents on handheld devices become impossible. The limited resolution restricts the amount of information to be displayed. The download time is intolerably long. In this paper, we introduce the fractal summarization model for document summarization on handheld devices. Fractal summarization is developed based on the fractal theory. It generates a brief skeleton of summary at the first stage, and the details of the summary on different levels of the document are generated on demands of users. Such interactive summarization reduces the computation load in comparing with the generation of the entire summary in one batch by the traditional automatic summarization, which is ideal for wireless access. Three-tier architecture with the middle-tier conducting the major computation is also discussed. Visualization of summary on handheld devices is also investigated.
Article
Full-text available
Human-quality text summarization systems are di#cult to design, and even more di#cult to evaluate, in part because documents can di#er along several dimensions, such as length, writing style and lexical usage. Nevertheless, certain cues can often help suggest the selection of sentences for inclusion in a summary. This paper presents our analysis of news-article summaries generated by sentence selection. Sentences are ranked for potential inclusion in the summary using a weighted combination of statistical and linguistic features. The statistical features were adapted from standard IR methods. The potential linguistic ones were derived from an analysis of news-wire summaries. Toevaluate these features we use a normalized version of precision-recall curves, with a baseline of random sentence selection, as well as analyze the properties of such a baseline. We illustrate our discussions with empirical results showing the importance of corpus-dependent baseline summarization standards, compression ratios and carefully crafted long queries.
Article
Full-text available
We argue that the advent of large volumes of fulllength text, as opposed to short texts like abstracts and newswire, should be accompanied by corresponding new approaches to information access. Toward this end, we discuss the merits of imposing structure .on fulllength text documents; that is, a partition of t'he text into coherent multi-paragraph units that represent the pattern of subtopics that comprise the text. Using this structure, we can make a distinction between the main topics, which occur throughout the length of the text, and the subtopics, which are of only limited extent. We discuss why recognition of subtopic structure is important and how, to some degree of accuracy, it can be found. We describe a new way of specifying queries on full-length documents and then describe an experiment in which making use of the recognition of local st'ructure achieves better results on a typical information retrieval task than does a standard IR measure.
Article
Full-text available
In this paper we discuss research designed to investigate the ability of users to find information in texts written in languages unknown to them. One study shows how document thumbnail visualizations can be used effectively to choose potentially relevant documents. Another study shows how a user of a cross-language text retrieval system who has no foreign language knowledge can never-the-less choose relevant documents using a variety of interactive techniques and Web resources. We review our current research, on cross-language visualizations, and discuss the design of user-assisted methods for cross-language query term disambiguation and automatic techniques, including Machine Translation, to display texts in languages other than the language of the query. We review work we have done concerning language independent summarization and document summary translation. We then describe how these results are being applied and extended in the design of a new demonstration interface, ...
Article
Experimental subjects wrote abstracts of articles using a simplified version of the TEXNET abstracting assistance software. In addition to the full text, subjects were presented with either keywords or phrases extracted automatically, The resulting abstracts, and the times taken, were recorded automatically; some additional information was gathered by oral questionnaire. Selected abstracts produced were evaluated on various criteria by independent raters. Results showed considerable variation among subjects, but 37% found the keywords Or phrases "quite" or "very" useful in writing their abstracts, Statistical analysis failed to support several hypothesized relations: phrases were not viewed as significantly more helpful than keywords; and abstracting experience did not correlate with originality of wording, approximation of the author abstract, or greater conciseness. Requiring further study are some unanticipated strong correlations including the following: Windows experience and writing an abstract like the author's; experience reading abstracts and thinking one had written a good abstract; gender and abstract length; gender and use of words and phrases from the original text. Results have also suggested possible modifications to the TEXNET software.
Article
A new fractal technique for the analysis and compression of digital images is presented. It is shown that a family of contours extracted from an image can be modelled geometrically as a single entity, based on the theory of recurrent iterated function systems (RIFS). RIFS structures are a rich source for deterministic images, including curves which cannot be generated by standard techniques. Control and stability properties are investigated. We state a control theorem - the recurrent collage theorem - and show how to use it to constrain a recurrent IFS structure so that its attractor is close to a given family of contours. This closeness is not only perceptual; it is measured by means of a min-max distance, for which shape and geometry is important but slight shifts are not. It is therefore the right distance to use for geometric modeling. We show how a very intricate geometric structure, at all scales, is inherently encoded in a few parameters that describe entirely the recurrent structures. Very high data compression ratios can be obtained. The decoding phase is achieved through a simple and fast reconstruction algorithm. Finally, we suggest how higher dimensional structures could be designed to model a wide range of digital textures, thus leading our research towards a complete image compression system that will take its input from some low-level image segmenter.
Conference Paper
As a result of the rapid growth in Internet access, significantly more information has become available online in real time. However, there is not sufficient time for users to read large volumes of information and make decisions accordingly. The problem of information-overloading can be resolved through the application of automatic summarization. Many summarization systems for documents in different languages have been implemented. However, the performance of summarization system on documents in different languages has not yet been investigated. In this paper, we compare the result of fractal summarization technique on parallel documents in Chinese and English. The grammatical and lexical differences between Chinese and English have significant effect on the summarization processes. Their impact on the performances of the summarization for the Chinese and English parallel documents is compared.
Article
Excerpts of technical papers and magazine articles that serve the purposes of conventional abstracts have been created entirely by automatic means. In the exploratory research described, the complete text of an article in machine-readable form is scanned by an IBM 704 data-processing machine and analyzed in accordance with a standard program. Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the “auto-abstract."
Article
Machine techniques for reducing technical documents to their essential discriminating indices are investigated. Human scanning patterns in selecting “topic sentences” and phrases composed of nouns and modifiers were simulated by computer program. The amount of condensation resulting from each method and the relative uniformity in indices are examined. It is shown that the coordinated index provided by the phrase is the more meaningful and discriminating.
Chapter
"...a blend of erudition (fascinating and sometimes obscure historical minutiae abound), popularization (mathematical rigor is relegated to appendices) and exposition (the reader need have little knowledge of the fields involved) ...and the illustrations include many superb examples of computer graphics that are works of art in their own right." Nature
Article
The experimental evidence accumulated over the past 20 years indicates that text indexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective termweighting systems. This article summarizes the insights gained in automatic term weighting, and provides baseline single-term-indexing models with which other more elaborate content analysis procedures can be compared.
Conference Paper
To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. This paper focuses on document extracts, a particular kind of computed document summary. Document extracts consisting of roughly 20% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries. The trends in our results are in agreement with those of Edmundson who used a subjectively weighted combination of features as opposed to training the feature weights using a corpus. We have developed a trainable summarization program that is grounded in a sound statistical framework.
Article
Relevance feedback is an automatic process, introduced over 20 years ago, designed to produce improved query formulations following an initial retrieval operation. The principal relevance feedback methods described over the years are examined briefly, and evaluation data are included to demonstrate the effectiveness of the various methods. Prescriptions are given for conducting text retrieval operations iteratively using relevance feedback. © 1990 John Wiley & Sons, Inc.
Article
As the demand for global information increases signifi- cantly, multilingual corpora has become a valuable lin- guistic resource for applications to cross-lingual infor- mation retrieval and natural language processing. In or- der to cross the boundaries that exist between different languages, dictionaries are the most typical tools. How- ever, the general-purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus- based approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain-specific parallel or comparable corpora that are employed in machine translation and cross-lingual infor- mation retrieval. Most of these are corpora between Indo-European languages, such as English/French and English/Spanish. The Asian/Indo-European corpus, es- pecially English/Chinese corpus, is relatively sparse. The objective of the present research is to construct English/ Chinese parallel corpus automatically from the World Wide Web. In this paper, an alignment method is pre- sented which is based on dynamic programming to iden- tify the one-to-one Chinese and English title pairs. The method includes alignment at title level, word level and character level. The longest common subsequence (LCS) is applied to find the most reliable Chinese trans- lation of an English word. As one word for a language may translate into two or more words repetitively in another language, the edit operation, deletion, is used to resolve redundancy. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method using the daily press release articles by the Hong Kong SAR government as the test bed. The precision of the result is 0.998 while the recall is 0.806. The release articles and speech articles, published by Hongkong & Shanghai Banking Corporation Limited, are also used to test our method, the precision is 1.00, and the recall is 0.948.
Article
This paper describes new methods of automatically extracting documents for screening purposes, i.e. the computer selection of sentences having the greatest potential for conveying to the reader the substance of the document. While previous work has focused on one component of sentence significance, namely, the presence of high-frequency content words (key words), the methods described here also treat three additional components: pragmatic words (cue words); title and heading words; and structural indicators (sentence location). The research has resulted in an operating system and a research methodology. The extracting system is parameterized to control and vary the influence of the above four components. The research methodology includes procedures for the compilation of the required dictionaries, the setting of the control parameters, and the comparative evaluation of the automatic extracts with manually produced extracts. The results indicate that the three newly proposed components dominate the frequency component in the production of better extracts.
Article
Four working steps taken from a comprehensive empirical model of expert abstracting are studied in order to prepare an explorative implementation of a simulation model. It aims at explaining the knowledge processing activities during professional summarizing. Following the case-based and holistic strategy of qualitative empirical research, we develop the main features of the simulation system by investigating in detail a small but central test case—four working steps where an expert abstractor discovers what the paper is about and drafts the topic sentence of the abstract. Following the KADS methodology of knowledge engineering, our discussion begins with the empirical model (a conceptual model in KADS terms) and aims at a computational model which is implementable without determining the concrete implementation tools (the design model according to KADS). The envisaged solution uses a blackboard system architecture with cooperating object-oriented agents representing cognitive strategies and a dynamic text representation which borrows its conceptual relations in particular from RST (Rhetorical Structure Theory). As a result of the discussion we feel that a small simulation model of professional summarizing is feasible.
Article
The optimal amount of information needed in a given decision-making situation lies somewhere along a continuum from "not enough" to "too much". Ackoff proposed that information systems often hinder the decision-making process by creating information overload. To deal with this problem, he called for systems that could filter and condense data so that only relevant information reached the decision maker. The potential for information overload is especially critical in text-based information. The purpose of this research is to investigate the effects and theoretical limitations of extract condensing as a text processing tool in terms of recipient performance. In the experiment described here, an environment is created in which the effects of text condensing are isolated from the effects of message and individual recipient differences. The data show no difference in reading comprehension performance between the condensed forms and the original document. This indicates that condensed forms can be produced that are equally as informative as the original document. These results suggest that it is possible to apply a relatively simple computer algorithm to text and produce extracts that capture enough of the information contained in the original document so that the recipient can perform as if he or she had read the original. These results also identify a methodology for assessing the effectiveness of text condensing schemes. The research presented here contributes to a small but growing body of work on text-based information systems and, specifically, text condensing.
Book
Most writing on sociological method has been concerned with how accurate facts can be obtained and how theory can thereby be more rigorously tested. In The Discovery of Grounded Theory, Barney Glaser and Anselm Strauss address the equally Important enterprise of how the discovery of theory from data--systematically obtained and analyzed in social research--can be furthered. The discovery of theory from data--grounded theory--is a major task confronting sociology, for such a theory fits empirical situations, and is understandable to sociologists and laymen alike. Most important, it provides relevant predictions, explanations, interpretations, and applications. In Part I of the book, "Generation Theory by Comparative Analysis," the authors present a strategy whereby sociologists can facilitate the discovery of grounded theory, both substantive and formal. This strategy involves the systematic choice and study of several comparison groups. In Part II, The Flexible Use of Data," the generation of theory from qualitative, especially documentary, and quantitative data Is considered. In Part III, "Implications of Grounded Theory," Glaser and Strauss examine the credibility of grounded theory. The Discovery of Grounded Theory is directed toward improving social scientists' capacity for generating theory that will be relevant to their research. While aimed primarily at sociologists, it will be useful to anyone Interested In studying social phenomena--political, educational, economic, industrial-- especially If their studies are based on qualitative data.
Article
After introductory training, a research assistant used the TEXNET abstracting assistance software to create abstracts to articles available via the World Wide Web. The assistant also compiled introductory documentation, including a guide to abstracting using computer assistance tools. This article discusses problems encountered, tools selected for preferred use, and implications for future software development.
Conference Paper
A document understanding method based on the tree representation of document structures is proposed. It is shown that documents have an obvious hierarchical structure in their geometry which is represented by a tree. A small number of rules are introduced to transform the geometric structure into the logical structure which represents the semantics. The virtual field separator technique is employed to utilize the information carried by special constituents of documents such as field separators and frames, keeping the number of transformation rules small. Experimental results on a variety of document formats have shown that the proposed method is applicable to most of the documents commonly encountered in daily use, although there is still room for further refinement of the transformation rules
Article
An approach to image coding based on a fractal theory of iterated contractive transformations defined piecewise is described. The main characteristics of this approach are that it relies on the assumption that image redundancy can be efficiently captured and exploited through piecewise self-transformability on a block-wise basis, and it approximates an original image by a fractal image, obtained from a finite number of iterations of an image transformation called a fractal code. This approach is referred to as fractal block coding. The general coding-decoding system is based on the construction, for an image to be encoded, of a fractal code-a contractive image transformation for which the original image is an approximate fixed point-which, when applied iteratively on any initial image of the decoder, produces a sequence of images which converges to a fractal approximation of the original. The design of a system for the encoding of monochrome digital images at rates below 1 b/pixel is described. Ideas and extensions from the work of other researchers are presented
Article
ing methods; fractals; information visualization; program display; UI theory 1. INTRODUCTION As computer systems evolve, the capability of restoring and managing information increases more and more. At the same time, computer users must view increasing amounts of information through video displays which are physically limited in size. Displaying information 1 effectively is a main concern in many software applications. For example, in visual programming systems[Shu 1988], graphic representations become very complex if the number of visual elements increases. In hypertext 1 The word "information" is used as a structured set of primitive elements which is specific to each application. Author's address: 481 Minor Hall, School of Optometry, University of California, Berkeley, CA 94720-2020. email: koike@milo.berkeley.edu; (permanent address: Graduate School of Information Systems, University of Electro-Communications, 1--5--1, Chofugaoka, Chofu, Tokyo 182, Japan. email: koike@cas.uec.a...
Article
This paper addresses the problem of identifying likely topics of texts by their posi- tion in the text. It describes the automated training and evaluation of an Optimal Posi- tion Policy, a method of locating the likely positions of topic-bearing sentences based on genre-specific regularities of discourse structure. This method can be used in applications such as information retrieval, routing, and text summarization.
Article
A useful first step in document summarl- sation is the selection of a small number of 'meaningful' sentences from a larger text.
MINDS-Multilingual Interactive Document Summarization
  • J Cowie
  • K Mahesh
  • S Nirenburg
  • R Zajaz