Conference Paper

Automatic Summarization of Chinese and English Parallel Documents

Authors:
  • Hong Kong Metropolitan University
Conference Paper

Automatic Summarization of Chinese and English Parallel Documents

If you want to read the PDF, try requesting it from the authors.

Abstract

As a result of the rapid growth in Internet access, significantly more information has become available online in real time. However, there is not sufficient time for users to read large volumes of information and make decisions accordingly. The problem of information-overloading can be resolved through the application of automatic summarization. Many summarization systems for documents in different languages have been implemented. However, the performance of summarization system on documents in different languages has not yet been investigated. In this paper, we compare the result of fractal summarization technique on parallel documents in Chinese and English. The grammatical and lexical differences between Chinese and English have significant effect on the summarization processes. Their impact on the performances of the summarization for the Chinese and English parallel documents is compared.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The document hierarchy is language independent. The fractal summarization model has been implemented for English and Chinese (Wang & Yang, 2003). ...
Article
Many automatic text summarization models have been developed in the last decades. Related research in information science has shown that human abstractors extract sentences for summaries based on the hierarchical structure of documents; however, the existing automatic summarization models do not take into account the human abstractor's behavior of sentence extraction and only consider the document as a sequence of sentences during the process of extraction of sentences as a summary. In general, a document exhibits a well-defined hierarchical structure that can be described as fractals—mathematical objects with a high degree of redundancy. In this article, we introduce the fractal summarization model based on the fractal theory. The important information is captured from the source document by exploring the hierarchical structure and salient features of the document. A condensed version of the document that is informatively close to the source document is produced iteratively using the contractive transformation in the fractal theory. The fractal summarization model is the first attempt to apply fractal theory to document summarization. It significantly improves the divergence of information coverage of summary and the precision of summary. User evaluations have been conducted. Results have indicated that fractal summarization is promising and outperforms current summarization techniques that do not consider the hierarchical structure of documents.
Conference Paper
As a result of the recent information explosion, there is an increasing demand for automatic summarization, and human abstractors often synthesize summaries that are based on sentences that have been extracted by machine. However, the quality of machine-generated summaries is not high. As a special application of information retrieval systems, the precision of automatic summarization can be improved by user relevance feedback, in which the human abstractor can direct the sentence extraction process and useful information can be retrieved efficiently. Automatic summarization with relevance feedback is a helpful tool to assist professional abstractors in generating summaries, and in this work we propose a relevance feedback model for fractal summarization. The results of the experiment show that relevance feedback effectively improves the performance of automatic fractal summarization.
Conference Paper
Full-text available
Query-expansion is an effective Relevance Feedback technique for improving performance in Information Retrieval. In general query-expansion methods select terms from the complete contents of relevant documents. One problem with this approach is that expansion terms unrelated to document relevance can be introduced into the modified query due to their presence in the relevant documents and distribution in the document collection. Motivated by the hypothesis that query-expansion terms should only be sought from the most relevant areas of a document, this investigation explores the use of document summaries in query-expansion. The investigation explores the use of both context-independent standard summaries and query-biased summaries. Experimental results using the Okapi BM25 probabilistic retrieval model with the TREC-8 ad hoc retrieval task show that query-expansion using document summaries can be considerably more effective than using full-document expansion. The paper also presents a novel approach to term-selection that separates the choice of relevant documents from the selection of a pool of potential expansion terms. Again, this technique is shown to be more effective that standard methods.
Conference Paper
Full-text available
In this paper, we introduce the fractal summarization model based on the fractal theory. In fractal summarization, the important information is captured from the source text by exploring the hierarchical structure and salient features of the document. A condensed version of the document that is informatively close to the original is produced iteratively using the contractive transformation in the fractal theory. User evaluation has shown that fractal summarization outperforms traditional summarization.
Article
Full-text available
We argue that the advent of large volumes of fulllength text, as opposed to short texts like abstracts and newswire, should be accompanied by corresponding new approaches to information access. Toward this end, we discuss the merits of imposing structure .on fulllength text documents; that is, a partition of t'he text into coherent multi-paragraph units that represent the pattern of subtopics that comprise the text. Using this structure, we can make a distinction between the main topics, which occur throughout the length of the text, and the subtopics, which are of only limited extent. We discuss why recognition of subtopic structure is important and how, to some degree of accuracy, it can be found. We describe a new way of specifying queries on full-length documents and then describe an experiment in which making use of the recognition of local st'ructure achieves better results on a typical information retrieval task than does a standard IR measure.
Article
Full-text available
We propose a method of paraphrasing a Japanese noun modifier into a noun phrase in the form of "A no B." The semantic structures of "A no B" are sometimes recognized by supplementing some abbreviated predicate. We define these abbreviated verbs as "deletable verbs" in twoways: 1. Wechoose verbs matched with the semantic relations of "A no B" by using a thesaurus. 2. We choose verbs associated with specific nouns. If a verb frequently co-occurs with a noun in newspaper articles, we concluded that the verb is associated with the noun. By defining "deletable verbs" and utilizing a variety of the semantic structure of "A no B," we accomplished this paraphrasing by using only surface linguistics characteristics. 1 Introduction In natural language, various expressions can be used to denote an identical object, and a human can paraphrase an expression into some other expressions with the same meaning. Paraphrasing is an essential human skill to use natural language (Sato, 1999), thus its rea...
Article
Full-text available
In this paper we discuss research designed to investigate the ability of users to find information in texts written in languages unknown to them. One study shows how document thumbnail visualizations can be used effectively to choose potentially relevant documents. Another study shows how a user of a cross-language text retrieval system who has no foreign language knowledge can never-the-less choose relevant documents using a variety of interactive techniques and Web resources. We review our current research, on cross-language visualizations, and discuss the design of user-assisted methods for cross-language query term disambiguation and automatic techniques, including Machine Translation, to display texts in languages other than the language of the query. We review work we have done concerning language independent summarization and document summary translation. We then describe how these results are being applied and extended in the design of a new demonstration interface, ...
Book
Communication Systems and Information Theory. A Measure of Information. Coding for Discrete Sources. Discrete Memoryless Channels and Capacity. The Noisy-Channel Coding Theorem. Techniques for Coding and Decoding. Memoryless Channels with Discrete Time. Waveform Channels. Source Coding with a Fidelity Criterion. Index.
Article
A new fractal technique for the analysis and compression of digital images is presented. It is shown that a family of contours extracted from an image can be modelled geometrically as a single entity, based on the theory of recurrent iterated function systems (RIFS). RIFS structures are a rich source for deterministic images, including curves which cannot be generated by standard techniques. Control and stability properties are investigated. We state a control theorem - the recurrent collage theorem - and show how to use it to constrain a recurrent IFS structure so that its attractor is close to a given family of contours. This closeness is not only perceptual; it is measured by means of a min-max distance, for which shape and geometry is important but slight shifts are not. It is therefore the right distance to use for geometric modeling. We show how a very intricate geometric structure, at all scales, is inherently encoded in a few parameters that describe entirely the recurrent structures. Very high data compression ratios can be obtained. The decoding phase is achieved through a simple and fast reconstruction algorithm. Finally, we suggest how higher dimensional structures could be designed to model a wide range of digital textures, thus leading our research towards a complete image compression system that will take its input from some low-level image segmenter.
Article
The geometry of fractal shapes, developed to describe complex natural shapes, is presented. The topics covered include: classic fractals, galaxies and eddies, scaling fractals, nonscaling fractals, self-mapping fractals, randomness, stratified random fractals, fractional brown fractals, random tremas, and texture.
Book
Information retrieval is a sub-field of computer science that deals with the automated storage and retrieval of documents. Providing the latest information retrieval techniques, this guide discusses Information Retrieval data structures and algorithms, including implementations in C. Aimed at software engineers building systems with book processing components, it provides a descriptive and evaluative explanation of storage and retrieval systems, file structures, term and query operations, document operations and hardware. Contains techniques for handling inverted files, signature files, and file organizations for optical disks. Discusses such operations as lexical analysis and stoplists, stemming algorithms, thesaurus construction, and relevance feedback and other query modification techniques. Provides information on Boolean operations, hashing algorithms, ranking algorithms and clustering algorithms. In addition to being of interest to software engineering professionals, this book will be useful to information science and library science professionals who are interested in text retrieval technology.
Article
Digital libraries store materials in electronic format. Research and development in digital libraries includes content creation, conversion, indexing, organization, and dissemination. The key technological issues are how to search and display desired selections from and across large collections effectively [Schatz & Chen, 1996]. Digital library research projects (DLI-1) sponsored by NSF/DARPA/NASA have a common theme of bringing search to the net, which is the flagship research effort for the National Information Infrastructure (NII) in the United States. A repository is an indexed collection of objects. Indexing is an important task for searching. The better the indexing, the better the searching result. Developing a universal digital library has been the dream of many researchers, however, there are still many problems to be solved before such a vision is fulfilled. The most critical is to support a cross-lingual retrieval or multilingual digital library. Much work has been done on English information retrieval, however, there is relatively less work on Chinese information retrieval. In this article, we focus on Chinese indexing, which is the foundation of Chinese and cross-lingual information retrieval. The smallest indexing units in Chinese digital libraries are words, while the smallest units in a Chinese sentence are characters. However, Chinese text has no delimiter to mark word boundaries as it is in English text. In English or other languages using Roman or Greek-based orthographies, often, spacing reliably indicates word boundaries. In Chinese, a number of characters are placed together without any delimiters indicating the boundaries between consecutive characters. In this article, we investigate the combination and boundary detection approaches based on mutual information for segmentation. The combination approach combines n-grams to form words with more number of characters. In the combination approach Algorithm 1 does not allow overlapping of n-grams while Algorithm 2 does. The boundary detection approach detects the segmentation points on a sentence based on the values and the change of values of the mutual information. Experiments are conducted to evaluate their performances. An interface of the system is also presented to show how a Chinese web page is downloaded, the text in the page filtered, and segmented into words. The segmented words can be submitted for indexing or new unknown words can be identified and submitted to a dictionary.
Article
Machine techniques for reducing technical documents to their essential discriminating indices are investigated. Human scanning patterns in selecting “topic sentences” and phrases composed of nouns and modifiers were simulated by computer program. The amount of condensation resulting from each method and the relative uniformity in indices are examined. It is shown that the coordinated index provided by the phrase is the more meaningful and discriminating.
Article
The experimental evidence accumulated over the past 20 years indicates that text indexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective termweighting systems. This article summarizes the insights gained in automatic term weighting, and provides baseline single-term-indexing models with which other more elaborate content analysis procedures can be compared.
Conference Paper
With the explosion in the quantity of on-line text and multimedia information in recent years, demand for text summarization technology is growing. Increased pressure for technology advances is coming from users of the web, on-line information sources, and new mobile devices, as well as from the need for corporate knowledge management. Commercial companies are increasingly starting to offer text summarization capabilities, often bundled with information retrieval tools. In this paper, I will discuss the significance of some recent developments in summarization technology.
Article
As the demand for global information increases signifi- cantly, multilingual corpora has become a valuable lin- guistic resource for applications to cross-lingual infor- mation retrieval and natural language processing. In or- der to cross the boundaries that exist between different languages, dictionaries are the most typical tools. How- ever, the general-purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus- based approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain-specific parallel or comparable corpora that are employed in machine translation and cross-lingual infor- mation retrieval. Most of these are corpora between Indo-European languages, such as English/French and English/Spanish. The Asian/Indo-European corpus, es- pecially English/Chinese corpus, is relatively sparse. The objective of the present research is to construct English/ Chinese parallel corpus automatically from the World Wide Web. In this paper, an alignment method is pre- sented which is based on dynamic programming to iden- tify the one-to-one Chinese and English title pairs. The method includes alignment at title level, word level and character level. The longest common subsequence (LCS) is applied to find the most reliable Chinese trans- lation of an English word. As one word for a language may translate into two or more words repetitively in another language, the edit operation, deletion, is used to resolve redundancy. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method using the daily press release articles by the Hong Kong SAR government as the test bed. The precision of the result is 0.998 while the recall is 0.806. The release articles and speech articles, published by Hongkong & Shanghai Banking Corporation Limited, are also used to test our method, the precision is 1.00, and the recall is 0.948.
Article
This paper describes new methods of automatically extracting documents for screening purposes, i.e. the computer selection of sentences having the greatest potential for conveying to the reader the substance of the document. While previous work has focused on one component of sentence significance, namely, the presence of high-frequency content words (key words), the methods described here also treat three additional components: pragmatic words (cue words); title and heading words; and structural indicators (sentence location). The research has resulted in an operating system and a research methodology. The extracting system is parameterized to control and vary the influence of the above four components. The research methodology includes procedures for the compilation of the required dictionaries, the setting of the control parameters, and the comparative evaluation of the automatic extracts with manually produced extracts. The results indicate that the three newly proposed components dominate the frequency component in the production of better extracts.
Article
This paper proposes that the process of language understanding can be modeled as a collective phenomenon that emerges from a myriad of microscopic and diverse activities. The process is analogous to the crystallization process in chemistry. The essential features of this model are: asynchronous parallelism; temperature-controlled randomness; and statistically emergent active symbols. A computer program that tests this model on the task of capturing the effect of context on the perception of ambiguous word boundaries in Chinese sentences is presented. The program adopts a holistic approach in which word identification forms an integral component of sentence analysis. Various types of knowledge, from statistics to linguistics, are seamlessly integrated for the tasks of word boundary disambiguation as well as sentential analysis. Our experimental results showed that the model is able to address the word boundary ambiguity problems effectively.
Article
Four working steps taken from a comprehensive empirical model of expert abstracting are studied in order to prepare an explorative implementation of a simulation model. It aims at explaining the knowledge processing activities during professional summarizing. Following the case-based and holistic strategy of qualitative empirical research, we develop the main features of the simulation system by investigating in detail a small but central test case—four working steps where an expert abstractor discovers what the paper is about and drafts the topic sentence of the abstract. Following the KADS methodology of knowledge engineering, our discussion begins with the empirical model (a conceptual model in KADS terms) and aims at a computational model which is implementable without determining the concrete implementation tools (the design model according to KADS). The envisaged solution uses a blackboard system architecture with cooperating object-oriented agents representing cognitive strategies and a dynamic text representation which borrows its conceptual relations in particular from RST (Rhetorical Structure Theory). As a result of the discussion we feel that a small simulation model of professional summarizing is feasible.
Book
Most writing on sociological method has been concerned with how accurate facts can be obtained and how theory can thereby be more rigorously tested. In The Discovery of Grounded Theory, Barney Glaser and Anselm Strauss address the equally Important enterprise of how the discovery of theory from data--systematically obtained and analyzed in social research--can be furthered. The discovery of theory from data--grounded theory--is a major task confronting sociology, for such a theory fits empirical situations, and is understandable to sociologists and laymen alike. Most important, it provides relevant predictions, explanations, interpretations, and applications. In Part I of the book, "Generation Theory by Comparative Analysis," the authors present a strategy whereby sociologists can facilitate the discovery of grounded theory, both substantive and formal. This strategy involves the systematic choice and study of several comparison groups. In Part II, The Flexible Use of Data," the generation of theory from qualitative, especially documentary, and quantitative data Is considered. In Part III, "Implications of Grounded Theory," Glaser and Strauss examine the credibility of grounded theory. The Discovery of Grounded Theory is directed toward improving social scientists' capacity for generating theory that will be relevant to their research. While aimed primarily at sociologists, it will be useful to anyone Interested In studying social phenomena--political, educational, economic, industrial-- especially If their studies are based on qualitative data.
Article
ing methods; fractals; information visualization; program display; UI theory 1. INTRODUCTION As computer systems evolve, the capability of restoring and managing information increases more and more. At the same time, computer users must view increasing amounts of information through video displays which are physically limited in size. Displaying information 1 effectively is a main concern in many software applications. For example, in visual programming systems[Shu 1988], graphic representations become very complex if the number of visual elements increases. In hypertext 1 The word "information" is used as a structured set of primitive elements which is specific to each application. Author's address: 481 Minor Hall, School of Optometry, University of California, Berkeley, CA 94720-2020. email: koike@milo.berkeley.edu; (permanent address: Graduate School of Information Systems, University of Electro-Communications, 1--5--1, Chofugaoka, Chofu, Tokyo 182, Japan. email: koike@cas.uec.a...
Article
This paper addresses the problem of identifying likely topics of texts by their posi- tion in the text. It describes the automated training and evaluation of an Optimal Posi- tion Policy, a method of locating the likely positions of topic-bearing sentences based on genre-specific regularities of discourse structure. This method can be used in applications such as information retrieval, routing, and text summarization.
Article
The majority of information retrieval experiments are evaluated by measures such as average precision and average recall. Fundamental decisions about the superiority of one retrieval technique over another are made solely on the basis of these measures. We claim that average performance figures need to be validated with a careful statistical analysis and that there is a great deal of additional information that can be uncovered by looking closely at the results of individual queries. This paper is a case study of stemming algorithms which describes a number of novel approaches to evaluation and demonstrates their value.
MINDS-Multilingual Interactive Document Summarization
  • J Cowie
  • K Mahesh
  • S Nirenburg
  • R Zajaz
Combining Dictionary, Rules and Statistical Information in Segmentation of Chinese
  • J Y Nie
  • M L Hannan
  • W Jin