Conference PaperPDF Available

Chinese-English Term Translation Mining Based on Semantic Prediction

Authors:

Abstract and Figures

Using abundant Web resources to mine Chinese term translations can be applied in many fields such as reading/writing as- sistant, machine translation and cross- language information retrieval. In mining English translations of Chinese terms, how to obtain effective Web pages and evaluate translation candidates are two challenging issues. In this paper, the ap- proach based on semantic prediction is first proposed to obtain effective Web pages. The proposed method predicts possible English meanings according to each constituent unit of Chinese term, and expands these English items using semantically relevant knowledge for searching. The refined related terms are extracted from top retrieved documents through feedback learning to construct a new query expansion for acquiring more effective Web pages. For obtaining a cor- rect translation list, a translation evaluation method in the weighted sum of multi-features is presented to rank these candidates estimated from effective Web pages. Experimental results demonstrate that the proposed method has good per- formance in Chinese-English term trans- lation acquisition, and achieves 82.9% accuracy.
Content may be subject to copyright.
A preview of the PDF is not available
... [Fei Huang et al. 2005] took continuous English strings as candidates and possible translations [1] . [Gaolin Fang et al. 2006] proposed a method o segment the source terms, and expand the terms to collect web pages. Each English word was built as a beginning index, and ...
... then the string candidates are constructed with the increase of string in the form of an English word unit in a 100-byte window with the keyword at the center [2] . [Sun Jun et al. 2008] proposed a forwardbackward maximum matching method to segment the source term. ...
... , ( _ (2) Where is the source term, t is one candidate. ...
Article
Full-text available
Most researchers extracted candidate term using unsupervised method. In this paper, a supervised candidate term extraction method is proposed. It combines English Part of Speech and headword expansion chunking strategy. Firstly, it retrieves bilingual snippets from web by term expansion, then, crawls Chinese-English pages and screens out Chinese words from the Chinese-English pages, lastly, a headword expansion chunking strategy is used to identify English phrases and the English Noun phrases and Verb phrases are selected. These selected English phrases serve as the last candidate term for term translation mining. Experimental results show that the supervised candidate term extraction method improves the top 10 inclusion rate by 1.6% than baseline system, which verifies that the supervised candidate term extraction method is effective. (C) 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [CEIS 2011]
... Using punctuations to divide snippets and taking continual English strings as candidates, they evaluated potential translations by a combined model. Developing an approach for partitioning source terms, Fang et al. [24] expanded these terms in order to extract web pages. Building each English term as a beginning index, candidate strings were constructed as the string of a unit of English words increases in a 100-byte window centered on the keyword. ...
... According to Fang et al. [24], the overall feature cost is the linear combination of all features used. This research proposes a different strategy for integrating features. ...
Article
Full-text available
Bilingual web pages are widely used to mine translations of unknown terms. This study focused on an effective solution for obtaining relevant web pages, extracting translations with correct lexical boundaries, and ranking the translation candidates. This research adopted co-occurrence information to obtain the subject terms and then expanded the source query with the translation of the subject terms to collect effective bilingual search engine snippets. Afterwards, valid candidates were extracted from small-sized, noisy bilingual corpora using an improved frequency change measurement that combines adjacent information. This research developed a method that considers surface patterns, frequency–distance, and phonetic features to elect an appropriate translation. The experimental results revealed that the proposed method performed remarkably well for mining translations of unknown terms.
... There are various studies that attempted to translate English to the local dialect. For example, Fang et al. [3] have explored methods for translation and disambiguation for out-of-vocabulary (OOV) terms when using Chinese query on an English text collection. Their Chinese-English Term Translation system has two parts i.e. web page handling and term translation mining. ...
... In SLT, the central issue is that the quality of the translation must be extremely high [7]. The related study is summarized in table 1. Fang et al. [3] Chinese-English cross-lingual information retrieval (CLIR). ...
Conference Paper
Full-text available
This paper reports the design and development of the Sarawak Malay Dialect (SMD) Online Translation Tool. This web based tool provides a translation process for Sarawak Malay Dialect words to Bahasa Malaysia words or vice-versa. The site is implemented using the combination of Hypertext Preprocessor (PHP), Java Script, and Hyper Text Markup Language (HTML). The database is created and managed using an integrated server package of Apache, mySQL, PHP and Perl (XAMPP). In this paper we also report a usability study of the tool. We believe this is the first attempt to building up a comprehensive documentation of the Sarawak Malay Dialect.
... Li, Cao and Li (2003) present an English reading-assistance system that suggests translations of words and phrases based on mining techniques. Gaolin, Hao and Fumihito (2006) show a method to predict possible English meanings according to each component of a Chinese term. ...
Article
Full-text available
In this paper we present a methodology that makes it possible to mine a document collection from a domain without knowing the language in which the documents are written. We describe in detail a method, tools and results that can be used within a digital library context for Science Watch and Competitive Intelligence. We consider a collection associated with the aquaculture domain written in Chinese and extracted from a digital library. Based on the original coding (UNICODE) of the data and the tag marking the structure of the documents, we extract key elements (authors, phrases, etc.) from within the domain and analyse them. The results are displayed in the form of graphs and networks. We extract people networks and semantic networks before examining their evolution over a period of several years. The principles developed in this paper can be applied to any language.
Chapter
This paper focuses on the Web-based Chinese-English Out-of-Vocabulary (OOV) term translation pattern, and emphasizes on the translation selection based on multiple feature fusion and the ranking based on Ranking Support Vector Machine (Ranking SVM). By utilizing the SIGHAN2005 corpus for the Chinese Named Entity Recognition (NER) task and selected new terms, the experiments based on different data sources show the consistent results. From the experimental results for combining our model with Chinese-English Cross-Language Information Retrieval (CLIR) on the data sets of TREC, it can be found that the obvious performance improvements for both query translation and CLIR are obtained.
Article
This paper introduces a method which aims at translating Chinese terms into English. Our motivation is providing deep semantic-level information for term translation through analyzing the semantic structure of terms. Using the contextual information in the term and the first sememe of each word in HowNet as features, we trained a Support Vector Machine (SVM) model to identify the dependencies among words in a term. Then a Conditional Random Field (CRF) model is trained to mark semantic relations for term dependencies. During translation, the semantic relations within the Chinese terms are identified and three features based on semantic structure are integrated into the phrase-based statistical machine translation system. Experimental results show that the proposed method achieves 1.58 BLEU points improvement in comparison with the baseline system.
Conference Paper
Traditional bilingual snippets retrieval method is to select the top N snippets returned from web search engine as bilingual corpora. In this paper, an improved bilingual snippets retrieval method is proposed. It combines term expansion, the surface pattern matching model and top N bilingual snippets selection to retrieve bilingual snippets. More relative bilingual snippets can be found by using term expansion, and surface pattern matching is useful to select bilingual snippets according to the constitution of unknown term translation. The top 100 bilingual snippets and those satisfy the surface pattern matching model serve as the last bilingual corpora for term translation mining. Experimental results show that the improved bilingual snippets retrieving method improves the top 100 inclusion rate by 2.3% than baseline system, which verified that the improved bilingual snippets retrieving method is effective.
Article
This paper proposes a simple but powerful approach for obtaining technical term translation pairs in patent domain from Web automatically. First, several technical terms are used as seed queries and submitted to search engineering. Secondly, an extraction algorithm is proposed to extract some key word translation pairs from the returned web pages. Finally, a multi-feature based evaluation method is proposed to pick up those translation pairs that are true technical term translation pairs in patent domain. With this method, we obtain about 8,890,000 key word translation pairs which can be used to translate the technical terms in patent documents. And experimental results show that the precision of these translation pairs are more than 99%, and the coverage of these translation pairs for the technical terms in patent documents are more than 84%.
Article
Due to a limited coverage of the existing bilingual dictionary, it is often difficult to translate the Out-Of-Vocabulary terms (OOV) in many natural language processing tasks. In this paper, we propose a general cascade mining technique of three steps, it leverages OOV category to optimize the effectiveness of each step. OOV category based expansion policy is suggested to get more relevant mixed-language documents. OOV category based hybrid extraction approach is suggested to perform a robust extraction. A more flexible model combination based on OOV category is also suggested. Moreover, we conducted experiments to evaluate the effectiveness of each step and the overall performance of the mining technique. The experimental results show significantly performance improvement than the existing methods.
Article
In Cross-Language Information Retrieval (CLIR) process, Out-Of-Vocabulary (OOV) or the unknown word translation is a significant and challenging issue. Specifically, for English-Chinese OOV translation, OOV term detection and extraction of translation pair still remain to be key problems. In this paper, an English-Chinese OOV translation pattern based on PAT-Tree is proposed. Web-mining is utilized as the corpus source to collect translation pairs, and translation candidates are acquired by Chinese OOV term extraction based on PAT-Tree. The experimental results show that the proposed approach can outperform some of the current translation engines, and is especially efficient in English-Chinese OOV translation.
Article
Full-text available
The web, teeming as it is with language data, of all manner of varieties and languages, in vast quantity and freely available, is a fabulous linguists' playground. The Special Issue explores ways in which this dream is being explored.
Article
Full-text available
UC Berkeley participated in the pivot bilingual task of the CLIR track at NTCIR Workshop 4. Our focus was on Chinese and Korean searches against the Japanese News document collection, using English as a pivot language. For comparison of our pivot techniques, we submitted Japanese monolingual and English ∆ Japanese bilingual search rankings as well. Two different commercial translation software packages were used in quite different ways - one did standard query translation from Chinese or Korean topics to English and then to Japanese, while the other was used to translate the Japanese corpus to English word- by-word using 'fast document translation'. Another interesting search approach was to segment and use Chinese search topics directly as if they were Japanese topics
Article
Full-text available
This paper describes our Korean-English cross-language information retrieval system for NTCIR-4. Our system is based on a query translation approach with a bilingual dictionary and co-occurrence infor-mation between English terms in English corpus. In this year, we have focused on translation of unknown words. We have expanded the existing bilingual dictio-nary by gathering some of the Korean-English trans-lation pairs for Korean words from Web manually. For other unknown not contained in the expanded bilin-gual dictionary, we automatically transliterated into English using pre-constructed mapping table. Some issues for processing Korean queries and documents are also described, such as identification of Korean phrases. On evaluation collections for NTCIR-4, per-formance of our system is 30.25% for description query type, 33.33% for title query type, and 32.47% for combination query type of description and nar-rative in relax scoring. Post-submission experiments show that our expanded dictionary and transliteration mechanism improve the performance of our system.
Conference Paper
Full-text available
Mining terminology translation from a large amount of Web data can be applied in many fields such as reading/writing assistant, machine translation and cross-language information retrieval. How to find more comprehensive results from the Web and obtain the boundary of candidate translations, and how to remove irrelevant noises and rank the remained candidates are the challenging issues. In this paper, after reviewing and analyzing all possible methods of acquiring translations, a feasible statistics-based method is proposed to mine terminology translation from the Web. In the proposed method, on the basis of an analysis of different forms of term translation distributions, character-based string frequency estimation is presented to construct term translation candidates for exploring more translations and their boundaries, and then sort-based subset deletion and mutual information methods are respectively proposed to deal with subset redundancy information and prefix/suffix redundancy information formed in the process of estimation. Extensive experiments on two test sets of 401 and 3511 English terms validate that our system has better performance.
Conference Paper
Full-text available
There have been significant advances in Cross-Language Information Retrieval (CLIR) in recent years. One of the major remaining reasons that CLIR does not perform as well as monolingual retrieval is the presence of out of vocabulary (OOV) terms. Previous work has either relied on manual intervention or has only been partially successful in solving this problem. We use a method that extends earlier work in this area by augmenting this with statistical analysis, and corpus-based translation disambiguation to dynamically discover translations of OOV terms. The method can be applied to both Chinese-English and English-Chinese CLIR, correctly extracting translations of OOV terms from the Web automatically, and thus is a significant improvement on earlier work.
Conference Paper
We present a system for extracting an English translation of a given Japanese technical term by collecting and scoring translation candidates from the web. We first show that there are a lot of partially bilingual documents in the web that could be useful for term translation, discovered by using a commercial technical term dictionary and an Internet search engine. We then present an algorithm for obtaining translation candidates based on the distance of Japanese and English terms in web documents, and report the results of a preliminary experiment.
Article
We present a statistical word feature, the Word Relation Matrix, which can be used to find translated pairs of words and terms from non-parallel corpora, across language groups. Online dictionary entries are used as seed words to generate Word Relation Matrices for the unknown words according to correlation measures. Word Relation Matrices are then mapped across the corpora to find translation pairs. Translation accuracies are around 30% when only the top candidate is counted. Nevertheless, top 20 candidate output give a 50.9% average increase in accuracy on human translator performance.
Article
Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is more difficult, because most statistical clues useful in the processing of parallel texts cannot be applied to non-parallel texts. Whereas for parallel texts in some studies up to 99% of the word alignments have been shown to be correct, the accuracy for non-parallel texts has been around 30% up to now. The current study, which is based on the assumption that there is a correlation between the patterns of word co-occurrences in corpora of different languages, makes a significant improvement to about 72% of word translations identified correctly.
Conference Paper
The paper presents an approach to automatically extracting translations of Web query terms through mining of Web anchor texts and link structures. One of the existing difficulties in cross-language information retrieval (CLIR) and Web search is the lack of the appropriate translations of new terminology and proper names. Such a difficult problem can be effectively alleviated by our proposed approach, and the resource of anchor texts in the Web is proven a valuable corpus for this kind of term translation
Article
The English Reading Wizard uses bilingual Web and local-dictionary data to help readers understand foreign languages by translating words and phrases. Methods include the expectation maximization algorithm and bilingual bootstrapping.