[show abstract][hide abstract] ABSTRACT: To aid research and development in machine translation, we have produced a test collection for Japanese/English machine translation and performed the Patent Translation Task at the Seventh NTCIR Workshop. To obtain a parallel corpus, we extracted patent documents for the same or related inventions published in Japan and the United States. Our test collection includes approximately 2 000 000 sentence pairs in Japanese and English, which were extracted automatically from our parallel corpus. These sen- tence pairs can be used to train and evaluate machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval, which can be used to evaluate the contribution of machine translation to retrieving patent documents across lan- guages. This paper describes our test collection, meth- ods for evaluating machine translation, and evaluation results for research groups participated in our task. Our research is the first significant exploration into utilizing patent information for the evaluation of ma- chine translations.
[show abstract][hide abstract] ABSTRACT: We organized a machine translation (MT) task at the Seventh NTCIR Workshop. Participating groups were requested to machine translate sentences in patent documents and also search topics for retrieving patent documents across languages. We analyzed the relationship between the accuracy of MT and its effects on the retrieval accuracy.
Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19-23, 2009; 01/2009
[show abstract][hide abstract] ABSTRACT: We have produced a test collection for ma- chine translation (MT). Our test collection in- cludes approximately 2000000 sentence pairs in Japanese and English, which were extracted from patent documents and can be used to train and evaluate MT systems. Our test col- lection also includes search topics for cross- lingual information retrieval, to evaluate the contribution of MT to retrieving patent docu- ments across languages. We performed a task for MT at the NTCIR workshop and used our test collection to evaluate participating groups. This paper describes scientific knowledge ob- tained through our task.
[show abstract][hide abstract] ABSTRACT: In Mongolian, two different alphabets are used, Cyrillic and Mongolian. In this paper, we focus solely on the Mongolian language using the Cyrillic alphabet, in which a content word can be inflected when concatenated with one or more suffixes. Identifying the original form of content words is crucial for natural language processing and information retrieval. We propose a lemmatization method for Mongolian. The advantage of our lemmatization method is that it does not rely on noun dictionaries, enabling us to lemmatize out-of-dictionary words. We also apply our method to indexing for information retrieval. We use newspaper articles and technical abstracts in experiments that show the effectiveness of our method. Our research is the first significant exploration of the effectiveness of lemmatization for information retrieval in Mongolian.
[show abstract][hide abstract] ABSTRACT: This paper introduces the Patent Mining Task of the Seventh NTCIR Workshop and the test collections produced in this task. The task's goal was the classification of research papers written in either Japanese or English in terms of the International Patent Classification (IPC) system, which is a global standard. For this task, 12 participant groups submitted 49 runs. In this paper, we also report the evaluation results of the task.
Proceedings of the 1st ACM workshop on Patent Information Retrieval, PaIR 2008, Napa Valley, California, USA, October 30, 2008; 01/2008
[show abstract][hide abstract] ABSTRACT: Although the World Wide Web has of late become an important source to consult for the meaning of words, a number of technical terms related to high technology are not found on the Web. This paper describes a method to produce an encyclopedic dictionary for high-tech terms from patent information. We used a collection of unexamined patent applications published by the Japanese Patent Office as a source corpus. Given this collection, we extracted terms as headword candidates and retrieved applications including those headwords. Then, we extracted paragraph-style descriptions and categorized them into technical domains. We also extracted related terms for each headword. We have produced a dictionary including approximately 400000 Japanese terms as headwords. We have also implemented an interface with which users can explore our dictionary by reading text descriptions and viewing a related-term graph.
Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco; 01/2008
[show abstract][hide abstract] ABSTRACT: To aid research and development in machine translation, we have produced a test collec- tion for Japanese/English machine transla- tion. To obtain a parallel corpus, we extracted patent documents for the same or related in- ventions published in Japan and the United States. Our test collection includes approx- imately 2000000 sentence pairs in Japanese and English, which were extracted automati- cally from our parallel corpus. These sentence pairs can be used to train and evaluate machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval, which can be used to evaluate the contribution of machine translation to retriev- ing patent documents across languages. This paper describes our test collection, methods for evaluating machine translation, and pre- liminary experiments.
[show abstract][hide abstract] ABSTRACT: In Modern Mongolian, a content word can be inflected when concatenated with suf- fixes. Identifying the original forms of content words is crucial for natural lan- guage processing and information retrieval. We propose a lemmatization method for Modern Mongolian and apply our method to indexing for information retrieval. We use technical abstracts to show the effec- tiveness of our method experimentally.
[show abstract][hide abstract] ABSTRACT: Several types of queries are widely used on the World Wide Web and the expected retrieval method can vary depend- ing on the query type. We propose a method for classify- ing queries into informational and navigational types. Be- cause terms in navigational queries often appear in anchor text for links to other pages, we analyze the distribution of query terms in anchor texts on the Web for query classifi- cation purposes. While content-based retrieval is effective for informational queries, anchor-based retrieval is effective for navigational queries. Our retrieval system combines the results obtained with the content-based and anchor-based retrieval methods, in which the weight for each retrieval re- sult is determined automatically depending on the result of the query classification. We also propose a method for im- proving anchor-based retrieval. Our retrieval method, which computes the probability that a document is retrieved in re- sponse to the given query, identifies synonyms of query terms in the anchor texts on the Web and uses these synonyms for smoothing purposes in the probability estimation. We use the NTCIR test collections and show the effectiveness of individual methods and the entire Web retrieval system experimentally.
Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21-25, 2008; 01/2008
[show abstract][hide abstract] ABSTRACT: In aiming at research and development on machine translation, we produced a test collection for Japanese-English machine translation in the seventh NTCIR Workshop. This paper describes details of our test collection. From patent documents published in Japan and the United States, we extracted patent families as a parallel corpus. A patent family is a set of patent documents for the same or related invention and these documents are usually filed to more than one country in different languages. In the parallel corpus, we aligned Japanese sentences with their counterpart English sentences. Our test collection, which includes approximately 2000000 sentence pairs, can be used to train and test machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval and the contribution of machine translation to a patent retrieval task can also be evaluated. Our test collection will be available to the public for research purposes after the NTCIR final meeting.
Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco; 01/2008
[show abstract][hide abstract] ABSTRACT: The processing of intellectual property documents, such as patents, has been important to the industry, business, and law communities. Recently, the importance of patent processing has also been recognized in academic research communities, particularly by information retrieval and natural language processing researchers. In addition, large test collections that include patents have recently become available, to enable the systematic evaluation of methodologies from a scientific point of view. In the light of these activities, this special issue is intended to collect advanced research papers on patent processing. As an introduction to the special issue on patent processing, this paper surveys the relevant literature and outlines the papers selected for the special issue.
[show abstract][hide abstract] ABSTRACT: In the Sixth NTCIR Workshop, we organized the Patent Retrieval Task and performed three subtasks; Japanese Retrieval, English Retrieval, and Classifi- cation. This paper describes the Japanese Retrieval Subtask and English Retrieval Subtask, both of which were intended for patent-to-patent invalidity search task. We report the evaluation results of the groups participating in those subtasks.
[show abstract][hide abstract] ABSTRACT: To transliterate foreign words, in Japanese and Korean, phonograms, such as Katakana and Hangul, are used. In Chinese, the pronunciation of a source word is spelled out using Kanji characters. Because Kanji is ideogrammatic representation, different Kanji characters are associated with the same pronunciation, but can potentially con- vey different meanings and impressions. To select appropriate Kanji characters, an ex- isting method requests the user to provide one or more related terms for a source word, which is time-consuming and expensive. In this paper, to reduce this human effort, we use the World Wide Web to extract related terms for source words. We show the effec- tiveness of our method experimentally.
[show abstract][hide abstract] ABSTRACT: In this paper, we propose a novel ap-proach for Cross-Lingual Question Answer-ing (CLQA). In the proposed method, the statistical machine translation (SMT) is deeply incorporated into the question an-swering process, instead of using it as the pre-processing of the mono-lingual QA pro-cess as in the previous work. The proposed method can be considered as exploiting the SMT-based passage retrieval for CLQA task. We applied our method to the English-to-Japanese CLQA system and evaluated the performance by using NTCIR CLQA 1 and 2 test collections. The result showed that the proposed method outperformed the previous pre-translation approach.
[show abstract][hide abstract] ABSTRACT: This paper proposes a method to combine text-based and citation-based retrieval methods in the invalidity patent search. Using the NTCIR-6 test collection including eight years of USPTO patents, we show the eectiveness of our method experimentally.
SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007; 01/2007
[show abstract][hide abstract] ABSTRACT: This paper proposes methods for extracting loanwords from Cyrillic Mongolian corpora and producing a Japanese-Mongolian bilingual dictionary. We extract loanwords from Mongolian corpora using our own handcrafted rules. To complement the rule-based extraction, we also extract words in Mongolian corpora that are phonetically similar to Japanese Katakana words as loanwords. In addition, we correspond the extracted loanwords to Japanese words and produce a bilingual dictionary. We propose a stemming method for Mongolian to extract loanwords correctly. We verify the effectiveness of our methods experimentally.
ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006; 01/2006
[show abstract][hide abstract] ABSTRACT: For transliterating foreign words into Chi- nese, the pronunciation of a source word is spelled out with Kanji characters. Be- cause Kanji comprises ideograms, an indi- vidual pronunciation may be represented by more than one character. However, because different Kanji characters convey different meanings and impressions, char- acters must be selected carefully. In this paper, we propose a transliteration method that models both pronunciation and im- pression, whereas existing methods do not model impression. Given a source word and impression keywords related to the source word, our method derives possible transliteration candidates and sorts them according to their probability. We evalu- ate our method experimentally.
EMNLP 2007, Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 22-23 July 2006, Sydney, Australia; 01/2006
[show abstract][hide abstract] ABSTRACT: On the World Wide Web, the volume of subjective information, such as opinions and reviews, has been increasing rapidly. The trends and rules latent in a large set of subjective descriptions can potentially be useful for decision-making purposes. In this paper, we propose a method for summarizing subjective descriptions, specifically opinions in Japanese. We visualize the pro and con arguments for a target topic, such as "Should Japan introduce the summertime system?" Users can summarize the arguments about the topic in order to choose a more reasonable standpoint for decision making. We evaluate our system, called "OpinionReader", experimentally.