[Show abstract][Hide abstract] ABSTRACT: To aid research and development in machine translation, we have produced a test collection for Japanese/English machine translation and performed the Patent Translation Task at the Seventh NTCIR Workshop. To obtain a parallel corpus, we extracted patent documents for the same or related inventions published in Japan and the United States. Our test collection includes approximately 2 000 000 sentence pairs in Japanese and English, which were extracted automatically from our parallel corpus. These sen- tence pairs can be used to train and evaluate machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval, which can be used to evaluate the contribution of machine translation to retrieving patent documents across lan- guages. This paper describes our test collection, meth- ods for evaluating machine translation, and evaluation results for research groups participated in our task. Our research is the first significant exploration into utilizing patent information for the evaluation of ma- chine translations.
[Show abstract][Hide abstract] ABSTRACT: In Mongolian, two different alphabets are used, Cyrillic and Mongolian. In this paper, we focus solely on the Mongolian language using the Cyrillic alphabet, in which a content word can be inflected when concatenated with one or more suffixes. Identifying the original form of content words is crucial for natural language processing and information retrieval. We propose a lemmatization method for Mongolian. The advantage of our lemmatization method is that it does not rely on noun dictionaries, enabling us to lemmatize out-of-dictionary words. We also apply our method to indexing for information retrieval. We use newspaper articles and technical abstracts in experiments that show the effectiveness of our method. Our research is the first significant exploration of the effectiveness of lemmatization for information retrieval in Mongolian.
Information Processing & Management 07/2009; 45:438-451. · 1.07 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We organized a machine translation (MT) task at the Seventh NTCIR Workshop. Participating groups were requested to machine translate sentences in patent documents and also search topics for retrieving patent documents across languages. We analyzed the relationship between the accuracy of MT and its effects on the retrieval accuracy.
Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19-23, 2009; 01/2009
[Show abstract][Hide abstract] ABSTRACT: We have produced a test collection for ma- chine translation (MT). Our test collection in- cludes approximately 2000000 sentence pairs in Japanese and English, which were extracted from patent documents and can be used to train and evaluate MT systems. Our test col- lection also includes search topics for cross- lingual information retrieval, to evaluate the contribution of MT to retrieving patent docu- ments across languages. We performed a task for MT at the NTCIR workshop and used our test collection to evaluate participating groups. This paper describes scientific knowledge ob- tained through our task.
[Show abstract][Hide abstract] ABSTRACT: Reflecting the rapid growth of information technology, the configuration of software applications, such as word processors and spreadsheets, is both sophisticated and complicated. It is often difficult for users to identify the relevant functions in an online manual of a target application. In this paper, we propose a question answering system that finds functions related to a user's request. To enhance our system, we addressed two "mismatch" problems. The first problem is associated with mismatch in vocabulary, in which the same concept is represented by different words in the manual and in the user's question. The second problem is associated with mismatch in function. Although a user may have a hypothetical function for their purpose in mind, the purpose can sometimes be accomplished by other functions. To resolve these mismatch problems, we use the World Wide Web to extract related terms for software functions, so that a user's question can be matched to the relevant function with a high accuracy. We show the effectiveness of our system experimentally.
Advanced Information Networking and Applications - Workshops, 2008. AINAW 2008. 22nd International Conference on; 04/2008
[Show abstract][Hide abstract] ABSTRACT: In aiming at research and development on machine translation, we produced a test collection for Japanese-English machine translation in the seventh NTCIR Workshop. This paper describes details of our test collection. From patent documents published in Japan and the United States, we extracted patent families as a parallel corpus. A patent family is a set of patent documents for the same or related invention and these documents are usually filed to more than one country in different languages. In the parallel corpus, we aligned Japanese sentences with their counterpart English sentences. Our test collection, which includes approximately 2000000 sentence pairs, can be used to train and test machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval and the contribution of machine translation to a patent retrieval task can also be evaluated. Our test collection will be available to the public for research purposes after the NTCIR final meeting.
Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco; 01/2008
[Show abstract][Hide abstract] ABSTRACT: This paper introduces the Patent Mining Task of the Seventh NTCIR Workshop and the test collections produced in this task. The task's goal was the classification of research papers written in either Japanese or English in terms of the International Patent Classification (IPC) system, which is a global standard. For this task, 12 participant groups submitted 49 runs. In this paper, we also report the evaluation results of the task.
Proceedings of the 1st ACM workshop on Patent Information Retrieval, PaIR 2008, Napa Valley, California, USA, October 30, 2008; 01/2008
[Show abstract][Hide abstract] ABSTRACT: Although the World Wide Web has of late become an important source to consult for the meaning of words, a number of technical terms related to high technology are not found on the Web. This paper describes a method to produce an encyclopedic dictionary for high-tech terms from patent information. We used a collection of unexamined patent applications published by the Japanese Patent Office as a source corpus. Given this collection, we extracted terms as headword candidates and retrieved applications including those headwords. Then, we extracted paragraph-style descriptions and categorized them into technical domains. We also extracted related terms for each headword. We have produced a dictionary including approximately 400000 Japanese terms as headwords. We have also implemented an interface with which users can explore our dictionary by reading text descriptions and viewing a related-term graph.
Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco; 01/2008
[Show abstract][Hide abstract] ABSTRACT: To aid research and development in machine translation, we have produced a test collec- tion for Japanese/English machine transla- tion. To obtain a parallel corpus, we extracted patent documents for the same or related in- ventions published in Japan and the United States. Our test collection includes approx- imately 2000000 sentence pairs in Japanese and English, which were extracted automati- cally from our parallel corpus. These sentence pairs can be used to train and evaluate machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval, which can be used to evaluate the contribution of machine translation to retriev- ing patent documents across languages. This paper describes our test collection, methods for evaluating machine translation, and pre- liminary experiments.
[Show abstract][Hide abstract] ABSTRACT: Several types of queries are widely used on the World Wide Web and the expected retrieval method can vary depend- ing on the query type. We propose a method for classify- ing queries into informational and navigational types. Be- cause terms in navigational queries often appear in anchor text for links to other pages, we analyze the distribution of query terms in anchor texts on the Web for query classifi- cation purposes. While content-based retrieval is effective for informational queries, anchor-based retrieval is effective for navigational queries. Our retrieval system combines the results obtained with the content-based and anchor-based retrieval methods, in which the weight for each retrieval re- sult is determined automatically depending on the result of the query classification. We also propose a method for im- proving anchor-based retrieval. Our retrieval method, which computes the probability that a document is retrieved in re- sponse to the given query, identifies synonyms of query terms in the anchor texts on the Web and uses these synonyms for smoothing purposes in the probability estimation. We use the NTCIR test collections and show the effectiveness of individual methods and the entire Web retrieval system experimentally.
Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21-25, 2008; 01/2008
[Show abstract][Hide abstract] ABSTRACT: In Modern Mongolian, a content word can be inflected when concatenated with suf- fixes. Identifying the original forms of content words is crucial for natural lan- guage processing and information retrieval. We propose a lemmatization method for Modern Mongolian and apply our method to indexing for information retrieval. We use technical abstracts to show the effec- tiveness of our method experimentally.
[Show abstract][Hide abstract] ABSTRACT: The processing of intellectual property documents, such as patents, has been important to the industry, business, and law communities. Recently, the importance of patent processing has also been recognized in academic research communities, particularly by information retrieval and natural language processing researchers. In addition, large test collections that include patents have recently become available, to enable the systematic evaluation of methodologies from a scientific point of view. In the light of these activities, this special issue is intended to collect advanced research papers on patent processing. As an introduction to the special issue on patent processing, this paper surveys the relevant literature and outlines the papers selected for the special issue.
Information Processing & Management 09/2007; · 1.07 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper proposes a method to combine text-based and citation-based retrieval methods in the invalidity patent search. Using the NTCIR-6 test collection including eight years of USPTO patents, we show the eectiveness of our method experimentally.
SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007; 01/2007
[Show abstract][Hide abstract] ABSTRACT: In the Sixth NTCIR Workshop, we organized the Patent Retrieval Task and performed three subtasks; Japanese Retrieval, English Retrieval, and Classifi- cation. This paper describes the Japanese Retrieval Subtask and English Retrieval Subtask, both of which were intended for patent-to-patent invalidity search task. We report the evaluation results of the groups participating in those subtasks.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we propose a novel ap-proach for Cross-Lingual Question Answer-ing (CLQA). In the proposed method, the statistical machine translation (SMT) is deeply incorporated into the question an-swering process, instead of using it as the pre-processing of the mono-lingual QA pro-cess as in the previous work. The proposed method can be considered as exploiting the SMT-based passage retrieval for CLQA task. We applied our method to the English-to-Japanese CLQA system and evaluated the performance by using NTCIR CLQA 1 and 2 test collections. The result showed that the proposed method outperformed the previous pre-translation approach.
[Show abstract][Hide abstract] ABSTRACT: To transliterate foreign words, in Japanese and Korean, phonograms, such as Katakana and Hangul, are used. In Chinese, the pronunciation of a source word is spelled out using Kanji characters. Because Kanji is ideogrammatic representation, different Kanji characters are associated with the same pronunciation, but can potentially con- vey different meanings and impressions. To select appropriate Kanji characters, an ex- isting method requests the user to provide one or more related terms for a source word, which is time-consuming and expensive. In this paper, to reduce this human effort, we use the World Wide Web to extract related terms for source words. We show the effec- tiveness of our method experimentally.
[Show abstract][Hide abstract] ABSTRACT: We propose a cross-media lecture-on-demand system, called lodem, which searches a lecture video for specific segments in response to a text query. We utilize the benefits of text, audio, and video data corresponding to a single lecture. lodem extracts the audio track from a target lecture video, generates a transcription by large-vocabulary continuous speech recognition, and produces a text index. A user can formulate text queries using the textbook related to the target lecture and can selectively view specific video segments by submitting those queries. Experimental results showed that by adapting speech recognition to the lecturer and the topic of the target lecture, the recognition accuracy was increased and consequently the retrieval accuracy was comparable with that obtained by human transcription. lodem is implemented as a client–server system on the Web to facilitate e-learning.
Speech Communication 05/2006; · 1.55 Impact Factor