Sadao Kurohashi

Kyoto University, Kioto, Kyōto, Japan

Are you Sadao Kurohashi?

Claim your profile

Publications (162)6.89 Total impact

  • Mo Shen, Daisuke Kawahara, Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: In pursuing machine understanding of human language, highly accurate syntactic analysis is a crucial step. In this work, we focus on dependency grammar, which models syntax by encoding transparent predicate-argument structures. Recent advances in dependency parsing have shown that employing higher-order subtree structures in graph-based parsers can substantially improve the parsing accuracy. However, the inefficiency of this approach increases with the order of the subtrees. This work explores a new reranking approach for dependency parsing that can utilize complex subtree representations by applying efficient subtree selection methods. We demonstrate the effectiveness of the approach in experiments conducted on the Penn Treebank and the Chinese Treebank. Our system achieves the best performance among known supervised systems evaluated on these datasets, improving the baseline accuracy from 91.88% to 93.42% for English, and from 87.39% to 89.25% for Chinese.
    07/2014; 22(7):1208-1218. DOI:10.1109/TASLP.2014.2327295
  • Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); 06/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: This article proposes a new distortion model for phrase-based statistical machine translation. In decoding, a distortion model estimates the source word position to be translated next (subsequent position; SP) given the last translated source word position (current position; CP). We propose a distortion model that can simultaneously consider the word at the CP, the word at an SP candidate, the context of the CP and an SP candidate, relative word order among the SP candidates, and the words between the CP and an SP candidate. These considered elements are called rich context. Our model considers rich context by discriminating label sequences that specify spans from the CP to each SP candidate. It enables our model to learn the effect of relative word order among SP candidates as well as to learn the effect of distances from the training data. In contrast to the learning strategy of existing methods, our learning strategy is that the model learns preference relations among SP candidates in each sentence of the training data. This leaning strategy enables consideration of all of the rich context simultaneously. In our experiments, our model had higher BLUE and RIBES scores for Japanese-English, Chinese-English, and German-English translation compared to the lexical reordering models.
    ACM Transactions on Asian Language Information Processing 02/2014; 13(1):2:1--2:21. DOI:10.1145/2537128
  • Gongye Jin, Daisuke Kawahara, Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: Many knowledge acquisition tasks are tightly dependent on fundamental analysis technologies, such as part of speech (POS) tagging and parsing. Dependency parsing, in particular, has been widely employed for the acquisition of knowledge related to predicate-argument structures. For such tasks, the dependency parsing performance can determine quality of acquired knowledge, regardless of target languages. Therefore, reducing dependency parsing errors and selecting high quality dependencies is of primary importance. In this study, we present a language-independent approach for automatically selecting high quality dependencies from automatic parses. By considering several aspects that affect the accuracy of dependency parsing, we created a set of features for supervised classification of reliable dependencies. Experimental results on seven languages show that our approach can effectively select high quality dependencies from dependency parses.
    01/2014; 21(6):1163-1182. DOI:10.5715/jnlp.21.1163
  • 01/2014; 21(2):213-247. DOI:10.5715/jnlp.21.213
  • Ryohei Sasano, Sadao Kurohashi, Manabu Okumura
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a simple but effective approach to unknown word processing in Japanese morphological analysis, which handles 1) unknown words that are derived from words in a pre-defined lexicon and 2) unknown onomatopoeias. Our approach leverages derivation rules and onomatopoeia patterns, and correctly recognizes certain types of unknown words. Experiments revealed that our approach recognized about 4,500 unknown words in 100,000 Web sentences with only roughly 80 harmful side effects and a 6% loss in speed.
    01/2014; 21(6):1183-1205. DOI:10.5715/jnlp.21.1183
  • [Show abstract] [Hide abstract]
    ABSTRACT: The Chinese and Japanese languages share Chinese characters. Since the Chinese characters in Japanese originated from ancient China, many common Chinese characters exist between these two languages. Since Chinese characters contain significant semantic information and common Chinese characters share the same meaning in the two languages, they can be quite useful in Chinese-Japanese machine translation (MT). We therefore propose a method for creating a Chinese character mapping table for Japanese, traditional Chinese, and simplified Chinese, with the aim of constructing a complete resource of common Chinese characters. Furthermore, we point out two main problems in Chinese word segmentation for Chinese-Japanese MT, namely, unknown words and word segmentation granularity, and propose an approach exploiting common Chinese characters to solve these problems. We also propose a statistical method for detecting other semantically equivalent Chinese characters other than the common ones and a method for exploiting shared Chinese characters in phrase alignment. Results of the experiments carried out on a state-of-the-art phrase-based statistical MT system and an example-based MT system show that our proposed approaches can improve MT performance significantly, thereby verifying the effectiveness of shared Chinese characters for Chinese-Japanese MT.
    ACM Transactions on Asian Language Information Processing 10/2013; 12(4). DOI:10.1145/2523057.2523059
  • ACL2013; 08/2013
  • Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi
    Proceedings of the Sixth Workshop on Building and Using Comparable Corpora; 08/2013
  • Jun Harashima, Sadao Kurohashi
    Proceedings of COLING 2012; 12/2012
  • Tomohide Shibata, Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: This article proposes a predicate-argument structure based Textual Entailment Recognition system exploiting wide-coverage lexical knowledge. Different from conventional machine learning approaches where several features obtained from linguistic analysis and resources are utilized, our proposed method regards a predicate-argument structure as a basic unit, and performs the matching/alignment between a text and hypothesis. In matching between predicate-arguments, wide-coverage relations between words/phrases such as synonym and is-a are utilized, which are automatically acquired from a dictionary, Web corpus, and Wikipedia.
    ACM Transactions on Asian Language Information Processing 12/2012; 11(4). DOI:10.1145/2382593.2382598
  • Yugo Murawaki, Sadao Kurohashi
    Proceedings of COLING 2012; 12/2012
  • Denny Cahyadi, Fabien Cromieres, Sadao Kurohashi
    Proceedings of the 10th Workshop on Asian Language Resources; 12/2012
  • Toshiaki Nakazawa, Sadao Kurohashi
    Proceedings of COLING 2012; 12/2012
  • Jun Harashima, Sadao Kurohashi
    Proceedings of COLING 2012; 12/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 x 108 Web documents with a precision rate of about 94%.
    The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA; 01/2011
  • Source
    Fabien Cromierès, Sadao Kurohashi
    Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a Web spam detection algorithm that relies on link analysis. The method consists of three steps: (1) decomposition of web graphs in densely connected sub graphs and calculation of the features for each sub graph, (2) use of SVM classifiers to identify sub graphs composed of Web spam, and (3) propagation of predictions over web graphs by a biased Page Rank algorithm to expand the scope of identification. We performed experiments on a public benchmark. An empirical study of the core structure of web graphs suggests that highly ranked non-spam hosts can be identified by viewing the coreness of the web graph elements.
    Proceedings of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011, Campus Scientifique de la Doua, Lyon, France, August 22-27, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we investigate the effectiveness of the system design of a Web information analysis for open-domain decision support. In order to make decisions, it is required to collect and compare information from various view points. In case of making decisions based on Web information, however, it is difficult to obtain diverse information from variety of sources by using current search engines. Based on this observation, we design a system for supporting open-domain decision making, which analyzes Web information. Among the major design decisions are to focus on two elements, i.e. identifying the source of information and the extraction of informative content, and to organize the two elements so that the user can quickly grasp who is saying what on the Web. The assumption behind such decisions is that information organized in such a way would facilitate proper judgments in the user's decision making process. We conduct users evaluation to verify the effectiveness of our approach. In the result, it is confirmed that our system is superior to current search engine for grasping organized information from different stance of senders and supports the process of decision making, by (i) uncovering biases, (ii) showing various opinions from multiple view points, (iii) revealing information sources.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a series of enhancements to the English-Japanese version of a multilingual Linguis-tics Based Machine Translation system, for the transla-tion of complex sentences, modality and complex verbal structures. The system is using a classical transfer-based architecture and dedicated lexical databases. Relying on linguistic data acquired on large corpora or compiled from the web, corrections have been done in constituent re-ordering, lexical selection and verb conjugation. Even if the system is not, so far, as efficient as state-of-the-art English-Japanese MT systems, the results show a clear progress and underline the interest of using syntactic in-formation in MT.

Publication Stats

1k Citations
6.89 Total Impact Points

Institutions

  • 1992–2013
    • Kyoto University
      • Graduate School of Informatics
      Kioto, Kyōto, Japan
  • 2011
    • University of Geneva
      Genève, Geneva, Switzerland
  • 2008–2011
    • National Institute of Information and Communications Technology
      • Information Analysis Laboratory
      Edo, Tōkyō, Japan
    • Yamagata University
      • Department of Informatics
      Ямагата, Yamagata, Japan
  • 2001–2006
    • The University of Tokyo
      • Graduate School of Information Science and Technology
      Tokyo, Tokyo-to, Japan
  • 2003
    • Bunkyo University
      Edo, Tōkyō, Japan