Sadao Kurohashi

Kyoto University, Kioto, Kyōto, Japan

Are you Sadao Kurohashi?

Claim your profile

Publications (160)6.89 Total impact

  • Mo Shen, Daisuke Kawahara, Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: In pursuing machine understanding of human language, highly accurate syntactic analysis is a crucial step. In this work, we focus on dependency grammar, which models syntax by encoding transparent predicate-argument structures. Recent advances in dependency parsing have shown that employing higher-order subtree structures in graph-based parsers can substantially improve the parsing accuracy. However, the inefficiency of this approach increases with the order of the subtrees. This work explores a new reranking approach for dependency parsing that can utilize complex subtree representations by applying efficient subtree selection methods. We demonstrate the effectiveness of the approach in experiments conducted on the Penn Treebank and the Chinese Treebank. Our system achieves the best performance among known supervised systems evaluated on these datasets, improving the baseline accuracy from 91.88% to 93.42% for English, and from 87.39% to 89.25% for Chinese.
    Audio, Speech, and Language Processing, IEEE/ACM Transactions on. 07/2014; 22(7):1208-1218.
  • Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); 06/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: This article proposes a new distortion model for phrase-based statistical machine translation. In decoding, a distortion model estimates the source word position to be translated next (subsequent position; SP) given the last translated source word position (current position; CP). We propose a distortion model that can simultaneously consider the word at the CP, the word at an SP candidate, the context of the CP and an SP candidate, relative word order among the SP candidates, and the words between the CP and an SP candidate. These considered elements are called rich context. Our model considers rich context by discriminating label sequences that specify spans from the CP to each SP candidate. It enables our model to learn the effect of relative word order among SP candidates as well as to learn the effect of distances from the training data. In contrast to the learning strategy of existing methods, our learning strategy is that the model learns preference relations among SP candidates in each sentence of the training data. This leaning strategy enables consideration of all of the rich context simultaneously. In our experiments, our model had higher BLUE and RIBES scores for Japanese-English, Chinese-English, and German-English translation compared to the lexical reordering models.
    ACM Transactions on Asian Language Information Processing 02/2014; 13(1):2:1--2:21.
  • Journal of Natural Language Processing. 01/2014; 21(2):213-247.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The Chinese and Japanese languages share Chinese characters. Since the Chinese characters in Japanese originated from ancient China, many common Chinese characters exist between these two languages. Since Chinese characters contain significant semantic information and common Chinese characters share the same meaning in the two languages, they can be quite useful in Chinese-Japanese machine translation (MT). We therefore propose a method for creating a Chinese character mapping table for Japanese, traditional Chinese, and simplified Chinese, with the aim of constructing a complete resource of common Chinese characters. Furthermore, we point out two main problems in Chinese word segmentation for Chinese-Japanese MT, namely, unknown words and word segmentation granularity, and propose an approach exploiting common Chinese characters to solve these problems. We also propose a statistical method for detecting other semantically equivalent Chinese characters other than the common ones and a method for exploiting shared Chinese characters in phrase alignment. Results of the experiments carried out on a state-of-the-art phrase-based statistical MT system and an example-based MT system show that our proposed approaches can improve MT performance significantly, thereby verifying the effectiveness of shared Chinese characters for Chinese-Japanese MT.
    ACM Transactions on Asian Language Information Processing 10/2013; 12(4).
  • ACL2013; 08/2013
  • Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi
    Proceedings of the Sixth Workshop on Building and Using Comparable Corpora; 08/2013
  • Tomohide Shibata, Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: This article proposes a predicate-argument structure based Textual Entailment Recognition system exploiting wide-coverage lexical knowledge. Different from conventional machine learning approaches where several features obtained from linguistic analysis and resources are utilized, our proposed method regards a predicate-argument structure as a basic unit, and performs the matching/alignment between a text and hypothesis. In matching between predicate-arguments, wide-coverage relations between words/phrases such as synonym and is-a are utilized, which are automatically acquired from a dictionary, Web corpus, and Wikipedia.
    ACM Transactions on Asian Language Information Processing 12/2012; 11(4).
  • Denny Cahyadi, Fabien Cromieres, Sadao Kurohashi
    Proceedings of the 10th Workshop on Asian Language Resources; 12/2012
  • Jun Harashima, Sadao Kurohashi
    Proceedings of COLING 2012; 12/2012
  • Jun Harashima, Sadao Kurohashi
    Proceedings of COLING 2012; 12/2012
  • Toshiaki Nakazawa, Sadao Kurohashi
    Proceedings of COLING 2012; 12/2012
  • Yugo Murawaki, Sadao Kurohashi
    Proceedings of COLING 2012; 12/2012
  • Source
    Yugo Murawaki, Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: A key factor of high quality word segmentation for Japanese is a high-coverage dictionary, but it is costly to manually build such a lexical resource. Although external lexical resources for human readers are potentially good knowledge sources, they have not been utilized due to differences in segmentation criteria. To supplement a morphological dictionary with these resources, we propose a new task of Japanese noun phrase segmentation. We apply non-parametric Bayesian language models to segment each noun phrase in these resources according to the statistical behavior of its supposed constituents in text. For inference, we propose a novel block sampling procedure named hybrid type-based sampling, which has the ability to directly escape a local optimum that is not too distant from the global optimum. Experiments show that the proposed method efficiently corrects the initial segmentation given by a morphological analyzer.
    Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL; 01/2011
  • Source
    Fabien Cromierès, Sadao Kurohashi
    Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 x 108 Web documents with a precision rate of about 94%.
    The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a series of enhancements to the English-Japanese version of a multilingual Linguis-tics Based Machine Translation system, for the transla-tion of complex sentences, modality and complex verbal structures. The system is using a classical transfer-based architecture and dedicated lexical databases. Relying on linguistic data acquired on large corpora or compiled from the web, corrections have been done in constituent re-ordering, lexical selection and verb conjugation. Even if the system is not, so far, as efficient as state-of-the-art English-Japanese MT systems, the results show a clear progress and underline the interest of using syntactic in-formation in MT.
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a Web spam detection algorithm that relies on link analysis. The method consists of three steps: (1) decomposition of web graphs in densely connected sub graphs and calculation of the features for each sub graph, (2) use of SVM classifiers to identify sub graphs composed of Web spam, and (3) propagation of predictions over web graphs by a biased Page Rank algorithm to expand the scope of identification. We performed experiments on a public benchmark. An empirical study of the core structure of web graphs suggests that highly ranked non-spam hosts can be identified by viewing the coreness of the web graph elements.
    Proceedings of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011, Campus Scientifique de la Doua, Lyon, France, August 22-27, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we investigate the effectiveness of the system design of a Web information analysis for open-domain decision support. In order to make decisions, it is required to collect and compare information from various view points. In case of making decisions based on Web information, however, it is difficult to obtain diverse information from variety of sources by using current search engines. Based on this observation, we design a system for supporting open-domain decision making, which analyzes Web information. Among the major design decisions are to focus on two elements, i.e. identifying the source of information and the extraction of informative content, and to organize the two elements so that the user can quickly grasp who is saying what on the Web. The assumption behind such decisions is that information organized in such a way would facilitate proper judgments in the user's decision making process. We conduct users evaluation to verify the effectiveness of our approach. In the result, it is confirmed that our system is superior to current search engine for grasping organized information from different stance of senders and supports the process of decision making, by (i) uncovering biases, (ii) showing various opinions from multiple view points, (iii) revealing information sources.
  • [Show abstract] [Hide abstract]
    ABSTRACT: A vast amount of information and knowledge has been accumulated and circulated on the Web. They provide people with options regarding their daily lives and are starting to have a strong influence on governmental policies and business management. A crucial problem is that information on the Web is not necessarily credible. This paper describes an information analysis system called WISDOM, which assists users in assessing the credibility of information on the Web. WISDOM is to organize information on a given topic through the following three types of analyses: (1) extracting and contrasting opinions and important statements around the points related to the topic, (2) identifying and classifying the information sender of each page; and (3) analyzing the appearance of each page, for example, page design and writing style. Our preliminary evaluation indicates the effectiveness of WISDOM and its advantage to Google from the viewpoint of the ability of grasping the difference of information senders and opinions.
    Universal Communication Symposium (IUCS), 2010 4th International; 11/2010

Publication Stats

1k Citations
6.89 Total Impact Points


  • 1992–2013
    • Kyoto University
      • Graduate School of Informatics
      Kioto, Kyōto, Japan
  • 2011
    • University of Geneva
      Genève, Geneva, Switzerland
  • 2008–2011
    • National Institute of Information and Communications Technology
      Edo, Tōkyō, Japan
    • Yamagata University
      • Department of Informatics
      Ямагата, Yamagata, Japan
  • 2007
    • University of Tsukuba
      Tsukuba, Ibaraki, Japan
    • University of Liverpool
      Liverpool, England, United Kingdom
  • 2001–2006
    • The University of Tokyo
      • Graduate School of Information Science and Technology
      Tokyo, Tokyo-to, Japan