Sadao Kurohashi

Kyoto University, Kioto, Kyoto, Japan

Are you Sadao Kurohashi?

Claim your profile

Publications (175)13.45 Total impact

  • Chenhui Chu · Toshiaki Nakazawa · Sadao Kurohashi

    No preview · Article · Dec 2015
  • Source
    Isao Goto · Masao Utiyama · Eiichiro Sumita · Sadao Kurohashi

    Preview · Article · Jun 2015
  • Source
    Chenhui Chu · Toshiaki Nakazawa · Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract parallel sentences from them for SMT. Parallel sentence extraction relies highly on bilingual lexicons that are also very scarce. We propose an unsupervised bilingual lexicon extraction based parallel sentence extraction system that first extracts bilingual lexicons from comparable corpora and then extracts parallel sentences using the lexicons. Our bilingual lexicon extraction method is based on a combination of topic model and context based methods in an iterative process. The proposed method does not rely on any prior knowledge, and the performance can be improved iteratively. The parallel sentence extraction method uses a binary classifier for parallel sentence identification. The extracted bilingual lexicons are used for the classifier to improve the performance of parallel sentence extraction. Experiments conducted with the Wikipedia data indicate that the proposed bilingual lexicon extraction method greatly outperforms existing methods, and the extracted bilingual lexicons significantly improve the performance of parallel sentence extraction for SMT.
    Preview · Article · Jan 2015
  • Mo Shen · Daisuke Kawahara · Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: In pursuing machine understanding of human language, highly accurate syntactic analysis is a crucial step. In this work, we focus on dependency grammar, which models syntax by encoding transparent predicate-argument structures. Recent advances in dependency parsing have shown that employing higher-order subtree structures in graph-based parsers can substantially improve the parsing accuracy. However, the inefficiency of this approach increases with the order of the subtrees. This work explores a new reranking approach for dependency parsing that can utilize complex subtree representations by applying efficient subtree selection methods. We demonstrate the effectiveness of the approach in experiments conducted on the Penn Treebank and the Chinese Treebank. Our system achieves the best performance among known supervised systems evaluated on these datasets, improving the baseline accuracy from 91.88% to 93.42% for English, and from 87.39% to 89.25% for Chinese.
    No preview · Article · Jul 2014 · IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • Mo Shen · Hongxiao Liu · Daisuke Kawahara · Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: The focus of recent studies on Chinese word segmentation, part-of-speech (POS) tagging and parsing has been shifting from words to characters. However, existing methods have not yet fully utilized the potentials of Chinese characters. In this paper, we investigate the usefulness of character-level part-of-speech in the task of Chinese morphological analysis. We propose the first tagset designed for the task of character-level POS tagging. We propose a method that performs character-level POS tagging jointly with word segmentation and word-level POS tagging. Through experiments, we demonstrate that by introducing character-level POS information, the performance of a baseline morphological analyzer can be significantly improved.
    No preview · Conference Paper · Jun 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This article proposes a new distortion model for phrase-based statistical machine translation. In decoding, a distortion model estimates the source word position to be translated next (subsequent position; SP) given the last translated source word position (current position; CP). We propose a distortion model that can simultaneously consider the word at the CP, the word at an SP candidate, the context of the CP and an SP candidate, relative word order among the SP candidates, and the words between the CP and an SP candidate. These considered elements are called rich context. Our model considers rich context by discriminating label sequences that specify spans from the CP to each SP candidate. It enables our model to learn the effect of relative word order among SP candidates as well as to learn the effect of distances from the training data. In contrast to the learning strategy of existing methods, our learning strategy is that the model learns preference relations among SP candidates in each sentence of the training data. This leaning strategy enables consideration of all of the rich context simultaneously. In our experiments, our model had higher BLUE and RIBES scores for Japanese-English, Chinese-English, and German-English translation compared to the lexical reordering models.
    Preview · Article · Feb 2014 · ACM Transactions on Asian Language Information Processing
  • Source
    Gongye Jin · Daisuke Kawahara · Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: Many knowledge acquisition tasks are tightly dependent on fundamental analysis technologies, such as part of speech (POS) tagging and parsing. Dependency parsing, in particular, has been widely employed for the acquisition of knowledge related to predicate-argument structures. For such tasks, the dependency parsing performance can determine quality of acquired knowledge, regardless of target languages. Therefore, reducing dependency parsing errors and selecting high quality dependencies is of primary importance. In this study, we present a language-independent approach for automatically selecting high quality dependencies from automatic parses. By considering several aspects that affect the accuracy of dependency parsing, we created a set of features for supervised classification of reliable dependencies. Experimental results on seven languages show that our approach can effectively select high quality dependencies from dependency parses.
    Preview · Article · Jan 2014
  • Masatsugu Hangyo · Daisuke Kawahara · Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: In Japanese, zero references often occur and many of them are categorized into zero ex-ophora, in which a referent is not mentioned in the document. However, previous studies have focused on only zero endophora, in which a referent explicitly appears. We present a zero reference resolution model considering zero exophora and author/reader of a document. To deal with zero exophora, our model adds pseudo entities corresponding to zero exophora to candidate referents of zero pronouns. In addition, we automatically detect mentions that refer to the author and reader of a document by using lexico-syntactic patterns. We represent their particular behavior in a discourse as a feature vector of a machine learning model. The experimental results demonstrate the effectiveness of our model for not only zero exophora but also zero endophora.
    No preview · Article · Jan 2014
  • Source
    Ryohei Sasano · Daisuke Kawahara · Sadao Kurohashi · Manabu Okumura
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a method for automatically acquiring knowledge about case alternations between the passive/causative and active voices. Our method leverages large lexical case frames obtained from a large Web corpus, and several alternation patterns. We then use the acquired knowledge to a case alternation task and show its usefulness.
    Preview · Article · Jan 2014
  • Jun Harashima · Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: Most relevance feedback methods re-rank search results using only the information of surface words in texts. We present a method that uses not only the information of surface words but also that of latent words that are inferred from texts. We infer latent word distribution in each document in the search results using latent Dirichlet allocation (LDA). When feedback is given, we also infer the latent word distribution in the feedback using LDA. We calculate the similarities between the user feedback and each document in the search results using both the surface and latent word distributions and re-rank the search results on the basis of the similarities. Evaluation results show that when user feedback consisting of two documents (3,589 words) is given, the proposed method improves the initial search results by 27.6% in precision at 10 (P@10). Additionally, it proves that the proposed method can perform well even when only a small amount of user feedback is available. For example, an improvement of 5.3% in P@10 was achieved when user feedback constituted only 57 words.
    No preview · Article · Jan 2014
  • Masatsugu Hangyo · Daisuke Kawahara · Sadao Kurohashi

    No preview · Article · Jan 2014
  • Source
    Ryohei Sasano · Sadao Kurohashi · Manabu Okumura
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a simple but effective approach to unknown word processing in Japanese morphological analysis, which handles 1) unknown words that are derived from words in a pre-defined lexicon and 2) unknown onomatopoeias. Our approach leverages derivation rules and onomatopoeia patterns, and correctly recognizes certain types of unknown words. Experiments revealed that our approach recognized about 4,500 unknown words in 100,000 Web sentences with only roughly 80 harmful side effects and a 6% loss in speed.
    Preview · Article · Jan 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: The Chinese and Japanese languages share Chinese characters. Since the Chinese characters in Japanese originated from ancient China, many common Chinese characters exist between these two languages. Since Chinese characters contain significant semantic information and common Chinese characters share the same meaning in the two languages, they can be quite useful in Chinese-Japanese machine translation (MT). We therefore propose a method for creating a Chinese character mapping table for Japanese, traditional Chinese, and simplified Chinese, with the aim of constructing a complete resource of common Chinese characters. Furthermore, we point out two main problems in Chinese word segmentation for Chinese-Japanese MT, namely, unknown words and word segmentation granularity, and propose an approach exploiting common Chinese characters to solve these problems. We also propose a statistical method for detecting other semantically equivalent Chinese characters other than the common ones and a method for exploiting shared Chinese characters in phrase alignment. Results of the experiments carried out on a state-of-the-art phrase-based statistical MT system and an example-based MT system show that our proposed approaches can improve MT performance significantly, thereby verifying the effectiveness of shared Chinese characters for Chinese-Japanese MT.
    No preview · Article · Oct 2013 · ACM Transactions on Asian Language Information Processing
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes new distortion models for phrase-based SMT. In decoding, a distortion model estimates the source word position to be translated next (NP) given the last translated source word position (CP). We propose a distortion model that can consider the word at the CP, a word at an NP candidate, and the context of the CP and the NP candidate simultaneously. Moreover, we propose a further improved model that considers richer context by discriminating label sequences that specify spans from the CP to NP candidates. It enables our model to learn the effect of relative word order among NP candidates as well as to learn the effect of distances from the training data. In our experiments, our model improved 2.9 BLEU points for Japanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models.
    No preview · Conference Paper · Aug 2013
  • Chenhui Chu · Toshiaki Nakazawa · Sadao Kurohashi

    No preview · Conference Paper · Aug 2013

  • No preview · Article · Jan 2013
  • Yugo Murawaki · Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose the use of spans in addition to edges in noun compound analysis. A span is a sequence of words that can represent a noun compound. Compared with edges, spans have good properties in terms of semi-supervised parsing. They can be reliably extracted from a huge amount of unannotated text. In addition, while the combinations of edges such as sibling and grandparent interactions are, in general, difficult to handle in parsing, it is quite easy to utilize spans with arbitrary width. We show that spans can be incorporated straightforwardly into the standard chart-based parsing algorithm. We create a semi-supervised discriminative parser that combines edge and span features. Experiments show that span features improve accuracy and that further gain is obtained when they are combined with edge features.
    No preview · Conference Paper · Dec 2012
  • Tomohide Shibata · Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: This article proposes a predicate-argument structure based Textual Entailment Recognition system exploiting wide-coverage lexical knowledge. Different from conventional machine learning approaches where several features obtained from linguistic analysis and resources are utilized, our proposed method regards a predicate-argument structure as a basic unit, and performs the matching/alignment between a text and hypothesis. In matching between predicate-arguments, wide-coverage relations between words/phrases such as synonym and is-a are utilized, which are automatically acquired from a dictionary, Web corpus, and Wikipedia.
    No preview · Article · Dec 2012 · ACM Transactions on Asian Language Information Processing
  • Jun Harashima · Sadao Kurohashi

    No preview · Conference Paper · Dec 2012
  • Jun Harashima · Sadao Kurohashi
    [Show abstract] [Hide abstract]
    ABSTRACT: Sentence compression is important in a wide range of applications in natural language processing. Previous approaches of Japanese sentence compression can be divided into two groups. Word-based methods extract a subset of words from a sentence to shorten it, while bunsetsubased methods extract a subset of bunsetsu (where a bunsetsu is a text unit that consists of content words and following function words). Basically, bunsetsu-based methods perform better than word-based methods. However, bunsetsu-based methods have the disadvantage that they cannot drop unimportant words from each bunsetsu because they have to follow constraints under which each bunsetsu is treated as a unit. In this paper, we propose a novel compression method to overcome this disadvantage. Our method relaxes the constraints using Lagrangian relaxation and shortens each bunsetsu if it contains unimportant words. Experimental results show that our method effectively compresses a sentence while preserving its important information and grammaticality.
    No preview · Conference Paper · Dec 2012

Publication Stats

1k Citations
13.45 Total Impact Points

Institutions

  • 1992-2014
    • Kyoto University
      • Graduate School of Informatics
      Kioto, Kyoto, Japan
  • 2011
    • University of Geneva
      Genève, Geneva, Switzerland
  • 2008-2010
    • National Institute of Information and Communications Technology
      • Information Analysis Laboratory
      Edo, Tōkyō, Japan
  • 2001-2006
    • The University of Tokyo
      • Graduate School of Information Science and Technology
      Tokyo, Tokyo-to, Japan
  • 2003
    • Bunkyo University
      Edo, Tōkyō, Japan