Sadao Kurohashi

Sadao Kurohashi
Kyoto University | Kyodai · Graduate School of Informatics

PhD

About

318
Publications
16,330
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,686
Citations
Citations since 2017
103 Research Items
1433 Citations
2017201820192020202120222023050100150200250300
2017201820192020202120222023050100150200250300
2017201820192020202120222023050100150200250300
2017201820192020202120222023050100150200250300

Publications

Publications (318)
Article
Natural language processing for medical applications (medical NLP) requires high-quality annotated corpora. In this study, we designed a versatile annotation scheme for clinical-medical text and a set of associated guidelines, which address two common subtasks used in medical NLP: named entity recognition (NER) and relation extraction (RE). The ann...
Preprint
Solving math word problems is the task that analyses the relation of quantities and requires an accurate understanding of contextual natural language information. Recent studies show that current models rely on shallow heuristics to predict solutions and could be easily misled by small textual perturbations. To address this problem, we propose a Te...
Preprint
Relation extraction (RE) has achieved remarkable progress with the help of pre-trained language models. However, existing RE models are usually incapable of handling two situations: implicit expressions and long-tail relation types, caused by language complexity and data sparsity. In this paper, we introduce a simple enhancement of RE using $k$ nea...
Article
Word segmentation, part-of-speech (POS) tagging, and syntactic parsing are three fundamental Chinese analysis tasks for Chinese language processing, which are also crucial for various downstream tasks such as machine translation and information extraction. To achieve high accuracy for these tasks, treebanks that contain sentences manually annotated...
Preprint
Previous studies have introduced a weakly-supervised paradigm for solving math word problems requiring only the answer value annotation. While these methods search for correct value equation candidates as pseudo labels, they search among a narrow sub-space of the enormous equation space. To address this problem, we propose a novel search algorithm...
Preprint
To solve Math Word Problems, human students leverage diverse reasoning logic that reaches different possible equation solutions. However, the mainstream sequence-to-sequence approach of automatic solvers aims to decode a fixed solution equation supervised by human annotation. In this paper, we propose a controlled equation generation solver by leve...
Article
Full-text available
Volitionality and subject animacy are fundamental and closely related properties of an event. Their classification is challenging because it requires contextual text understanding and a huge amount of labeled data. This paper proposes a novel method that jointly learns volitionality and subject animacy at a low cost, heuristically labeling events i...
Article
Full-text available
Recent research shows that Transformer-based language models (LMs) store considerable factual knowledge from the unstructured text datasets on which they are pre-trained. The existence and amount of such knowledge have been investigated by probing pre-trained Transformers to answer questions without accessing any external context or knowledge (also...
Article
In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on m...
Article
Full-text available
End-to-end speech translation (ST) is the task of directly translating source language speech to target language text. It has the potential to generate better translation than those obtained by simply combining automatic speech recognition (ASR) with machine translation (MT). We propose cross-lingual transfer learning for end-to-end ST, where the m...
Preprint
Massively multilingual sentence representation models, e.g., LASER, SBERT-distill, and LaBSE, help significantly improve cross-lingual downstream tasks. However, multiple training procedures, the use of a large amount of data, or inefficient model architectures result in heavy computation to train a new model according to our preferred languages an...
Preprint
Contrastive pre-training on distant supervision has shown remarkable effectiveness for improving supervised relation extraction tasks. However, the existing methods ignore the intrinsic noise of distant supervision during the pre-training stage. In this paper, we propose a weighted contrastive learning method by leveraging the supervised data to es...
Preprint
Word alignment has proven to benefit many-to-many neural machine translation (NMT). However, high-quality ground-truth bilingual dictionaries were used for pre-editing in previous methods, which are unavailable for most language pairs. Meanwhile, the contrastive objective can implicitly utilize automatically learned word alignment, which has not be...
Article
Flashcard systems are effective tools for learning words but have their limitations in teaching word usage. To overcome this problem, we suggest that a flashcard system shows a new example sentence on each repetition. This extension requires high-quality example sentences, automatically extracted from a huge corpus. To do this, we use a Determinant...
Preprint
Full-text available
Low resource speech recognition has been long-suffering from insufficient training data. While neighbour languages are often used as assistant training data, it would be difficult for the model to induct similar units (character, subword, etc.) across the languages. In this paper, we assume similar units in neighbour language share similar term fre...
Preprint
In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on m...
Preprint
Existing multimodal machine translation (MMT) datasets consist of images and video captions or general subtitles, which rarely contain linguistic ambiguity, making visual information not so effective to generate appropriate translations. We introduce VISA, a new dataset that consists of 40k Japanese-English parallel sentence pairs and corresponding...
Article
Automatically solving math word problems is a critical task in the field of natural language processing. Recent models have reached their performance bottleneck and require more high-quality data for training. We propose a novel data augmentation method that reverses the mathematical logic of math word problems to produce new high-quality math prob...
Preprint
Meta learning with auxiliary languages has demonstrated promising improvements for cross-lingual natural language processing. However, previous studies sample the meta-training and meta-testing data from the same language, which limits the ability of the model for cross-lingual transfer. In this paper, we propose XLA-MAML, which performs direct cro...
Preprint
Full-text available
We present an open-access natural language processing toolkit for Japanese medical information extraction. We first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We experiment with the practical annotation scenarios by separately annotating two d...
Preprint
Large-scale models for learning fixed-dimensional cross-lingual sentence representations like Large-scale models for learning fixed-dimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks. However, further increases and modifications based on such...
Preprint
With advances in neural language models, the focus of linguistic steganography has shifted from edit-based approaches to generation-based ones. While the latter's payload capacity is impressive, generating genuine-looking texts remains challenging. In this paper, we revisit edit-based linguistic steganography, with the idea that a masked language m...
Preprint
Full-text available
Medical information is valuable information obtained from humans regarding the phenotype of diseases. Omics data is informative to understand diseases at biomolecular level. We aimed to detect patient stratification patterns in a data-driven manner and identify candidate drug targets by investigating biomolecules that are linked to phenotype-level...
Preprint
Full-text available
Medical information is valuable information obtained from humans regarding the phenotype of diseases. Omics data is informative to understand diseases at biomolecular level. We aimed to detect patient stratification patterns in a data-driven manner and identify candidate drug targets by investigating biomolecules that are linked to phenotype-level...
Article
Full-text available
Intelligent dialogue systems are expected to be a new interface between humans and machines. An ideal intelligent dialogue system should estimate the user’s internal states and incorporate the estimation results into its response appropriately. In this paper, we focus on the movie recommendation dialogues and propose a dialogue system that consider...
Article
Full-text available
Modeling the relations between text spans in a document is a crucial yet challenging problem for extractive summarization. Various kinds of relations exist among text spans of different granularity, such as discourse relations between elementary discourse units and coreference relations between phrase mentions. In this paper, we utilize heterogeneo...
Article
With the help of the detailed annotated question answering dataset HotpotQA, recent question answering models are trained to justify their predicted answers with supporting facts from context documents. Some related works train the same model to find supporting facts and answers jointly without having specialized models for each task. The others tr...
Article
Full-text available
Correcting typographical errors (typos) is important to mitigate errors in downstream natural language processing tasks. Although a large number of typo–correction pairs are required to develop typo correction systems, no such dataset is available for the Japanese language. Previous studies on building French and English typo datasets have exploite...
Article
Full-text available
In machine translation of spoken language, it is known that phenomena specific to spoken language have a negative impact on translation accuracy. Therefore, in this study, as a preprocessing step for Japanese-English translation in our university lecture translation system, we improve the translation accuracy by automatically converting spoken-styl...
Article
Although discourse parsing is fundamental to natural language processing, limited research has been conducted on corpus-based discourse parsing in Japanese. Herein, we construct a Japanese corpus annotated with discourse units, discourse connectives, and discourse relations. We propose four strategies of easily and rapidly developing a corpus: (1)...
Preprint
Full-text available
Intelligent dialogue systems are expected as a new interface between humans and machines. Such an intelligent dialogue system should estimate the user's internal state (UIS) in dialogues and change its response appropriately according to the estimation result. In this paper, we model the UIS in dialogues, taking movie recommendation dialogues as ex...
Preprint
Automatically solving math word problems is a critical task in the field of natural language processing. Recent models have reached their performance bottleneck and require more high-quality data for training. Inspired by human double-checking mechanism, we propose a reverse operation based data augmentation method that makes use of mathematical lo...
Preprint
Joint entity and relation extraction aims to extract relation triplets from plain text directly. Prior work leverages Sequence-to-Sequence (Seq2Seq) models for triplet sequence generation. However, Seq2Seq enforces an unnecessary order on the unordered triplets and involves a large decoding length associated with error accumulation. These introduce...
Preprint
The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign...
Preprint
Full-text available
Neural machine translation (NMT) needs large parallel corpora for state-of-the-art translation quality. Low-resource NMT is typically addressed by transfer learning which leverages large monolingual or parallel corpora for pre-training. Monolingual pre-training approaches such as MASS (MAsked Sequence to Sequence) are extremely effective in boostin...
Article
Texts like news, encyclopedias, and some social media strive for objectivity. Yet bias in the form of inappropriate subjectivity — introducing attitudes via framing, presupposing truth, and casting doubt — remains ubiquitous. This kind of bias erodes our collective trust and fuels social conflict. To address this issue, we introduce a novel testbed...
Article
An NLP tool is practical when it is fast in addition to having high accuracy. We describe the architecture and the used methods to achieve 250× analysis speed improvement on the Juman++ morphological analyzer together with slight accuracy improvements. This information should be useful for implementors of high-performance NLP and machine-learning b...
Preprint
Full-text available
Sequence-to-sequence (S2S) pre-training using large monolingual data is known to improve performance for various S2S NLP tasks in low-resource settings. However, large monolingual corpora might not always be available for the languages of interest (LOI). To this end, we propose to exploit monolingual corpora of other languages to complement the sca...
Preprint
Full-text available
Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach det...
Preprint
In this paper, we propose a two-layered multi-task attention based neural network that performs sentiment analysis through emotion analysis. The proposed approach is based on Bidirectional Long Short-Term Memory and uses Distributional Thesaurus as a source of external knowledge to improve the sentiment and emotion prediction. The proposed system h...
Preprint
Full-text available
Recognizing affective events that trigger positive or negative sentiment has a wide range of natural language processing applications but remains a challenging problem mainly because the polarity of an event is not necessarily predictable from its constituent words. In this paper, we propose to propagate affective polarity using discourse relations...
Conference Paper
Frequently Asked Question (FAQ) retrieval is an important task where the objective is to retrieve an appropriate Question-Answer (QA) pair from a database based on a user's query. We propose a FAQ retrieval system that considers the similarity between a user's query and a question as well as the relevance between the query and an answer. Although a...
Article
Full-text available
To communicate with humans in a human-like manner, systems need to understand behavior and psychological states in situations of human-machine interactions, such as in the cases of autonomous driving and nursing robots. We focus on driving situa- tions as they are part of our daily lives and concern safety. To develop such systems, a corpus annotat...
Preprint
Full-text available
Frequently Asked Question (FAQ) retrieval is an important task where the objective is to retrieve the appropriate Question-Answer (QA) pair from a database based on the user's query. In this study, we propose a FAQ retrieval system that considers the similarity between a user's query and a question computed by a traditional unsupervised information...
Article
Recently, dependency parsers with neural networks have outperformed existing parsers. When these parsing models are applied to Chinese sentences, they are used in a pipeline model with word segmentation and POS tagging models. In such cases, parsing models do not work well because of word segmentation and POS tagging errors. This can be solved by j...
Article
Various past, present, and future events are described in text. To understand such text, correct interpretation of temporal information is essential. For this purpose, many corpora associating events with temporal information have been constructed. Although these corpora focus on expressions with strong temporality, many expressions have weak tempo...
Chapter
This chapter describes several approaches of using comparable corpora beyond the area of MT for under-resourced languages, which is the primary focus of the ACCURAT project. Section 7.1, which is based on Rapp and Zock (Automatic dictionary expansion using non-parallel corpora. In: A. Fink, B. Lausen, W. Seidel, & A. Ultsch (Eds.) Advances in Data...
Preprint
Full-text available
We propose a novel two-layered attention network based on Bidirectional Long Short-Term Memory for sentiment analysis. The novel two-layered attention network takes advantage of the external knowledge bases to improve the sentiment prediction. It uses the Knowledge Graph Embedding generated using the WordNet. We build our model by combining the two...
Article
This paper presents a method for improving semantic role labeling (SRL) using a large amount of automatically acquired knowledge. We acquire two varieties of knowledge, which we call surface case frames and deep case frames. Although the surface case frames are compiled from syntactic parses and can be used as rich syntactic knowledge, they have li...