
Satoshi Sekine- Doctor of Engineering
- Group Leader at RIKEN
Satoshi Sekine
- Doctor of Engineering
- Group Leader at RIKEN
About
158
Publications
25,666
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,948
Citations
Introduction
Natural Language Processing
Knowledge Base Construction
Structuring knowledge in Wikipedia
Current institution
Publications
Publications (158)
Named entity recognition (NER), which detects named entities in text and classifies them as PERSON or LOCATION, is a fundamental technique in natural language processing. Recently, NER systems have been demanded for classification into fine-grained classes. Generally, training data are required to construct an NER system. However, manual labeling i...
Wikipedia can be edited by anyone and thus contains various quality sentences. Therefore, Wikipedia includes some poor-quality edits, which are often marked up by other editors. While editors' reviews enhance the credibility of Wikipedia, it is hard to check all edited text. Assisting in this process is very important, but a large and comprehensive...
Asking and answering questions are inseparable parts of human social life. The primary purposes of asking questions are to gain knowledge or request help which has been the subject of question-answering studies. However, questions can also reflect negative intentions and include implicit offenses, such as highlighting one's lack of knowledge or bol...
Named-entity recognition (NER) technologies have been developed for the general knowledge domain, and the extension of NER to specific domains has now become a topic of active research. In this paper, we provide a survey of recent studies on domain-specific NER technologies for the Japanese language. We focus on aspects of different applications su...
Wikipedia can be edited by anyone and thus contains various quality sentences. Therefore, Wikipedia includes some poor-quality edits, which are often marked up by other editors. While editors' reviews enhance the credibility of Wikipedia, it is hard to check all edited text. Assisting in this process is very important, but a large and comprehensive...
The proliferation of social media and online communication platforms
has made social interactions more accessible, leading to a
significant expansion of research into language use with a particular
focus on toxic behavior and hate speech. Few studies, however,
have focused on the tacit information that may imply a negative
intention and the perspec...
Wikipedia is an exhaustive resource that contains too much information for any one human to completely absorb. Computers, on the other hand, are able to trawl through information at a rapid pace. However, as Wikipedia is written in such a way that it is clear for people to read, it is not possible for a machine to easily utilise and manipulate this...
Background.While the use of citations for assessing research impact is well-studied, there is little work that investigates the content introduced into the citing documents through citations and the linguistic expressions used to represent the cited content Objectives. This study analysed the types of content introduced into citing documents using...
Wikipedia is a huge opportunity for machine learning, being the largest semi-structured base of knowledge available. Because of this, countless works examine its contents, and focus on structuring it in order to make it usable in learning tasks, for example by classifying it into an ontology. Beyond its textual contents, Wikipedia also displays a t...
The NTCIR-14 QA Lab-PoliInfo aims to achieve real-world complex question-answering (QA) technologies using Japanese political information, such as local assembly minutes and newsletters. QA Lab-PoliInfo has three tasks, namely, segmentation, summarization and classification. We describe the dataset used, formal run results, and comparison between h...
Wikipedia is a great source of general world knowledge which can guide NLP models better understand their motivation to make predictions. We aim to create a large set of structured knowledge, usable for NLP models, from Wikipedia. The first step we take to create such a structured knowledge source is fine-grain classification of Wikipedia articles....
Many text generation tasks naturally contain two steps: content selection and surface realization. Current neural encoder-decoder models conflate both steps into a black-box architecture. As a result, the content to be described in the text cannot be explicitly controlled. This paper tackles this problem by decoupling content selection from the dec...
Monotonicity reasoning is one of the important reasoning skills for any intelligent natural language inference (NLI) model in that it requires the ability to capture the interaction between lexical and syntactic structures. Since no test set has been developed for monotonicity reasoning with wide coverage, it is still unclear whether neural models...
Large crowdsourced datasets are widely used for training and evaluating neural models on natural language inference (NLI). Despite these efforts, neural models have a hard time capturing logical inferences, including those licensed by phrase replacements, so-called monotonicity reasoning. Since no large dataset has been developed for monotonicity r...
Fine-Grained Named Entity Recognition (FG-NER) is critical for many NLP applications. While classical named entity recognition (NER) has attracted a substantial amount of research, FG-NER is still an open research domain. The current state-of-the-art (SOTA) model for FG-NER relies heavily on manual efforts for building a dictionary and designing ha...
We propose a FAQ search method with automatically generated questions by a question generator created from community Q&As. In our method, a search model is trained with automatically generated questions and their corresponding FAQs. We conducted experiments on a Japanese Q&A dataset created from a user support service on Twitter. The proposed metho...
A challenge in creating a dataset for machine reading comprehension (MRC) is to collect questions that require a sophisticated understanding of language to answer beyond using superficial cues. In this work, we investigate what makes questions easier across recent 12 MRC datasets with three question styles (answer extraction, description, and multi...
This paper addresses the task of assigning labels of fine-grained named entity (NE) types to Wikipedia articles. Information of NE types are useful when extracting knowledge of NEs from natural language text. It is common to apply an approach based on supervised machine learning to named entity classification. However, in a setting of classifying i...
This paper reports error analysis results on the product attribute value extraction task. We built the system that extracted attribute values from product descriptions by simply matching the descriptions and entries in an attribute value dictionary. The dictionary is automatically constructed by parsing semi-structured data such as tables and itemi...
On an e-commerce site, product blurbs (short promotional statements) and user reviews give us a lot of information about products. While a blurb should be appealing to encourage more users to click on a product link, sometimes sellers may miss or misunderstand which aspects of the product are important to their users. We therefore propose a novel t...
A natural language processing apparatus includes a result acquisition unit that acquires a plurality of analysis results indicating parts of speech of morphemes contained in one or more common sentences from a plurality of types of morphological analyzers, a pattern acquisition unit that detects a common segmentation point in the plurality of analy...
One exemplary aspect comprises a computer system comprising: (a) a preprocessing unit that extracts text from a webpage to produce at least a first set of candidate keywords, applies language processing to produce at least a second set of candidate keywords, and combines said first and second sets of candidate keywords into a first candidate pool;...
The present disclosure is directed to a computer system and method performed by a selectively programmed data processor for providing data to a Web page such that items are presented to the user in a way that imitates a real world shopping experience. Various aspects of the disclosed technology also relate to systems and methods for calculating pro...
Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS). Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Ja...
This paper presents an information extraction method of political problems in minutes of local council which include councilor's utterances impartially. We focus on lexical heads of noun phrases in order to extract political problems. In this paper, the lexical heads are limited to noun. Our method is divided into two steps as follows: First step,...
Transliteration has been usually recognized by spelling-based supervised models. However, a single model cannot deal with mixture of words with different origins, such as "get" in "piaget" and "target". Li et al. (2007) propose a class transliteration method, which explicitly models the source language origins and switches them to address this issu...
At present, online shopping is typically a search-oriented activity where a user gains access to products which best match their query. Instead, we propose a surf-oriented online shopping paradigm, which links associated products allowing users to "wander around" the online store and enjoy browsing a variety of items. As an initial step in creating...
Bootstrapping has been used as a very efficient method to extract a group of items similar to a given set of seeds. However,
the bootstrapping method intrinsically has several parameters whose optimal values differ from task to task, and from target
to target. In this paper, first, we will demonstrate that this is really the case and serious proble...
We present a simple semi-supervised relation extraction system with large-scale word clustering. We focus on systematically exploring the effectiveness of different cluster-based features. We also propose several statistical methods for selecting clusters at an appropriate level of granularity. When training on different sizes of data, our semi-sup...
Transliteration, a rich source of proper noun spelling variations, is usually recognized by phonetic- or spelling-based models. However, a single model cannot deal with different words from different language origins, e.g., "get" in "piaget" and "target." Li et al. (2007) propose a method which explicitly models and classifies the source language o...
The third WePS (Web People Search) Evaluation campaign took place in 2009-2010 and attracted the participation of 13 research groups from Europe, Asia and North America. Given the top web search results for a person name, two tasks were addressed: a clustering task, which consists of grouping together web pages referring to the same person, and an...
While the web provides a fantastic linguistic resource, col lecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engin es to collect online information, but this is hopelessly ine fficient for building large-scale linguistic resources, such as lists of named-e ntity types...
In this paper, we will describe an idea and its implementation for an ngram search en-gine for very large sets of ngrams. The en-gine supports queries with an arbitrary number of wildcards. It takes a fraction of a second for a search, and can provide the fillers of the wildcards. We implemented the system using two datasets. One is the 1 billion 5...
The second WePS (Web People Search) Evaluation cam-paign took place in 2008-2009 with the participation of 19 re-search groups from Europe, Asia and North America. Given the output of a Web Search Engine for a (usually ambiguous) person name as query, two tasks were addressed: a clustering task, which consists of grouping together web pages referri...
In this paper, we describe the Web People Search 2 attribute extraction task (WePS2-AE). It was conducted in September-December 2008 along with the WePS2 clustering task. Six groups participated in the AE task. We will describe the motivation, task definition, evaluation set up, participating systems, and evaluation results. We will discuss the pro...
This paper proposes a method to extract causal knowledge (cause and effect relations) using clue phrases and syntactic patterns
from Japanese newspaper articles concerning economic trends. For example, a sentence fragment “World economy recession due
to the subprime loan crisis ...” contains causal knowledge in which “World economy recession” is an...
This paper presents the motivation, resources and results for the first Web People Search task, which was organized as part of the SemEval-2007 evaluation exercise. Also, we will describe a survey and proposal for a new task, "attribute extraction", which is planned for inclusion in the second evaluation, planned for autumn, 2008.
Named Entities (NE) are regarded as an important type of semantic knowledge in many natural language processing (NLP) applications. Originally, a limited number of NE categories were proposed. In MUC, it was 7 categories - people, organization, location, time, date, money and percentage expressions. However, it was noticed that such a limited numbe...
This paper proposes a new method of the sentiment analysis utilizing inter-sentence structures especially for coping with reversal phe- nomenon of word polarity such as quotation of other's opinions on an opposite side. We model these phenomenon using Hidden Conditional Random Fields(HCRFs) with three kinds of features: transition features, polarit...
In this paper, we will describe a search tool for a huge set of ngrams. The tool supports queries with an arbitrary number of wild- cards. It takes a fraction of a second for a search, and can provide the fillers of the wildcards. The system runs on a single Linux PC with reasonable size memory (less than 4GB) and disk space (less than 400GB). This...
In this paper we introduce a system that collects English-Japanese translation document pairs from the Web that are relevant
to subject keywords specified by the user. The system, QRselect, is specifically designed to meet the needs of online volunteer
translators who, in the process of translation, want to refer to a small and specific set of tran...
Named Entities provides critical information for many NLP applications. Named Entity recognition and classification (NERC) in text is recognized as one of the important sub-tasks of Information Extraction (IE). The seven papers in this volume cover various interesting and informative aspects of NERC research. Nadeau & Sekine provide an extensive su...
We present a method for acquiring ontological knowledge using search query logs. We first use query logs to identify important contexts associated with terms belonging to a semantic category; we then use these contexts to harvest new words belonging to this category. Our evaluation on selected categories indicates that the method works very well to...
In this paper, we will describe ODIE, the On-Demand Information Extraction system. Given a user's query, the system will pro- duce tables of the salient information about the topic in structured form. It produces the tables in less than one minute without any knowledge engineering by hand, i.e. pat- tern creation or paraphrase knowledge creation, w...
This paper presents the task definition, re-sources, participation, and comparative re-sults for the Web People Search task, which was organized as part of the SemEval-2007 evaluation exercise. This task consists of clustering a set of documents that mention an ambiguous person name according to the actual entities referred to using that name.
This paper presents the task definition, resources, participation, and comparative results for the Web People Search task, which was organized as part of the SemEval-2007 evaluation exercise. This task consists of clustering a set of documents that mention an ambiguous person name according to the actual entities referred to using that name.
We are trying to extend the boundary of Information Extraction (IE) systems. Ex-isting IE systems require a lot of time and human effort to tune for a new scenario. Preemptive Information Extraction is an attempt to automatically create all feasible IE systems in advance without human in-tervention. We propose a technique called Unrestricted Relati...
This paper describes a system which identifies discourse relations between two successive sentences in Japanese. On top of the lexical information previously proposed, we used phrasal pattern information. Adding phrasal information improves the system's accuracy 12%, from 53% to 65%.
At present, adapting an Information Ex- traction system to new topics is an expen- sive and slow process, requiring some knowledge engineering for each new topic. We propose a new paradigm of Informa- tion Extraction which operates 'on demand' in response to a user's query. On-demand Information Extraction (ODIE) aims to completely eliminate the cu...
This paper describes an automatic dictionary construction method for Named Entity Recognition (NER) on specific domains such as restaurant guides. NER is the first step toward Information Extrac-tion (IE), and we believe that such a dictionary construction method for NER is crucial for developing IE systems for a wide range of domains in the World...
This paper describes a system which solves language tests for second grade students (7 years old). In Japan, there are materials for students to measure understanding of what they studied, just like SAT for high school students in US. We use textbooks for the stu-dents as the target material of this study. Questions in the materials are classified...
This paper describes two methods for detecting word segments and their morphological information in a Japanese spontaneous speech corpus, and describes how to tag a large spontaneous speech corpus accurately by using the two methods. The first method is used to detect any type of word segments. The second method is used when there are several defin...
tasks (Task 2 and Task 5) at the DUC-2004 formal run and evaluated the performance of our summarization system. Our system based on sentence extraction also uses a module to estimate similarity between sentences.
The tagging of Named Entities, the names of particular things or classes, is regarded as an important component technology for many NLP applications. The first Named Entity set had 7 types, organization, location, person, date, time, money and percent expressions. Later, in the IREX project artifact was added and ACE added two, GPE and facility, to...
In this paper, we discuss the performance of crosslingual information extraction systems employing an automatic pattern acquisition module. This module, which creates extraction patterns starting from a user's narrative task description, allows rapid customization to new extraction tasks. We compare two approaches: (1) acquiring patterns in the sou...
Discovering the significant relations embedded in documents would be very useful not only for information retrieval but also for question answering and summarization. Prior methods for relation discovery, however, needed large annotated corpora which cost a great deal of time and effort. We propose an unsupervised method for relation discovery from...
In this paper we describe a way to discover Named Entities by using the distribution of words in news articles. Named Entity recognition is an important task for today's natural language applications, but it still suffers for its data sparseness. We used an observation that a Named Entity often appears synchronously in several news articles, wherea...
This paper describes two methods for detecting word segments and their morphological information in a Japanese spontaneous speech corpus, and describes how to tag a large spontaneous speech corpus accurately by using the two methods. The first method is used to detect any type of word segments. The second method is used when there are several defin...
This paper does not necessarily reflect the position or the policy of the U.S. Government
to realize. Under such circumstances, we believe, it is important to observe how humans are doing the same task, and look around for different strategies.
We describe a method to automatically extract hyponyms from Japanese newspapers. First, we discover patterns which can extract hyponyms of a noun, such as "A nado-no B (B such as A)", then we apply the patterns to the newspaper corpus to extract instances. The procedure works best to extract hyponyms of concrete things in the middle of the word hie...
This paper describes Japanese-English-Chinese aligned parallel treebank corpora of newspaper articles. They have been constructed by trans-lating each sentence in the Penn Treebank and the Kyoto University text corpus into a cor-responding natural sentence in a target lan-guage. Each sentence is translated so as to reflect its contextual informatio...
This paper presents a method to construct Japanese KATAKANA variant list from large corpus. Our method is useful for information retrieval, information extraction, question answering, and so on, because KATAKANA words tend to be used as "loan words" and the transliteration causes several variations of spelling. Our method consists of three steps. A...
We participated in three multi-document sum-marization tasks at the DUC-2003 formal run and evaluated the performance of our summa-rization system. Our summarization system based on sentence extraction also incorporated a module to estimate similarity between sen-tences for multi-document summarization. The similarity information was used for selec...
We developed a cross-lingual, question-answering (CLQA) system for Hindi and English. It accepts questions in English, finds candidate answers in Hindi newspapers, and translates the answer candidates into English along with the context surrounding each answer. The system was developed as part of the surprise language exercise (SLE) within the TIDE...
We are trying to find paraphrases from Japanese news articles which can be used for Information Extraction. We focused on the fact that a single event can be reported in more than one article in different ways. However, certain kinds of noun phrases such as names, dates and numbers behave as "anchors" which are unlikely to change across articles. O...
We report evaluation results for our sum-marization system and analyze the result-ing summarization data for three differ-ent types of corpora. To develop a ro-bust summarization system, we have cre-ated a system based on sentence extraction and applied it to summarize Japanese and English newspaper articles, obtained some of the top results at two...
Several approaches have been described for the automatic unsupervised acquisition of patterns for information extraction.
We participated in both the single-document and multi-document summarization tasks at the TSC 2002. We have incorporated two modules into our earlier summarization system, which is based on a sentenceextraction technique, so that we could apply the system to the multi-document summarization task. One is a module to categorize document sets and the...
Introduction The difficulties in current NLP applications are seldon, due to the lack of appropriate frameworks for encoding our linguistic or extra-linguistic knowledge, but rather to the fact that we do not know in advance what actual instances of knowledge should be, even though we know in advance what types of knowledge are required. It normall...
A deterministic finite state transdncer is a fast device for analyzing strings. It takes O(n) tilne to analyze a string of length n. In this 1)al)er, an application of this technique to Japanese deI)endcncy analysis will be described. kVe achieved the speed at a small cost in accm'acy. It takes about 0.17 millisecond to analyze one sentence (averag...
We will report on one of the tvo tasks in the IREX (Information Retrieval and Extraction Exercise) project, an owluation-based project fbr Infbrmatio1 Retricwd and Infbrlnation Ex- traction in .Japanese (Sekine and Isahara, 2000) (IREX Committee, 1999). Tile project started in 1998 and concluded in September 1999 with many participants and collabor...
St, atisfical kmguage models play a jor role in eraTent speech recognition sys- tems. Most of fimse models have cussed on relatively local interactions between words. Recently, however, there have been several atempl;s to incort)o rate other knowlcdge sotn't:es in parLicular longerqmge wor(1 (let)enden(:ics, in order to improve sl)ee(:h recognizers...
Several approaches have been described for the automatic unsupervised acquisition of patterns for information extraction. Each approach is based on a particular model for the patterns to be acquired, such as a predicate-argument structure or a dependency chain. The effect of these alternative models has not been previously studied. In this paper, w...
This paper describes two methods for detecting word segments and their morphological information in a Japanese spontaneous speech corpus, and describes how to tag a large spontaneous speech corpus accurately by using the two methods. The first method is used to detect any type of word segments. The second method is used when there are several defin...
Introduction 'Standard,zing ontologles xs a challenging task Ontologles have been created based on different backgrounds, different purposes and d,fferent people However, standardizing them m useful not only for apphcatmns, such as Machine Translation and Informat,on Retrieval, but also to improve the. ontologles themselves During the process of st...
this paper, we propose a corpus-based automatic method to collect synonymous expressions using multiple newspaper articles. The basic idea is to automatically collect synonymous sentences such as the ones shown in Figure 1. Dierent newspapers from the same day might contain articles which report on the same event; we try to extract similar expressi...
We describe a method for generating sentences from "keywords" or "headwords". This method consists of two main parts, candidate-text construction and evaluation. The construction part generates text sentences in the form of dependency trees by using complementary information to replace information that is missing because of a "knowledge gap" and ot...
This paper describes a project tagging a spontaneous speech corpus with morphological information such as word segmentation and parts-ofspeech. We use a morphological analysis system based on a maximum entropy model, which is independent of the domain of corpora. In this paper we show the tagging accuracy achieved by using the model and discuss pro...