About
35
Publications
9,179
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
366
Citations
Introduction
My research focuses on sequence labeling, namely word segmentation, machine transliteration, factoid question-answering, and dialogue intent detection, via minimally supervised sub-word learning. I’m currently working on data augmentation for a noise-tolerant classifier and its applications in personalization using federated learning.
Additional affiliations
August 2002 - August 2012
Education
September 2004 - July 2012
Publications
Publications (35)
Numerous studies have analyzed the influences of word segmentation (WS) performance on information retrieval (IR) for Mandarin Chinese and have demonstrated a non-monotonic relationship between WS accuracy and IR effectiveness. The usefulness of the compound words that have been a focus of the IR literature is not reflected by common WS evaluation...
This work proposes a unified view of several features based on frequent strings
extracted from unlabeled data that improve the conditional random fields (CRF)
model for Chinese word segmentation (CWS). These features include
character-based n-gram (CNG), accessor variety based string (AVS) and its
variation of left-right co-existed feature (LRAVS),...
Since a Chinese syllable can correspond to many characters (homophones), the syllable-to-character conversion task is quite challenging for Chinese phonetic input methods (CPIM). There are usually two stages in a CPIM: 1. segment the syllable sequence into syllable words, and 2. select the most likely character words for each syllable word. A CPIM...
We propose a hybrid architecture for the NTCIR-5 CLQA C-C (Cross Language Question Answering from Chinese to Chinese) Task. Our system, the Academia Sinica Question-Answering System (ASQA), outputs exact answers to six types of factoid question: personal names, location names, organization names, artifacts, times, and numbers. The architecture of A...
This work proposes a novel metric, Maximally Amortized Cost (MAC), for cost evaluations of error correction of predictive Chinese input methods (IMs). With a series of real-time sim-ulation, user correction behaviors are analyzed by estimating generalized backward compati-bility of adaptive Chinese IMs. Comparisons between three IMs by using MAC wi...
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technolog...
PromptSource is a system for creating, sharing, and using natural language prompts. Prompts are functions that map an example from a dataset to a natural language input and target output. Using prompts to train and query language models is an emerging area in NLP that requires new tools that let users develop and refine these prompts collaborativel...
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks. It has been hypothesized that this is a consequence of implicit multitask learning in language model training. Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale...
On word segmentation problems, machine learning architecture engineering often draws attention. The problem representation itself, however, has remained almost static as either word lattice ranking or character sequence tagging, for at least two decades. The latter of-ten shows stronger predictive power than the former for out-of-vocabulary (OOV) i...
The paper describes DGLab question answering system and automatic evaluation method at NTCIR-13 QA Lab-3 for Japanese university entrance exam on world history essay. Submissions of subtasks include extraction, summarization, and evaluation method, in Phase-2 and Research Run for both Japanese and English. The proposed system follows the organizer-...
This work proposes a novel metric, Maximally Amortized Cost (MAC), for cost evaluations of error correction of predictive Chinese input methods (IMs). With a series of real-time simulation, user correction behaviors are analyzed by estimating generalized back-ward compatibility of adaptive Chinese IMs. Comparisons between three IMs by using MAC wit...
This work presents an English-to-Chinese (E2C) machine transliteration system based on two-stage conditional random fields (CRF) models with accessor variety (AV) as an additional feature to approximate local context of the source language. Experiment results show that two-stage CRF method outperforms the one-stage opponent since the former costs l...
This work proposed a unified view of several unsupervised feature selection based on frequent strings that improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS), term-contributed frequency (TCF), and term-contributed boundary (TCB),...
This work describes a grapheme-based approach of English-to-Chinese (E2C) transliteration, which includes many-to-many alignment models and conditional random fields using accessor variety (AV) as an additional feature based on source graphemes. Experimental results indicate that the AV of a given English segment can generally improve effectiveness...
This work represents several unsupervised feature selections based on frequent strings that help improve conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based N-gram (CNG), Accessor Variety based string (AVS), and Term Contributed Frequency (TCF) with a specific manner of boundary overlapp...
This paper proposes a novel feature for conditional random field (CRF) model in Chinese word segmentation system. The system uses a conditional random field as machine learning model with one simple feature called term contributed boundaries (TCB) in addition to the "BIEO" character-based label scheme. TCB can be extracted from unlabeled corpora au...
This paper presents a Chinese word segmentation system submitted to the closed training evaluations of CIPS-SIGHAN-2010 bakeoff. The system uses a conditional random field model with one simple feature called term contri-buted boundaries (TCB) in addition to the "BI" character-based tagging ap-proach. TCB can be extracted from unla-beled corpora au...
Question Answering (QA) research has been conducted in many languages. Nearly all the top performing systems use heavy methods that require sophisticated techniques, such as parsers or logic provers. However, such techniques are usually unavailable or unaffordable for under-resourced languages or in resource-limited situations. In this article, we...
In this paper, we propose an automated evaluation metric for text entry. We also consider possible improvements to existing text entry evaluation metrics, such as the minimum string distance error rate, keystrokes per character, cost per correction, and a unified approach proposed by MacKenzie, so they can accommodate the special characteristics of...
Intelligent Input Methods (IM) are essential for making text entries in many East Asian scripts, but their application to other languages has not been fully explored. This paper discusses how such tools can contribute to the development of computer processing of other oriental languages. We propose a design philosophy that regards IM as a text serv...
This paper describes the experiments we are currently conducing at Academia Sinica for extending a proposed international standard for lexical framework (Lexical Markup Framework) and shows the relevance of this work for the Digital Archives Program. Although this framework is very rich and powerful, it originally has been developed for European la...
Many grammar checkers in rule-based approach do not handle errors that come from various usages, for example, the usages of prepositions. To study the behavior of prepositions, we introduce the language model into a grammar-checking task. A language model is trained from a large training corpus, which contains many short phrases. It can be used for...
Input method (IM) is a sine qua non for text entry of many Asian languages, but its potential applications on other languages remain under-explored. This paper proposes a philosophy of input method design by seeing it as a nonintrusive plug-in text service framework. Such design allows new functionalities of text processing to be attached onto a ru...
This paper has been withdrawn by the author, because it was merged into cs.HC/0508041
Question Answering (QA) is becoming an increasingly important research area in natural language processing. Since 1999, many international question answering contests have been held at conferences and workshops, such as TREC, CLEF, and NTCIR. Thus far, eleven languages – Bulgarian, Dutch, English, Finnish, French, German, Indonesian, Italian, Japan...
For NTCIR-6 CLQA, we improved our question answering system ASQA (Academia Sinica Question Answering System), which participated in NTCIR-5 CLQA, so that it could deal with the Chinese-Chinese (C-C) subtask and the English-Chinese (E-C) subtask. There are three innovations in the improved system: (a) to handle the E-C subtask, we have built an Engl...
We propose a hybrid architecture for answering Chinese factoid questions from news documents. Our proposed system, the Academia Sinica Question-Answering System (ASQA), outputs exact answers for factoid questions of six types, i.e., person name, location name, organization name, artifact, time, and number. The architecture of ASQA comprises four ma...
Passage retrieval plays an important role in a Chinese factoid Question Answering (QA) system. Query term selection is the process of choosing keywords from a given question to make the most use of information retrieval engines. Query terms selected by humans are analyzed to measure the difficulty and for evaluating machine generated results. Three...
For NTCIR-6 CLQA, we improved our question answering system ASQA (Academia Sinica Question Answering System), which participated in NTCIR-5 CLQA, so that it could deal with the Chinese-Chinese (C-C) subtask and the English-Chinese (E-C) subtask. There are three innovations in the improved system: (a) to handle the E-C subtask, we have built an Engl...
Questions
Question (1)
My motivation is to somehow (blindly) learn (negative) patterns with plain text corpora.
For a bag of words {this, is, a, book}, once a corpus tells us there is no usage of "book is this a" for sure, and so on so forth, then hopefully by negation one may find some hidden rules to promote the bag-of-words model to something similar to LDA.