Shu-Kai Hsieh

Shu-Kai Hsieh
National Taiwan University | NTU · Graduate Institute of Linguistics

Ph.D in Linguistics

About

106
Publications
20,238
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
495
Citations
Additional affiliations
August 2011 - October 2015
National Taiwan University
Position
  • Professor (Assistant)
September 2008 - June 2011
National Taiwan Normal University
Position
  • Professor (Assistant)
February 2006 - August 2007
Academia Sinica
Position
  • PostDoc Position
Education
January 2000 - February 2006
University of Tuebingen
Field of study
  • Computerlinguistik

Publications

Publications (106)
Article
Full-text available
The present study aimed to investigate the neural mechanism underlying semantic processing in Mandarin Chinese adult learners, focusing on the learners who were Indo-European language speakers with advanced levels of proficiency in Mandarin Chinese. We used functional magnetic resonance imaging technique and a semantic judgment task to test 24 Mand...
Article
As cultural conflicts are intensifying locally and internationally in the aftermath of COVID-19pandemic, fine-tuned investigation of culture/religion, especially that of the marginalized populations, holds the potential to reduce disparity and suffering in the global village. This study used3 textual analysis programs—Topic Modeling, C-LIWC, and SS...
Preprint
Full-text available
A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g....
Chapter
Full-text available
Human societies change over time. When new ideas, new tools, new institutions, new knowledge, and new ways of life arise, they bring forth new words – neologisms. This chapter begins with an overview of research on the major waves of neologisms in the history of Chinese, followed by a review of recent work on the conventionalization measures of neo...
Chapter
Language variations and changes have been widely investigated since they are encapsulated phenomena involving many linguistic factors. The notion of conventionalization, which is regarded as the diachronic process subject to normal constraints on language change, can refer to the newly coding of conceptual categories in the synchronic sense, or the...
Article
Recurrent word sequences, referred to as “lexical bundles”, may be structurally incomplete, but they serve important communicative functions. Despite the essential roles of lexical bundles in discourse, many methodological issues have been raised in the process of identifying lexical bundles, which is generally frequency-based. The present study id...
Article
This paper proposes an innovative approach to link basic lexicon (e.g. Swadesh list) to upper ontology as the foundation of OntoLex interface to address the challenge of building language resources for endangered languages in the linked data paradigm. A linked data approach to language resources requires existing, and preferably sizable, language r...
Article
Full-text available
This paper investigates the most frequent lexical bundle (LB) ka li kong (to-you-say) (KLK), in an 18.5-hour Taiwanese Southern Min conversation corpus. The analysis focuses on the discourse-pragmatic functions of KLK, the role it plays in the speaker’s management of information in talk-in-interaction, and the collocations that are employed. The re...
Chapter
This paper aims to explore a special type of idiomatic expressions of even length called Quadrasyllabic Idiomatic Expressions (QIEs) in Chinese, and explain their variations with reference to semantic and structural constraints on the elements imposed by the construction of QIEs on the one hand, and its interplay with individual semantic elements i...
Article
In Generative Lexicon Theory ( glt ) (Pustejovsky 1995), co-composition is one of the generative devices proposed to explain the cases of verbal polysemous behavior where more than one function application is allowed. The English baking verbs were used as examples to illustrate how their arguments co-specify the verb with qualia unification. Some s...
Article
This paper proposes an innovative approach to link basic lexicon (e.g. Swadesh list) to upper ontology as the foundation of OntoLex interface to address the challenge of building language resources for endangered languages in the linked data paradigm. A linked data approach to language resources requires existing, and preferably sizable, language r...
Article
Recurrent word sequences, referred to as “lexical bundles”, may be structurally incomplete, but they serve important communicative functions. Despite the essential roles of lexical bundles in discourse, many methodological issues have been raised in the process of identifying lexical bundles, which is generally frequency-based. The present study id...
Book
Full-text available
his monograph is a translation of two seminal works on corpus-based studies of Mandarin Chinese words and parts of speech. The original books were published as two pioneering technical reports by Chinese Knowledge and Information Processing group (CKIP) at Academia Sinica in 1993 and 1996, respectively. Since then, the standard and PoS tagset propo...
Chapter
Full-text available
One of the biggest challenges in digital humanities is to extract, formulate, and represent knowledge from textual database in a way that is relevent and accessible for other scientific disciplines. This challenge is critical for historical texts as these are often one of the few tangible sources for us to extract our knowledge from our past. Howev...
Article
Full-text available
In this paper, we present a proposed system designed for sentiment detection for micro-blog data in Chinese. Our system surprisingly benefits from the lack of word boundary in Chinese writing system and shifts the focus directly to larger and more relevant chunks. We use an unsupervised Chinese word segmentation system and binomial test to extract...
Conference Paper
Full-text available
Most corpus-based lexical studies require considerable efforts in manually annotating grammatical relations in order to find the collocations of the target word in corpus data. In this paper, we claim that the current techniques of natural language processing can facilitate lexical research by automating the annotation of these relations. Besides,...
Article
Full-text available
This article describes an approach to constructing a language resource through automatically sketching grammatical relations of words in an untagged corpus based on dependency parses. Compared to the handcrafted, rule-based Word Sketch Engine (Kilgarriff et al. 2004), this approach provides more details about the different syntagmatic usages of eac...
Article
Full-text available
Les dictionnaires sont des objets socioculturels qui peuvent être utilisés comme struc-tures sous-jacentes pour la modélisation en sciences cognitives. Nous montrons d’abord queles réseaux lexicaux construits à partir de dictionnaires, malgré un désaccord de surface au ni-veau des liens, partagent une structure topologique commune. En supposant que...
Conference Paper
This paper aims to investigate degree modification in Mandarin through the case of creative degree modifier各種 [gezhong] (all kinds of; very). We provide a theoretical analysis following the Generative Lexicon Theory and show that各種 [gezhong] not only selects gradable adjectival predicates but also restricts the possible combinations as well as inte...
Chapter
Lexical semantics was deemed peripheral in formal linguistics’ early pursuit of a rule-based account of language since lexicon is viewed as the repository of idiosyncrasies and meaning is considered fuzzy and difficult to delineate. The papers collected in Levin and Pinker’s (1992) Lexical and Conceptual Semantics, however, reestablished the centra...
Article
While much attention has been paid to the complement coercion operation in English (e.g., began a book), the same phenomenon in Chinese is still under-researched. Our study examines twenty coercing verbs in Chinese, creating a coercion profile for each verb and conducting a cluster analysis based on the coercion profiles. The results suggest that s...
Conference Paper
Full-text available
We proposes a language resource by automatically sketching grammatical relations of words based on dependency parses from untagged texts. The advantage of word sketch based on parsed corpora is, compared to Sketch Engine (Kilgarriff, Rychly, Smrz, & Tugwell, 2004), to provide more details about the different usage of each word such as various types...
Chapter
We discuss the development of a multilingual lexicon linked to the Suggested Upper Merged Ontology (SUMO) formal ontology. The ontology as well as the lexicon have been expressed in Web Ontology Language (OWL), as well as their original formats, for use on the semantic web and in linked data. We describe the Open Multilingual Wordnet (OMW), a multi...
Conference Paper
What determines the “basicness” of words still remains a challenging question in creating basic lexicons and basic wordlists. Since frequency and dispersion seem to be the most dominant criteria, it is questioned that whether contextual factors also help to define the concept of “basicness.” From the perspective of the distributional model, meaning...
Article
Full-text available
Semantic relations of different types have played an important role in wordnet, and have been widely recognized in various fields. In recent years, with the growing interests of constructing semantic network in support of intelligent systems, automatic semantic relation discovery has become an urgent task. This paper aims to extract semantic relati...
Article
In Buddhist Digital Archives, there are three core elements — lexicon, content and catalog that represent the knowledge of Buddhist Scriptures. However, the close relationship among these three core elements has not been explicitly and systematically highlighted. This paper aims to propose a framework for the integration of cross-language Buddhist...
Conference Paper
Full-text available
The present study proposes an innovative way of expanding the lexical repository of Chinese Wordnet (CWN). Fine-grained as the senses and sense facets of its entries are, the current status of CWN fails to include such high-frequency words as àiqíng ( 'love') and such high-familiarity words as guānxīn ('to care') dud to the lack of manpower. In vie...
Chapter
This chapter aims at creating a common standard for Asian language resources that is compatible with an international standard ? lexical markup framework (LMF). In particular, it focuses on following issues: lexical specification and data categories (DC) relevant for building multilingual lexical resources for some Asian languages, including Chines...
Conference Paper
Language change is a ubiquitous and inevitable phenomenon in daily usages, represented by both novel interpretations and usages of old words, as well as through the development of entirely new words called neologisms. This study aims to give a theoretical account of the prefix微[wéi] in Mandarin, which recently has extended its meanings by combing w...
Article
Full-text available
This study adopts a corpus-based computational linguistic approach to measure individual differences (IDs) in visual word recognition. Word recognition has been a cardinal issue in the field of psycholinguistics. Previous studies examined the IDs by resorting to test-based or questionnaire-based measures. Those measures, however, confined the resea...
Conference Paper
Full-text available
This study examines how different dimensions of corpus frequency data may affect the outcome of statistical modeling of lexical items. The corpus used in our analysis is an elderly speaker corpus in its early development, and the target words are temporal expressions, which might reveal how the speech produced by the elderly is organized. We conduc...
Article
Full-text available
KYOTO is an Asian-European project developing a community platform for modeling knowledge and finding facts across languages and cultures. The platform operates as a Wiki system that multilingual and multi-cultural communities can use to agree on the meaning of terms in specific domains. The Wiki is fed with terms that are automatically extracted f...
Conference Paper
Full-text available
This presentation introduces a Python module (PyCWN) for accessing and processing Chinese lexical resources. In particular, our focus is put on the Chinese Wordnet (CWN) that has been developed and released by CWN group at Academia Sinica. PyCWN provides the access to Chinese Wordnet (sense and relation data) under the Python environment. The prese...
Article
Full-text available
This document describes the prelimi-nary release of the integrated Kyoto sys-tem for specific domain WSD. The sys-tem uses concept miners (Tybots) to ex-tract domain-related terms and produces a domain-related thesaurus, followed by knowledge-based WSD based on word-net graphs (UKB). The resulting system can be applied to any language with a lexica...
Article
Full-text available
中文词汇网络(Chinese WordNet, 简称CWN) 的设计理念, 是在完整的知识系统下兼顾词义与 词义关系的精确表达与语言科技应用. 中文词义的区分与词义间关系的精确表征必须建立在语言 学理论, 特别是词汇语义学的基础上. 而词义内容与词义关系的发掘与验证, 则必须源自实际语料. 我们采用的方法是分析与语料结合. 结合的方式则除了验证与举例外, 主要是在大量语料上平行进行词义标记, 以反向回馈验证. 完整, 强健知识系统的建立, 是兼顾知识本体(ontology) 的完备规范(formal integrity) 和人类语言系统内部的完整知识.
Conference Paper
Full-text available
The aim of this study is to use the word-space model to measure the semantic loads of single verbs, profile verbal lexicon acquisition, and explore the semantic information on Chinese resultative verb compounds (RVCs). A distributional model based on Academia Sinica Balanced Corpus (ASBC) with Latent Semantic Analysis (LSA) is built to investigate...
Article
Full-text available
In this paper we describe the data that will be used to compare the semantic struc-tures that emerge from synonymy in French and in Mandarin. We aim at studying these semantic structures at both a global, lexicographic level, using lexicons, synonymy and trans-lation dictionaries and at a more localised, experimental level, using data collected in...
Article
Full-text available
In analyzing the formation of a given compound, both its internal syntactic structure and semantic relations need to be considered. The Generative Lexicon Theory (GL Theory) provides us with an explanatory model of compounds that captures the qualia modification relations in the semantic composition within a compound, which can be applied to natura...
Article
Full-text available
In this paper we define a lexical metrology in graphs of verbal synonymy to com-pute the flexsemic score of speakers from their verbal productions in action denomination tasks. This flexsemic score is used to automatically categorize young children versus young adults. We show that this score is effective in French and in Mandarin.
Article
Full-text available
KYOTO is an Asian-European project developing a community platform for modeling knowledge and finding facts across languages and cultures. The platform operates as a Wiki system that multilingual and multi-cultural communities can use to agree on the meaning of terms in specific domains. The Wiki is fed with terms that are automatically extracted f...
Conference Paper
Full-text available
Modeling of semantic space is a new and challenging research topic both in cog-nitive science and linguistics. Existing approaches can be classified into two different types according to how the calculation are done: either a word-byword co-occurrence matrix or a word-by-context matrix (Riordan 2007). In this paper, we argue that the existing popul...
Article
Full-text available
This paper reports prototype multilingual query expansion system relying on LMF compliant lexical resources. The system is one of the deliverables of a three-year project aiming at establishing an international standard for language resources which is applicable to Asian languages. Our important contributions to ISO 24613, standard Lexical Markup F...
Conference Paper
Full-text available
Lexical Markup Framework (LMF, ISO-24613) is the ISO standard which provides a common standardized framework for the construction of natural language processing lexicons. LMF facilitates data exchange among computational linguistic resources, and also promises a convenient uniformity for future application. This study describes the design and imple...
Conference Paper
Full-text available
Wiktionary, a satellite of the Wikipedia initiative, can be seen as a potential re- source for Natural Language Processing. It requires however to be processed be- fore being used efficiently as an NLP re- source. After describing the relevant as- pects of Wiktionary for our purposes, we focus on its structural properties. Then, we describe how we...
Conference Paper
Full-text available
This study proposes an approach to extract domain-specific words, and to distinguish the word senses with the aim of extending current WordNet architecture for domain applications. The domain-specific lexicon is compiled with a Wordnet-LMF format in compliance with 180 1643 for the internationally collaborative KYOTO project. The findings and resul...
Article
Full-text available
Wiktionary, a satellite of the Wikipedia initiative, can be seen as a potential resource for Natural Language Processing. It requires however to be processed before being used efficiently as an NLP resource. After describing the relevant aspects Wiktionary for our purposes, we focus on its structural properties. Then, we describe how we extracted s...
Article
Full-text available
In this paper we present an application fostering the integration and interoperability of computational lexicons, focusing on the particular case of mutual linking and cross-lingual enrichment of two wordnets, the ItalWordNet and Sinica BOW lexicons. This is intended as a case-study investigating the needs and requirements of semi-automatic integra...
Article
Full-text available
Although some traditional readability formulas have shown high predictive validity in the r = 0.8 range and above (Chall & Dale, 1995), they are generally not based on genuine linguistic processing factors, but on statistical correlations (Crossley et al., 2008). Improvement of readability assessment should focus on finding variables that truly rep...
Article
Lexical semantic relations have played an important role in the recent developments of Natural Language Processing and Computational Lexical Resources as well. This paper reviews the notion of lexical s emantic relations in the WordNet-like lexical resources, and proposes a formal modeling o f lexical semantic relations using the extended Formal Co...
Conference Paper
Full-text available
Wiktionary, a satellite of the Wikipedia initiative, can be seen as a potential resource for Natural Language Processing. It requires however to be processed before being used efficiently as an NLP resource. After describing the relevant aspects of Wiktionary for our purposes, we focus on its structural properties. Then, we describe how we extracte...