About
74
Publications
11,134
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
223
Citations
Introduction
Skills and Expertise
Current institution
Additional affiliations
August 2019 - May 2020
August 2009 - June 2023
April 2024 - August 2024
Publications
Publications (74)
语料库的分类一般从收录语言材料的地域、 时代、 语种、 语体、 语域等层面, 或语音、 词法、 句 法、 语义、 篇章等标注层级, 或面向具体的研究问题等角度来划分。但更基本的、 却较少关注的问题是语料库的构建和应用 是面向学者的语言研究需要, 还是面向计算机的语言计算建模需要。本文从语料库研究和应用的角度出发, 将语料库研究 划分为面向全文检索、 语言计量和语言智能等三个层次, 着重介绍了与后两者相关的期刊、 会议和成果资源, 助力研究者确 定研究范式, 构建和使用相匹配的标注语料库, 实现多学科的交叉融合和应用。
A large number of ancient classic texts that have been passed down from the ancient times (pre-Qin and Han dynasties) to the present day are valuable, requiring urgent processing and excavation. Word segmentation and part-of-speech (POS) tagging as lexical analysis become the basic work of ancient Chinese information processing.
0. 引言 语料库是语言学研究的重要基础资源。纵观语 料库发展历史, 计算机技术的发展推动着语料库建 设和研究不断深入。 语料库的兴起得益于计算机技 术的进步, 由纸质文本转换为电子文本, 给语言的储 存和计算带来了极大便利。语言研究需要语言材料 为研究对象, 在电子语料库出现以前, 卡片式的摘录 和统计已经是语言研究的基本方法之一, 可以看做 是现代语料库方法的雏形。而大规模电子语料库的 出现, 为语言研究开辟了更广阔的研究空间。随着研 究需求的扩大, 语料库研究呈现精细化、 多样化的特 点, 语料库的类别也愈加丰富多样。 截至目前, 语料库已经历了三个发展阶段。 20 世纪 60 年代,第一代电子语料库的典型代表为 BROWN 语 料库, 除了标注原始语料的元数据, 如作者、 写作时间、...
Meaning Representation has emerged as a prominent area of research in sentence-level semantic parsing within the field of natural language processing in recent years. Substantial progress has been made in various NLP subtasks through the application of AMR. This paper presents the third Chinese Abstract Meaning Representation Parsing Evaluation, he...
Kinship is an important issue in history studies. The kinship database is the key resource to analyze the structure, succession, and evolution of families. However, one kinship could be expressed by different words, and one kinship word may be vague and ambiguous in natural languages, especially in pre-modern Chinese. As in the well-known China Bio...
Most natural language processing (NLP) tasks operationalize an input sentence as a sequence with token-level embeddings and features, despite its clausal structure. Taking abstract meaning representation (AMR) parsing as an example, recent parsers are empowered by transformers and pre-trained language models, but long-distance dependencies (LDDs) i...
Most natural language processing (NLP) tasks suffer performance degradation when encountering long complex sentences, such as semantic parsing, syntactic parsing, machine translation, and text summarization. Previous works address the issue with an intuition of decomposing complex sentences and linking simple ones, such as RST-style discourse parsi...
Graph neural networks (GNNs) have achieved remarkable success in structured prediction, owing to the GNNs’ powerful ability in learning expressive graph representations. However, most of these works learn graph representations based on a static graph constructed by an existing parser, suffering from two drawbacks: (1) the static graph might be erro...
A Joint Model of Automatic Sentence Seamentation and Punctuation for Ancient Classical Texts Based on Deep Learning
[Purpose/Significance] Form the view of digital humanities, the character based full-text retrieval is not sufficient for the linguistic, historical and cultural research on Chinese classical books. The word segmentation has been applied to construct the ancient knowledgebase of the classics. However, the texts with word segmentation are very few,...
A Preliminary Study on the Stereoscopic Distance Reading of Classical Poetry with Reference to Multi - source Data: Take Emperor Oianlongs More than Forty Thousand Imperial Poems as an Example
Compound sentences contain at least two independent clauses which account for a large proportion in natural language, especially in Chinese. Therefore, the discrimination and relation recognition of compound sentences are essential yet crucial to text understanding. Based on the presence of connective words, compound sentences can be categorized as...
The traditional studies of Chinese dialects focus on phonetics, phonology and vocabulary, while the studies of grammar such as function words still need to be well conducted. Function words are important means of expressing grammatical meaning in a language, and they are the key features for distinguishing different dialects. In addition, the curre...
Featured Application
Semantic dependency parsing could be applied in many downstream tasks of natural language processing, including named entity recognition, information extraction, machine translation, sentiment analysis, question generation, question answering, etc.
Abstract
Higher-order information brings significant accuracy gains in semantic...
Metaphor is very common in natural language, which conveys deep meaning beyond the literal meaning. Metaphor computation is still a challenge in NLP tasks. In this paper, we introduce the conventional metaphor theories, and the computational models and approaches, as well as resource constructions. Meanwhile, we find that metaphor theories fail to...
Higher-order features bring significant accuracy gains in semantic dependency parsing. However, modeling higher-order features with exact inference is NP-hard. Graph neural networks (GNNs) have been demonstrated to be an effective tool for solving NP-hard problems with approximate inference in many graph learning tasks. Inspired by the success of G...
In Mandarin Chinese, when the noun head appears in the context, a quantity noun phrase can be reduced to a quantity phrase with the noun head omitted. This phrase structure is called elliptical quantity noun phrase. The automatic recovery of elliptical quantity noun phrase is crucial in syntactic parsing, semantic representation and other downstrea...
meaning representations (AMRs) represent sentence semantics as rooted labeled directed acyclic graphs. Though there is a strong correlation between the AMR graph of a sentence and its corresponding dependency tree, the recent neural network AMR parsers do neglect the exploitation of dependency structure information. In this paper, we explore a nove...
Chinese anaphora resolution technology has been widely used in many natural language processing tasks, such as machine translation, information extraction and automatic text summarization. In this paper, we first introduce the resources for anaphora resolution, and then present the existing works on Chinese noun phrase resolution based on machine l...
In this paper, a survey is done to introduce the named entity recognition task in Chinese medical text and its practical significance. First, the existing datasets for the named entity recognition task of Chinese medical text are presented, then the survey is given on the algorithms for this task, mainly from the perspectives on matching and sequen...
Noun phrases reflect people’s understanding of the world entities and play an important role in people’s language system, conceptual system and application system.
With the Chinese “的(de)” structure, attributive noun phrases of the combined type can accommodate more words and syntactic structures, resulting in rich levels and complex semantic struc...
The Shiji (史記Records of the Grand Historian) is of great value for Chinese history before 90 BCE. Many online databases provide character-based search of the Shiji. We go beyond simple search by creating an word-based open-access database of the Basic Annals (本纪) of the Shiji that allows the exploration of relationships between persons and the rela...
Pre-Qin ancient Chinese (PQAC) plays an important role in the history of Chinese development. In previous works, most research is focused on sense explanation, while few can show the general vocabulary and the conceptual system characteristic of Pre-Qin Dynasty. In this paper, we construct a preliminary wordnet for Pre-Qin ancient Chinese (named PQ...
The basic tasks of ancient Chinese information processing include automatic sentence segmentation, word segmentation, part-of-speech tagging and named entity recognition. To avoid the error accumulation in the pipeline processing, this paper proposes a joint approach to sentence segmentation and lexical analysis. The BiLSTM-CRF neural network model...
The basic tasks of ancient Chinese information processing include automatic sentence segmentation, word segmentation, part-of-speech tagging and named entity recognition. To avoid the error accumulation in the pipeline processing, this paper proposes a joint approach to sentence segmentation and lexical analysis. The BiLSTM-CRF neural network model...
The Shiji (史記 Records of the Grand Historian ) is of great value for Chinese history before 90 BCE. Many online databases provide character-based search of the Shiji . We go beyond simple search by creating an word-based open-access database of the Basic Annals (本纪) of the Shiji that allows the exploration of relationships between persons and the r...
Semantic role labeling (SRL) is a fundamental task in Chinese language processing, but there are three major problems about the construction of SRL corpora. First, disagreements occurred in previous studies over the definition and number of semantic roles. Second, it is hard for static predicate frames to cover dynamic predicate usages. Third, it i...
Imagery is one of the core elements in understanding and appreciating ancient poetry. Lack of imagery data leads to subjective researches in traditional imagery theory. Some quantitative studies are recently proposed but such studies are in lack of annotated corpora. This paper reports the construction of a richly annotated imagery corpus compiled...
Bin Li Yuan Wen Li Song- [...]
Nianwen Xue
Meaning Representation (AMR) is a meaning representation framework in which the meaning of a full sentence is represented as a single-rooted, acyclic, directed graph. In this article, we describe an ongoing project to build a Chinese AMR (CAMR) corpus, which currently includes 10,149 sentences from the news-group and weblog portion of the Chinese T...
What is language? Is there a quick look at the linguistics? What are the linguists doing? Is it possible for a robot to talk? Why is it so hard for the Chinese people to learn English? How can a baby learn a language? Why there is language? How can it evolve from time to time? Why English dominates in our world? Why Chinese has so many characters?...
Yuan Wen 宋丽 Taizhong Wu- [...]
Weiguang Qu
The non-projective structure of dependency tree refers to the phenomenon that the word nodes on the dependency tree are misplaced with the sequence of words in the original sentence. It is not only of great influence on the sematic parser, and is also of great value in linguistic theory. The non-projective structures are found in the dependency tre...
Bin Li WEN Yuan SONG Li- [...]
Nianwen Xue
As a new sentence-level meaning representation, Abstract Meaning Representation (AMR) uses a rooted acyclic directed graph to represent the meaning of a sentence. A large AMR bank has been constructed for English, but the concepts of an AMR graph are not aligned to the words in a sentence, which artificially increases the difficulty in manual annot...
The Chinese historical classics Zuo Zhuan is of great value to study the history between 722-468 BC. The persons in the literature and the places they have been to are typical topics in the studies of historical persons and events. However, the traditional full text retrieval is not sufficient for such studies, because either a person or a place us...
The loanwords from Chinese in Japanese are always the object of linguistics studies. This paper annotates the words in the Japan Etymology Dictionary, to obtain the data of the earliest year, the part of speech, the proportion of Chinese characters and the similarity with Chinese words. The results show that, most of the loanwords emerge in Nara, H...
本书总结了作者近年来在词语认知属性方面所做的研究和实践。全书围绕词义的计算机表示和计算方法,区分出百科知识、语言知识和常识,以认知属性和情感框架来刻画偏于主观情感的意义。作者从互联网上采集校对了带概率信息的23万条汉语“词语—属性”对,研制出可视化检索系统,分析了英汉认知属性的异同,将认知属性应用于词语相似度计算、副名结构和原型理论的解释与验证,获得了一系列新的发现与认识。
【目的】验证中古时期分词一致性和语料类别对CRFs分词效率的影响,在此基础上进一步提高分词效率,降低人工校对的工作量。【方法】以中古时期的史书、佛经、小说类语料为例,针对中古汉语的自动分词问题,优化分词原则,运用CRFs模型和词典相结合的方法,消除中古汉语人工分词结果中易出现的分词不一致问题;同时在CRFs分词中引入字符分类、字典信息两种特征,并通过对比实验选取每种特征最合适的分词模板。【结果】实验结果显示,分词结果的总F值在封闭测试中达到99%以上,开放测试的综合测试中也达到89%-95%。【局限】分词不一致研究主要针对双字词,因此三字以上词语(多字词)的识别效果稍有欠缺。【结论】在有效提高分词一致性的前提下,字符分类、词典标记特征能够有效提高中古汉语CRFs分词的精确度。同时本文提出的中...
The Princeton WordNet® (PWN) is a widely used lexical knowledge database for semantic information processing. There are now many wordnets under creation for languages worldwide. In this paper, we endeavor to construct a wordnet for Pre-Qin ancient Chinese (PQAC), called PQAC WordNet (PQAC-WN), to process the semantic information of PQAC. In previou...
The “ADV+N” construction is a special phenomenon in mandarin Chinese. The rule is widely accepted that the nouns having descriptive semantic features are more likely to be used in the construction. However, the descriptive semantic features are casually used in previous works. How many semantic features a noun has and how strong the relationship be...
The evolution of the Chinese vocabulary is one of the indispensable parts of the research on the history of the Chinese language, and is the basis of clarifying the origins of contemporary Chinese vocabulary. For lack of the high-quality and large-scale diachronic corpus, the overall evolutionary process of the Chinese vocabulary is hard to demonst...
Cultural Revolution Vocabulary is one of sociolinguistic research objects. This paper constructs the Cultural Revolution Corpus based on automatic word segmentation and part-of-speech tagging. Combining qualitative and quantitative analysis, we analyze the top 1,000 words having the max TF-IDF value and word highest frequencies during the Cultural...
People often use similes of pattern "as adjective as noun" to express their feelings on web medias. The adjective in the pattern is generally the salient property and strong impression of the noun entity in the speaker's mind. By querying the simile templates from search engines, we construct a large database of "noun-adjective" items in English an...
The Pre-Qin Chinese plays a key role in the history of Chinese. However, for the lack of annotated corpus, the overview of Pre-Qin Chinese vocabulary is still not clear. This paper introduces the corpus of 25 Pre-Qin classical texts, which are under manual word segmentation and part-of-speech tagging. Then, the character and word frequencies are ca...
Liu Liu Bin Li Lijun Bu- [...]
Xiaohe Chen
Words' property of times is an important type of additional meaning which represents the spirit of times. People get the information of times from words by their own experience, but automatic recognition by computers is still difficult. This paper proposes a method of automatic recognition of the property of times based on large-scale corpus, which...
Measuring word similarities is a fundamental issue in NLP, while the measuring procedure is always aided by dictionaries or corpus. However, when figurative or metaphor usages are considered, the situation becomes more complicated. Therefore, based on the Chinese-English bilingual lexical cognitive property knowledgebase we constructed, we design s...
This document describes three systems calculating semantic similarity between two Chinese words. One is based on Machine Readable Dictionaries and the others utilize both MRDs and Corpus. These systems are performed on SemEval-2012 Task 4: Evaluating Chinese Word Similarity.
Cognitive properties of words are very useful in figurative language understanding, language acquisition and translation. To overcome the subjectivity and low efficiency in manual construction of such database, we propose a web-based method for automatic collection and analysis of cognitive properties. The method employs simile templates to query t...
Every language has its own culture background, thus it is difficult to translate or retrieve figurative expressions across languages. Based on the metaphoric cognition and feature analysis theory, we collect data from the web to construct the Chinese-English bilingual lexical cognitive property knowledgebase linked to "HowNet". By comparing the dif...
《动宾搭配的语义分析和计算》将认知语言学理论、语料库语言学、语言计算理论和机器学习技术融合起来,围绕动宾之间的语义选择限制问题展开研究。基于大规模数据,对不同动词对宾语选择限制的多样性和强度差异,做了系统标注和统计分析。从主观性和认知事件框架的角度,揭示了动宾搭配多样性的深层次约束,描写了词语褒贬的指向问题,利用选择限制对动宾搭配的转喻本体做了自动理解实验并验证了隐喻理论的有效性,对明喻句的自动识别实验取得良好效果。
In this paper we argue for a word-sense based formalization for collocation, and proposes a seed-based approach for collocation extraction for specific purposes. The approach uses RFR_SUM model to iteratively classify polysemous word sense in the corpus. The collocation strength is also obtained by RFR. To capture the syntactic relation inside coll...
Selectional Preferences (SPs) in verb-object(VO) constructions have been widely used in NLP applications, such as WSD, metaphor comprehension etc. To estimate the number of verbs that have strong SPs, 38,119 VO types of 1,462 verbs are extracted from "Modern Chinese Cihai", tagged in How Net sense inventory with automatic tagging algorithm, The sta...
Tao Xu Weiguang Qu Xuri Tang- [...]
Hui Li
This paper proposes a novel approach for word similarity computation based on word sense vectors. The word sense vector is built using HIT-IR Tongyici Cilin (extended) for concept generalization and is further modified by the use of relative and absolute frequency filters. Experiments show that the approach not only overcomes the problem of similar...
This paper proposes an integrated approach for personal name recognition (PNR) in Chinese by utilizing both statistical language models and categorized linguistic knowledge. Various formulas are proposed for calculating personal name credibility and context credibility for different types of personal names. Experiment is conducted on large-scale co...
In this paper, we introduce our work on SemEval-2012 task 5: Chinese Semantic De-pendency Parsing. Our system is based on MSTParser and two effective methods are proposed: splitting sentence by punctuations and extracting last character of word as lemma. The experiments show that, with a combina-tion of the two proposed methods, our system can impr...