
Yang Xu- University of Toronto
Yang Xu
- University of Toronto
About
98
Publications
15,828
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,136
Citations
Current institution
Publications
Publications (98)
A key function of the lexicon is to express novel concepts as they emerge over time through a process known as lexicalization. The most common lexicalization strategies are the reuse and combination of existing words, but they have typically been studied separately in the areas of word meaning extension and word formation. Here, we offer an informa...
Morality is central to social well-being and cognition, and moral lexicon is a key device for human communication of moral concepts and experiences. How was the moral lexicon formed? We explore this open question and hypothesize that words evolved to take on abstract moral meanings from concrete and grounded experiences. We test this hypothesis by...
A defining property of human language is the creative use of words to express multiple meanings through word meaning extension. Such lexical creativity is manifested at different timescales, ranging from language development in children to the evolution of word meanings over history. We explored whether different manifestations of lexical creativit...
Moral norms vary across cultures. A recent line of work suggests that English large language models contain human-like moral biases, but these studies typically do not examine moral variation in a diverse cultural setting. We investigate the extent to which monolingual English language models contain knowledge about moral norms in different countri...
Slang is a common type of informal language, but its flexible nature and paucity of data resources present challenges for existing natural language systems. We take an initial step toward machine generation of slang by developing a framework that models the speaker’s word choice in slang context. Our framework encodes novel slang meaning by relatin...
Humans can make moral inferences from multiple sources of input. In contrast, automated moral inference in artificial intelligence typically relies on language models with textual input. However, morality is conveyed through modalities beyond language. We present a computational framework that supports moral inference from natural images, demonstra...
The lexicon is an evolving symbolic system that expresses an unbounded set of emerging meanings with a limited vocabulary. As a result, words often extend to new meanings. Decades of research have suggested that word meaning extension is non-arbitrary, and recent work formalizes this process as cognitive models of semantic chaining whereby emerging...
Automated moral inference is an emerging topic of critical importance in artificial intelligence. The contemporary approach typically relies on language models to infer moral relevance or moral properties of a concept. This approach demands complex parameterization and costly computation, and it tends to disconnect with existing psychological accou...
A key function of the lexicon is to express novel concepts as they emerge over time through a process known as lexicalization. The most common lexicalization strategies are the reuse and combination of existing words, but they have typically been studied separately in the areas of word meaning extension and word formation. Here we offer an informat...
Morality is central to social well-being and cognition, and moral lexicon is a key device for human communication of moral concepts and experiences. How was the moral lexicon formed? We explore this open question and hypothesize that words evolved to take on abstract moral meanings from concrete and grounded experiences. We test this hypothesis by...
Theorists have argued that morality builds on several core modular foundations. When do different moral foundations emerge in life? Prior work has explored the conceptual development of different aspects of morality in childhood. Here, we offer an alternative approach to investigate the developmental emergence of moral foundations through the lexic...
Categorization is ubiquitous in human cognition and society, and shapes how we perceive and understand the world. Because categories reflect the needs and perspectives of their creators, no category system is entirely objective, and inbuilt biases can have harmful social consequences. Here we propose methods for measuring biases in hierarchical sys...
Humans often make creative use of words to express novel senses. A long-standing effort in natural language processing has been focusing on word sense disambiguation (WSD), but little has been explored about how the sense inventory of a word may be extended toward novel meanings. We present a paradigm of word sense extension (WSE) that enables word...
Moral norms vary across cultures. A recent line of work suggests that English large language models contain human-like moral biases, but these studies typically do not examine moral variation in a diverse cultural setting. We investigate the extent to which monolingual English language models contain knowledge about moral norms in different countri...
Humans often make creative use of words to express novel senses. A long-standing effort in natural language processing has been focusing on word sense disambiguation (WSD), but little has been explored about how the sense inventory of a word may be extended toward novel meanings. We present a paradigm of word sense extension (WSE) that enables word...
Semantic change is attested commonly in the historical development of lexicons across the world's languages. Extensive research has sought to characterize regularity in semantic change, but existing studies have typically relied on manual approaches or the analysis of a restricted set of languages. We present a large-scale computational analysis to...
Semantic change is attested commonly in the historical development of lexicons across the world's languages. Extensive research has sought to characterize regularity in semantic change, but existing studies have typically relied on manual approaches or the analysis of a restricted set of languages. We present a large-scale computational analysis to...
Scientific progress, or scientific change, has been an important topic in the philosophy and history of science. Previous work has developed quantitative approaches to characterize the progression of science in different fields, but how individual scientists make progress through their careers is not well understood at a comprehensive scale. We cha...
Humans can flexibly extend word usages across different grammatical classes, a phenomenon known as word class conversion. Noun-to-verb conversion, or denominal verb (e.g., to Google a cheap flight), is one of the most prevalent forms of word class conversion. However, existing natural language processing systems are impoverished in interpreting and...
The meaning of a slang term can vary in different communities. However, slang semantic variation is not well understood and under-explored in the natural language processing of slang. One existing view argues that slang semantic variation is driven by culture-dependent communicative needs. An alternative view focuses on slang's social functions sug...
The meaning of a slang term can vary in different communities. However, slang semantic variation is not well understood and under-explored in the natural language processing of slang. One existing view argues that slang semantic variation is driven by culture-dependent communicative needs. An alternative view focuses on slang's social functions sug...
Languages vary considerably in syntactic structure. About 40% of the world's languages have subject-verb-object order, and about 40% have subject-object-verb order. Extensive work has sought to explain this word order variation across languages. However, the existing approaches are not able to explain coherently the frequency distribution and evolu...
Gender associations have been a long‐standing research topic in psychological and social sciences. Although it is known that children learn aspects of gender associations at a young age, it is not well understood how they might emerge through the course of development. We investigate whether gender associations, such as the association of dresses w...
Gender associations have been a long-standing research topic in psychological and social sciences. Although it is known that children learn aspects of gender association at a young age, it is not well understood how they might emerge through the course of development. We investigate whether gender associations, such as the association of dresses wi...
Humans can flexibly extend word usages across different grammatical classes, a phenomenon known as word class conversion. Noun-to-verb conversion, or denominal verb (e.g., to Google a cheap flight), is one of the most prevalent forms of word class conversion. However, existing natural language processing systems are impoverished in interpreting and...
Slang is a predominant form of informal language making flexible and extended use of words that is notoriously hard for natural language processing systems to interpret. Existing approaches to slang interpretation tend to rely on context but ignore semantic extensions common in slang word usage. We propose a semantically informed slang interpretati...
Slang is a predominant form of informal language making flexible and extended use of words that is notoriously hard for natural language processing systems to interpret. Existing approaches to slang interpretation tend to rely on context but ignore semantic extensions common in slang word usage. We propose a semantically informed slang interpretati...
In lexicalist linguistic theories, argument structure is assumed to be predictable from the meaning of verbs. As a result, the verb is the primary determinant of the meaning of a clause. In contrast, construction grammarians propose that argument structure is encoded in constructions (or form-meaning pairs) that are distinct from verbs. Decades of...
Contextualized word embeddings have demonstrated state-of-the-art performance in various natural language processing tasks including those that concern historical semantic change. However, language models such as BERT was trained primarily on contemporary corpus data. To investigate whether training on historical corpus data improves diachronic sem...
Significance
Grammatical marking of features such as number, tense, and evidentiality varies widely across languages. Despite this variation, we show that grammatical markers support efficient information transfer from speakers to listeners. We apply a formal model of communication to data from dozens of languages and find that grammatical marking...
Humans possess the unique ability to communicate emotions through language. Although concepts like anger or awe are abstract, there is a shared consensus about what these English emotion words mean. This consensus may give the impression that their meaning is static, but we propose this is not the case. We cannot travel back to earlier periods to s...
Natural language relies on a finite lexicon to express an unbounded set of emerging ideas. One result of this tension is the formation of new compositions, such that existing linguistic units can be combined with emerging items into novel expressions. We develop a framework that exploits the cognitive mechanisms of chaining and multimodal knowledge...
Natural language relies on a finite lexicon to express an unbounded set of emerging ideas. One result of this tension is the formation of new compositions, such that existing linguistic units can be combined with emerging items into novel expressions. We develop a framework that exploits the cognitive mechanisms of chaining and multimodal knowledge...
Morality plays an important role in social well-being, but people's moral perception is not stable and changes over time. Recent advances in natural language processing have shown that text is an effective medium for informing moral change, but no attempt has been made to quantify the origins of these changes. We present a novel unsupervised framew...
Humans possess the unique ability to communicate emotions through language. Although concepts like anger or awe are abstract, there is a shared consensus about what these English emotion words mean. This consensus may give the impression that their meaning is static, but we propose this is not the case. We cannot travel back to earlier periods to s...
As the numbers of submissions to conferences grow quickly, the task of assessing the quality of academic papers automatically, convincingly, and with high accuracy attracts increasing attention. We argue that studying interpretable dimensions of these submissions could lead to scalable solutions. We extract a collection of writing features, and con...
The use of euphemisms is a known driver of language change. It has been proposed that women use euphemisms more than men. Although there have been several studies investigating gender differences in language, the claim about euphemism usage has not been tested comprehensively through time. If women do use euphemisms more, this could mean that women...
Transformer language models have shown remarkable ability in detecting when a word is anomalous in context, but likelihood scores offer no information about the cause of the anomaly. In this work, we use Gaussian models for density estimation at intermediate layers of three language models (BERT, RoBERTa, and XLNet), and evaluate our method on BLiM...
Functionalist accounts of language suggest that forms are paired with meanings in ways that support efficient communication. Previous work on grammatical marking suggests that word forms have lengths that enable efficient production, and work on the semantic typology of the lexicon suggests that word meanings represent efficient partitions of seman...
Slang is a common type of informal language, but its flexible nature and paucity of data resources present challenges for existing natural language systems. We take an initial step toward machine generation of slang by developing a framework that models the speaker's word choice in slang context. Our framework encodes novel slang meaning by relatin...
Overextension—the phenomenon that children extend known words to describe referents outside their vocabulary—is a hallmark of lexical innovation in early childhood. Overextension is a subject of extensive inquiry in linguistics and developmental psychology, but there exists no coherent formal account of this phenomenon. We develop a general computa...
The use of euphemisms is a known driver of language change. It has been proposed that women use euphemisms more than men. Although there have been several studies investigating gender differences in language, the claim about euphemism usage has not been tested comprehensively through time. If women do use euphemisms more, this could mean that women...
Semantic shifts can reflect changes in beliefs across hundreds of years, but it is less clear whether trends in fast-changing communities across a short time can be detected. We propose semantic coordinates analysis, a method based on semantic shifts, that reveals changes in language within publications of a field (we use AI as example) across a sh...
We present a methodological framework for inferring symmetry of verb predicates in natural language. Empirical work on predicate symmetry has taken two main approaches. The feature-based approach focuses on linguistic features pertaining to symmetry. The context-based approach denies the existence of absolute symmetry but instead argues that such i...
We present a methodological framework for inferring symmetry of verb predicates in natural language. Empirical work on predicate symmetry has taken two main approaches. The feature-based approach focuses on linguistic features pertaining to symmetry. The context-based approach denies the existence of absolute symmetry but instead argues that such i...
Overextension—the phenomenon that children extend known words to describe referents outside their vocabulary—is a hallmark of lexical innovation in early childhood. Overextension is a subject of extensive inquiry in linguistics and developmental psychology, but there exists no coherent formal account of this phenomenon. We develop a general computa...
Word class flexibility refers to the phenomenon whereby a single word form is used across different grammatical categories. Extensive work in linguistic typology has sought to characterize word class flexibility across languages, but quantifying this phenomenon accurately and at scale has been fraught with difficulties. We propose a principled meth...
We explore how linguistic categories extend over time as novel items are assigned to existing categories. As a case study we consider how Chinese numeral classifiers were extended to emerging nouns over the past half century. Numeral classifiers are common in East and Southeast Asian languages, and are prominent in the cognitive linguistics literat...
Developing moral awareness in intelligent systems has shifted from a topic of philosophical inquiry to a critical and practical issue in artificial intelligence over the past decades. However, automated inference of everyday moral situations remains an under-explored problem. We present a text-based approach that predicts people's intuitive judgmen...
Languages differ qualitatively in their numeral systems. At one extreme, some languages have a small set of number terms, which denote approximate or inexact numerosities; at the other extreme, many languages have forms for exact numerosities over a very large range, through a recursively defined counting system. Why do numeral systems vary as they...
Lexical semantic typology has identified important cross-linguistic generalizations about the variation and commonalities in polysemy patterns---how languages package up meanings into words. Recent computational research has enabled investigation of lexical semantics at a much larger scale, but little work has explored lexical typology across seman...
We explore how linguistic categories extend over time as novel items are assigned to existing categories. As a case study we consider how Chinese numeral classifiers were extended to emerging nouns over the past half century. Numeral classifiers are common in East and Southeast Asian languages, and are prominent in the cognitive linguistics literat...
Chinese dynastic histories form a large continuous linguistic space of approximately 2000 years, from the 3rd century BCE to the 18th century CE. The histories are documented in Classical (Literary) Chinese in a corpus of over 20 million characters, suitable for the computational analysis of historical lexicon and semantic change. However, there is...
In natural language, multiple meanings often share a single word form, a phenomenon known as colexification. Some sets of meanings are more frequently colexified across languages than others, but the source of this variation is not well understood. We propose that cross-linguistic variation in colexification frequency is non-arbitrary and reflects...
In natural language, multiple meanings often share a single word form, a phenomenon known as colexification. Some sets of meanings are more frequently colexified across languages than others, but the source of this variation is not well understood. We propose that cross-linguistic variation in colexification frequency is non-arbitrary and reflects...
We present a text-based framework for investigating moral sentiment change of the public via longitudinal corpora. Our framework is based on the premise that language use can inform people's moral perception toward right or wrong, and we build our methodology by exploring moral biases learned from diachronic word embeddings. We demonstrate how a pa...
One way that languages are able to communicate a potentially infinite set of ideas through a finite lexicon is by compressing emerging meanings into words, such that over time, individual words come to express multiple, related senses of meaning. We propose that overarching communicative and cognitive pressures have created systematic directionalit...
Human language relies on a finite lexicon to express a potentiallyinfinite set of ideas. A key result of this tension is that wordsacquire novel senses over time. However, the cognitive processesthat underlie the historical emergence of new word senses arepoorly understood. Here, we present a computational frameworkthat formalizes competing views o...
Significance
How do words develop new senses? Unlike changes in sound or grammar where there are rich formal characterizations, semantic change is poorly understood. Changes in meaning are often considered intractable, with sparse attempts at formalizing and evaluating the principles against historical data at scale. We present a data-enriched form...
Crosslinguistic research on domains including kinship, color, folk biology, number, and spatial relations has documented the different ways in which languages carve up the world into named categories. Although word meanings vary widely across languages, unrelated languages often have words with similar or identical meanings, and many logically poss...
Previous research has proposed an adaptive cue combination view of the development of human spatial reorientation (Newcombe & Huttenlocher, 2006), whereby information from multiple sources is combined in a weighted fashion in localizing a target, as opposed to being modular and encapsulated (Hermer & Spelke, 1996). However, no prior work has formal...
One way that languages are able to communicate a potentially infinite set of ideas through a finite lexicon is by compressing emerging meanings into words, such that over time, individual words come to express multiple, related senses of meaning. We propose that overarching communicative and cognitive pressures have created systematic directionalit...
Humans are experts at face individuation. Although previous work has identified a network of face-sensitive regions and some of the temporal signatures of face processing, as yet, we do not have a clear understanding of how such face-sensitive regions support learning at different time points. To study the joint spatio-temporal neural basis of face...
The Sapir‐Whorf hypothesis holds that human thought is shaped by language, leading speakers of different languages to think differently. This hypothesis has sparked both enthusiasm and controversy, but despite its prominence it has only occasionally been addressed in computational terms. Recent developments support a view of the Sapir‐Whorf hypothe...
What forces have shaped the evolution of the lexicon? Languages evolve under the pressure of having to communicate an unbounded set of ideas using a finite set of linguistic structures. This suggests why the transmission of ideas should be compressed such that one word will develop multiple senses. Previous theory also suggests how a word might dev...
The Sapir-Whorf hypothesis holds that our thoughts are shaped by our native language, and that speakers of different languages therefore think differently. This hypothesis is controversial in part because it appears to deny the possibility of a universal groundwork for human cognition, and in part because some findings taken to support it have not...
Semantic categories in the world’s languages often reflect a historical process of chaining: A
name for one referent is extended to a conceptually related referent, and from there on to
other referents, producing a chain of exemplars that all bear the same name. The beginning
and end points of such a chain might in principle be rather dissimilar. T...
Semantic categories in the world's languages often reflect a historical process of chaining: A name for one idea is extended to a conceptually related idea, and from there on to other ideas, producing a chain of concepts that all bear the same name. The beginning and end points of such a chain might in principle be conceptually rather dissimilar. T...
Humans are remarkably proficient at categorizing visually-similar objects. To better understand the cortical basis of this categorization process, we used magnetoencephalography (MEG) to record neural activity while participants learned-with feedback-to discriminate two highly-similar, novel visual categories. We hypothesized that although prefront...
Humans are remarkably proficient at categorizing visually-similar objects. To better understand the cortical basis of this categorization process, we used magnetoencephalography (MEG) to record neural activity while participants learned–with feedback–to discriminate two highly-similar, novel visual categories. We hypothesized that although prefront...
Humans are remarkably proficient at categorizing visually-similar objects. To better understand the cortical basis of this categorization process, we used magnetoencephalography (MEG) to record neural activity while participants learned–with feedback–to discriminate two highly-similar, novel visual categories. We hypothesized that although prefront...
Identifying brain regions with high differential response under multiple experimental conditions is a fundamental goal of functional imaging. In many studies, regions of interest (ROIs) are not determined a priori but are instead discovered from the data, a process that requires care because of the great potential for false discovery. An additional...
Magnetoencephalography (MEG) enables a noninvasive interface with the brain that is potentially capable of providing movement-related information similar to that obtained using more invasive neural recording techniques. Previous studies have shown that movement direction can be decoded from multichannel MEG signals recorded in humans performing wri...
We present a cluster-based decoding algorithm for discovering regions of interest (ROIs) from EEG/MEG data in source space
(or optimal cluster of sources) and predicting multiple conditions in a single experimental trial. Our algorithm automatically
identifies contiguous brain regions that yield maximum mean test statistics from hypothesis tests ov...
Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis, little attention has been paid to uncertainty in the results obtained.
We present an R/Bioconductor port of a fast novel algorithm for Bayesian agglomerative hierarchical clustering an...
Figure 4. Gene clustering dendrogram for the NASC data.
Figure 3. Condition clustering dendrogram for the NASC data.
GO annotations for BHC clusters. Statistically significantly over-represented GO annotations for BHC clusters (Bonferroni-corrected p-value < 0.05)
GO annotations for agglomerative hierarchical clustering. Statistically significantly over-represented GO annotations for clusters manually identified from agglomerative hierarchical clustering (Bonferroni-corrected p-value < 0.05)
LeafDisparity values for the NASC experiments. The BHC clustering dendrogram is compared to a standard hierarchical method using uncentred correlation coefficients and complete linnkage.
Figure 2. Gene clustering dendrogram of a subset of the Ideker et al. data, showing leaf harmony values
Table 1 – Speed-trial of the BHC algorithm. Trials were based on the NASC data (880 genes, 31 features), clustering over genes. In each case, the data were duplicated or a subset of genes taken as appropriate to get the required number genes and features. All trials were run on a single 2 GHz CPU core on a Macbook Pro laptop.
Table 2. Data discretisation for NASC experiment clustering
Table 3. Data discretisation for NASC gene clustering
BHC cluster membership. BHC cluster membership
The Dirichlet process mixture (DPM) is a widely used model for clustering and for general nonparametric Bayesian density es- timation. Unfortunately, like in many sta- tistical models, exact inference in a DPM is intractable, and approximate methods are needed to perform efficient inference. While most attention in the literature has been placed on...