Khuyagbaatar Batsuren

Khuyagbaatar Batsuren
National University of Mongolia | NUM · Department of Computer and Information Technology

Doctor of Philosophy

About

25
Publications
4,117
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
138
Citations
Citations since 2016
22 Research Items
134 Citations
201620172018201920202021202201020304050
201620172018201920202021202201020304050
201620172018201920202021202201020304050
201620172018201920202021202201020304050

Publications

Publications (25)
Conference Paper
Full-text available
Large-scale morphological databases provide essential input to a wide range of NLP applications. Inflectional data is of particular importance for morphologically rich (agglutinative and highly inflecting) languages, and derivations can be used, e.g. to infer the semantics of out-of-vocabulary words. Extending the scope of state-of-the-art multilin...
Preprint
Full-text available
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data i...
Preprint
Full-text available
The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian,...
Preprint
Full-text available
This paper describes a method to enrich lexical resources with content relating to linguistic diversity, based on knowledge from the field of lexical typology. We capture the phenomenon of diversity through the notions of lexical gap and language-specific word and use a systematic method to infer gaps semi-automatically on a large scale. As a first...
Preprint
Full-text available
Metonymy is regarded as a universally shared cognitive phenomenon; as such, humans are taken to effortlessly produce and comprehend metonymic senses. However, experimental studies on metonymy have been focused on Western societies, and the linguistic data backing up claims of universality has not been large enough to provide conclusive evidence. We...
Preprint
Full-text available
The ability to generalise well is one of the primary desiderata of natural language processing NLP). Yet, what `good generalisation' entails and how it should be evaluated is not well understood, nor are there any common standards to evaluate it. In this paper, we aim to lay the ground-work to improve both of these issues. We present a taxonomy for...
Preprint
Full-text available
In NLP, models are usually evaluated by reporting single-number performance scores on a number of readily available benchmarks, without much deeper analysis. Here, we argue that - especially given the well-known fact that benchmarks often contain biases, artefacts, and spurious correlations - deeper results analysis should become the de-facto stand...
Conference Paper
Full-text available
Traditional bilingual dictionaries, once pivotal translation tools, have been superseded on the Web by multilingual lexical databases that interconnect the lexicons of hundreds of languages, built for both human and computational uses. A close look at the structure of such databases reveals, however, a form of linguistic bias, namely an inbuilt pre...
Preprint
Full-text available
The Universal Knowledge Core (UKC) is a large multilingual lexical database with a focus on language diversity and covering over a thousand languages. The aim of the database, as well as its tools and data catalogue, is to make the somewhat abstract notion of diversity visually understandable for humans and formally exploitable by machines. The UKC...
Article
Full-text available
We present CogNet, a large-scale, automatically-built database of sense-tagged cognates—words of common origin and meaning across languages. CogNet is continuously evolving: its current version contains over 8 million cognate pairs over 338 languages and 35 writing systems, with new releases already in preparation. The paper presents the algorithm...
Chapter
Full-text available
Lexical similarity data, quantifying the “proximity” of languages based on the similarity of their lexicons, has been increasingly used to estimate the cross-lingual reusability of language resources, for tasks such as bilingual lexicon induction or cross-lingual transfer. Existing similarity data, however, originates from the field of comparative...
Preprint
Full-text available
As the role of algorithmic systems and processes increases in society, so does the risk of bias, which can result in discrimination against individuals and social groups. Research on algorithmic bias has exploded in recent years, highlighting both the problems of bias, and the potential solutions, in terms of algorithmic transparency (AT). Transpar...
Preprint
Full-text available
Mitigating bias in algorithmic systems is a critical issue drawing attention across communities within the information and computer sciences. Given the complexity of the problem and the involvement of multiple stakeholders, including developers, end-users and third-parties, there is a need to understand the landscape of the sources of bias, and the...
Conference Paper
Full-text available
This paper presents the Mongolian Wordnet (MOW), and a general methodology of how to construct it from various sources e.g. lexical resources and expert translations. As of today, the MOW contains 23,665 synsets, 26,875 words, 2,979 glosses, and 213 examples. The manual evaluation of the resource estimated its quality at 96.4%.
Conference Paper
Full-text available
This paper introduces CogNet, a new, large-scale lexical database that provides cognates—words of common origin and meaning—across languages. The database currently contains 3.1 million cognate pairs across 338 languages using 35 writing systems. The paper also describes the automated method by which cognates were computed from publicly available w...
Conference Paper
Full-text available
We present a large scale multilingual lexical resource, the Universal Knowledge Core (UKC), which is organized like a Wordnet with, however, a major design difference. In the UKC the meaning of words is represented not only with synsets but also using language independent concepts which cluster together the synsets which, in different languages, co...
Conference Paper
Full-text available
The main goal of this paper is to describe a general approach to the problem of understanding linguistic phenomena, as they appear in lexical semantics, through the analysis of large scale resources, while exploiting these results to improve the quality of the resources themselves. The main contributions are: the approach itself, a formal quantitat...
Conference Paper
Full-text available
Our goal is the construction, maintenance and evolution of a large-scale multilingual lexico-semantic resource, called UKC (for Universal Knowledge Core). Differently from previous approaches where similar resources were built by experts often on the basis of a corpus of reference documents , the UKC is constructed via crowdsourcing. We see languag...
Article
Full-text available
Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We...
Article
Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biomedical text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. In this study, we take a ste...

Network

Cited By

Projects

Projects (2)
Project
The goal is to study, model and adapt to diversity, how it appears in Language and knowledge, as the basis for the development of systems which live in an open world
Project
develop all the languages of the world, with a focus on minority languages and use them to develop culture and diversity aware knowledge