Conference PaperPDF Available

Semantic Annotation for Ukrainian: Categorization Scheme, Principles, and Tools

Authors:

Abstract

A semantic tagset for the semantic annotation of Ukrainian-language texts is presented, and the use of the taxonomic approach is substantiated. The categorization scheme implemented in the tagset takes into account the cognitive-linguistic perspective on categorization, specifically the basic level of categorization. Semantic tags are to be assigned to lemmas in the existing Large Electronic Dictionary of Ukrainian (VESUM) yielding a semantic lexicon that will be used by the TagText tagger (both tools developed by the r2u team) to add semantic annotation to the GRAC corpus. Used in conjunction with POS tags, semantic tags will serve as a powerful tool for the linguistic exploration of corpus data and for solving NLP tasks involving Ukrainian.
Semantic Annotation for Ukrainian: Categorization
Scheme, Principles, and Tools
Vasyl Starko[0000-0002-2530-2107]
Ukrainian Catholic University, 2a Kozelnytska Street,
Lviv, 79026, Ukraine
v.starko@ucu.edu.ua
Abstract. A semantic tagset for the semantic annotation of Ukrainian-language
texts is presented, and the use of the taxonomic approach is substantiated. The
categorization scheme implemented in the tagset takes into account the
cognitive-linguistic perspective on categorization, specifically the basic level of
categorization. Semantic tags are to be assigned to lemmas in the existing Large
Electronic Dictionary of Ukrainian (VESUM) yielding a semantic lexicon that
will be used by the TagText tagger (both tools developed by the r2u team) to
add semantic annotation to the GRAC corpus. Used in conjunction with POS
tags, semantic tags will serve as a powerful tool for the linguistic exploration of
corpus data and for solving NLP tasks involving Ukrainian.
Keywords: semantic annotation, semantic lexicon, semantic tagset,
categorization, corpus, r2u, GRAC, VESUM, TagText.
1 Introduction
Semantic annotation is an important type of annotation of natural-language texts.
There are several different approaches to annotating texts with semantic labels: based
on WordNet [14], FrameNet [4], hierarchical classification [12], and taxonomic
classification [6], [7].
The ideographic, or hierarchical, approach to semantic tagging proceeds in the top-
down fashion, from the most general notions to the most specific terms. While this
system has its merits, it is not without its flaws. There is no one universal hierarchical
classification scheme as is evidenced by discrepancies in the thesauri of different
languages. The top layers of any such system are quite abstract and beyond the
intuitive understanding of most users, while some categories appear to be not quite
coherent, for example A10+ Open; Finding; Showing [12]. These systems involve a
fine-grained semantic classification of vocabulary with numerous semantic features
that transcend POS boundaries. It has been argued [7] that such a purely semantic
approach is not well-suited for corpora for reasons of cumbersomeness, excessive
ambiguity, and counterintuitive groupings. For example, the feature ‘motion’ would
be assigned not only to verbs and deverbal nouns (to run, running) but also to
Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
adjectives (quick), object names (feet, wheels), and so on. Some of these words, such
as road and smoke, would be highly surprising to an average corpus user.
Furthermore, hierarchical classification is dichotomousa word may be assigned
to one category only, while taxonomic classification allows for flexible attribution of
several semantic features to a word. Taxonomic classification aligns well with
cognitive linguistic research showing that speakers perceive and classify words
relying on integral gestalts (a simultaneous, complex totality of their various features)
rather than on discrete features [7].
For the Ukrainian corpus [1] developed by Nataliia Darchuk and her colleagues, a
hybrid approach has been proposed involving a combination of taxonomic semantic
classification for non-scientific texts and the construction of ontologies for various
scientific domains to be applied to scientific texts [2]. However, the full specification
of the semantic tagset has yet to be made publicly available by this group. The web
interface to their corpus does not allow semantic searches so far, and semantically
tagged words can only be seen in frequency wordlists generated on their website.
In the absence of a large-scale resource, such as WordNet, for Ukrainian, which
would allow the attribution of specific sense to words in context, and given the
advantages of the taxonomic system, it is reasonable to pursue the latter route. Thus,
we propose a lexicon-based approach to semantic annotation for Ukrainian using
taxonomic semantic tags. The lexicon will be publicly available together with a
tagger, enabling researchers to semantically tag any Ukrainian-language texts.
While the proposed semantic annotation for Ukrainian has a predominantly
practical focus, its scientific value lies in the domain of semantic description and
formalization. It is, essentially, the first step in the direction of a formalized semantic
description of Ukrainian.
2 Purpose and Principles of Semantic Annotation
The present paper is focused on developing a semantic tagset, while the overarching
goal of the broader endeavor is to build a semantic lexicon for Ukrainian and add a
semantic annotation layer to the General Regionally Annotated Corpus of Ukrainian
(GRAC) [5]. With semantic tags in place, the user will be able to apply them
separately or in combination with POS tags in constructing search queries.
Researchers will be able to investigate, among other things, the linguistic behavior of
semantically motivated classes of words: their combinatorial properties, patterns of
government, usage and variation patterns, etc. An analysis of semantic phenomena,
such as meaning shifts, polysemy, and word sense disambiguation, will have a firmer
foundation. Semantic tags will also be useful for applications outside of corpus
linguistics, for example, named entity recognition and information extraction.
GRAC has a wide audiencefrom school students to foreign students of Ukrainian
to the NLP community and to seasoned linguists. Thus, the semantic annotation
scheme for Ukrainian must be accessible and transparent to users with varying
degrees of linguistic expertise. Moreover, it needs to be as unambiguous as possible.
With proper explanation, each semantic tag should leave no doubt in the user’s mind
as to its content. This will require taking into account the so-called linguistic, or
naïve, worldview possessed by the native speakers of the language.
The approach implemented in the Russian National Corpus [6], [7] served as a
point of departure for the system of semantic annotation presented here, albeit there
are significant differences in a number of aspects. Both systems employ the faceted,
rather than hierarchical, approach to classification, allowing for one word to be placed
in several taxonomic classes (assigned different semantic tags) at the same time,
rather than focusing narrowly on those that are relevant for genusspecies relations.
Taxonomy-based semantic annotation must meet three requirements: it must be
generally understandable, linguistically meaningful [6], and cognitively motivated.
The names of semantic tags need to be transparent and intuitive. Alphanumeric
designations (such as those used in the USAS project, e.g. S1.2.1, which means
‘approachability and friendliness’) necessitate intimate knowledge of the
classification scheme and, at least initially, frequent lookup. In contrast, the names of
semantic tags for Ukrainian will be abbreviations of common English words (see
examples below).
The linguistic relevance of semantic classes means that words belonging to the
same semantic class should exhibit commonality of linguistic behavior by virtue of
such membership. Semantics is viewed as having surface manifestations and, more
precisely, as a motivating factor for surface lexical behavior. This view is widely
accepted in cognitive linguistics and has been convincingly advocated, among others,
by the prominent semantics scholar Anna Wierzbicka [13]. In selecting semantic
features and establishing the overall classification scheme for such multifaceted
feature attribution, it is important to consider what we know about human natural-
language categorization and its principles [3], [9].
As far as semantic features labeled by tags are concerned, they should be
independent, basic, forming sufficiently large classes, generating a minimum of noise,
and yielding optimal results when searching for constructions (using complex corpus
search queries) [7: 226]. Furthermore, the set of semantic tags needs to be constructed
in such a way as to yield a linguistically relevant classification for the entire
Ukrainian vocabulary and reflect, as much as possible, the semantic groupings
involved in regular patterns of lexico-grammatical interactions in Ukrainian texts. In
other words, semantic classes should be maximally homogeneous in terms of their
linguistic behavior, while at the same time remaining sufficiently broad for an
effective semantic description. A fine balance will need to be found between these
competing requirements.
It has been found that it is easier for corpus users to formulate queries based on
basic categories, i.e., basic groupings of words in a certain part of speech. For
example, the basic categories for nouns are humans, animals, plants, tools, food, etc.
Such classes are a reflection of the so-called “basic level” of human natural language
categorization. This level is populated by categories in which perceptual, behavioral,
and abstract features converge and which plays an especially prominent role in human
categorization [11]. Basic semantic classes are not elementary and could theoretically
be further decomposed, but this would result in the loss of their privileged status [7].
For example, the English verb to melt and its Ukrainian equivalent танути denote a
change of state, which is a basic semantic class. Decomposing their meaning into
‘begin to be different’ would assign them to very broad low-level classes (inception,
being, and difference), which they would share with many other words (to start,
existence, different, etc.), but would miss the basic semantic class.
While moving one level up (to the superordinate level) or down (to the subordinate
level) from the basic level is quite effortless and is regularly carried out by the
speakers of the language, high levels of abstraction or, on the contrary, of detail are
predominantly in the realm of science. Therefore, in developing the classification
scheme for semantic annotation it is important to stay on, or close to, the basic level
of categorization and strive to keep semantic classes psychologically real. This is
highly relevant if semantic tags are to be used by a wide audience with uneven levels
of linguistic competence and different backgrounds. The flexible taxonomic approach
proposed here is compliant with the requirement of psychological and cognitive
reality: indeed, numerous words ordinarily combine features of several classes, for
example, the verb break means an action and a change of state, and it would be
suboptimal to force it into just one of these categories.
Different word classes (verbs, adjectives, and adverbs, as well as concrete nouns
and abstract nouns) have different sets of semantic tags. Nevertheless, several features
are used consistently across POS boundaries to mark lexical items with similar
meaning, for example, the tag smell applies equally to the verb пахнути ‘to smell
(good)’, the noun запах ‘smell’, and the adjective пахучий ‘fragrant’.
There are two independent parameters for semantic annotion expressing part
whole (or elementset) relations (mereology) and geometrical or spacial properties of
objects (topology, e.g., container and surface). The list of topological tags may grow
if further studies of usage patterns point to other relevant topological features.
Taxonomic classification is not necessarily flat, and in our case, it is complemented
with elements of shallow hierarchical classification. For example, tools are
subdivided into instruments, devices, means of transportation, weapons, etc., and
qualities are divided into physical and abstract, while physical qualities are further
subdivided into form, sound, color, light, taste, smell, temperature, and weight.
Semantic tags are assigned to a list of lemmas, thereby creating a semantic lexicon.
In cases of ambiguity, i.e., when a lemma may have more than one set of semantic
tags due to being used in multiple senses, all such sets are listed in the semantic
lexicon, leaving the problem of semantic disambiguation for later stages. The
semantic description of a word in a semantic lexicon is constructed so as to achieve
certain explanatory and predictive power regarding various properties of this lexeme,
such as its collocation patterns, valency, derivational potential, etc.
To sum up, faceted classification involves semantically cohesive categories that are
flexibly combined as needed to best describe a given concept. This kind of
classification stands closest to the naïve worldview of the users and their
extralinguistic experience and is concordant with the principles of natural language
categorization. The categorization scheme and the choice of semantic tags need to
meet the criteria dictated by this approach and orientation.
3 Tools for Semantic Annotation
Each individual sense of a word singled out in explanatory dictionaries can potentially
require a distinct set of semantic features. Even though lexicographic descriptions and
semantic annotation differ in their respective objectives and the choice of semantic
features, explanatory dictionaries are the most useful source of semantic information
for the purposes of multifaceted semantic annotation. Other sources include
dictionaries of specialized vocabulary (scientific and technical terms, slang, etc.),
Wikipedia as a source of communal knowledge, and context in case of words lacking
a lexicographic description.
The semantic annotation of GRAC is a natural complement to its morphological
annotation. Morphological annotation enables the user to explore the corpus in a fairly
complex manner, constructing search queries with the use of the Corpus Query
Language (CQL) and employing a variety of grammar tags (POS and grammatical
features such as number, gender, case, tense, etc.) [5]. For example, one may search
for verbs in the past tense followed by singular nouns in the dative case within the
span of three words.
Morphological annotation in GRAC is based on two key components developed by
the r2u group: 1) VESUM, a large POS dictionary for Ukrainian currently containing
over 400,000 lemmas from which over 6 million wordforms are generated [8];
2) TagText, a POS tagger based on VESUM and also employing dynamic tagging for
complex out-of-vocabulary words, dates, numbers, and punctuation symbols [10]. For
a synthetic language, such as Ukrainian, it is convenient to add this layer when
morphological (POS) annotation is already in place. This approach has several
advantages. First, it makes use of lemmatization, which means that semantic tags can
be added to lemmas instead of each individual wordform. Second, the TagText tagger
will be modified to include semantic tags and tag texts for GRAC in one go. Third,
semantic tags can be added incrementally and, for certain groups of words,
automatically to the same dictionary (VESUM) which is already used by the tagger.
The semantic lexicon will be developed iteratively, enabling refinements of the
semantic tagset, semantic classes, and features assigned to individual words. We
envisage progression through the following stages:
1) developing the initial semantic tagset;
2) assigning semantic tags to the top 3,000 most frequent lemmas;
3) modifying the tagset if necessary;
4) semantic tagging of GRAC;
5) expanding the semantic lexicon, modifying the tagset, and tagging subsequent
versions of GRAC.
4 Semantic Tagset
As mentioned above, each part of speech has its own set of semantic tags. Nouns are
divided into the following large groups: conc (concrete nouns), abst (abstract nouns),
and prop (proper names). Concrete and abstract nouns are treated separately as shown
below. (For reasons of space, examples with glosses are provided selectively). Tags
are shown in boldface, followed by a description and examples.
The taxonomic classification of concrete nouns is as follows: hum (human beings),
hum:group (human groups, including groups of people based on ethnicity or place of
origin or residence), hum:kin (kinship terms), supernat (supernatural beings),
animal (animals), plant (plants), mushr (mushrooms), stuff (substances and
materials), loc (locations and spaces), build (buildings and constructions), vehicle
(vehicles), furnit (furniture), dish (plates, dishes, cups, and kitchen utensils), cloth
(clothes and footwear), food (food and drinks), org (organizations), event (events),
work (works, such as works of art and texts), and tool (tools and appliances in
general). Tools are further subdivided into tool:instr (tools, implements, e.g., ручка
‘pen’), tool:device (devices), tool:weapon (weapons), and tool:music (musical
instruments).
The tags hum and supernat can be quickly assigned based on a special mark (<) in
the VESUM declension codes, which corresponds to these two categories. Likewise,
animals are marked with <> and receive the tag animal. Thus, some 20,000 animate
nouns in the lexicon can be efficiently supplied with semantic tags this way.
Mereology for concrete nouns is represented by the overarching label part (e.g.,
середина ‘middle part’), which is broken down into the following lower-level
categories: part:hum (human body parts), part:animal (animal body parts),
part:plant (parts of plants), part:build (parts of buildings and constructions),
part:tool (parts of tools and appliances in general), part:tool:instr (parts of tools and
implements, e.g., ручка ‘handle’), part:tool:device (parts of devices, e.g., кнопка
‘button’), part:tool:weapon (parts of weapons), part:tool:music (parts of musical
instruments, e.g., дека ‘sounding board’), part:tool:furnit (parts of furniture, e.g.,
ніжка ‘leg’), part:tool:dish (parts of dishes, носик ‘spout’), part:tool:cloth (parts of
clothes and footwear, e.g., рукав ‘sleeve’). Four tags describe mereology for concrete
nouns: quant (quanta, i.e., particles and portions of substance, e.g., крапля ‘drop’),
set (sets, e.g., букет ‘bouquet’), collect (collective nouns, e.g., меблі ‘furniture’,
селянство ‘peasantry’), and higher_class (classes at the superordinate level,
immediately above the basic level, e.g., рослина ‘plant’, засіб ‘tool’).
Concrete nouns have two topology tags, viz., container (containers, e.g., зала
‘hall’) and surface (surfaces, e.g., майдан ‘square’), and two evaluation tags, viz.,
posit (positive, e.g., герой ‘hero’) and negat (negative, e.g., вайло ‘sluggard’).
The taxonomy of abstract nouns is as follows: move (movement, motion, e.g., біг
‘running’, переставляння ‘rearrangement’), move:body (movement of a body part or
change of position, e.g., нахил ‘bend’), placing (placement of an object, e.g.
завантаження ‘loading’, розташовування ‘placement; ordering’), impact (physical
impact, e.g., удар ‘strike’, зішкрябування ‘scraping off’), creat (creation of a
physical object, e.g., складання ‘putting together’, розроблення ‘development’),
destr (destruction, e.g. руйнація ‘ruination’, злам ‘breaking’), change_state (change
of state or quality, e.g. розігрів ‘warming up’, спрощення ‘simplification’), exist
(existence, e.g. буття ‘being’, відсутність ‘absence’), appear (beginning of
existence, e.g., постання ‘emergence’, народження ‘birth’), disappear (end of
existence, смерть ‘death’, скасування ‘cancellation’), loc (location or position,
урочище ‘natural landmark’, поза ‘pose’), loc:body (special body position, e.g.,
сидіння ‘sitting’), contact (contact and support, e.g., доторк ‘touch’, опертя
‘support’), poss (possession, e.g., придбання ‘acquisition’, втрата ‘loss’), ment
(mental domain, e.g., думка ‘thought’), percept (perception, e.g., погляд ‘look,
view’), psych (psychological domain, e.g., збудженість ‘excitement’), psych:emot
(emotion, e.g., смуток ‘sorrow’), psych:vol (volition, e.g., бажання ‘desire’),
speech (speech acts, e.g., обговорення ‘discussion’), physio (physiology, e.g., втома
‘fatigue’), weather (weather phenomena, e.g., дощ rain’), sound (sound, e.g.,
дзенькіт ‘ring(ing)’), color (color, e.g., блакить ‘blue color, azure’), light (light,
e.g., промінь ‘ray’), taste (taste, e.g., кислинка sour taste’), smell (smell, e.g.,
пахощі ‘’), tempr (temperature, e.g., прохолода cool(ness)’), weight (weight, e.g.,
тягар burden’), time (time, e.g., прийдешнє future’), time:period (period, e.g.,
строк period’), time:moment (moment in time, e.g., реченець deadline’),
time:week (day of the week), time:month (month), time:age (age, e.g., молодість
youth’), quality:phys (physical quality, e.g., твердість hardness’), quality:abst
(abstract quality, e.g., непередбачуваність unpredictability’), quality:abst:hum
(abstract quality of a person, e.g., щедрість generosity’), behave (human behavior
and acts, e.g., халатність ‘negligence’, сварка quarrel’), interact (interaction and
relationships, e.g., допомога ‘help’, дружба friendship’), event (event, e.g., збори
‘meeting’, фестиваль festival’), disease (disease, e.g., грип influenza’), game
(game, e.g., шашки checkers), sport (sports, e.g., гімнастика calisthenics’),
param (parameter, e.g., довжина length’), and unit (unit of measurement, e.g.,
секунда second’).
Abstract nouns have three tags for mereology: part (part, e.g., початок
‘beginning’), quant (quanta, e.g., раз ‘(one) time’, момент ‘moment’), and set (sets,
e.g., система ‘system’). Two tags refer to evaluation: posit (positive, e.g., насолода
‘delight’) and negat (negative, e.g., вульгарність ‘vulgarity’).
Proper names receive four tags that are already implemented in the current system
for POS tagging: fname (first name, e.g., Тарас ‘Taras’), pname (patronymic, e.g.,
Григорович ‘Hryhorovych’), lname (last name, e.g., Шевченко ‘Shevchenko’), and
geo (geographical names, e.g., Дніпро ‘Dnieper’). Thus, a significant portion of the
lemmas (over 53,000 proper names) in VESUM already have all the tags that are
required for semantic annotation.
Adjectives are divided into nine taxonomic classes: size (size, e.g., широкий
‘wide’), dist (distance, e.g., сусідній ‘neighboring’), quantit (quantity, e.g.,
нечисленний ‘not numerous’), orient (orientation, direction, e.g., правий ‘right’,
прибережний ‘coastal’, зворотний ‘reverse’), time (time, e.g., майбутній ‘future’),
dur (duration, e.g., короткочасний ‘brief, short-lived’), age (age, e.g., старий
‘old’), speed (speed, e.g., меткий ‘quick, nimble’), and quality. Quality is not an
independent feature; rather, it is qualified as either physical or abstract and supplied
with a specific semantic tag: quality:phys (physical quality, e.g., липучий ‘sticky’),
quality:phys:form (form, e.g., вигнутий ‘bent’), quality:phys:sound (sound, e.g.,
дзвінкий ‘resounding’), quality:phys:color (color, e.g., зелений ‘green’),
quality:phys:light (light, e.g., яскравий ‘bright’), quality:phys:taste (taste, e.g.,
гіркий ‘bitter’), quality:phys:smell (smell, e.g., духмяний ‘fragrant’),
quality:phys:tempr (temperature, e.g., гарячий ‘hot’), quality:phys:weight (weight,
e.g., легкий ‘light’), quality:abst (abstract quality, e.g., непередбачуваний
‘unpredictable’), quality:abst:hum (abstract quality of a person, e.g., хитрий
‘cunning’), quality:abst:ment (abstract quality in the mental domain, e.g.,
зрозумілий ‘understandable’).
Adjectives have three additional tags (max, min, and absol ‘absolute’) that are
combined with the following tags: size, dist, quant, dur, age, and speed. For
example, size:max (large size, e.g., довгий ‘long’), size:min (small size, e.g.,
низький ‘low’), and size:absol (absolute size, e.g., триметровий ‘three-meter
long’). Two evaluation tags are also applied: posit (positive, e.g., щасливий ‘happy’)
and negat (negative, e.g., нечесний ‘dishonest’).
Adverbs are divided into 12 large categories: place (location, e.g., тут here’),
orient (orientation, direction, e.g., вниз ‘down’, праворуч to/on the right’), dist
(distance, e.g., недалеко ‘not far’, високо ‘high’), quantit (quantity, e.g., трішки ‘a
little’, тричі ‘three times’), time (time, e.g., відтепер from now on’), dur
(duration, e.g., недовго not long’), speed (speed, поволі slowly’), manner
(manner, e.g., по-українськи ‘in Ukrainian’, навприсядки in a squatting position’),
degree (degree, достатньо sufficiently’), cause (cause, спересердя ‘in anger’),
goal (purpose, e.g., навмисне ‘intentionally’), and quality. Many qualities have
essentially the same meaning whether expressed by adverbs or by adjectives. Thus,
adverbs receive the same tags as adjectives to describe qualities: quality:phys
(physical quality, e.g., рівно smoothly’), quality:phys:form (form, e.g., криво
obliquely’), quality:phys:sound (sound, e.g., дзвінко resoundingly’),
quality:phys:color (color, барвисто colorfully’), quality:phys:light (light, e.g.,
яскраво brightly’), quality:phys:taste (taste, e.g., гірко bitterly’),
quality:phys:smell (smell, духмяно fragrantly’), quality:phys:tempr (temperature,
холодно coldly’), quality:phys:weight (weight , легко lightly’), quality:abst
(abstract quality, непередбачувано unexpectedly’), quality:abst:hum (abstract
quality of a person, хитро cunningly’), and quality:abst:ment (abstract quality in
the mental domain (зрозуміло understandably’).
The categorization scheme for adverbs includes two additional tags (max and min)
that are combined with the tags dist, dur, speed, and quantit. For example, dist:max
(large distance, e.g., далеко ‘far away’) and dist:min (short distance, e.g., поблизу
‘near’). Similar to adjectives and nouns, adverbs have two evaluation tags: posit
(positive, e.g., щасливо ‘happily’) and negat (negative, нечесно ‘dishonestly’).
Verbs have ramified semantic classification with few hierarchical elements: move
(movement, e.g., іти ‘to walk’, штовхати to push’), move:body (movement of a
body part or change of position, e.g., нахилятиcя ‘to bend’), placing (placement of
an object, e.g., завантажити ‘to load’, розташувати to place, to arrange’),
impact (physical impact, e.g., ударяти ‘to strike’, зішкрябувати to scrape off’),
creat (creation of a physical object, e.g., складати ‘to put together’, розробляти to
develop’), destr (destruction, e.g., руйнувати ‘to destroy’, зламати to break’),
change_state (change of state or quality, e.g., розігрівати ‘to warm up’, спростити
to simplify’), exist (existence, e.g., бути to be’), appear (beginning of existence,
e.g., народитися to be born’), disappear (end of existence, e.g., померти to die’,
скасувати to cancel’), loc (location or position, e.g., покласти to put’), loc:body
(special body position, e.g., сидіти to sit’), contact (contact and support, e.g.,
торкатися ‘to touch’, спиратися to rest on’), poss (possession, e.g., придбати ‘to
acquire’, втратити to lose’), ment (mental domain, e.g., думати to think’),
percept (perception, e.g., дивитися ‘to look’, побачити to see’), psych
(psychological domain, e.g., турбуватися to be concerned’), psych:emot (emotion,
e.g., засмучуватися to be sad’), psych:vol (volition, e.g., бажати to desire’),
speech (speech acts, e.g., обговорювати to discuss’), behave (human behavior and
acts, e.g., дражнитися to tease’), physio (physiology, e.g., втомлюватися to
become tired’), weather (weather phenomenon, e.g., дощити to rain’), sound
(sound, e.g., дзенькнути to tinkle’), color (color, e.g., біліти to become white’),
light (light, e.g., засвітитися to light up’), taste (taste, e.g., гірчити to taste
bitter’), smell (smell, e.g., духмяніти to smell pleasant’), caus (causative verbs, e.g.,
скласти to put smth together’), and noncaus (non-causative verbs, e.g.,
повертатися to return’).
While semantic tagsets are developed separately for each part of speech, it is
important to uniformly tag similar semantic content. For example, physical qualities
are consistently assigned the same tags for nouns and adjectives: sound, color, light,
etc. Abstract nouns and verbs also share a number of semantic tags: move, placing,
impact, etc. This way, a search query can be flexibly formulated to either zero in on a
specific part of speech with a given tag or retrieve multiple parts of speech referring
to the same physical quality.
5 Conclusion
The problem of semantic annotation for Ukrainian, specifically for the GRAC corpus,
can be resolved using the taxonomic approach and relying on insights from human
natural language categorization. The semantic tagset proposed here can be applied to
create a semantic lexicon by assigning semantic tags to individual lemmas in
VESUM, a POS dictionary for Ukrainian. TagText, the POS tagger for Ukrainian by
the r2u group, can then be modified to carry out semantic tagging of texts.
Complex queries enabled by a combination of morphological (POS) and semantic
tagging can be a powerful tool for corpus studies, linguistic research, and a variety of
NLP applications. In particular, it has the potential to enhance the search functionality
of the GRAC corpus and open up opportunities for the study of semantic classes in
Ukrainian.
6 References
1. Corpus of the Ukrainian Language, available at
http://www.mova.info/corpus.aspx
2. Darchuk, N.P.: Mozhlyvosti semantychnoyi rozmitky korpusu ukrainskoyi movy
(KUM) [Possibilities of the Semantic Markup of the Corpus of the Ukrainian
Language (KUM)]. In: Naukovyi chasopys Natsionalnoho pedahohichnoho
universytetu im. M.P. Drahomanova. Seriya 9: Suchasni tendentsiyi rozvytku
mov, Vypusk 15, pp. 18-28. (2017) (in Ukrainian)
3. Evans, V., Green, M.: Cognitive Linguistics. An Introduction. Edinburgh. (2006)
4. FrameNet, available at http://framenet.icsi.berkeley.edu
5. Shvedova, M., von Waldenfels, R., Yarygin, S., Kruk, M., Rysin, A., Starko, V.,
Woźniak, M.: GRAC: General Regionally Annotated Corpus of Ukrainian. Kyiv,
Oslo, Jena. (2017-2020), available at uacorpus.org.
6. Kustova G.I., Lyashevskaya O.N., Paducheva E.V., Rakhilina E.V.:
Semanticheskaya razmetka leksiki v natsionalnom korpuse russkogo jazyka:
printsipy, problemy, perspektivy [Semantic Markup of Vocabulary in the Russian
National Corpus: Principles, Problems and Perspectives]. In: Nationalnyi korpus
russkogo yazyka: 2003-2005 [Russian National Corpus: 2003-2005], Moskva, pp.
155174. (2005) (in Russian)
7. Rakhilina E.V., Kustova G.I., Lyashevskaya O.N., Reznikova T.I., Shemanaeva
O. Ju.: Zadachi i printsipy semanticheskoy razmetki leksiki v NKRJa [The
Objectives and Principles of Semantic Markup of Vocabulary in the Russian
National Corpus]. In: Nationalnyi korpus russkogo yazyka. Novyie rezultaty i
perspektivy [Russian National Corpus. New Results and Perspectives]. Sankt-
Peterburg, pp. 215-239. (2009) (in Russian)
8. Starko, V.: Kompiuterni linhvistychni proekty hurtu r2u: stan I zastosuvannia
[Computational Linguistic Projects of the r2u Group: Progress and Applications].
In: Ukrainska mova, No. 3, pp. 86100. (2017) (in Ukrainian)
http://nbuv.gov.ua/UJRN/Ukrm_2017_3_9
9. Starko, V.: Paradyhma kohnityvnoyi linhvistyky j problema katehoryzatsiyi
[Paradigm of Cognitive Linguistics and the Problem of Categorization]. In:
Naukovi zapysky [Natsionalnoho universytetu “Ostrozka akademiya”] Seriya:
Filolohichna, Vyp. 48, pp. 113116. (2014) (in Ukrainian)
http://nbuv.gov.ua/UJRN/Nznuoaf_2014_48_37.
10. Starko, V.: Rozviazannia kompiuternolinhvistychnykh zavdan zasobamy hurtu
r2u [Solving Computational Linguistic Problems with Tools Developed by the r2u
Group]. In: U poshukakh harmoniyi movy, Kyiv, pp. 367373. (2020) (in
Ukrainian) https://r2u.org.ua/data/other/Klymenko_2020/Klymenko_2020.pdf
11. Taylor, J.R.: Linguistic Categorizaton. 2nd Ed., Oxford. (1995)
12. UCREL Semantic Analysis System, available at http://ucrel.lancs.ac.uk/usas
13. Wierzbicka, A.: Lexicography and Conceptual Analysis. Tucson. (1995)
14. WordNet, available at http://wordnet.princeton.edu
... It will not work for Ukrainian-language texts. It is necessary to adapt the parser and stemming algorithms to the Ukrainian language based on thematic frequency dictionaries of the basics [19][20][21][22][23][24][25][26][27]. ...
... testuali, come i dati anagrafici di un autore, quando, da chi e dove è stato pubblicato un testo, in che lingua è scritto, il genere testuale e così via. (Starko, 2020). Un altro studio molto interessante, che rivela le potenzialità dell'annotazione semantica, riguarda delle analisi semantiche effettuate per conto di un'azienda che sviluppa dispositivi per il controllo e la gestione dello stress. ...
Thesis
Full-text available
The present study is a theoretical examination of the discipline of corpus linguistics and, at the same time, a practical analysis of the contemporary Russian netspeak through the use of traditional and web corpora (National Corpus of Russian Language, RuTenTen11 and RuTenTen17). The survey focused on English loanwords, youth slang and contemporary Russian netspeak, the language used on the Web, a linguistic variety, which has typical characteristics of both spoken and written languages. Among the observed peculiarities, the most frequent are the abbreviations, acronyms, reduplication of letters, repeated messages, quotes from other comments or other users. In addition, exclusive characteristics of the Russian netspeak alone were observed, such as the tendency to transliterate English loanwords as they are pronounced, or the existence of multiple transcriptions for the same word. Данная работа представляет собой теоретическое исследование истории и методов корпусной лингвистики и в то же время практическое исследование современного русского языка Интернета. Исследование было проведено с использованием традиционного корпуса — Национального корпуса русского языка — и двух веб-корпусов — ruTenTen11 и ruTenTen17, доступных на веб-платформе Sketch Engine Среди наблюдаемых особенностей наиболее частыми являются аббревиатуры, акронимы, повторы букв, повторы сообщений, цитаты из других комментариев или других пользователей. Кроме того, наблюдались исключительные характеристики только русского сетевого языка, такие как тенденция к транслитерации английских заимствований по мере их произношения или наличие множественных транскрипций для одного и того же слова.
Book
An authoritative general introduction to cognitive linguistics, this book provides up-to-date coverage of all areas of the field and sets in context recent developments within cognitive semantics and cognitive approaches to grammar.
  • N P Darchuk
Darchuk, N.P.: Mozhlyvosti semantychnoyi rozmitky korpusu ukrainskoyi movy (KUM) [Possibilities of the Semantic Markup of the Corpus of the Ukrainian Language (KUM)]. In: Naukovyi chasopys Natsionalnoho pedahohichnoho universytetu im. M.P. Drahomanova. Seriya 9: Suchasni tendentsiyi rozvytku mov, Vypusk 15, pp. 18-28. (2017) (in Ukrainian)
GRAC: General Regionally Annotated Corpus of Ukrainian
  • M Shvedova
  • R Von Waldenfels
  • S Yarygin
  • M Kruk
  • A Rysin
  • V Starko
  • M Woźniak
Shvedova, M., von Waldenfels, R., Yarygin, S., Kruk, M., Rysin, A., Starko, V., Woźniak, M.: GRAC: General Regionally Annotated Corpus of Ukrainian. Kyiv, Oslo, Jena. (2017-2020), available at uacorpus.org.
Semanticheskaya razmetka leksiki v natsionalnom korpuse russkogo jazyka: printsipy, problemy, perspektivy [Semantic Markup of Vocabulary in the Russian National Corpus: Principles, Problems and Perspectives
  • G I Kustova
  • O N Lyashevskaya
  • E V Paducheva
  • E V Rakhilina
Kustova G.I., Lyashevskaya O.N., Paducheva E.V., Rakhilina E.V.: Semanticheskaya razmetka leksiki v natsionalnom korpuse russkogo jazyka: printsipy, problemy, perspektivy [Semantic Markup of Vocabulary in the Russian National Corpus: Principles, Problems and Perspectives]. In: Nationalnyi korpus russkogo yazyka: 2003-2005 [Russian National Corpus: 2003-2005], Moskva, pp. 155-174. (2005) (in Russian)
Zadachi i printsipy semanticheskoy razmetki leksiki v NKRJa [The Objectives and Principles of Semantic Markup of Vocabulary in the Russian National Corpus
  • E V Rakhilina
  • G I Kustova
  • O N Lyashevskaya
  • T I Reznikova
  • O Shemanaeva
  • Ju
Rakhilina E.V., Kustova G.I., Lyashevskaya O.N., Reznikova T.I., Shemanaeva O. Ju.: Zadachi i printsipy semanticheskoy razmetki leksiki v NKRJa [The Objectives and Principles of Semantic Markup of Vocabulary in the Russian National Corpus]. In: Nationalnyi korpus russkogo yazyka. Novyie rezultaty i perspektivy [Russian National Corpus. New Results and Perspectives]. Sankt-Peterburg, pp. 215-239. (2009) (in Russian)
Kompiuterni linhvistychni proekty hurtu r2u: stan I zastosuvannia
  • V Starko
Starko, V.: Kompiuterni linhvistychni proekty hurtu r2u: stan I zastosuvannia [Computational Linguistic Projects of the r2u Group: Progress and Applications].
Paradyhma kohnityvnoyi linhvistyky j problema katehoryzatsiyi [Paradigm of Cognitive Linguistics and the Problem of Categorization
  • V Starko
Starko, V.: Paradyhma kohnityvnoyi linhvistyky j problema katehoryzatsiyi [Paradigm of Cognitive Linguistics and the Problem of Categorization]. In: Naukovi zapysky [Natsionalnoho universytetu "Ostrozka akademiya"] Seriya: Filolohichna, Vyp. 48, pp. 113-116. (2014) (in Ukrainian) http://nbuv.gov.ua/UJRN/Nznuoaf_2014_48_37.
Linguistic Categorizaton
  • J R Taylor
Taylor, J.R.: Linguistic Categorizaton. 2nd Ed., Oxford. (1995)