Conference PaperPDF Available

Adapting word sketches for specialized knowledge extraction

Authors:

Abstract and Figures

Word sketches (WSs) in Sketch Engine have become a basic tool in terminology work. They bring out patterns of term behavior that would be too time-consuming to identify manually. Most default English WS columns in Sketch Engine extract words with a frequent syntactic relationship with the search word in a corpus (e.g., the nouns usually functioning as the subject of a given verb). The usefulness of the default syntactic WSs for collocational analysis is evident, but their contribution to specialized knowledge extraction is less straightforward. This paper presents a work in progress consisting of adapting the default WSs for specialized knowledge extraction. In previous work, we developed the contextonymic WS, and semantic WSs, which specifically target specialized knowledge extraction. This paper explores two changes to the default WSs. The first change enables WSs to extract nouns functioning as subject and object in the same sentence: (e.g., fertilizer>yield: fertilizer increases yield; fertilizer improves yield), which usually corresponds to an agent-patient relation. The other change concerns the extraction of the adjectives that modify a noun. This involves modifications to two of the existing WS columns that extract adjectives and the addition of a new type. We evaluated the precision of these adaptations in a specialized corpus of English texts on Agronomy. Additionally, we compared their output with terminological definitions of a set of terms to assess their usefulness for specialized knowledge extraction. The results indicate that WS columns of nouns functioning as subject and object in the same sentence are sufficiently accurate and potentially useful for specialized knowledge extraction. However, the results for the adjectival WS columns are inconclusive.
Content may be subject to copyright.
64 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
ADAPTING WORD SKETCHES FOR SPECIALIZED KNOWLEDGE
EXTRACTION
Antonio San Martín, Catherine Trekker
University of Quebec in Trois-Rivières, Canada
antonio.san.martin.pizarro@uqtr.ca; catherine.trekker-seguin@uqtr.ca
Abstract
Word sketches (WSs) in Sketch Engine have become a basic tool in terminology work. They bring out
patterns of term behavior that would be too time-consuming to identify manually. Most default English
WS columns in Sketch Engine extract words with a frequent syntactic relationship with the search word
in a corpus (e.g., the nouns usually functioning as the subject of a given verb). The usefulness of the
default syntactic WSs for collocational analysis is evident, but their contribution to specialized knowledge
extraction is less straightforward.
This paper presents a work in progress consisting of adapting the default WSs for specialized knowledge
extraction. In previous work, we developed the contextonymic WS, and semantic WSs, which specically
target specialized knowledge extraction. This paper explores two changes to the default WSs. The rst change
enables WSs to extract nouns functioning as subject and object in the same sentence: (e.g., fertilizer>yield:
fertilizer increases yield; fertilizer improves yield), which usually corresponds to an agent-patient relation.
The other change concerns the extraction of the adjectives that modify a noun. This involves modications
to two of the existing WS columns that extract adjectives and the addition of a new type.
We evaluated the precision of these adaptations in a specialized corpus of English texts on Agronomy.
Additionally, we compared their output with terminological denitions of a set of terms to assess their
usefulness for specialized knowledge extraction. The results indicate that WS columns of nouns functioning
as subject and object in the same sentence are sufciently accurate and potentially useful for specialized
knowledge extraction. However, the results for the adjectival WS columns are inconclusive.
Keywords word sketches, corpus analysis, specialized knowledge extraction, Sketch Engine
1 Introduction
One of the most useful features of Sketch Engine (https://www.sketchengine.eu/) (Kilgarriff et al., 2014)
is the generation of word sketches (WSs), which have become a basic tool in terminology work. A WS is
a one-page summary of a search word’s most common usage patterns in a given corpus. It lists the words
that are syntactically related to the search word in the corpus and includes a link to the corresponding
concordances. Some examples of WS columns are the verbs having the search word as subject or object,
the modiers of the search word, or the words that the search word modies (Figure 1).
Figure 1. Default WS columns of maple in enTenTen18
corpus
65
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Since these word behavior patterns are too time-consuming to identify manually, WSs signicantly facilitate
collocational analysis. However, corpus analysis is not only useful for extracting linguistic information
but also for conceptual knowledge. This is especially true when it comes to elaborating terminological
denitions (in which the conceptual content that terms convey is described) or build conceptual networks
(in which concepts are interconnected through conceptual relations). For these tasks, the usefulness of the
WSs that Sketch Engine generates by default is less straightforward.
In previous work, we proposed new types of WS specically developed for specialized knowledge
extraction: (i) contextonymic WS (see 1.2.1); (ii) semantic WSs (see 1.2.2). This paper presents a work in
progress consisting of adapting the default English WS for the same purpose. More specically, we explore
the creation of two WS columns that extract the relation between the subject and the object of the same
sentence and the grouping of different columns that extract the adjectives that qualify a noun. This adapted
version of the default WS would eventually become part of a single WS that specically targets specialized
knowledge extraction along with the contextonymic WS and the semantic WS.
The rest of the article is organized as follows. In the remainder of this section, we will explain how WSs
are generated, and we describe our previous work on creating WSs for specialized knowledge extraction.
Section 2 will focus on how the new WS columns were developed and evaluated. Section 3 presents the
evaluation results. Finally, in Section 4, we analyze the results and draw some conclusions.
1.1 Word sketch generation
Sketch Engine matches patterns in the form of rules expressed in CQL language to generate WSs
(Jakubíček et al., 2010). A CQL rule is composed of tokens in the form of attributes (part-of- speech
tag, lemma, word form, etc.) and values combined with regular expressions. For instance, the rule
“[tag=”J.*”]{2} [lemma=”technology”]” captures all the instances of technology preceded by two
adjectives. Figure 2 reproduces some matching concordances in our specialized English corpus on
Agronomy (see section 2 for details on the corpus).
Figure 2. Concordances illustrating the rule “[tag=”J.*”]{2}
[lemma=”technology”]”
For WS generation, the rules intended to capture the same syntactic relation are grouped into a gramrel (for
“grammatical relation”). For instance, to identify the relation between verbs and their objects, the gramrel
“objects of X/verbs with X as object” (included in the default sketch grammar) is composed of three rules
(Figure 3)
The set of gramrels that produce a WS constitutes a sketch grammar (Figure 4). For example, the default
English sketch grammar contains 40 rules organized into 25 gramrels. The number of WS columns can be
greater than the number of gramrels because a dual gramrel produces two columns (e.g., “objects of X/
verbs with X as object”, which results in one column for the verbs, and another for the nouns).
Since sketch grammars are text les containing CQL rules grouped in gramrels, it is possible to modify
or expand them by integrating new rules or adapting or deleting existing ones. Users can compile their
own corpora with the sketch grammar of their choice in Sketch Engine. This allows the creation of sketch
grammars adapted to different corpus needs.
1.2 Sketch grammars for specialized knowledge extraction
The default sketch grammar in Sketch Engine is mainly based on syntactic co-occurrence. In other words,
it lists words that appear in the same context as the search word and which maintain a syntactic relationship
66 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
with it (Evert, 2009, p. 1222). This type of co-occurrence is of great importance for collocational analysis,
and WSs were designed with this end in mind (Kilgarriff & Tugwell, 2001). The usefulness of syntactic co-
occurrence for specialized knowledge extraction is less straightforward because the relevance of syntactic
relations for conceptual analysis varies. For instance, the WS listing the modiers of a noun may include,
among others, adjectives
Figure 3. “Objects of X/verbs with X as object” gramrel, its resulting WS columns and concordances
from enTenTen18
corpus
indicating entrenched hyponyms (e.g., “urban farmer”, “rural farmer”), possible features (e.g., “risk-averse
farmer”, “successful farmer”), or they may not be of interest for conceptual analysis (e.g., “other farmer”,
“same farmer”). Therefore, syntactic co-occurrence can be exploited for specialized knowledge extraction.
However, the default English sketch grammar needs to be specially adapted for that purpose.
67
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Syntactic co-occurrence lies halfway along a continuum, with surface co-occurrence (the less constraining
kind) at one end and semantic co-occurrence (the most constraining kind) at the other end. Surface co-
occurrence, on which the contextonymic sketch grammar (see 1.2.1) is based, occurs when two words
appear in the same context without the need of any syntactic or semantic relationship (Evert, 2009, p.
1215). For instance, in “Because glyphosate is systemic, excess residue levels can persist…”, glyphosate
and residue would be surface co-occurrents (or
Figure 4. Example of the structure of a sketch
grammar
contextonyms) (as well as glyphosate and because, is, systemic, etc.), even if they do not establish a direct
syntactic or semantic relation. As for semantic co-occurrence, two words are said to co- occur if a semantic
relationship is established between them in a given context (e.g., hyponymy, meronymy, cause, etc.). For
instance, in “Glyphosate is the only herbicide that kills…” glyphosate and herbicide are semantic co-
occurrents because there is a hyponymic relation between them in that context.
The boundary between these three types of co-occurrence is fuzzy. Surface co-occurrence is generally
based on a window of tokens. In contrast, syntactic and semantic co-occurrences are detected by patterns.
Although the proposed adaptation of the default WSs in this paper is mostly based on syntactic co-
occurrence, some semantic components are also introduced. Before describing the proposed adaptations to
the default grammar, we briey present the contextonymic WS and the semantic WSs since the proposed
modications complement both.
1.2.1 Contextonymic WS
Extracting the contextonyms of a term can help to determine its semantic features (San Martín, in press).
The contextonymic WS was developed for extracting specialized knowledge for denition writing (San
Martín, 2016). Contextonym extraction can be based on various parameters (window span, exclusion of
certain parts of speech etc.). The current version of the contextonymic sketch grammar contains one gramrel
that denes the contextonym of a word as any verb, noun, or adjective before or after the search word with
zero to 44 words between them beyond sentence or paragraph limits. It also excludes certain very common
lemmas (e.g., be, have, etc.) that do not convey signicant semantic features of the search word.
68 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
The contextonymic sketch grammar is useful for specialized knowledge extraction because it provides
terms that are closely related to the search word, which are sometimes not captured by other WS. For
example, by consulting the contextonyms of fungicide in our corpus, it is possible to deduce that important
semantic features of the term are that fungicide application allows the control of certain diseases in crops,
but some pathogens can develop resistance to them. Figure 5 reproduces concordance lines that illustrate
the relation of fungicides and its rst ve contextonyms.
Figure 5.Contextonymic WS and concordances of fungicide in our Agronomy
corpus
With the contextonymic sketch grammar, it is usually necessary to consult the corresponding concordances
to discover their relation to the search word. This disadvantage is compensated for by the fact that this WS
yields valuable results even in smaller corpora.
1.2.2 Semantic sketch grammar
Semantic co-occurrence is based on knowledge patterns, which are lexico-syntactic patterns that match
contexts in which a specic semantic relation is conveyed (Meyer, 2001, p. 281). An example of a
knowledge pattern is “X and other Y” (e.g., “manure and other fertilizers”), which encodes a hyponymic
relation (manure is a type of fertilizer), or “X contains Y” (e.g., “fertilizers contain urea”), which encodes
a meronymic relation (urea is a part of fertilizer).
The EcoLexicon Semantic Sketch Grammar (ESSG) (http://ecolexicon.ugr.es/essg/) (León-Araúz et al.,
2016; León-Araúz & San Martín, 2018) encodes knowledge patterns that capture hyponymy, meronymy,
cause, function, and location relations in English. There is a French version that at the moment only
includes hyponymy (San Martín et al., 2020). Figure 6 shows an example of each of the columns of the
ESSG in English extracted from the EcoLexicon corpus (León-Araúz et al., 2018). This corpus is available
to any Sketch Engine user and comes compiled with the ESSG.
The ESSG has the advantage of clearly identifying the semantic relationship linking the terms, but the
number of results is lower than with other types of WS and requires large corpora to yield useful results.
2 Method
This adaptation of the default English sketch grammar currently envisages the following: (i) creation
of new gramrel; (ii) splitting and merging of gramrels; (iii) modication of gramrel; (iv) suppression of
gramrels; and (v) a combination of these strategies. This paper focuses on the creation and evaluation of a
new gramrel called “X is the proto-agent of…/X is the proto-patient of…” as well as the splitting, merging,
and modication of the gramrels “modiers of X” and “adjective predicates of X”.
For this purpose, we applied a modied version of the methodology of creating knowledge- pattern-
based sketch grammars (San Martín et al., 2020, p. 5954). This methodology is based primarily on the
69
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
iterative renement and evaluation of CQL rules. The corpus (7,249,297 words) that we used consisted of
specialized texts on Agronomy from the following sources:
- 36.7 %: theoretical and practical documents on Agronomy published by the Food and
Agriculture Organization (FAO) and various national and regional governments in English-
speaking countries.
- 30.1 %: specialized monographs and encyclopedias on Agronomy.
- 22.6 %: scientic articles from the International Journal of Agronomy.
- 10.6 %: articles from Wikipedia, manually veried to belong to the eld of Agronomy.
In the early stages of gramrel development, the emphasis is on the evaluation of individual rules. Evaluation
is performed by querying the rule in a corpus compiled in Sketch Engine and
Figure 6. Sample of semantic WSs extracted with the ESSG from the EcoLexicon English
corpus
ascertaining whether the rule extracts expected results without generating noise. This type of precision is
evaluated in small samples (usually 100 random lines) to verify that the modications in the rules produce
70 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
the expected result. Since evaluating recall would slow down the process, the number of total matches
extracted by the rule is taken as a proxy for recall (i.e., the greater the number of matches, the greater the
recall).
For sketch grammars, the precision of rules is less important than recall. Since users access results in
the form of WS (lists of results ordered by frequency or association score), exceptions, errors, and other
noisy results tend to be relegated to the bottom of WS lists. More frequent and signicant results tend to
appear at the top. For this reason, during the development of gramrels, it is also important to periodically
test the resulting WS, even though this is more time-consuming because the sketch grammar needs to be
previously compiled. Accordingly, this research study evaluated our two adaptations of the default sketch
grammar on the basis of the results in WS form (see section 2.3 for the evaluation methodology).
The gramrels can be downloaded at <https://uqtr.ca/knowledge-sketch-grammar/>. Instructions on how to
use them in Sketch Engine are also available at that address.
2.1 The proto-agent-patient gramrel
The organization of specialized domains is based on events in which the interaction between different
types of agents and patients plays a predominant role. (Faber, 2015, p. 23). However, it is not currently
possible to extract the agent-patient relation in Sketch Engine in a user-friendly way. For this reason, we
developed a new gramrel that extracts the relation between the nouns functioning as subject and object
in the same sentence (e.g., farmer and crop in Figure 7). This syntactic relation is useful for specialized
knowledge extraction because the subject usually accomplishes an action that affects the object in some
way. In terms of semantic roles, the former is normally characterized as the agent, but depending on the
verb, it can also be an experiencer, an instrument, among other roles. The object can be typically labeled
as the patient, but also as the theme, the recipient, among other roles. Dowty (1991) groups these semantic
roles into two macroroles: proto-agent and proto-patient. Consequently, this gramrel is called “X is the
proto- agent of…/X is the proto-patient of…” (proto-agent-patient gramrel).
The rst step consisted in creating a basic version of the gramrel by combining the two default gramrels
“objects of “X”/verbs with “X” as object” (object gramrel) and “subjects of “X”/verbs with “X” as subject”
(subject gramrel). New rules are thus created by combination instead of
Figure 7. Concordances with farmer as subject and crop as object in our Agronomy
corpus
grouping the corresponding rules in a single gramrel. The active-voice rules are combined in a new rule,
whereas the passive-voice rules are merged into another rule (Figure 8).
71
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Figure 8. Combination of subject and object rules into the proto-agent-patient
rules
This basic version was used as a benchmark in the evaluation. Since it only returns 80,814 matches in our
corpus, it was enriched and rened to increase recall. Some of the changes that allowed us to increase recall
without compromising precision included the following:
- Optional modal verbs (will, can, must, etc.): “…any ammonium-containing fertilizer will
ultimately decrease soil pH…”.
- Additional optional auxiliary verbs: “…wind erosion is causing signicant soil loss…”.
- Possibility of certain subordinate structures (is capable of…, have the advantage/
ability/… of/to, seems/appears/… to…, is used/designed/intended to…, etc.): “…parasitic
nematodes are capable of causing plant diseases...”, “…cover crops are used to improve the soil
structure...”.
We also created a new rule that captures the subject-object relation that could not be derived from the
subject and object gramrels: PROTO-PATIENT that PROTO-AGENT affects (and variants). This rule
matches concordances such as “…nitrogen that the crop roots can take up and use” or “… management
practices that farmers adopt focus on herbicides”.
Once these changes were applied, the resulting rules were evaluated and three limitations were identied:
phrasal verbs, multiple nouns in subject or object position, and verbs that do not convey a proto-agent-
patient relation or inverse the order (since the subject is a proto-patient, and the object, a proto-agent).
Regarding phrasal verbs, the basic version does not allow the presence of a preposition between the verb
and the object. Instead, Sketch Engine’s default sketch grammar displays phrasal verbs in specic columns
(Figure 9). Since not including them in the proto-agent-patient gramrel reduces recall, we allowed up
to two optional prepositions between the verb and the object to retrieve concordances such as “…when
the crop takes up most of the nitrogen…” or microbes break down organic matter”. Even though this
occasionally generates noise (e.g., “…goods travel from manufacturers to distributors…”), preliminary
tests showed that the increase in recall compensated for it.
Figure 9. Phrasal verbs WS columns of plant in the corpus
enTenTen18
72 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
A second problem was that WS results can only be single words. The subject and object gramrels only
capture the last noun in noun compounds (e.g., “…nitrogen applications disturb the soil”. In the case of
noun compounds linked by a preposition (e.g., “Rotation of crops increases the production of biomass...”)
and enumerations (e.g., “Cattle digestion, fertilizers and animal wastes cause emissions...”), only the
closest noun to the verb is captured. These limitations are justied in that they protect the precision of
the rules. However, they limit recall because the noun compound head is not always detected. (e.g., “The
pollution of surface and groundwater that produces serious health problems…”). Likewise, some nouns do
not occupy the subject or object head position but semantically could act as proto-agent or proto-patient.
For instance, in “…nitrogen applications disturb the soil”, the head of the subject is applications and is,
therefore, the direct proto-agent. However, nitrogen is indirectly a proto-agent as well.
Therefore, we modied the rules to capture any noun (whether head or modier) in a nominal compound or
an enumeration. This included both noun compounds without a preposition (e.g., “hydrocarbon pesticide
residue”) or linked by a preposition of (e.g., “xation of nitrogen”). No other prepositions were included at
this point because preliminary evaluations showed that they were an important source of noise. Additionally,
we enabled the rule to capture all the nouns in enumerations either in subject or object position (e.g., “These
fungi infect many cereals, grasses and other plants...”). We also included enumerations with hyponymic
formulas (e.g., “…organisms such as fungi and nematodes can damage…”).
The third limitation concerns the fact that the subject-object relation does not always correspond to the proto-
agent-patient relation. This was addressed by ltering certain verbs. A rst group of verbs (invalidating
verbs) are those that do not convey the proto-agent-patient relation, for example, to be. This group also
includes verbs that convey certain relations already captured with the semantic WS, such as hyponymy
(e.g., include) or meronymy (e.g., have). The second group includes those that invert the common argument
order (inverting verbs). Table 1 includes both lists of verbs. While invalidating verbs were excluded from
all the rules, the rules were duplicated for inverting verbs. In one set of rules, the inverting verbs were
excluded, and in the others, the position of the proto-agent and proto-patient were interchanged.
Table 1. Invalidating and inverting
verbs
invalidating
verbs
accord, arise, be, become, belong, come, compete, compose, comprise, consist, contain,
dene, exist, feature, follow, gain, happen, have, include, lack, lose, match, name, need, origi-
nate, range, receive, refer, regard, relate, remain, require, stay, stem, survive, vary
inverting
verbs
depend (on), rest (on), lie (on), lie (in), result (from), suffer (from), rely (on/upon)
The enriched version of the gramrel returned 374,525 matches from our corpus, four times more than the
basic version. It is composed of the following six rules (note that the verb affect represents any verb with
the above-mentioned exceptions):
- PROTO-AGENT affects PROTO-PATIENT (and variants)
- PROTO-PATIENT is affected by PROTO-AGENT (and variants)
- PROTO-PATIENT that PROTO-AGENT affects (and variants)
- PROTO-AGENT affects PROTO-PATIENT (and variants) (inverting verbs)
- PROTO-PATIENT is affected by PROTO-AGENT (and variants) (inverting verbs)
- PROTO-PATIENT that PROTO-AGENT affects (and variants) (inverting verbs)
73
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
2.2 Adjectival gramrels
Two WS columns in the default sketch grammar extract the adjectives that modify a given noun. The
“modiers of X/nouns modied by X” gramrel (modiers gramrel) retrieves the adjectives and nouns that
appear before a noun (e.g., “perennial crop”, “corn crop”). The “adjectives predicates of X/subjects of be
X” gramrel (predicative adjectives gramrel) extracts an adjective placed after a noun even when separated
by to be (e.g., “crops resistant to herbicides”, “crops are tolerant”).
These columns have potentially useful features for building denitions or conceptual networks, namely
the ability to assign characteristics to a given concept. To adapt them to the extraction of specialized
knowledge, the modiers gramrel was divided in two: one gramrel for adjectives modiers and another
for noun modiers. This allowed us to explore whether a single column for all the adjectives that modify a
noun was useful. In another gramrel, all the nouns modifying another noun (either preceding the modied
noun or postposed with a preposition, e.g., “wheat production” and “production of wheat”) could also
be grouped. This paper only addresses adjectives from the point of view of a noun search word. In other
words, whereas the adjectives gramrel is dual (“adjectives of X/nouns modied by X”), it only focuses on
the column that lists adjectives (“adjectives of X”).
This new adjectives gramrel is composed of three gramrels that will be tested separately: the attributive
adjectives gramrel, predicative adjectives gramrel, and the hyponymic adjectives gramrel.
The attributive adjectives gramrel extracts the adjectives that precede the noun modied (e.g., “synthetic
fungicide”, “foliar fungicide”). This gramrel originates from the split of the modiers gramrel. It is
composed of a single rule, which was changed to exclude the following adjectives that are not useful for
specialized knowledge extraction: most, least, many, other, more, less, such, able, unable, due, capable,
incapable, various, several, few, same, different. These adjectives were also excluded from the other
adjectival gramrels. This gramrel produced 514,578 matches in our corpus.
The predicative adjectives gramrel is composed of a single rule in the default grammar. To create an
enriched version, we divided it into two separate rules. The rst rule captures the adjective placed directly
after the noun (e.g., “…keep the soil dry…”) and was modied so as to capture two adjectives (e.g.,
“…sheries more productive and sustainable…”). The second rule extracts the adjective placed after the
noun and the verb to be (e.g., “the soil is dry”). We increased its recall by adding more predicative verbs
(i.e., appear, look, seem, become, remain, get, turn). Other modications include optional auxiliary verbs
(e.g., “droughts have become more prolonged”), modal verbs (e.g., “soybeans will remain yellow”), two
adjectives (e.g., “soil is acid or alkaline”), and noun enumerations: (e.g., “leaves and small stems become
more brittle”). The basic version of the gramrel produced 34,127 matches in our corpus. The enriched
version obtained 46,358 matches, which is a 35.84% increase.
Finally, the hyponymic adjectives gramrel captures the adjectives that qualify the hyponym of the search
word. It is based on hyponymic knowledge patterns. It follows the logic that, in a hyponymic
structure, the adjective modifying the hypernym potentially expresses a characteristic of the hyponym. For
example, in “...the use of interventional measures such as fungicides...”, it can be deduced that fungicide is
a type of interventional measure. Therefore, interventional also applies to fungicide.
The starting point of this gramrel was the hyponymic rules in the ESSG (see 1.2.2.). We only retained those
rules that returned at least 1000 results in our corpus. We then excluded the rules that were excessively
noisy, although they might be included in the future if they can be rened to yield satisfactory results. The
hyponymic adjectives gramrel produced 27,420 matches in our corpus and contains the following rules:
1. adjective HYPERNYM such as/including/especially/like/includes HYPONYM (e.g.,
“…agricultural inputs such as herbicides”)
2. HYPONYM and/or other adjective HYPERNYM (e.g., “…antibiotics or other effective
antimicrobials)
74 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
2.3 Evaluation methods
These adaptations were evaluated in two stages. The rst stage evaluated the WS columns in terms
of precision, whereas the second stage evaluated the usefulness of the gramrels to extract specialized
knowledge.
In both stages, the evaluation was performed using one high-frequency term (crop with 28,457 occurrences in
the Agronomy corpus) and two medium-frequency terms (fungicide with 1,161 occurrences, and nematode
with 866 occurrences) as search words. Additionally, for the second stage, we extracted denitions of
these terms from specialized glossaries and multidomain terminology databases (only if the denition was
labeled as belonging to Agronomy or its subdomains). In total, 30 denitions of crop, 20 of nematode,
and 26 of fungicide were recovered. In all stages, only the rst ve most frequent results per WS column
were considered. To evaluate the precision (i.e., the percentage of correct results), we assessed whether
the results were correct by accessing the corresponding concordance lines. In the case of the proto-agent-
patient gramrel, a concordance line was considered correct (i.e., a true positive) if a proto-agent-patient
relationship can be deduced directly or indirectly from the concordance.
For attributive and predicative adjectives, a concordance line was considered correct if the adjective
qualied the captured noun in the concordance. In the case of hyponymic adjectives, it was considered
correct if the noun inherited the characteristic expressed by the adjective.
We also calculated validity, according to which a result in a WS column is valid when at least one of its
associated concordances is a true positive (San Martín et al., 2020, p. 5961). For example, the relation
fungicide is the proto-patient of industry” has four associated concordances (Figure 10). Since only 3 and
4 are correct, the precision of this result is 50%. However, because there is at least one correct concordance,
validity is 100%.
Figure 10. Concordances associated to the relation “fungicide is the proto-patient of
industry
The second evaluation stage explored the usefulness of the adaptations for specialized knowledge extraction
by comparing the gramrels results with the denitions of the search terms. In some cases, they were also
compared with the contextonymic and semantic WS columns.
3 Results
3.1 Evaluation of the proto-agent-patient gramrel
3.1.1 First stage
The table with the complete results of the precision and validity analysis is in Appendix 1. Figure 11
summarizes the results in terms of precision and validity, and Figure 12 reproduces the resulting WS
columns. On average, the enriched version has a precision of 71.17% compared to 66.01% for the basic
version. The enriched version was found to perform better for fungicide and crop, whereas the basic version
performed better for nematode.
75
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
In terms of validity, the results of the enriched version are also slightly higher than the basic version. On
average, the enriched one obtained 93.33% validity, while the basic one obtained 83.33%. The enriched
version performed better for nematode and fungicide than the basic version. Both versions were 100%
valid for crop.
The analysis of the incorrect concordances allowed us to identify different types of errors, mostly
attributable to the limitations of WSs. Such errors include problems with sentence segmentation (e.g.,
“Destroy or control weeds and soil pests Incorporate crop residues...”) and POS-tagging (e.g., “This water
blistering disorder crops up from time to time...”). This type of error was present in the same proportion in
the basic and enriched versions.
Figure 11. Precision and validity of the proto-agent-patient
gramrel
Another type of error caused by a WS limitation concerned noun compounds. As previously explained,
to overcome this problem, the enriched version retrieves all the nouns in nominal compounds. Although
this facilitates the retrieval of many correct results, it is also a source of noise (e.g., “Plant pathologists are
investigating methods of nematode control”. However, these preliminary results seem to indicate that the
increase in recall compensates for the noise, especially since it is preferable to prioritize recall rather than
precision.
76 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Finally, there were errors generated by invalidating and inverting verbs. Even though the enriched version
accounts for some of these verbs, new ones appeared in the concordances (e.g., “Crops may tolerate
greater amounts of blowing soil...”). All these verbs will be analyzed before including the rules.
Figure 12. Resulting WS columns from the proto-agent-patient gramrel. The basic version has a green
dot.
The
enriched version, a blue
dot
3.1.2 Second stage
Since the enriched version returned more precise results, the second stage was performed on this one. First,
we compared the proto-agents and proto-patients in the extracted denitions of the analysis terms (Table 2)
with the gramrel results (Table 3). Terms present in both the denitions and the WS are in bold.
Table 2. Proto-agents and proto-patients in the denitions. The number of occurrences in the denitions is
in
parentheses
fungicide is the proto-agent of…
fungus(24), growth(6), disease(3), plant(3), mold(2), mil-
dew(2), control(1), yeast(1), pathogen(1), crop(1), product
(1), soil(1), development(1)
is the proto-patient of… insect(1), ant(1)
crop
is the proto-agent of... -
is the proto-patient of... livestock(1), labor(1), farmer(1), people(1)
nematode
is the proto-agent of...
plant(11), root(5), animal(5), crop(2), vine(2), tissue(2),
agriculture(1), pest(1), slug(1), leatherjacket(1), loss(1),
human(1), yield(1), swelling(1), growth(1), disease(1),
bird(1), mammal(1), insect(1), juice(1), structure(1), dam-
age(1)
is the proto-patient of... plant(1), control(1), brassica(1), chemical(1), nemati-
cide(1), water(1), contamination(1)
77
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Table 3. Top ve results in the proto-agent-patient WS with the number of correct concordances in
parentheses
fungicide is the proto-agent of… disease(10), yield(7), incidence(6), grain(6), plant(4)
is the proto-patient of… grower(5), industry(2), leaf(3), application(2), seed(0)
crop is the proto-agent of... soil(92), water(38), yield(37), weed(35),
production(26)
is the proto-patient of... farmer(122), soil(40), system(37), water(25),
fertilizer(23)
nematode is the proto-agent of... disease(10), plant(10), root(6), crop(4), effector(4)
is the proto-patient of... crop(3), cultivar(3), soil(2), nematicide(1), plant(0)
Of the 28 valid WS results, only 8 (i.e., 28.57%) appear as well in the denitions. This low percentage was
to be expected because denitions select the conceptual information considered most relevant. Corpora
tend to contain much more information. It can also be observed that the use of the proto-agent-patient
relation in the denitions is variable: fungicide and nematode are most frequently dened as proto-agent,
while in the denitions of crop, both macroroles are rare.
Some frequent proto-agents in the denitions are missing in the WS results. The most prominent case is
fungus as a proto-agent of fungicide. In the column “fungicide is the proto-agent of...”, fungus is in 85th
position with only one associated concordance (“...fungicides kill fungi...”). The contextonymic WS of
fungicide was thus consulted to determine whether the low result of fungi was a case of silence (i.e., that
the proto-agent-patient gramrel missed relevant concordances).
Fungus is only the 72nd contextonym of fungicide with 96 concordances. This indicates that, although the
relationship between fungicide and fungus is relevant to dene fungicide, specialized texts do not often
mention this characteristic. Additionally, we observed that only 12 out of the 96 concordances directly
or indirectly convey a proto-agent-proto-patient relationship. However, in most cases, it is not reected
in a subject-object relation (e.g., “…strobilurin fungicides are very active against many plant pathogenic
fungi…”). Among the ones in which there is indeed a subject- object relation, most are cases of anaphora
(e.g., “…fungicides penetrate into plant tissue, where they kill or inhibit a fungus…”) and uncommon
use of punctuation (e.g., “…fungicides (kill fungi)...” Both cases are difcult to account for in CQL rules
without generating excessive noise.
Finally, regarding the complementarity of this relationship with the contextonymic WS, the rst
contextonymic results of the three terms show matches with the proto-agent-patient gramrel. It follows
that the proto-agent-patient columns can facilitate the discovery of the relationship between the search
word and many of its contextonyms. As shown in Table 4, for 9 of the 15 most frequent contextonyms
of fungicide, nematode, and crop, the rst ve proto-agent-patient results help to determine the semantic
relationship between the two terms.
78 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Table 4. Comparison of contextonyms (number of occurrences in parenthesis) and the results from the
proto-agent-patient gramrel
search word contextonym fungicide is its…
fungicide
disease(907) proto-agent
crop(581) proto-agent/proto-patient
application(462) -
resistance(421) -
control(406) -
nematode
plant(776) proto-agent
soil(600) proto-patient
crop(398) proto-agent
root(347) proto-agent
population(261) -
crop
soil(15,672) proto-agent
plant(9,063) -
production(8,707) proto-agent
use(verb)(7,944) -
yield(7,465) proto-agent
3.2 Evaluation of the adjectives gramrel
3.2.1 First stage
The table with the complete results of the precision and validity analysis is in Appendix 2. Figure 13
summarizes the precision and validity data. Figure 14 reproduces the WS columns.
As for the attributive adjectives gramrel, it has a very high precision (99.93 % of average) and a perfect
validity. The only errors that the gramrel produced are due to problems with corpus segmentation.
As for predicative adjectives, the enriched version performed slightly better than the basic version in both
precision (72.53% vs. 68.47 %) and validity (80% vs. 75.94%). It is not surprising that crop, the most
frequent term, obtained the best results. In fact, in the enriched version, the validity is 100%. Despite
the small sample, the precision and validity are sufciently high to conclude that this gramrel performs
satisfactorily. Moreover, the results indicate that the enriched version, even retrieving 35.84% more results,
maintains or even surpasses the level of precision of the basic version.
79
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Figure 13. Precision and validity of the of the adjectival
gramrels
As for the frequent errors observed in the extraction rules of predicative adjectives, they are similar to
the ones of the proto-agent-patient gramrel. Both the basic and enriched versions retrieve some incorrect
results stemming from problems with POS tagging and corpus segmentation. The enriched version also
matches some incorrect matches due to noun compounds: “...combinations of these fungicides are effective
in the management…”.
Finally, the results of the hyponymic gramrel are notably inferior to the other gramrels both in precision
and validity. While the precision is 42.2%, the validity is only slightly higher at 53.33%. However, the
results for crop are 79.93% precise and 100% valid. Since crop is the most frequent term, this is consistent
with the fact that knowledge-pattern-based gramrels (such as this gramrel) need larger corpora to yield
satisfactory results.
This gramrel shares the same errors as the other ones, but there are also two that are unique to it. The rst
error occurs when the head of the hypernym is a collective noun. For example, in “there is a wide variety
of parasites including trematodes, cestodes, nematodes…”, wide qualies variety, and not nematode’s
hypernym (parasite). Therefore, nematode cannot be said to inherit that attribute. This error can easily be
solved by ltering out collective nouns.
The other error occurs when the head of the hypernym and the head of the hyponym are the same lemma.
For example, “…plant-feeding nematodes such as lesion nematodes”. In this case, plant- feeding only
applies to one particular type of nematode (i.e., lesion nematodes). This can also be easily ltered to
improve the performance of this gramrel. However, the attributive adjectives gramrel already captures the
adjective preceding the hypernym (i.e., “plant-feeding nematodes”.)
80 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Figure 14. Resulting WS columns from the adjectival gramrels. The basic version of the predicative
adjectives has a green dot.
The
enriched version, a blue
dot.
3.2.2 Second stage
In this stage, we left out the basic version of the predicative adjectives gramrel and evaluated the attributive
and hyponymic adjectives gramrel along with the enriched predicative adjectives gramrel. First, we
compared the gramrel results (Table 5) with the adjectives extracted from the denitions of the terms under
analysis (Table 6). In both tables, the terms present in both the denitions and WSs are in bold.
Table 5. Adjectives in the denitions. The number of occurrences in the denitions is in
parentheses
fungicide chemical (6), physical (2), toxic (1)
nematode
microscopic (7), parasitic (6), cylindrical (3), small (3), unsegmented (3), slender (2),
elongated (2), worm-like (2), nonsegmented (2), numerous (1), multicellular (1), living
(1), benecial (1), harmful (1), former (1), biological (1), phytoparasitic (1), abundant (1),
long (1), legless (1), worm-shaped (1), aquatic (1), round (1), colorless (1), threadlike (1)
crop cultivated (9), grown (8), agricultural (2), used (2), growing (2), young (2),
managed (2), horticultural (1), animal (1), total (1), yearly (1), wild (1)
81
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Table 6. Top ve results in the adjectival WSs with the number of correct concordances in
parentheses
attributive predicative hyponymic
fungicide new (27) available (6) effective (1)
synthetic (20) important (2) interventional (1)
foliar (16) effective (1) powdery (0)
systemic (15) economical (0) noncross-resistant (0)
modern (8) miniscule (0) downy (0)
nematode parasitic (56) present (2) soilborne (5)
reniform 38 migratory (1) important (0)
plant-parasitic (17) microscopic (1) wide (0)
endoparasitic (7) sedentary (1) fungal (0)
pathogenic (6) predatory (0) plant-feeding (0)
crop important (267) resistant (17) agricultural (28)
annual (200) susceptible (15) cultural (15)
major 183) sensitive (14) organic (13)
perennial (178) vulnerable (12) agronomic (6)
main (147) important (10) biological (5)
The adjectives extracted from the denitions are only the ones that describe the term. For instance, in the
denition of fungicide as “chemical compound used to control fungi”, the adjective chemical describes
fungicide and is therefore included in the list.
These adjectives have a low level of correspondence (3 out of 36, i.e., 8.33%) with the ones extracted with
the gramrels. There is also no specic gramrel that has more matches since each one has only one match.
This may be due to the fact that corpora tend to contain more information than denitions. In addition,
the adjectives that qualify a noun do not necessarily express a dening characteristic of that concept. It
may be a characteristic of a subtype or of an instance of the concept. For example, in “…between two
periods, the crop is vulnerable to weeds…”, the predicative adjective refers to an instance of crop and is
also time-restricted. Therefore, it cannot be deduced that vulnerability to weeds is a characteristic of crops,
only a possible feature. Another example is “Modern fungicides can be applied to crops…”. In this case,
the attributive adjective modern selects a subset of all existing fungicides. Being modern is also a possible
characteristic of fungicides though not a dening one.
As for the complementarity of adjectival gramrels and contextonyms, it should be noted that the initial
positions of the WS contextonym are usually occupied by nouns. In the case of crop, it is necessary to go
to the 19th result to nd the rst adjective (high). With nematode, the rst adjective is parasitic, in position
24th. As for fungicide, the rst adjective (new) is in position 18th. As for the semantic gramrels, none of
them extract adjectives. For these reasons, an adjectival gramrel does potentially provide complementary
information.
82 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
4 Analysis and Discussion
The results of the evaluation of the proto-agent-patient gramrel are promising. Despite the small sample,
the enriched version was found to achieve over 70% precision and over 90% validity. The gramrel in
its current state is thus functional. The evaluation also shows ways to improve the rules. For example,
increasing the list of invalidating and inverting verbs will lead to greater precision. A more thorough
evaluation of the impact of retrieving all nouns that make up nominal compounds in object and subject
position will also allow us to adjust the rules accordingly.
Comparison with the denitions did not yield a high level of correspondence. It was especially relevant
that some frequent proto-agents and proto-patients in the denitions were absent from the gramrel results.
This was the case of fungus in relation to fungicide. The analysis of the concordances in which fungus is
a contextonym of fungicide reveals that the addition of patterns other than subject-object to the gramrel
could improve the gramrel.
Regarding adjectival gramrels, preliminary results did not reveal whether it would be useful to merge them
into a single gramrel. One argument against merging is that the attributive gramrel returned almost seven
times as many results as the other two combined, which means that the others would have very little weight
in a merged gramrel. However, as yet, there is not sufcient evidence to afrm that they are more useful
separately than merged, especially since having too many WS columns could lead to information overload.
As for the usefulness of adjectival WS for specialized knowledge extraction, this preliminary evaluation
was inconclusive. None of the gramrels have a higher correspondence with the adjectives extracted from the
denitions. However, there is also the question of whether denitions are a good benchmark for evaluating
the capacity of WSs to assist in specialized knowledge extraction. New forms of evaluation should thus
also be explored.
While there were some very useful adjectives extracted by the gramrels, others were less so because the
adjectives that qualify a noun do not always express a dening characteristic. Thus, our results suggest
that the adaptation of this gramrel is worth further study. Thanks to the high precision of the attributive
and predicative adjectives gramrels, they are a good research tool for that purpose. As for hyponymic
adjectives, the results indicate that the rules need to be rened to achieve greater precision before the
gramrel can be considered functional.
It is also worth pointing out that we have not evaluated the inverse columns (i.e., using an adjective as a
search word to obtain the list of nouns it qualies). It is probable that these WS columns are useful for
studying adjectives from a conceptual point of view.
In future work, we will continue to develop these gramrels and evaluate them with a larger sample. For
this purpose, corpora from different specialized domains and different types of search terms will be used.
Finally, in parallel, we will further adapt the English default sketch grammar to create a sketch grammar for
specialized knowledge extraction that would also include contextonymic and semantic gramrels.
5 References
Dowty, D. (1991). Thematic Proto-Roles and Argument Selection. Language, 67(3), 547. Evert, S. (2009).
Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International
Handbook (Vol. 2, pp. 1212–1248). De Gruyter.
Faber, P. (2015). Frames as a Framework for Terminology. In H. J. Kockaert & F. Steurs (Eds.), Handbook
of Terminology (Vol. 1, pp. 14–33). John Benjamins.
Jakubíček, M., Kilgarriff, A., McCarthy, D., & Rychlý, P. (2010). Fast Syntactic Searching in Very Large
Corpora for Many Languages. Proceedings of the 24th Pacic Asia Conference on Language,
Information and Computation (pp. 741–747). Waseda University.
83
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V.
(2014). The Sketch Engine: Ten Years on. Lexicography, 1(1), 7–36.
Kilgarriff, A., & Tugwell, D. (2001). Word sketch: Extraction and display of signicant collocations for
lexicography. Proceedings of ACL Workshop on Collocation: Computational Extraction, Analysis
and Exploitation, 32–38.
León-Araúz, P., & San Martín, A. (2018). The EcoLexicon Semantic Sketch Grammar: from Knowledge
Patterns to Word Sketches. Proceedings of the LREC 2018 Workshop “Globalex 2018 – Lexicography
& WordNets” (pp. 94–99). Globalex.
León-Araúz, P., San Martín, A., & Faber, P. (2016). Pattern-based Word Sketches for the Extraction of
Semantic Relations. Proceedings of the 5th International Workshop on Computational Terminology
(pp. 73–82).
León-Araúz, P., San Martín, A., & Reimerink, A. (2018). The EcoLexicon English Corpus as an open
corpus in Sketch Engine. Proceedings of the 18th EURALEX International Congress (pp. 893 901).
Euralex.
Meyer, I. (2001). Extracting knowledge-rich contexts for terminography - A conceptual and methodological
framework. In D. Bourigault, M.-C. L’homme & C. Jacquemin, (Eds.), Recent Advances in
Computational Terminology (pp. 279–302). John Benjamins.
San Martín, A. (2016). La representación de la variación contextual mediante deniciones terminológicas
exibles. [Doctoral dissertation, University of Granada].
San Martín, A., Trekker, C., & León-Araúz, P. (2020). Extraction of Hyponymic Relations in French
with Knowledge-Pattern-Based Word Sketches. Proceedings of the 12th Language Resources and
Evaluation Conference (pp. 5955–5963). ELRA.
San Martín, A. In press. A Flexible Approach to Terminological Denitions: Representing Thematic
Variation. International Journal of Lexicography.
Acknowledgements
We thank Natasha Herger, research assistant, for her contribution to the data analysis. This research
was carried out as part of project Développement d’une méthodologie d’élaboration de dénitions
terminologiques : analyse de corpus et variation contextuelle (2020-NP-267503) funded by Quebec’s
Society and Culture Research Fund (FQRSC).
84 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Appendix 1. Results of the proto-agent-patient gramrel evaluation
BASIC
VERSION
Proto-agent Proto-patient Cor-
rect/
total
Prec.
(%)
Most com-
mon
verb Prec.
(%) Val.
(%) Prec.
(%) Val.
(%)
nematode root 4/4 100 invade 100 100 80 80
soil 3/3 100 infest
loss 3/3 100 cause
disease 3/3 100 cause
damage 3/3 100 cause
LM135 nematode 1/1 100 60 60
solarization 1/1 100
estimate 1/1 100
loss 0/1 0
pest 0/1 0
fungicide grower 0/3 066.67 80 56.19 70
incidence 3/3 100 reduce
control 3/3 100 provide
growth 3/3 100
effect 1/3 33.33
seed fungicide 2/7 28.57 receive 45.71 60
grower 3/3 100 use
leaf 2/2 100 absorb
crop 0/2 0
shasho 0/1 0
crop nutrient 6/20 30 remove 44.15 100 61.84 100
legume 1/16 6.25
water 11/14 78.57 use
yield 9/14 64.29 produce
nitrogen 5/12 41.67 x
farmer crop 40/47 85.11 grow 79.52 100
farm 9/9 100 grow|pro-
duce
rotation 7/8 87.5 involve
soil 6/8 75 affect
system 4/8 50 use
85
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
ENRICHED
VERSION
Pro-
to-agent Proto-patient Correct/
total
Prec.
(%)
Most common
verb Prec.
(%) Val.
(%) Prec.
(%) Val.
(%)
nematode disease 10/12 83.33 cause 87.99 100 71.50 90
plant 10/11 90.91 cause
root 6/7 85.71 damage|invade
crop 4/5 80 cause
effector 4/4 100 secrete
crop nematode 3/4 75 suppress 55 80
plant 0/4 0
soil 2/4 50 affect
cultivar 3/3 100 support
Nemati-
cide
1/2 50
fungicide disease 10/11 90.91 control 89.02 100 76.17 90
yield 7/8 87.5 increase
incidence 6/6 100 reduce
grain 6/6 100 treat|affect
plant 4/6 66.67
seed fungicide 0/6 063.33 80
grower 5/5 100 use|inquire
industry 2/4 50
leaf 3/3 100 absorb
application 2/3 66.67
crop soil 92/125 73.60 improve 66.68 100 65.86 100
yield 37/67 55.22 produce|impact
water 38/57 66.67 use
weed 35/48 72.92 suppress
production 26/40 65 improve|increase
farmer crop 122/133 91.73 grow 65.04 100
soil 40/65 61.54 affect
system 37/58 63.79 use
fertilizer 23/46 50 affect
water 25/43 58.14 affect
86 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Appendix 2. Results of the adjectives gramrel evaluation
Gramrel Term Adjective Correct/total
Prec.
(%)
Prec.
(%)
Val.
(%)
attributive
adjectives
nematode
parasitic 56/56 100 100 100
reniform 38/38 100
plant-parasitic 17/17 100
endoparasitic 7/7 100
pathogenic 6/6 100
predicative
adjectives
(basic
ver-
sion)
present 2/2 100 80 80
migratory 1/1 100
predatory 0/1 0
microscopic 1/1 100
abundant 1/1 100
predicative
adjectives
(enriched
version)
present 2/2 100 80 80
migratory 1/1 100
predatory 0/1 0
microscopic 1/1 100
sedentary 1/1 100
hyponymic
adjectives soilborne 5/6 83.33 16.67 20
important 0/3 0
wide 0/2 0
fungal 0/2 0
plant-feeding 0/1 0
attributive
adjectives
fungicide
new 27/27 100 100 100
synthetic 20/20 100
foliar 16/16 100
systemic 15/15 100
modern 8/8 100
predicative
adjectives
(basic
ver-
sion)
available 6/11 54.55 37.58 60
effective 1/3 33.33
important 2/2 100
economical 0/2 0
miniscule 0/1 0
predicative
adjectives
(enriched
version)
available 6/11 54.55 40.91 60
important 2/2 100
effective 1/2 50
economical 0/2 0
miniscule 0/1 0
hypo-
nymic
adjec-
tives
effective 1/2 50 30 40
powdery 0/1 0
noncross-resis-
tant 0/1 0
interventional 1/1 100
downy 0/1 0
87
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
attrib-
utive
adjec-
tives
crop
important 267/267 100 99.79 100
annual 200/201 99.5
major 183/183 100
perennial 178/179 99.44
main 147/147 100
predicative
adjectives
(basic
version)
susceptible 14/16 87.5 87.83 100
resistant 15/15 100
sensitive 14/14 100
vulnerable 11/12 91.67
important 6/10 60
predica-
tive ad-
jectives
(en-
riched
version)
susceptible 15/18 83.33 96.67 100
resistant 17/17 100
sensitive 14/14 100
vulnerable 12/12 100
important 10/10 100
hypo-
nymic
adjec-
tives
agricultural 28/29 96.55 79.93 100
organic 13/20 65
cultural 15/15 100
agronomic 6/9 66.67
biological 5/7 71.43
Article
Increasingly, computer technologies in linguistics offer their advanced tools to process, store and select language data, which has triggered the fast development of the actual branch of linguistic studies – corpus linguistics. Through the use of large-scale empirical data and advanced computer technologies to reach objective insights into language function, linguistic corpora have quickly become invaluable resources. Data obtained through corpus analysis facilitate the drawing of qualitatively new conclusions about language and highlight research directions that previously received little attention. Despite substantial linguistic work on Dmytro Dontsov’s writings, there still exist a number of questions not studied so far. Among these are compiling a linguistic language corpus as well as a concordance of this prominent thinker, public figure, publicist utilizing modern methods to calculate and fix lexemes in order to identify their attributive collocation. Any available means of programming, facilitating processing language material of a big mass, have been realized to be an undoubtedly productive way to properly perform conceptual analysis and to serve as an additional tool for studying. The purpose of the study is to determine the peculiarities of the unique features of Dmytro Dontsov’s maxims in the aspect of corpus linguistics and lexicography on the basis of the corpus space that forms the linguistic vision of the world and is a source of creating lexicographical pieces of work (concordances, a writing language, a writer language, etc.). We emphasize that the creation of Dmytro Dontsov’s writing concordance has become the subject of corpus linguistics study for the first time. The subject of our study is the writings of Dmytro Dontsov, their lexicographical parametrization which provides all possible words with their description (phonetic, word-creating, grammatical), along with quantitative indices, i.e. it is the result of learning many linguistic disciplines, unified by the dictionary. The methodology of study is the combination of general theoretical methods (analysis, generalization, explanation) with the applied methods of linguistics. Analysis of the studied conception is grounded in “The Spirit of Our Antiquity” text corpus compiled by using the Sketch Engine program. On the basis of the analysis, it is found out that an electronic corpus provides the opportunity to accelerate language study and increase its effectiveness, probability and checkability significantly. The article reveals heuristic potential, practical effectiveness of the corpus and application of concordant technologies in conceptual studies. It was discovered that, the construction of the full concordance of the writings of Dmytro Dontsov will enable showing the picture of the world on the basis of learning the author’s lexical wealth and reproducing his understanding of the political situation. The created concordance is a stage of forming lexicographical works about Dontsov, which provides understanding of stages, methods, principles and peculiarities of compiling Dmytro Dontsov’s writing language dictionary. The multifaceted study of D. Dontsov’s writings is believed to be of importance. The concordance serves as a new material, important for politicians, journalists, teachers, students, for it is an entrance in a system of new words, filling with new ideas, and possibility to perceive the world through the nation-state aspirations of a great thinker. Realization of his project will help to create a political-intellectual product of great importance to offer bright prospects for future linguistic study, whose tasks are supposed to use linguistic material.
Conference Paper
Full-text available
Word sketches are a powerful function of Sketch Engine that automatically summarizes the most common usage patterns of a search word in a corpus. While they have proven to be a valuable tool for collocational analysis in both general and specialized language, their potential for the extraction of terminological knowledge is yet to be fully realized. To address this, we introduce a novel semantic sketch grammar designed to extract the agent-patient relation, an important yet understudied relation. This paper presents the various stages of developing the rules that compose this sketch grammar as well as the evaluation of their precision. The errors identified during the evaluation process are also analyzed to guide future improvements. The sketch grammar is available online so that any user can apply it to their own corpora in Sketch Engine.
Article
Terminological conceptual analysis can be applied to purposes beyond terminology work. This article presents a Frame-based Terminology approach adapted to analyse concepts and inform the content of entries in the Humanitarian Encyclopedia. It proposes a method for conceptual analysis by systematising the extraction of knowledge rich contexts (KRCs) around corpus querying tasks through semantic sketch grammars (SSGs) and macros with knowledge patterns (KPs). KRCs are curated manually, modelled into conceptual propositions, and combined with corpus metadata into unified datasets. The method was tested on epidemic and coronavirus and their results are presented. This study provides a preliminary model to operationalise the study of conceptual variation. It also identifies the areas of terminological conceptual analysis with the potential to be informed by other research methods towards creating a standalone methodology.
Conference Paper
Full-text available
Hyponymy is the cornerstone of taxonomies and concept hierarchies. However, the extraction of hypernym-hyponym pairs from a corpus can be time-consuming, and reconstructing the hierarchical network of a domain is often an extremely complex process. This paper presents the development and evaluation of the French EcoLexicon Semantic Sketch Grammar (ESSG-fr), a French hyponymic sketch grammar for Sketch Engine based on knowledge patterns. It offers a user-friendly way of extracting hyponymic pairs in the form of word sketches in any user-owned corpus. The ESSG-fr contains three times more hyponymic patterns than its English counterpart and has been tested in a multidisciplinary corpus. It is thus expected to be domain-independent. Moreover, the following methodological innovations have been included in its development: (1) use of English hyponymic patterns in a parallel corpus to find new French patterns; (2) automatic inclusion of the results of the Sketch Engine thesaurus to find new variants of the patterns. As for its evaluation, the ESSG-fr returns 70% valid hyperonyms and hyponyms, measured on 180 extracted pairs of terms in three different domains.
Conference Paper
Full-text available
The EcoLexicon English Corpus (EEC) is a 23.1-million-word corpus of contemporary environmental texts. It was compiled by the LexiCon research group for the development of EcoLexicon (Faber, León-Araúz & Reimerink 2016; San Martín et al. 2017), a terminological knowledge base on the environment. It is available as an open corpus in the well-known corpus query system Sketch Engine (Kilgarriff et al. 2014), which means that any user, even without a subscription, can freely access and query the corpus. In this paper, the EEC is introduced by describing how it was built and compiled and how it can be queried and exploited, based both on the functionalities provided by Sketch Engine and on the parameters in which the texts in the EEC are classified.
Conference Paper
Full-text available
Many projects have applied knowledge patterns (KPs) to the retrieval of specialized information. Yet terminologists still rely on manual analysis of concordance lines to extract semantic information, since there are no user-friendly publicly available applications enabling them to find knowledge rich contexts (KRCs). To fill this void, we have created the KP-based EcoLexicon Semantic Sketch Grammar (ESSG) in the well-known corpus query system Sketch Engine. For the first time, the ESSG is now publicly available in Sketch Engine to query the EcoLexicon English Corpus. Additionally, reusing the ESSG in any English corpus uploaded by the user enables Sketch Engine to extract KRCs codifying generic-specific, part-whole, location, cause and function relations, because most of the KPs are domain-independent. The information is displayed in the form of summary lists (word sketches) containing the pairs of terms linked by a given semantic relation. This paper describes the process of building a KP-based sketch grammar with special focus on the last stage, namely, the evaluation with refinement purposes. We conducted an initial shallow precision and recall evaluation of the 64 English sketch grammar rules created so far for hyponymy, meronymy and causality. Precision was measured based on a random sample of concordances extracted from each word sketch type. Recall was assessed based on a random sample of concordances where known term pairs are found. The results are necessary for the improvement and refinement of the ESSG. The noise of false positives helped to further specify the rules, whereas the silence of false negatives allows us to find useful new patterns.
Conference Paper
Full-text available
Despite advances in computer technology, terminologists still tend to rely on manual work to extract all the semantic information that they need for the description of specialized concepts. In this paper we propose the creation of new word sketches in Sketch Engine for the extraction of semantic relations. Following a pattern-based approach, new sketch grammars are developed in order to extract some of the most common semantic relations used in the field of terminology: generic-specific, part-whole, location, cause and function.
Thesis
Full-text available
Las definiciones son uno de los componentes más importantes de cualquier recurso terminológico de calidad y un modo privilegiado de representar el conocimiento, pues ofrece una explicación directa en lenguaje natural del contenido de un concepto. La adecuación de las definiciones determinará en gran medida la utilidad global del recurso para el usuario. La motivación de este estudio parte de la observación de que a menudo las definiciones terminológicas no satisfacen las necesidades de los usuarios. En esta tesis doctoral, aplicamos premisas de la lingüística cognitiva (Lakoff 1987; Langacker 1987; Croft y Cruse 2004; Evans y Green 2006, inter alia) a la definición terminológica y presentamos una propuesta que se denomina la definición terminológica flexible. Consiste en un sistema de definiciones del mismo concepto compuesto por una definición general —en nuestro caso, que engloba el dominio del medio ambiente al completo— junto con definiciones adicionales en las que se describe el concepto específicamente desde el punto de vista de los distintos subdominios en los que el concepto es relevante. Dentro de la lingüística cognitiva ., nuestra propuesta se entronca principalmente en la teoría de la terminología basada en marcos (Faber et al. 2006, 2009; León Araúz 2009; Faber 2012, 2014), así como en las teorías de la cognición fundamentada (Barsalou 1993, 1999, 2003), la semántica de marcos (Fillmore 1976, 1977, 1982, Fillmore y Atkins 1992), la teoría de los prototipos (Rosch 1975; Rosch 1978; Rosch y Mervis 1975; Rosch et al. 1976) y la teoría de la teoría (Murphy y Medin 1985; Murphy 1993, 2000). Dado que la lingüística cognitiva demuestra que el contexto es un factor determinante en la construcción del significado de cualquier unidad léxica, incluidas las terminológicas, asumimos que la definición terminológica puede y debe reflejar los efectos del contexto, a pesar de que tradicionalmente la definición se haya entendido como la expresión del significado despojado de los efectos del contexto. El objetivo principal de esta tesis doctoral es analizar los efectos de la variación contextual en conceptos especializados del medio ambiente con vistas a su representación en la definición terminológica. En particular, nos concentramos en la variación contextual basada en restricciones temáticas. Esto es, en cómo las distintas áreas de conocimiento que forman el vasto dominio del medio ambiente conceptualizan de manera diferente los mismos conceptos y cómo ello puede reflejarse en la definición. Para alcanzar los objetivos de esta tesis doctoral, se llevó a cabo un estudio empírico consistente en el análisis de un conjunto de conceptos que varían contextualmente y la elaboración de la definición flexible de dos de ellos, cada uno de los cuales presentaba características contextuales diferentes. Como resultado de la primera parte de nuestro estudio empírico, dividimos nuestra noción de variación contextual dependiente del dominio en tres fenómenos diferentes (inspirados en Cruse [2011]): la modulación, la perspectivación y la subconceptualización. Todos los conceptos experimentan modulación, algunos también se perspectivizan y finalmente, un pequeño número de conceptos experimenta subconceptualización. En la segunda parte, aplicamos estas nociones a la definición terminológica y mostramos cómo construir definiciones flexibles desde la extracción del conocimiento hasta la redacción de la definición en sí. Esta tesis doctoral contribuye a la mejora de la calidad de las definiciones terminológicas porque, con nuestro enfoque, se proporciona al usuario una definición adaptada al dominio de su elección, multiplicando así las probabilidades de que la definición le ofrezca la información que necesita. Además, las definiciones terminológicas flexibles proporcionan una representación del conocimiento que se asemeja al sistema conceptual humano más que las definiciones tradicionales. Así pues, una definición flexible no solo proporciona información más relevante, sino que también lo logra de una manera que facilita y mejora potencialmente la adquisición de conocimiento.
Chapter
Full-text available
Terminology work involves the collection, analysis and distribution of terms. This is essential for a wide range of activities, such as technical writing and communication, knowledge acquisition, specialized translation, knowledge resource development and information retrieval. However, these activities cannot be performed randomly, but should be based on a systematic set of theoretical principles that reflect the cognitive and linguistic nature of terms as access points to larger knowledge configurations. “Frame-Based Terminology” (FBT) is a cognitive approach to terminology that is based on frame-like representations in the form of conceptual templates underlying the knowledge encoded in specialized texts (Faber 2011, 21; 2012; Faber et al. 2007, 42). FBT frames can be regarded as situated knowledge structures and are linguistically reflected in the lexical relations codified in terminographic definitions. These frames are the context in which FBT specifies the semantic, syntactic and pragmatic behaviour of specialised language units. They are based on the following set of micro-theories: (1) a semantic micro-theory; (2) a syntactic micro-theory and (3) a pragmatic micro-theory. Each micro-theory is related to the information encoded in term entries, the relations between specialised knowledge units and the concepts that they designate. Keywords: Terminology theory; Cognitive semantics; Concept modelling; Frames
Article
Full-text available
This paper introduces the Word Sketch: a collocation-based resource of proven value for English lexicography. Issues involving the automatic extraction and presentation of salient collocations are discussed. It is further shown how the combination of significant patterns may lead to even greater precision in the identification of collocations. 1
Article
To formulate definitions that meet user needs, terminologists and specialised lexicographers must know how to effectively select information. However, most definition writing guidelines are based on the specification of necessary and sufficient characteristics, which has serious drawbacks because it downplays the role of context (understood as any factor affecting how a term is interpreted) in specialised meaning construction. This paper focuses on thematic variation, an important type of contextual variation, and its representation in terminological definitions. To this end, this paper presents a corpus-based approach to writing definitions that takes into account thematic variation in the selection of information.
Article
As a novel attack on the perennially vexing questions of the theoretical status of thematic roles and the inventory of possible roles, this paper defends a strategy of basing accounts of roles on more unified domains of linguistic data than have been used in the past to motivate roles, addressing in particular the problem of ARGUMENT SELECTION (principles determining which roles are associated with which grammatical relations). It is concluded that the best theory for describing this domain is not a traditional system of discrete roles (Agent, Patient, Source, etc.) but a theory in which the only roles are two cluster-concepts called PROTO-AGENT and PROTO-PATIENT, each characterized by a set of verbal entailments: an argument of a verb may bear either of the two proto-roles (or both) to varying degrees, according to the number of entailments of each kind the verb gives it. Both fine-grained and coarse-grained classes of verbal arguments (corresponding to traditional thematic roles and other classes as well) follow automatically, as do desired 'role hierarchies'. By examining occurrences of the 'same' verb with different argument configurations—e.g. two forms of psych predicates and object-oblique alternations as in the familiar spray/load class—it can also be argued that proto-roles act as defaults in the learning of lexical meanings. Are proto-role categories manifested elsewhere in language or as cognitive categories? If so, they might be a means of making grammar acquisition easier for the child, they might explain certain other typological and acquisitional observations, and they may lead to an account of contrasts between unaccusative and unergative intransitive verbs that does not rely on deriving unaccusatives from underlying direct objects.