Content uploaded by Antonio San Martín
Author content
All content in this area was uploaded by Antonio San Martín on Nov 25, 2021
Content may be subject to copyright.
64 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
ADAPTING WORD SKETCHES FOR SPECIALIZED KNOWLEDGE
EXTRACTION
Antonio San Martín, Catherine Trekker
University of Quebec in Trois-Rivières, Canada
antonio.san.martin.pizarro@uqtr.ca; catherine.trekker-seguin@uqtr.ca
Abstract
Word sketches (WSs) in Sketch Engine have become a basic tool in terminology work. They bring out
patterns of term behavior that would be too time-consuming to identify manually. Most default English
WS columns in Sketch Engine extract words with a frequent syntactic relationship with the search word
in a corpus (e.g., the nouns usually functioning as the subject of a given verb). The usefulness of the
default syntactic WSs for collocational analysis is evident, but their contribution to specialized knowledge
extraction is less straightforward.
This paper presents a work in progress consisting of adapting the default WSs for specialized knowledge
extraction. In previous work, we developed the contextonymic WS, and semantic WSs, which specically
target specialized knowledge extraction. This paper explores two changes to the default WSs. The rst change
enables WSs to extract nouns functioning as subject and object in the same sentence: (e.g., fertilizer>yield:
fertilizer increases yield; fertilizer improves yield), which usually corresponds to an agent-patient relation.
The other change concerns the extraction of the adjectives that modify a noun. This involves modications
to two of the existing WS columns that extract adjectives and the addition of a new type.
We evaluated the precision of these adaptations in a specialized corpus of English texts on Agronomy.
Additionally, we compared their output with terminological denitions of a set of terms to assess their
usefulness for specialized knowledge extraction. The results indicate that WS columns of nouns functioning
as subject and object in the same sentence are sufciently accurate and potentially useful for specialized
knowledge extraction. However, the results for the adjectival WS columns are inconclusive.
Keywords word sketches, corpus analysis, specialized knowledge extraction, Sketch Engine
1 Introduction
One of the most useful features of Sketch Engine (https://www.sketchengine.eu/) (Kilgarriff et al., 2014)
is the generation of word sketches (WSs), which have become a basic tool in terminology work. A WS is
a one-page summary of a search word’s most common usage patterns in a given corpus. It lists the words
that are syntactically related to the search word in the corpus and includes a link to the corresponding
concordances. Some examples of WS columns are the verbs having the search word as subject or object,
the modiers of the search word, or the words that the search word modies (Figure 1).
Figure 1. Default WS columns of maple in enTenTen18
corpus
65
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Since these word behavior patterns are too time-consuming to identify manually, WSs signicantly facilitate
collocational analysis. However, corpus analysis is not only useful for extracting linguistic information
but also for conceptual knowledge. This is especially true when it comes to elaborating terminological
denitions (in which the conceptual content that terms convey is described) or build conceptual networks
(in which concepts are interconnected through conceptual relations). For these tasks, the usefulness of the
WSs that Sketch Engine generates by default is less straightforward.
In previous work, we proposed new types of WS specically developed for specialized knowledge
extraction: (i) contextonymic WS (see 1.2.1); (ii) semantic WSs (see 1.2.2). This paper presents a work in
progress consisting of adapting the default English WS for the same purpose. More specically, we explore
the creation of two WS columns that extract the relation between the subject and the object of the same
sentence and the grouping of different columns that extract the adjectives that qualify a noun. This adapted
version of the default WS would eventually become part of a single WS that specically targets specialized
knowledge extraction along with the contextonymic WS and the semantic WS.
The rest of the article is organized as follows. In the remainder of this section, we will explain how WSs
are generated, and we describe our previous work on creating WSs for specialized knowledge extraction.
Section 2 will focus on how the new WS columns were developed and evaluated. Section 3 presents the
evaluation results. Finally, in Section 4, we analyze the results and draw some conclusions.
1.1 Word sketch generation
Sketch Engine matches patterns in the form of rules expressed in CQL language to generate WSs
(Jakubíček et al., 2010). A CQL rule is composed of tokens in the form of attributes (part-of- speech
tag, lemma, word form, etc.) and values combined with regular expressions. For instance, the rule
“[tag=”J.*”]{2} [lemma=”technology”]” captures all the instances of technology preceded by two
adjectives. Figure 2 reproduces some matching concordances in our specialized English corpus on
Agronomy (see section 2 for details on the corpus).
Figure 2. Concordances illustrating the rule “[tag=”J.*”]{2}
[lemma=”technology”]”
For WS generation, the rules intended to capture the same syntactic relation are grouped into a gramrel (for
“grammatical relation”). For instance, to identify the relation between verbs and their objects, the gramrel
“objects of X/verbs with X as object” (included in the default sketch grammar) is composed of three rules
(Figure 3)
The set of gramrels that produce a WS constitutes a sketch grammar (Figure 4). For example, the default
English sketch grammar contains 40 rules organized into 25 gramrels. The number of WS columns can be
greater than the number of gramrels because a dual gramrel produces two columns (e.g., “objects of X/
verbs with X as object”, which results in one column for the verbs, and another for the nouns).
Since sketch grammars are text les containing CQL rules grouped in gramrels, it is possible to modify
or expand them by integrating new rules or adapting or deleting existing ones. Users can compile their
own corpora with the sketch grammar of their choice in Sketch Engine. This allows the creation of sketch
grammars adapted to different corpus needs.
1.2 Sketch grammars for specialized knowledge extraction
The default sketch grammar in Sketch Engine is mainly based on syntactic co-occurrence. In other words,
it lists words that appear in the same context as the search word and which maintain a syntactic relationship
66 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
with it (Evert, 2009, p. 1222). This type of co-occurrence is of great importance for collocational analysis,
and WSs were designed with this end in mind (Kilgarriff & Tugwell, 2001). The usefulness of syntactic co-
occurrence for specialized knowledge extraction is less straightforward because the relevance of syntactic
relations for conceptual analysis varies. For instance, the WS listing the modiers of a noun may include,
among others, adjectives
Figure 3. “Objects of X/verbs with X as object” gramrel, its resulting WS columns and concordances
from enTenTen18
corpus
indicating entrenched hyponyms (e.g., “urban farmer”, “rural farmer”), possible features (e.g., “risk-averse
farmer”, “successful farmer”), or they may not be of interest for conceptual analysis (e.g., “other farmer”,
“same farmer”). Therefore, syntactic co-occurrence can be exploited for specialized knowledge extraction.
However, the default English sketch grammar needs to be specially adapted for that purpose.
67
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Syntactic co-occurrence lies halfway along a continuum, with surface co-occurrence (the less constraining
kind) at one end and semantic co-occurrence (the most constraining kind) at the other end. Surface co-
occurrence, on which the contextonymic sketch grammar (see 1.2.1) is based, occurs when two words
appear in the same context without the need of any syntactic or semantic relationship (Evert, 2009, p.
1215). For instance, in “Because glyphosate is systemic, excess residue levels can persist…”, glyphosate
and residue would be surface co-occurrents (or
Figure 4. Example of the structure of a sketch
grammar
contextonyms) (as well as glyphosate and because, is, systemic, etc.), even if they do not establish a direct
syntactic or semantic relation. As for semantic co-occurrence, two words are said to co- occur if a semantic
relationship is established between them in a given context (e.g., hyponymy, meronymy, cause, etc.). For
instance, in “Glyphosate is the only herbicide that kills…” glyphosate and herbicide are semantic co-
occurrents because there is a hyponymic relation between them in that context.
The boundary between these three types of co-occurrence is fuzzy. Surface co-occurrence is generally
based on a window of tokens. In contrast, syntactic and semantic co-occurrences are detected by patterns.
Although the proposed adaptation of the default WSs in this paper is mostly based on syntactic co-
occurrence, some semantic components are also introduced. Before describing the proposed adaptations to
the default grammar, we briey present the contextonymic WS and the semantic WSs since the proposed
modications complement both.
1.2.1 Contextonymic WS
Extracting the contextonyms of a term can help to determine its semantic features (San Martín, in press).
The contextonymic WS was developed for extracting specialized knowledge for denition writing (San
Martín, 2016). Contextonym extraction can be based on various parameters (window span, exclusion of
certain parts of speech etc.). The current version of the contextonymic sketch grammar contains one gramrel
that denes the contextonym of a word as any verb, noun, or adjective before or after the search word with
zero to 44 words between them beyond sentence or paragraph limits. It also excludes certain very common
lemmas (e.g., be, have, etc.) that do not convey signicant semantic features of the search word.
68 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
The contextonymic sketch grammar is useful for specialized knowledge extraction because it provides
terms that are closely related to the search word, which are sometimes not captured by other WS. For
example, by consulting the contextonyms of fungicide in our corpus, it is possible to deduce that important
semantic features of the term are that fungicide application allows the control of certain diseases in crops,
but some pathogens can develop resistance to them. Figure 5 reproduces concordance lines that illustrate
the relation of fungicides and its rst ve contextonyms.
Figure 5.Contextonymic WS and concordances of fungicide in our Agronomy
corpus
With the contextonymic sketch grammar, it is usually necessary to consult the corresponding concordances
to discover their relation to the search word. This disadvantage is compensated for by the fact that this WS
yields valuable results even in smaller corpora.
1.2.2 Semantic sketch grammar
Semantic co-occurrence is based on knowledge patterns, which are lexico-syntactic patterns that match
contexts in which a specic semantic relation is conveyed (Meyer, 2001, p. 281). An example of a
knowledge pattern is “X and other Y” (e.g., “manure and other fertilizers”), which encodes a hyponymic
relation (manure is a type of fertilizer), or “X contains Y” (e.g., “fertilizers contain urea”), which encodes
a meronymic relation (urea is a part of fertilizer).
The EcoLexicon Semantic Sketch Grammar (ESSG) (http://ecolexicon.ugr.es/essg/) (León-Araúz et al.,
2016; León-Araúz & San Martín, 2018) encodes knowledge patterns that capture hyponymy, meronymy,
cause, function, and location relations in English. There is a French version that at the moment only
includes hyponymy (San Martín et al., 2020). Figure 6 shows an example of each of the columns of the
ESSG in English extracted from the EcoLexicon corpus (León-Araúz et al., 2018). This corpus is available
to any Sketch Engine user and comes compiled with the ESSG.
The ESSG has the advantage of clearly identifying the semantic relationship linking the terms, but the
number of results is lower than with other types of WS and requires large corpora to yield useful results.
2 Method
This adaptation of the default English sketch grammar currently envisages the following: (i) creation
of new gramrel; (ii) splitting and merging of gramrels; (iii) modication of gramrel; (iv) suppression of
gramrels; and (v) a combination of these strategies. This paper focuses on the creation and evaluation of a
new gramrel called “X is the proto-agent of…/X is the proto-patient of…” as well as the splitting, merging,
and modication of the gramrels “modiers of X” and “adjective predicates of X”.
For this purpose, we applied a modied version of the methodology of creating knowledge- pattern-
based sketch grammars (San Martín et al., 2020, p. 5954). This methodology is based primarily on the
69
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
iterative renement and evaluation of CQL rules. The corpus (7,249,297 words) that we used consisted of
specialized texts on Agronomy from the following sources:
- 36.7 %: theoretical and practical documents on Agronomy published by the Food and
Agriculture Organization (FAO) and various national and regional governments in English-
speaking countries.
- 30.1 %: specialized monographs and encyclopedias on Agronomy.
- 22.6 %: scientic articles from the International Journal of Agronomy.
- 10.6 %: articles from Wikipedia, manually veried to belong to the eld of Agronomy.
In the early stages of gramrel development, the emphasis is on the evaluation of individual rules. Evaluation
is performed by querying the rule in a corpus compiled in Sketch Engine and
Figure 6. Sample of semantic WSs extracted with the ESSG from the EcoLexicon English
corpus
ascertaining whether the rule extracts expected results without generating noise. This type of precision is
evaluated in small samples (usually 100 random lines) to verify that the modications in the rules produce
70 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
the expected result. Since evaluating recall would slow down the process, the number of total matches
extracted by the rule is taken as a proxy for recall (i.e., the greater the number of matches, the greater the
recall).
For sketch grammars, the precision of rules is less important than recall. Since users access results in
the form of WS (lists of results ordered by frequency or association score), exceptions, errors, and other
noisy results tend to be relegated to the bottom of WS lists. More frequent and signicant results tend to
appear at the top. For this reason, during the development of gramrels, it is also important to periodically
test the resulting WS, even though this is more time-consuming because the sketch grammar needs to be
previously compiled. Accordingly, this research study evaluated our two adaptations of the default sketch
grammar on the basis of the results in WS form (see section 2.3 for the evaluation methodology).
The gramrels can be downloaded at <https://uqtr.ca/knowledge-sketch-grammar/>. Instructions on how to
use them in Sketch Engine are also available at that address.
2.1 The proto-agent-patient gramrel
The organization of specialized domains is based on events in which the interaction between different
types of agents and patients plays a predominant role. (Faber, 2015, p. 23). However, it is not currently
possible to extract the agent-patient relation in Sketch Engine in a user-friendly way. For this reason, we
developed a new gramrel that extracts the relation between the nouns functioning as subject and object
in the same sentence (e.g., farmer and crop in Figure 7). This syntactic relation is useful for specialized
knowledge extraction because the subject usually accomplishes an action that affects the object in some
way. In terms of semantic roles, the former is normally characterized as the agent, but depending on the
verb, it can also be an experiencer, an instrument, among other roles. The object can be typically labeled
as the patient, but also as the theme, the recipient, among other roles. Dowty (1991) groups these semantic
roles into two macroroles: proto-agent and proto-patient. Consequently, this gramrel is called “X is the
proto- agent of…/X is the proto-patient of…” (proto-agent-patient gramrel).
The rst step consisted in creating a basic version of the gramrel by combining the two default gramrels
“objects of “X”/verbs with “X” as object” (object gramrel) and “subjects of “X”/verbs with “X” as subject”
(subject gramrel). New rules are thus created by combination instead of
Figure 7. Concordances with farmer as subject and crop as object in our Agronomy
corpus
grouping the corresponding rules in a single gramrel. The active-voice rules are combined in a new rule,
whereas the passive-voice rules are merged into another rule (Figure 8).
71
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Figure 8. Combination of subject and object rules into the proto-agent-patient
rules
This basic version was used as a benchmark in the evaluation. Since it only returns 80,814 matches in our
corpus, it was enriched and rened to increase recall. Some of the changes that allowed us to increase recall
without compromising precision included the following:
- Optional modal verbs (will, can, must, etc.): “…any ammonium-containing fertilizer will
ultimately decrease soil pH…”.
- Additional optional auxiliary verbs: “…wind erosion is causing signicant soil loss…”.
- Possibility of certain subordinate structures (is capable of…, have the advantage/
ability/… of/to, seems/appears/… to…, is used/designed/intended to…, etc.): “…parasitic
nematodes are capable of causing plant diseases...”, “…cover crops are used to improve the soil
structure...”.
We also created a new rule that captures the subject-object relation that could not be derived from the
subject and object gramrels: PROTO-PATIENT that PROTO-AGENT affects (and variants). This rule
matches concordances such as “…nitrogen that the crop roots can take up and use” or “… management
practices that farmers adopt focus on herbicides”.
Once these changes were applied, the resulting rules were evaluated and three limitations were identied:
phrasal verbs, multiple nouns in subject or object position, and verbs that do not convey a proto-agent-
patient relation or inverse the order (since the subject is a proto-patient, and the object, a proto-agent).
Regarding phrasal verbs, the basic version does not allow the presence of a preposition between the verb
and the object. Instead, Sketch Engine’s default sketch grammar displays phrasal verbs in specic columns
(Figure 9). Since not including them in the proto-agent-patient gramrel reduces recall, we allowed up
to two optional prepositions between the verb and the object to retrieve concordances such as “…when
the crop takes up most of the nitrogen…” or microbes break down organic matter”. Even though this
occasionally generates noise (e.g., “…goods travel from manufacturers to distributors…”), preliminary
tests showed that the increase in recall compensated for it.
Figure 9. Phrasal verbs WS columns of plant in the corpus
enTenTen18
72 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
A second problem was that WS results can only be single words. The subject and object gramrels only
capture the last noun in noun compounds (e.g., “…nitrogen applications disturb the soil”. In the case of
noun compounds linked by a preposition (e.g., “Rotation of crops increases the production of biomass...”)
and enumerations (e.g., “Cattle digestion, fertilizers and animal wastes cause emissions...”), only the
closest noun to the verb is captured. These limitations are justied in that they protect the precision of
the rules. However, they limit recall because the noun compound head is not always detected. (e.g., “The
pollution of surface and groundwater that produces serious health problems…”). Likewise, some nouns do
not occupy the subject or object head position but semantically could act as proto-agent or proto-patient.
For instance, in “…nitrogen applications disturb the soil”, the head of the subject is applications and is,
therefore, the direct proto-agent. However, nitrogen is indirectly a proto-agent as well.
Therefore, we modied the rules to capture any noun (whether head or modier) in a nominal compound or
an enumeration. This included both noun compounds without a preposition (e.g., “hydrocarbon pesticide
residue”) or linked by a preposition of (e.g., “xation of nitrogen”). No other prepositions were included at
this point because preliminary evaluations showed that they were an important source of noise. Additionally,
we enabled the rule to capture all the nouns in enumerations either in subject or object position (e.g., “These
fungi infect many cereals, grasses and other plants...”). We also included enumerations with hyponymic
formulas (e.g., “…organisms such as fungi and nematodes can damage…”).
The third limitation concerns the fact that the subject-object relation does not always correspond to the proto-
agent-patient relation. This was addressed by ltering certain verbs. A rst group of verbs (invalidating
verbs) are those that do not convey the proto-agent-patient relation, for example, to be. This group also
includes verbs that convey certain relations already captured with the semantic WS, such as hyponymy
(e.g., include) or meronymy (e.g., have). The second group includes those that invert the common argument
order (inverting verbs). Table 1 includes both lists of verbs. While invalidating verbs were excluded from
all the rules, the rules were duplicated for inverting verbs. In one set of rules, the inverting verbs were
excluded, and in the others, the position of the proto-agent and proto-patient were interchanged.
Table 1. Invalidating and inverting
verbs
invalidating
verbs
accord, arise, be, become, belong, come, compete, compose, comprise, consist, contain,
dene, exist, feature, follow, gain, happen, have, include, lack, lose, match, name, need, origi-
nate, range, receive, refer, regard, relate, remain, require, stay, stem, survive, vary
inverting
verbs
depend (on), rest (on), lie (on), lie (in), result (from), suffer (from), rely (on/upon)
The enriched version of the gramrel returned 374,525 matches from our corpus, four times more than the
basic version. It is composed of the following six rules (note that the verb affect represents any verb with
the above-mentioned exceptions):
- PROTO-AGENT affects PROTO-PATIENT (and variants)
- PROTO-PATIENT is affected by PROTO-AGENT (and variants)
- PROTO-PATIENT that PROTO-AGENT affects (and variants)
- PROTO-AGENT affects PROTO-PATIENT (and variants) (inverting verbs)
- PROTO-PATIENT is affected by PROTO-AGENT (and variants) (inverting verbs)
- PROTO-PATIENT that PROTO-AGENT affects (and variants) (inverting verbs)
73
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
2.2 Adjectival gramrels
Two WS columns in the default sketch grammar extract the adjectives that modify a given noun. The
“modiers of X/nouns modied by X” gramrel (modiers gramrel) retrieves the adjectives and nouns that
appear before a noun (e.g., “perennial crop”, “corn crop”). The “adjectives predicates of X/subjects of be
X” gramrel (predicative adjectives gramrel) extracts an adjective placed after a noun even when separated
by to be (e.g., “crops resistant to herbicides”, “crops are tolerant”).
These columns have potentially useful features for building denitions or conceptual networks, namely
the ability to assign characteristics to a given concept. To adapt them to the extraction of specialized
knowledge, the modiers gramrel was divided in two: one gramrel for adjectives modiers and another
for noun modiers. This allowed us to explore whether a single column for all the adjectives that modify a
noun was useful. In another gramrel, all the nouns modifying another noun (either preceding the modied
noun or postposed with a preposition, e.g., “wheat production” and “production of wheat”) could also
be grouped. This paper only addresses adjectives from the point of view of a noun search word. In other
words, whereas the adjectives gramrel is dual (“adjectives of X/nouns modied by X”), it only focuses on
the column that lists adjectives (“adjectives of X”).
This new adjectives gramrel is composed of three gramrels that will be tested separately: the attributive
adjectives gramrel, predicative adjectives gramrel, and the hyponymic adjectives gramrel.
The attributive adjectives gramrel extracts the adjectives that precede the noun modied (e.g., “synthetic
fungicide”, “foliar fungicide”). This gramrel originates from the split of the modiers gramrel. It is
composed of a single rule, which was changed to exclude the following adjectives that are not useful for
specialized knowledge extraction: most, least, many, other, more, less, such, able, unable, due, capable,
incapable, various, several, few, same, different. These adjectives were also excluded from the other
adjectival gramrels. This gramrel produced 514,578 matches in our corpus.
The predicative adjectives gramrel is composed of a single rule in the default grammar. To create an
enriched version, we divided it into two separate rules. The rst rule captures the adjective placed directly
after the noun (e.g., “…keep the soil dry…”) and was modied so as to capture two adjectives (e.g.,
“…sheries more productive and sustainable…”). The second rule extracts the adjective placed after the
noun and the verb to be (e.g., “the soil is dry”). We increased its recall by adding more predicative verbs
(i.e., appear, look, seem, become, remain, get, turn). Other modications include optional auxiliary verbs
(e.g., “droughts have become more prolonged”), modal verbs (e.g., “soybeans will remain yellow”), two
adjectives (e.g., “soil is acid or alkaline”), and noun enumerations: (e.g., “leaves and small stems become
more brittle”). The basic version of the gramrel produced 34,127 matches in our corpus. The enriched
version obtained 46,358 matches, which is a 35.84% increase.
Finally, the hyponymic adjectives gramrel captures the adjectives that qualify the hyponym of the search
word. It is based on hyponymic knowledge patterns. It follows the logic that, in a hyponymic
structure, the adjective modifying the hypernym potentially expresses a characteristic of the hyponym. For
example, in “...the use of interventional measures such as fungicides...”, it can be deduced that fungicide is
a type of interventional measure. Therefore, interventional also applies to fungicide.
The starting point of this gramrel was the hyponymic rules in the ESSG (see 1.2.2.). We only retained those
rules that returned at least 1000 results in our corpus. We then excluded the rules that were excessively
noisy, although they might be included in the future if they can be rened to yield satisfactory results. The
hyponymic adjectives gramrel produced 27,420 matches in our corpus and contains the following rules:
1. adjective HYPERNYM such as/including/especially/like/includes HYPONYM (e.g.,
“…agricultural inputs such as herbicides”)
2. HYPONYM and/or other adjective HYPERNYM (e.g., “…antibiotics or other effective
antimicrobials)
74 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
2.3 Evaluation methods
These adaptations were evaluated in two stages. The rst stage evaluated the WS columns in terms
of precision, whereas the second stage evaluated the usefulness of the gramrels to extract specialized
knowledge.
In both stages, the evaluation was performed using one high-frequency term (crop with 28,457 occurrences in
the Agronomy corpus) and two medium-frequency terms (fungicide with 1,161 occurrences, and nematode
with 866 occurrences) as search words. Additionally, for the second stage, we extracted denitions of
these terms from specialized glossaries and multidomain terminology databases (only if the denition was
labeled as belonging to Agronomy or its subdomains). In total, 30 denitions of crop, 20 of nematode,
and 26 of fungicide were recovered. In all stages, only the rst ve most frequent results per WS column
were considered. To evaluate the precision (i.e., the percentage of correct results), we assessed whether
the results were correct by accessing the corresponding concordance lines. In the case of the proto-agent-
patient gramrel, a concordance line was considered correct (i.e., a true positive) if a proto-agent-patient
relationship can be deduced directly or indirectly from the concordance.
For attributive and predicative adjectives, a concordance line was considered correct if the adjective
qualied the captured noun in the concordance. In the case of hyponymic adjectives, it was considered
correct if the noun inherited the characteristic expressed by the adjective.
We also calculated validity, according to which a result in a WS column is valid when at least one of its
associated concordances is a true positive (San Martín et al., 2020, p. 5961). For example, the relation
“fungicide is the proto-patient of industry” has four associated concordances (Figure 10). Since only 3 and
4 are correct, the precision of this result is 50%. However, because there is at least one correct concordance,
validity is 100%.
Figure 10. Concordances associated to the relation “fungicide is the proto-patient of
industry”
The second evaluation stage explored the usefulness of the adaptations for specialized knowledge extraction
by comparing the gramrels results with the denitions of the search terms. In some cases, they were also
compared with the contextonymic and semantic WS columns.
3 Results
3.1 Evaluation of the proto-agent-patient gramrel
3.1.1 First stage
The table with the complete results of the precision and validity analysis is in Appendix 1. Figure 11
summarizes the results in terms of precision and validity, and Figure 12 reproduces the resulting WS
columns. On average, the enriched version has a precision of 71.17% compared to 66.01% for the basic
version. The enriched version was found to perform better for fungicide and crop, whereas the basic version
performed better for nematode.
75
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
In terms of validity, the results of the enriched version are also slightly higher than the basic version. On
average, the enriched one obtained 93.33% validity, while the basic one obtained 83.33%. The enriched
version performed better for nematode and fungicide than the basic version. Both versions were 100%
valid for crop.
The analysis of the incorrect concordances allowed us to identify different types of errors, mostly
attributable to the limitations of WSs. Such errors include problems with sentence segmentation (e.g.,
“Destroy or control weeds and soil pests Incorporate crop residues...”) and POS-tagging (e.g., “This water
blistering disorder crops up from time to time...”). This type of error was present in the same proportion in
the basic and enriched versions.
Figure 11. Precision and validity of the proto-agent-patient
gramrel
Another type of error caused by a WS limitation concerned noun compounds. As previously explained,
to overcome this problem, the enriched version retrieves all the nouns in nominal compounds. Although
this facilitates the retrieval of many correct results, it is also a source of noise (e.g., “Plant pathologists are
investigating methods of nematode control”. However, these preliminary results seem to indicate that the
increase in recall compensates for the noise, especially since it is preferable to prioritize recall rather than
precision.
76 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Finally, there were errors generated by invalidating and inverting verbs. Even though the enriched version
accounts for some of these verbs, new ones appeared in the concordances (e.g., “Crops may tolerate
greater amounts of blowing soil...”). All these verbs will be analyzed before including the rules.
Figure 12. Resulting WS columns from the proto-agent-patient gramrel. The basic version has a green
dot.
The
enriched version, a blue
dot
3.1.2 Second stage
Since the enriched version returned more precise results, the second stage was performed on this one. First,
we compared the proto-agents and proto-patients in the extracted denitions of the analysis terms (Table 2)
with the gramrel results (Table 3). Terms present in both the denitions and the WS are in bold.
Table 2. Proto-agents and proto-patients in the denitions. The number of occurrences in the denitions is
in
parentheses
fungicide is the proto-agent of…
fungus(24), growth(6), disease(3), plant(3), mold(2), mil-
dew(2), control(1), yeast(1), pathogen(1), crop(1), product
(1), soil(1), development(1)
is the proto-patient of… insect(1), ant(1)
crop
is the proto-agent of... -
is the proto-patient of... livestock(1), labor(1), farmer(1), people(1)
nematode
is the proto-agent of...
plant(11), root(5), animal(5), crop(2), vine(2), tissue(2),
agriculture(1), pest(1), slug(1), leatherjacket(1), loss(1),
human(1), yield(1), swelling(1), growth(1), disease(1),
bird(1), mammal(1), insect(1), juice(1), structure(1), dam-
age(1)
is the proto-patient of... plant(1), control(1), brassica(1), chemical(1), nemati-
cide(1), water(1), contamination(1)
77
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Table 3. Top ve results in the proto-agent-patient WS with the number of correct concordances in
parentheses
fungicide is the proto-agent of… disease(10), yield(7), incidence(6), grain(6), plant(4)
is the proto-patient of… grower(5), industry(2), leaf(3), application(2), seed(0)
crop is the proto-agent of... soil(92), water(38), yield(37), weed(35),
production(26)
is the proto-patient of... farmer(122), soil(40), system(37), water(25),
fertilizer(23)
nematode is the proto-agent of... disease(10), plant(10), root(6), crop(4), effector(4)
is the proto-patient of... crop(3), cultivar(3), soil(2), nematicide(1), plant(0)
Of the 28 valid WS results, only 8 (i.e., 28.57%) appear as well in the denitions. This low percentage was
to be expected because denitions select the conceptual information considered most relevant. Corpora
tend to contain much more information. It can also be observed that the use of the proto-agent-patient
relation in the denitions is variable: fungicide and nematode are most frequently dened as proto-agent,
while in the denitions of crop, both macroroles are rare.
Some frequent proto-agents in the denitions are missing in the WS results. The most prominent case is
fungus as a proto-agent of fungicide. In the column “fungicide is the proto-agent of...”, fungus is in 85th
position with only one associated concordance (“...fungicides kill fungi...”). The contextonymic WS of
fungicide was thus consulted to determine whether the low result of fungi was a case of silence (i.e., that
the proto-agent-patient gramrel missed relevant concordances).
Fungus is only the 72nd contextonym of fungicide with 96 concordances. This indicates that, although the
relationship between fungicide and fungus is relevant to dene fungicide, specialized texts do not often
mention this characteristic. Additionally, we observed that only 12 out of the 96 concordances directly
or indirectly convey a proto-agent-proto-patient relationship. However, in most cases, it is not reected
in a subject-object relation (e.g., “…strobilurin fungicides are very active against many plant pathogenic
fungi…”). Among the ones in which there is indeed a subject- object relation, most are cases of anaphora
(e.g., “…fungicides penetrate into plant tissue, where they kill or inhibit a fungus…”) and uncommon
use of punctuation (e.g., “…fungicides (kill fungi)...” Both cases are difcult to account for in CQL rules
without generating excessive noise.
Finally, regarding the complementarity of this relationship with the contextonymic WS, the rst
contextonymic results of the three terms show matches with the proto-agent-patient gramrel. It follows
that the proto-agent-patient columns can facilitate the discovery of the relationship between the search
word and many of its contextonyms. As shown in Table 4, for 9 of the 15 most frequent contextonyms
of fungicide, nematode, and crop, the rst ve proto-agent-patient results help to determine the semantic
relationship between the two terms.
78 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Table 4. Comparison of contextonyms (number of occurrences in parenthesis) and the results from the
proto-agent-patient gramrel
search word contextonym fungicide is its…
fungicide
disease(907) proto-agent
crop(581) proto-agent/proto-patient
application(462) -
resistance(421) -
control(406) -
nematode
plant(776) proto-agent
soil(600) proto-patient
crop(398) proto-agent
root(347) proto-agent
population(261) -
crop
soil(15,672) proto-agent
plant(9,063) -
production(8,707) proto-agent
use(verb)(7,944) -
yield(7,465) proto-agent
3.2 Evaluation of the adjectives gramrel
3.2.1 First stage
The table with the complete results of the precision and validity analysis is in Appendix 2. Figure 13
summarizes the precision and validity data. Figure 14 reproduces the WS columns.
As for the attributive adjectives gramrel, it has a very high precision (99.93 % of average) and a perfect
validity. The only errors that the gramrel produced are due to problems with corpus segmentation.
As for predicative adjectives, the enriched version performed slightly better than the basic version in both
precision (72.53% vs. 68.47 %) and validity (80% vs. 75.94%). It is not surprising that crop, the most
frequent term, obtained the best results. In fact, in the enriched version, the validity is 100%. Despite
the small sample, the precision and validity are sufciently high to conclude that this gramrel performs
satisfactorily. Moreover, the results indicate that the enriched version, even retrieving 35.84% more results,
maintains or even surpasses the level of precision of the basic version.
79
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Figure 13. Precision and validity of the of the adjectival
gramrels
As for the frequent errors observed in the extraction rules of predicative adjectives, they are similar to
the ones of the proto-agent-patient gramrel. Both the basic and enriched versions retrieve some incorrect
results stemming from problems with POS tagging and corpus segmentation. The enriched version also
matches some incorrect matches due to noun compounds: “...combinations of these fungicides are effective
in the management…”.
Finally, the results of the hyponymic gramrel are notably inferior to the other gramrels both in precision
and validity. While the precision is 42.2%, the validity is only slightly higher at 53.33%. However, the
results for crop are 79.93% precise and 100% valid. Since crop is the most frequent term, this is consistent
with the fact that knowledge-pattern-based gramrels (such as this gramrel) need larger corpora to yield
satisfactory results.
This gramrel shares the same errors as the other ones, but there are also two that are unique to it. The rst
error occurs when the head of the hypernym is a collective noun. For example, in “there is a wide variety
of parasites including trematodes, cestodes, nematodes…”, wide qualies variety, and not nematode’s
hypernym (parasite). Therefore, nematode cannot be said to inherit that attribute. This error can easily be
solved by ltering out collective nouns.
The other error occurs when the head of the hypernym and the head of the hyponym are the same lemma.
For example, “…plant-feeding nematodes such as lesion nematodes”. In this case, plant- feeding only
applies to one particular type of nematode (i.e., lesion nematodes). This can also be easily ltered to
improve the performance of this gramrel. However, the attributive adjectives gramrel already captures the
adjective preceding the hypernym (i.e., “plant-feeding nematodes”.)
80 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Figure 14. Resulting WS columns from the adjectival gramrels. The basic version of the predicative
adjectives has a green dot.
The
enriched version, a blue
dot.
3.2.2 Second stage
In this stage, we left out the basic version of the predicative adjectives gramrel and evaluated the attributive
and hyponymic adjectives gramrel along with the enriched predicative adjectives gramrel. First, we
compared the gramrel results (Table 5) with the adjectives extracted from the denitions of the terms under
analysis (Table 6). In both tables, the terms present in both the denitions and WSs are in bold.
Table 5. Adjectives in the denitions. The number of occurrences in the denitions is in
parentheses
fungicide chemical (6), physical (2), toxic (1)
nematode
microscopic (7), parasitic (6), cylindrical (3), small (3), unsegmented (3), slender (2),
elongated (2), worm-like (2), nonsegmented (2), numerous (1), multicellular (1), living
(1), benecial (1), harmful (1), former (1), biological (1), phytoparasitic (1), abundant (1),
long (1), legless (1), worm-shaped (1), aquatic (1), round (1), colorless (1), threadlike (1)
crop cultivated (9), grown (8), agricultural (2), used (2), growing (2), young (2),
managed (2), horticultural (1), animal (1), total (1), yearly (1), wild (1)
81
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Table 6. Top ve results in the adjectival WSs with the number of correct concordances in
parentheses
attributive predicative hyponymic
fungicide new (27) available (6) effective (1)
synthetic (20) important (2) interventional (1)
foliar (16) effective (1) powdery (0)
systemic (15) economical (0) noncross-resistant (0)
modern (8) miniscule (0) downy (0)
nematode parasitic (56) present (2) soilborne (5)
reniform 38 migratory (1) important (0)
plant-parasitic (17) microscopic (1) wide (0)
endoparasitic (7) sedentary (1) fungal (0)
pathogenic (6) predatory (0) plant-feeding (0)
crop important (267) resistant (17) agricultural (28)
annual (200) susceptible (15) cultural (15)
major 183) sensitive (14) organic (13)
perennial (178) vulnerable (12) agronomic (6)
main (147) important (10) biological (5)
The adjectives extracted from the denitions are only the ones that describe the term. For instance, in the
denition of fungicide as “chemical compound used to control fungi”, the adjective chemical describes
fungicide and is therefore included in the list.
These adjectives have a low level of correspondence (3 out of 36, i.e., 8.33%) with the ones extracted with
the gramrels. There is also no specic gramrel that has more matches since each one has only one match.
This may be due to the fact that corpora tend to contain more information than denitions. In addition,
the adjectives that qualify a noun do not necessarily express a dening characteristic of that concept. It
may be a characteristic of a subtype or of an instance of the concept. For example, in “…between two
periods, the crop is vulnerable to weeds…”, the predicative adjective refers to an instance of crop and is
also time-restricted. Therefore, it cannot be deduced that vulnerability to weeds is a characteristic of crops,
only a possible feature. Another example is “Modern fungicides can be applied to crops…”. In this case,
the attributive adjective modern selects a subset of all existing fungicides. Being modern is also a possible
characteristic of fungicides though not a dening one.
As for the complementarity of adjectival gramrels and contextonyms, it should be noted that the initial
positions of the WS contextonym are usually occupied by nouns. In the case of crop, it is necessary to go
to the 19th result to nd the rst adjective (high). With nematode, the rst adjective is parasitic, in position
24th. As for fungicide, the rst adjective (new) is in position 18th. As for the semantic gramrels, none of
them extract adjectives. For these reasons, an adjectival gramrel does potentially provide complementary
information.
82 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
4 Analysis and Discussion
The results of the evaluation of the proto-agent-patient gramrel are promising. Despite the small sample,
the enriched version was found to achieve over 70% precision and over 90% validity. The gramrel in
its current state is thus functional. The evaluation also shows ways to improve the rules. For example,
increasing the list of invalidating and inverting verbs will lead to greater precision. A more thorough
evaluation of the impact of retrieving all nouns that make up nominal compounds in object and subject
position will also allow us to adjust the rules accordingly.
Comparison with the denitions did not yield a high level of correspondence. It was especially relevant
that some frequent proto-agents and proto-patients in the denitions were absent from the gramrel results.
This was the case of fungus in relation to fungicide. The analysis of the concordances in which fungus is
a contextonym of fungicide reveals that the addition of patterns other than subject-object to the gramrel
could improve the gramrel.
Regarding adjectival gramrels, preliminary results did not reveal whether it would be useful to merge them
into a single gramrel. One argument against merging is that the attributive gramrel returned almost seven
times as many results as the other two combined, which means that the others would have very little weight
in a merged gramrel. However, as yet, there is not sufcient evidence to afrm that they are more useful
separately than merged, especially since having too many WS columns could lead to information overload.
As for the usefulness of adjectival WS for specialized knowledge extraction, this preliminary evaluation
was inconclusive. None of the gramrels have a higher correspondence with the adjectives extracted from the
denitions. However, there is also the question of whether denitions are a good benchmark for evaluating
the capacity of WSs to assist in specialized knowledge extraction. New forms of evaluation should thus
also be explored.
While there were some very useful adjectives extracted by the gramrels, others were less so because the
adjectives that qualify a noun do not always express a dening characteristic. Thus, our results suggest
that the adaptation of this gramrel is worth further study. Thanks to the high precision of the attributive
and predicative adjectives gramrels, they are a good research tool for that purpose. As for hyponymic
adjectives, the results indicate that the rules need to be rened to achieve greater precision before the
gramrel can be considered functional.
It is also worth pointing out that we have not evaluated the inverse columns (i.e., using an adjective as a
search word to obtain the list of nouns it qualies). It is probable that these WS columns are useful for
studying adjectives from a conceptual point of view.
In future work, we will continue to develop these gramrels and evaluate them with a larger sample. For
this purpose, corpora from different specialized domains and different types of search terms will be used.
Finally, in parallel, we will further adapt the English default sketch grammar to create a sketch grammar for
specialized knowledge extraction that would also include contextonymic and semantic gramrels.
5 References
Dowty, D. (1991). Thematic Proto-Roles and Argument Selection. Language, 67(3), 547. Evert, S. (2009).
Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International
Handbook (Vol. 2, pp. 1212–1248). De Gruyter.
Faber, P. (2015). Frames as a Framework for Terminology. In H. J. Kockaert & F. Steurs (Eds.), Handbook
of Terminology (Vol. 1, pp. 14–33). John Benjamins.
Jakubíček, M., Kilgarriff, A., McCarthy, D., & Rychlý, P. (2010). Fast Syntactic Searching in Very Large
Corpora for Many Languages. Proceedings of the 24th Pacic Asia Conference on Language,
Information and Computation (pp. 741–747). Waseda University.
83
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V.
(2014). The Sketch Engine: Ten Years on. Lexicography, 1(1), 7–36.
Kilgarriff, A., & Tugwell, D. (2001). Word sketch: Extraction and display of signicant collocations for
lexicography. Proceedings of ACL Workshop on Collocation: Computational Extraction, Analysis
and Exploitation, 32–38.
León-Araúz, P., & San Martín, A. (2018). The EcoLexicon Semantic Sketch Grammar: from Knowledge
Patterns to Word Sketches. Proceedings of the LREC 2018 Workshop “Globalex 2018 – Lexicography
& WordNets” (pp. 94–99). Globalex.
León-Araúz, P., San Martín, A., & Faber, P. (2016). Pattern-based Word Sketches for the Extraction of
Semantic Relations. Proceedings of the 5th International Workshop on Computational Terminology
(pp. 73–82).
León-Araúz, P., San Martín, A., & Reimerink, A. (2018). The EcoLexicon English Corpus as an open
corpus in Sketch Engine. Proceedings of the 18th EURALEX International Congress (pp. 893 901).
Euralex.
Meyer, I. (2001). Extracting knowledge-rich contexts for terminography - A conceptual and methodological
framework. In D. Bourigault, M.-C. L’homme & C. Jacquemin, (Eds.), Recent Advances in
Computational Terminology (pp. 279–302). John Benjamins.
San Martín, A. (2016). La representación de la variación contextual mediante deniciones terminológicas
exibles. [Doctoral dissertation, University of Granada].
San Martín, A., Trekker, C., & León-Araúz, P. (2020). Extraction of Hyponymic Relations in French
with Knowledge-Pattern-Based Word Sketches. Proceedings of the 12th Language Resources and
Evaluation Conference (pp. 5955–5963). ELRA.
San Martín, A. In press. A Flexible Approach to Terminological Denitions: Representing Thematic
Variation. International Journal of Lexicography.
Acknowledgements
We thank Natasha Herger, research assistant, for her contribution to the data analysis. This research
was carried out as part of project Développement d’une méthodologie d’élaboration de dénitions
terminologiques : analyse de corpus et variation contextuelle (2020-NP-267503) funded by Quebec’s
Society and Culture Research Fund (FQRSC).
84 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Appendix 1. Results of the proto-agent-patient gramrel evaluation
BASIC
VERSION
Proto-agent Proto-patient Cor-
rect/
total
Prec.
(%)
Most com-
mon
verb Prec.
(%) Val.
(%) Prec.
(%) Val.
(%)
nematode root 4/4 100 invade 100 100 80 80
soil 3/3 100 infest
loss 3/3 100 cause
disease 3/3 100 cause
damage 3/3 100 cause
LM135 nematode 1/1 100 60 60
solarization 1/1 100
estimate 1/1 100
loss 0/1 0
pest 0/1 0
fungicide grower 0/3 066.67 80 56.19 70
incidence 3/3 100 reduce
control 3/3 100 provide
growth 3/3 100
effect 1/3 33.33
seed fungicide 2/7 28.57 receive 45.71 60
grower 3/3 100 use
leaf 2/2 100 absorb
crop 0/2 0
shasho 0/1 0
crop nutrient 6/20 30 remove 44.15 100 61.84 100
legume 1/16 6.25
water 11/14 78.57 use
yield 9/14 64.29 produce
nitrogen 5/12 41.67 x
farmer crop 40/47 85.11 grow 79.52 100
farm 9/9 100 grow|pro-
duce
rotation 7/8 87.5 involve
soil 6/8 75 affect
system 4/8 50 use
85
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
ENRICHED
VERSION
Pro-
to-agent Proto-patient Correct/
total
Prec.
(%)
Most common
verb Prec.
(%) Val.
(%) Prec.
(%) Val.
(%)
nematode disease 10/12 83.33 cause 87.99 100 71.50 90
plant 10/11 90.91 cause
root 6/7 85.71 damage|invade
crop 4/5 80 cause
effector 4/4 100 secrete
crop nematode 3/4 75 suppress 55 80
plant 0/4 0
soil 2/4 50 affect
cultivar 3/3 100 support
Nemati-
cide
1/2 50
fungicide disease 10/11 90.91 control 89.02 100 76.17 90
yield 7/8 87.5 increase
incidence 6/6 100 reduce
grain 6/6 100 treat|affect
plant 4/6 66.67
seed fungicide 0/6 063.33 80
grower 5/5 100 use|inquire
industry 2/4 50
leaf 3/3 100 absorb
application 2/3 66.67
crop soil 92/125 73.60 improve 66.68 100 65.86 100
yield 37/67 55.22 produce|impact
water 38/57 66.67 use
weed 35/48 72.92 suppress
production 26/40 65 improve|increase
farmer crop 122/133 91.73 grow 65.04 100
soil 40/65 61.54 affect
system 37/58 63.79 use
fertilizer 23/46 50 affect
water 25/43 58.14 affect
86 “Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
Appendix 2. Results of the adjectives gramrel evaluation
Gramrel Term Adjective Correct/total
Prec.
(%)
Prec.
(%)
Val.
(%)
attributive
adjectives
nematode
parasitic 56/56 100 100 100
reniform 38/38 100
plant-parasitic 17/17 100
endoparasitic 7/7 100
pathogenic 6/6 100
predicative
adjectives
(basic
ver-
sion)
present 2/2 100 80 80
migratory 1/1 100
predatory 0/1 0
microscopic 1/1 100
abundant 1/1 100
predicative
adjectives
(enriched
version)
present 2/2 100 80 80
migratory 1/1 100
predatory 0/1 0
microscopic 1/1 100
sedentary 1/1 100
hyponymic
adjectives soilborne 5/6 83.33 16.67 20
important 0/3 0
wide 0/2 0
fungal 0/2 0
plant-feeding 0/1 0
attributive
adjectives
fungicide
new 27/27 100 100 100
synthetic 20/20 100
foliar 16/16 100
systemic 15/15 100
modern 8/8 100
predicative
adjectives
(basic
ver-
sion)
available 6/11 54.55 37.58 60
effective 1/3 33.33
important 2/2 100
economical 0/2 0
miniscule 0/1 0
predicative
adjectives
(enriched
version)
available 6/11 54.55 40.91 60
important 2/2 100
effective 1/2 50
economical 0/2 0
miniscule 0/1 0
hypo-
nymic
adjec-
tives
effective 1/2 50 30 40
powdery 0/1 0
noncross-resis-
tant 0/1 0
interventional 1/1 100
downy 0/1 0
87
“Lexicography and Language Documentation”
PROCEEDINGS OF ASIALEX 2021
attrib-
utive
adjec-
tives
crop
important 267/267 100 99.79 100
annual 200/201 99.5
major 183/183 100
perennial 178/179 99.44
main 147/147 100
predicative
adjectives
(basic
version)
susceptible 14/16 87.5 87.83 100
resistant 15/15 100
sensitive 14/14 100
vulnerable 11/12 91.67
important 6/10 60
predica-
tive ad-
jectives
(en-
riched
version)
susceptible 15/18 83.33 96.67 100
resistant 17/17 100
sensitive 14/14 100
vulnerable 12/12 100
important 10/10 100
hypo-
nymic
adjec-
tives
agricultural 28/29 96.55 79.93 100
organic 13/20 65
cultural 15/15 100
agronomic 6/9 66.67
biological 5/7 71.43