Content uploaded by Chris Jenkins
Author content
All content in this area was uploaded by Chris Jenkins on Jun 13, 2019
Content may be subject to copyright.
1
Collaboration between the Natural Sciences and Computational
Linguistics: A Discussion of Issues
Anne E Thessen1,2, Ruth E Duerr1, Jenette Preciado3, Chris J Jenkins3, Martha Palmer3
1The Ronin Institute for Independent Scholarship, New Jersey, USA
2The Data Detektiv, Massachusetts, USA
3University of Colorado Boulder, CO 80309 USA
Cite: Thessen, A.E., Duerr, R.E., Preciado, J., Jenkins, C.J. and Palmer, M. 2018. Collaboration between the Natural Sciences and
Computational Linguistics: A discussion of Issues. Occasional Report 28 December 2018, INSTAAR, University of Colorado,
Boulder USA, 26pp. “DOI: 10.13140/RG.2.2.19353.26728”
Abstract
Natural Language Processing (NLP) is an important field of study dedicated to improving
automated reading and understanding of human text by machines through the development of
specialized algorithms. These algorithms need a large corpus of annotated text in order to learn
the semantics and syntax of human language, which is often specific and nuanced according to
the context. Because of this, many different types of corpora can be required to achieve good
performance in different domains. This article discusses an interdisciplinary collaboration
between natural scientists and computational linguists to develop three annotated corpora for
the purpose of training algorithms for automated ontology creation. This paper describes the
annotation methods used by each of the three domains in the ClearEarth project and discusses
the problems that arose in collaborating across domains and disciplines. Then, solutions and
guidelines for other similar, following projects will be proposed.
Keywords
interdisciplinary research, natural language processing, machine learning, ontology, annotation
Introduction
Natural Language Processing (NLP) is a collection of computational methods for the automated
reading and understanding of human text by machines. NLP was first applied in the fields of
machine translation (see in Cambria and White 2014). Later, it was applied to text mining,
wherein documents, such as newspaper articles, were read by computers for the purpose of
extracting machine-readable data. These efforts were largely successful, and the algorithms
have since been adapted for use in biomedicine, where they were used to find relationships
between diseases, drugs, and other molecules at a scale much faster than any human could
read and extract information.
The adaptation of NLP algorithms to biomedicine was not easy and required new training
material, a corpus of human-annotated biomedical text (Krallinger & Valencia 2005; Kim et al.
2008). Any efforts to replicate this adaptation to other disciplines would require the development
2
of a well-annotated corpus using text from that discipline; however, unlike previous work, we
aim to use the results of the NLP to generate and enhance domain-specific ontologies. This
paper is a description of the development of that corpus.
Background
Natural Language Processing
The development of automatic speech and language processing began with the advent of the
computer in the 40s and early 50s. Researchers immediately attempted machine translation,
most notably Russian to English machine translation at Georgetown University (Dostert 1955).
The first techniques were primarily rule based, falling into the symbolic processing paradigm
that characterized NLP development until the mid-80s, and encompassing parsing algorithms as
well as formal semantics. This was in contrast to research in speech processing, which had
always relied on statistical techniques. In the late 80s the stochastic methods from IBM’s
successful speech processing models (Bahl et al. 1983) finally began to permeate text
processing, triggering a statistical, or machine learning, revolution in the 90s. This was fueled by
large amounts of newly available text and speech data in electronic form which was easily
accessible through Penn’s Linguistic Data Consortium (Marcus et al. 1993; Hajič 1998; Palmer
et al. 2005; Pustejovsky et al. 2003). Reliable techniques for consistent linguistic annotation on
a large scale were also developed at this point, primarily at Penn (Marcus et al. 1993) and in
Prague (Hajič 1998), fostering the creation of the necessary training data for supervised
machine learning. The availability of the data as well as new high-performance computing
systems drew the interest of the ML community; and the application of support vector machines
(Boser et al. 1992; Cortes & Vapnik 1995), maximum entropy techniques, logistic regression
(Berger et al. 1996), and graphical Bayesian models (Pearl 1988) to NLP tasks resulted in
fruitful collaborations and the robust, broad-coverage systems for shallow semantic processing
that are in use today. Today’s advances are based on the return of both basic and deeper
variants of traditional neural networks for core NLP tasks (Manning 2015), which have risen
dramatically in popularity in the last few years.
The effectiveness of NLP processing with Machine Learning (ML) methods depends critically on
having large corpora from which the methods learn patterns. Many of the available tools for
performing NLP tasks were trained on corpora written for a general audience, such as
newspaper articles, and do not perform well on scientific or technical text, but retraining these
algorithms on a domain-specific corpus greatly improves their performance on these texts
(Lease & Charniak 2005; Pakhomov et al. 2006; Rimell & Clark 2009; Pyysalo et al. 2006). The
various texts, at different levels, for various audiences have been termed ‘genres’.
The ClearEarth Project
ClearTK is an NLP toolkit that supports the use of several internal and external machine
learning components and corpora (Bethard et al. 2014, http://cleartk.github.io/cleartk/). The
3
ClearTK package has been successfully used in biomedical NLP applications after training on
the linguistically annotated MiPACQ, SHARP and THYME corpora (Albright et al. 2013; Styler et
al. 2014). ClearTK, which is compatible with UIMA (IBM’s Unstructured Information
Management Architecture), is an essential element of the Apache cTAKES (clinical Text
Analysis Knowledge Extraction System; ctakes.apache.org) project. This open source project is
managed by Boston Children’s Hospital, and has a solid multi-institutional and international
developer and user base, including national projects such as i2b2 (https://www.i2b2.org/),
eMERGE (McCarty et al. 2011), PGRN (http://www.pgrn.org/), and PCORI
(https://www.pcori.org/). It is used on a daily basis to automatically process clinical notes and
extract relevant information by dozens of medical institutions.
ClearEarth is a collaborative project that brings together computational linguistics and domain
scientists to port ClearTK to the fields of geology, cryology, and ecology. The end goal is to
enable use of advanced industry and research software within the geo-, bio- and cryospheric
sciences for downstream operations such as data discovery, assessment and analysis. In
addition, ClearEarth uses the NLP results to generate domain-specific ontologies and other
semantic resources.
This paper discusses two main types of issues that were encountered during the process of
corpus development: collaborative and annotation. Collaborative issues were challenges that
arose in working across disciplines and domains, but more particularly between the observation-
heavy natural sciences and the technologically-oriented computational linguistics. Annotation
issues were challenges that arose in trying to apply linguistic rules developed to annotate
general audience text to scientific subject matter.
The Annotation Process
Significance of Annotation
The success of NLP relies heavily on a well-annotated corpus that can be used to train the
algorithms to interpret new, unannotated text. The process of annotating text for training
involves several humans using annotation software to identify words and phrases (and
sometimes whole documents) as belonging to a specific category. The Machine Learning (ML)
portion of the process takes these annotations and develops rules for itself to follow when
looking at new text. The annotation step is very important and very expensive. It can take years
to develop annotation guidelines and annotate enough material for proper training. Different
people must be able to annotate the same text similarly, which is referred to as having good
inter-annotator agreement (ITA). Having well-thought-out guidelines, well-written text, and
properly trained annotators is very important for achieving necessary agreement.
The specific annotation tasks depend on the final NLP goal. For example, if the goal is to
identify whole documents as being about a specific topic, the annotations will be different than if
4
the goal is to extract specific information from within text. Thus, annotated corpora need to be
appropriate for the NLP task and are not easily repurposed. While significant advances have
been made and several annotated corpora are available for NLP text mining of newspaper
articles and biomedicine, significant annotation tasks remain for repurposing these advances to
other domains.
To be successful, we need new annotated corpora in each of our domains of interest because
terminology usage varies dramatically between domains and even between text written for
different audiences. Different domains may use different terms for the exact same thing or
conversely may use the same term to mean vastly different things. Even within a domain, terms
can have ambiguous meaning or change over time (e.g., in ecology, Hodges 2008; Isasi-Catalá
2011; in cryology, https://globalcryospherewatch.org/reference/glossary.php where the same
term can have as many as 12 definitions; while the World Meteorological Organization’s latest
update of the WMO Sea-ice Nomenclature in 2014 significantly altered the relationships
between and definitions of many terms).
The annotation step is very important because machine learning algorithms require consistent
inputs to produce reasonable outputs. This requires good agreement between annotations of a
text by different annotators; which in turn requires good rules that can be followed by annotators
who often are not trained in the discipline that is the subject of the text. Text annotation is aided
by software tools and a set of guidelines for annotators. The annotation tool used by ClearEarth,
Anafora, was originally developed under NIH funding for the cTAKES related THYME project
(Chen & Styler 2013). Anafora is open source (https://github.com/weitechen/anafora) and was
demonstrated to a large audience at the 2013 NAACL (North American Chapter of the
Association for Computational Linguistics) conference (Chen & Styler 2013). The annotation
guidelines are available for community download at the THYME website (Styler et al. 2014). The
clinical notes with their annotations are available from hNLP via a Data Use agreement.
In this project, annotation guidelines were developed in collaboration between one of our three
domain experts and a linguist. For some, this project represented their first collaboration with
the other discipline and was an exercise in learning a new research culture. This paper will
describe the annotation methods used by each of the three domains in the ClearEarth project
and discuss the problems that arose in collaborating across disciplines and domains and
repurposing technology. Then, solutions and guidelines for other projects will be presented.
Annotation Methods
The goal of the ClearEarth project was to repurpose algorithms that had been producing state-
of-the-art natural language processing results for biomedical text for use in other scientific
domains: geology, cryology, and ecology. Because text is so domain specific, we needed an
annotated corpus for each domain to train the algorithms. Extensive descriptions of the
annotation guidelines used in this project for each domain are available, including examples of
difficult cases (https://github.com/ClearEarthProject/AnnotationGuidelines). Documents were
5
chosen by domain experts based on availability (e.g., licensing and digitization) and tone (e.g.,
introductory textbook style). The text available for annotation is different in each domain:
geology texts were pulled from Wikipedia articles, magazine and journal articles, and teaching
materials; ecology texts were obtained from Wikipedia, Encyclopedia of Life, and National
Geographic blogs; cryology text was pulled from the National Snow & Ice Data Center blogs
(National Snow and Ice Data Center 2016b; National Snow and Ice Data Center 2016a) and
journal articles were collected from an open access journal, The Cryosphere. Each domain
designed its own annotation schema and guidelines, though some attempt at consistency in
annotation decisions across domains was made.
As a starting point all three domains adopted the linguistic annotation guidelines for
Treebanking (syntactic annotation), PropBanking (semantic role labeling), Nominal Entity
Tagging (NE, typing of common terminology) and Richer Event Descriptions (RED, temporal
and causal relations between events (https://github.com/timjogorman/RicherEventDescription)).
The PropBank and RED guidelines were developed at Penn and the University of Colorado,
Boulder. The Treebank and NE guidelines are available through the Linguistic Data Consortium.
The Treebank and PropBank guidelines applied quite readily with little modification. The NE
guidelines are very dependent on the terms being annotated and how they have been
represented in an ontology, and these were very challenging to apply. In particular, since there
were not pre-existing, generally accepted, extensive ontologies for ecology, geology and the
cryosphere as a whole, successful annotation first required clarification of the ontological
relations. We knew texts would be information dense and challenging for our annotators: some
of whom had the benefit of linguistics training but were not familiar with the domains and some
of whom were domain experts but had no linguistics training. There were many differences
between the syntax and semantics of newspaper articles (for which the NE guidelines were
developed) and scientific articles as well as major differences between the scientific domains
that needed to be accounted for in the annotator rules. This was also especially true for the
RED guidelines, which did not port well, as discussed in the “Identifying Eventualities” section
below. In addition, our need to use the outputs for ontology generation was an important factor
in development of the annotation guidelines.
(Note: In the following, text examples are in quotes, semantic labels are in short capitals and
species names are italicized.)
In Ecology
The final set of annotation guidelines for ecology were the result of one year of iterative
collaboration between biologists and linguists. The guidelines resulted in robust final inter-
annotator agreement of 75% or higher and correctly represented the domain knowledge. The
guidelines were influenced by existing bio-ontologies (see OBOFoundry and BioPortal) and their
communities of development because any ontologies resulting from this project would have to
work within the context of these existing ontologies.
According to the ecology annotation guidelines, all concrete things are entities. Entities were
annotated as being either ABIOTIC MATERIALS, BIOTIC ENTITIES, or AGGREGATE MIXTURES of the
6
two. Some examples of entities include “organism” and “ecosystem”. Any state, event, or
process was annotated as an EVENTUALITY. These are things that can be placed on a timeline.
They can have a beginning and end (i.e. “the cheetah ate the gazelle” eventuality in bold) or
they could be an ongoing process (i.e. “plants undergo photosynthesis” eventuality in bold).
Any property or trait of an entity was annotated as a QUALITY. This included things like “color”
and “size” or more abstract concepts like “trophic mode”. Any term that referred to a unit of time,
was annotated as TIME. Examples include “summer”, “Miocene”, and “20 Sept”. Terms
describing specific places that could be found on a map, such as “Colorado” or “Sonoran
Desert” were annotated as LOCALITY. Generic terms like “tropical rainforest” were considered
environments and annotated as an AGGREGATE ENTITY. Numerical measurements were
annotated as VALUES and their units were annotated as UNITS.
Properties were assigned to entities and used to link related entities. The property HAS QUALITY
would be used to link entities and qualities. Eventualities could be linked to entities using HAS
EVENTUALITY. A quality could be linked to its value using HAS VALUE and a value could be linked
to its unit using HAS UNIT. For example, in ‘The net primary productivity of the ecosystem is 2
gC/m2/day.” The term “ecosystem” would be annotated as an AGGREGATE ENTITY. The term “net
primary production” would be annotated as a QUALITY. The VALUE would be “2” and the UNIT
would be “gC/m2/day”. Using the properties, net primary production HAS VALUE 2 and 2 HAS UNIT
gC/m2/day. The term “net primary production” could be assigned to ecosystem using the
QUALITY OF property. Additional properties, such as SYNONYM, SUBTYPE, SUPERTYPE, PART OF,
and HAS PART were also part of the annotation schema.
Annotators made several passes over the training documents, which focused on food web
interactions. On the first pass, the annotators identified entities and eventualities. On the second
pass, properties were added. These two types of annotations were separated to reduce
confusion and improve inter-annotator agreement.
For Cryology
The cryology schema initially focused on sea ice types and characteristics since that is one area
where existing semantic resources could be used to validate our processes. Sea ice types and
characteristics are often treated as entities, appearing on the sea ice charts used operationally
by ocean going vessels (see for example Partington et al. 2003). For example, we have an
annotation category called ICE WITH DEVELOPMENT which essentially distinguishes between sea
ice entities based on their stage of development which roughly correlates to the age and
thickness of the ice. “First-year ice” represents sea ice that has survived one summer’s melt and
is thin whereas “multi-year ice” represents ice that has survived three or more summer’s melts
and is typically thick. Both are ICE WITH DEVELOPMENT terms. We recognize nine types of ice
characteristics in our schema: ICE WITH DEVELOPMENT, ICE WITH FORM, ICE WITH CONCENTRATION,
ICE WITH SOURCE, ICE WITH ARRANGEMENT, ICE WITH ATTACHMENT, OPENINGS IN ICE, ICE WITH
SURFACE FEATURES, and ICE WITH MELTING STAGE. In addition, we have generic entities that
describe sea ice and its context: ICE, WATER, SNOW, AIR, SEASON, DIRECTION, ATTACHMENT,
AREA, LOCATION, TIME, QUALITY, VALUE, and UNITS. Any event, process, or state that affects sea
ice was annotated as an EVENTUALITY.
7
Cryology annotation benefited from having a narrow schema derived from the WMO Sea Ice
Nomenclature (World Meteorological Organization 2014) and a set of sea ice ontologies
developed as a part of the NSF-funded Semantic Sea Ice Interoperability Initiative (SSIII) (Duerr
et al. 2015) which meant that terms appeared in a consistent, standardized way. We could also
provide annotators with a set list of approximately 150 ice entities to capture. This narrow
schema helped to offset the difficulty of familiarizing annotators with the sea ice terms and
concepts. The first round of annotation for cryology focused on a small set of sea ice entities
and their properties that served as an entry point into the field for annotators. In the second
round of annotation, the annotators did well at identifying sea ice entities, but tended to disagree
on what eventualities to capture and how long the span ought to be. While the ice entities are
standardized in the cryology field, the terms used to describe phenomena and processes
(EVENTUALITIES) are less so and often come from different disciplines (e.g., atmospheric science
or geology). We eventually concluded that attempting to annotate entities and processes from
other disciplines was out of scope, even though cryospheric texts frequently are multi-
disciplinary due to the nature of cryospheric regions. The implications of this for many
cryospheric texts is that the same text could potentially be annotated several times; once for
each discipline represented in the text. A few small experiments with multiple annotation using
the ecology and cryology schema were attempted; though at this time the results are unclear.
Property annotations were used to label traits that distinguish one type of ice from another in an
effort to make it easier for a machine to correctly infer ontology classes. For example, using the
traits of “first-year ice” described above, the property categories HAS QUALITY were used to
connect “thin” to “first-year ice”. We also have VALUE and UNIT entities that allow us to capture
measurements, such as “two meters”, and properties that allow us to assign them to their
respective entities. The property, HAS EVENTUALITY allowed us to capture the relationship
between events, processes, and states to their effect on ice entities and vice versa. The
SYNONYM property allowed us to keep track of equivalent terms and phrases. SUBTYPE and
SUPERTYPE properties allowed us to order entities according to their hierarchies of types and
stages of development or deterioration. Like the ecology annotators, cryology annotators made
several passes over the training documents, first identifying entities and eventualities and
second, adding properties.
For Geology
The geology schema focuses on the core components of earthquake events and their
underlying geological process. Those core components include WHERE the earthquake occurred
in regards to mappable localities (i.e. the city of San Francisco), environmental context (i.e. the
San Andres Fault or along fault lines) and, sometimes, the precise latitude and longitude; WHEN
an event or process occurred or is occurring; and the defining QUALITIES of the event or process
(i.e., its intensity, magnitude, depth, distance, force, recurrence intervals, etc.). This leaves us
with a schema comprised of the following annotation categories for entities and attributes:
GEOPOLITICAL LOCATION, ENVIRONMENTAL CONTEXT, LATITUDE LONGITUDE LOCATION, TIME,
QUALITY, DIRECTION, UNIT, and VALUE. We reserved the annotation category EVENTUALITIES for
8
events, processes, and geological states that pertained to key themes in earthquake texts such
as strain or stress build-up, strain release, movement, formation, and deformation. We found
that arranging our thinking around themes instead of a list of predefined terms like rupture or
plate tectonics helped annotators view arguably generic eventualities like break and jostle as
relevant to the core concept of movement thus important and worth annotating.
Property annotations allow us to draw connections between the annotated entities and
eventualities in the text. These connections give us a clearer map of the processes, events,
states, and features that comprise an earthquake event. Take this sentence for example, “The
epicenter of the earthquake was estimated to be 10 miles off the coast of California.” Our
property schema allows us to make linkages showing that “earthquake” has an ENVIRONMENTAL
CONTEXT “epicenter,” which in turn has its own ENVIRONMENT CONTEXT “coast” and GEOPOLITICAL
LOCATION “California.” The VALUE “10” has a UNIT of “miles.” Miles, in turn, has a DIRECTION of
“off” and “off” has an ENVIRONMENTAL CONTEXT of “coast.” If this particular sentence had listed a
time when the earthquake occurred, we would have annotated it as having a property of time.
The property schema also makes a point of identifying SUPERTYPES, SUBTYPES, and SYNONYMS
when they occur.
Most of our geology corpus is comprised of teaching materials, textbooks, and glossaries.
These resources are concerned with explaining the underlying geological process that cause
earthquakes. These geological explanations are more detailed and nuanced than what one
might find in a blog or article breaking the news of an earthquake, which would likely place the
article’s focus on the earthquake’s impact on people, buildings, and infrastructure. We have paid
particular attention on how to illustrate the causal relationship in the earthquake texts. Our
solution is to extend the HAS EVENTUALITY property to allow for eventualities to be properties of
other eventualities. This allows us to show that tectonic activity causes earthquakes in a
sentence like this one: “Earthquakes are the direct result of tectonic activity”.
Annotation Results
Each domain had over 100,000 words of text from at least three different sources available for
annotation. The project goal was to annotate at least 1,000 instances of each entity, eventuality,
and property type for each domain. Ecology had four annotators, two from biology and two from
linguistics. Geology had two annotators, one from geology and one from linguistics. Cryology
had three annotators, all linguists.
Patterns emerged in the ecology annotations, which had two annotators from each discipline. At
the beginning of the annotation process, highest agreement was between annotators from the
same discipline. An example of higher agreement between annotators from the same discipline
is demonstrated by annotations of the following text: “There is a large transitional difference
between many terrestrial and aquatic systems as C:P and C:N ratios are much higher in
terrestrial systems while N:P ratios are equal between the two systems.” The ecology
annotators recognized “C:N ratios” as a QUALITY because they knew that Carbon to Nitrogen
9
ratios are an important measurement of an ecosystem. The linguist annotators did not know this
and annotated “C”, “N”, and “ratios” as three separate ABIOTIC MATERIALS. The linguist
annotators agreed with each other. The ecology annotators agreed with each other. Over time,
these patterns disappeared and inter-annotator agreement was more about the skill of the
individual annotator. It took three iterations of the guidelines for them to stabilize and for the
annotators to reach this point.
The annotation process was a major bottleneck in progress toward our goal. Human annotation
is very time consuming and fragile. As we refined our annotation guidelines, which could only be
tested through actual annotation, text had to be re-annotated multiple times. Ecology and
geology guidelines were revised five times while cryology guidelines were revised three times.
Many hours of work were lost in the process because changes in annotation procedure would
render previous annotations invalid. This experience makes the release of our guidelines
invaluable to future projects.
Development of an automated alternative to human annotation would be a major advance; but
even a partly automated solution would be helpful. In this project, it was possible to pre-
annotate the cryospheric texts using the lists of MWE’s and terminology in the well-developed
cryospheric glossaries. This pre-annotation did appear to improve inter-annotator agreement,
but at the time of this publication there are not enough data to make definitive statements.
Annotation Issues
Multi-Word Expressions
Historically, linguists have either annotated the “minimum span” or the “maximum span” when
annotating text (Bies et al. 2016). A minimum span is one in which only the headword of a
phrase is annotated and its pre-modifiers are ignored. We thought this approach would be
beneficial because it provides a helpful constraint for annotators, which limits the guesswork
involved and thus increases inter-annotator agreement. This strategy works very well in text that
is written concisely, using simple vocabulary, but not for scientific writing. We realized early on
that limiting our annotation to minimum spans set us up to model relationships between entities
and eventualities and their properties inaccurately. Take, for example, this ecology sentence:
“Oviparous animals lay eggs.” Minimum spans would only allow us to annotate “animals” and
“eggs” as BIOTIC ENTITIES and “lay” as an EVENTUALITY.
1
This might work fine for named-entity
annotation, but it is not helpful for ontology generation because the properties applied would say
BIOTIC ENTITY “animals” HAS EVENTUALITY “lay”. While it is true that many animals lay eggs, it is
not the case that all animals lay eggs. We needed to be able to make it clear to the machine
that it is an inherent quality of oviparous animals that they lay eggs. This is a relatively simple
example with few modifiers compared to other sentences in scientific text.
1
Eventualities can be any part-of-speech and can be annotated wherever they occur. For example,
“freezing” is an eventuality, but so is “frozen”.
10
In science writing, several descriptive words are often used to name a concept, such as “thin
Arctic first-year drift ice”. In this, more complicated example, the minimum span would be “ice”,
but only annotating ice as the entity would miss crucial meaning. The author is not writing about
ice, but a specific type of ice, “first-year ice”, that is thin, drifting and located in the Arctic. An
annotator using “minimum span” annotation, would miss this meaning and annotate only “ice”.
Conversely, if “maximum span” annotation were used, the entities annotated would consist of
the entire string of modifiers associated with a term. In our example above, this would mean that
an annotator would annotate the entire phrase, “thin Arctic first-year drift ice”, as a single entity,
which merges several concepts into a single term. The correct annotation lies somewhere in the
middle, but where?
Our solution was to allow the annotation of strictly defined multi-word expressions (MWE) which
are phrases that use multiple words to describe a single concept. We have various methods for
identifying MWEs. In general, we ask annotators to only capture minimum spans except for
instances in which the information would be inaccurately represented. In these cases and in all
three domains, we provide them with resources to check to see if a phrase qualifies as a MWE.
In ecology, annotators have access to a glossary of terms, multiword expressions, and
synonyms along with their definitions. In geology, annotators have access to a glossary built for
this project with terms and definitions borrowed from reputable glossaries in the earth sciences
field. Included in this glossary are common variations of descriptive terms. For example, the
entry for “p waves” lists its common premodifiers: p, primary, longitudinal, irrotational, push,
pressure, dilatational, compressional, push-push, etc. Annotators know that if they see any of
these premodifiers before the headword “wave” that it should be annotated as a MWE. Cryology
offers annotators a set of guidelines, which contain diagrams that outline the various types of
sea ice and their hierarchical phases or relationships based on the WMO Sea Ice Nomenclature
(World Meteorological Organization 2014). These diagrams guide the identification of a MWE.
For example, in the sentence “The response of the albedo of bare sea ice and snow-covered
sea ice to the addition of black carbon is calculated” the phrase “bare sea ice” would not be
annotated as a MWE - it is just sea ice without snow on top; you wouldn’t call snow-covered sea
ice a form of sea ice for the same reason you wouldn’t call snow covered granite a type of rock.
The phrase “bare sea ice” would not appear in a glossary while “sea ice” would; therefore, “sea
ice” is annotated as a MWE.
These glossary resources are limited and cannot capture all the variation we see in science
texts. There is a seemingly infinite number of headword and premodifier combinations, few of
which are actually standardized multiword expressions in the field. For example, in ecology, we
cannot efficiently predict all the various combinations of premodifiers with the noun “organism”.
This means that it is neither efficient nor accurate to call each one a multiword expression and
list it in a glossary. On the other hand we cannot ignore premodifiers without changing meaning.
To solve this problem, we ask annotators to label these premodifiers as qualities. Then
annotators create a relationship link between a premodifier and its headword called a conjoined
phrase. Any relevant properties are assigned to the conjoined phrase instead of only to the
11
headword. This allows us to accurately represent that “oviparous animals” HAS EVENTUALITY
“lay”. But then, how can we tell if a phrase should be annotated as a conjoined phrase or a
MWE? We can demonstrate the difference using “oviparous animals”, a conjoined phrase, and
“egg-laying animals”, a MWE. The difference between “egg-laying animal” and “oviparous
animal” is that “oviparous” is a domain-specific term that needs to be captured as an ontology
class, unlike “egg-laying” which is a very basic definition of oviparous. That is not to say that no
MWE need to be captured as ontology classes. Terms like “ambush predator” and “brood
parasitism” are MWE that should represent ontology classes because there are no single terms
that represent these ideas.
One seemingly obvious solution is to use a property, such as QUALITY OF, to connect the head
word and its modifier; however, this is not a good idea. Adding a property to connect a head
word and a modifier is redundant because the basic grammar of the sentence (which NLP
algorithms can understand) already connects the two.
At this stage of the project, conjoined phrases have been used in annotating ecology texts. It is
as yet unclear whether geology or cryology texts require this added level of complexity in
annotation. The efficacy of conjoined phrases in NLP and ML will be better understood after the
final results of the project are examined.
The Curse of Expert Knowledge
To begin the work, the domain experts performed some of the annotations while constructing
the schema to pass on to the annotation teams. The domain experts initially provided highly
detailed and complex annotations, capturing a broad spectrum of entities, events, and
relationships. Properties and relations were assigned that the experts knew were correct, but
were not explicitly addressed in the text. This resulted in very detailed annotations, but poor
inter-annotator agreement because annotators without the special knowledge were not able to
pick up on all the nuances. It also resulted in reduced feasibility of creating a viable annotation
schema and guidelines that would be accessible to annotators outside the domain of the text.
In order to solve this problem we created four rules. 1. The property annotations should only
reflect the explicit meaning of the sentence, 2. Properties should only be used to link
annotations within the same sentence or paragraph depending on annotation guidelines for that
domain, 3. Properties should only be assigned from left to right, and 4. Only actors would be
linked to an eventuality. Two minor changes were made to these rules to accommodate the
geology domain: a) exceptions were granted for Rule 3 in cases in which it would cause an
incorrect annotation, and b) in Rule 4, links to eventualities were not restricted to actors. The
first rule was difficult for the domain experts. Instead of annotating everything they knew, they
had to stop and think about what the text was communicating directly. Sometimes granular
information in the text had to be ignored for the sake of inter-annotator agreement. This meant
that less information could be gathered from each text, so more text was needed, but less
domain knowledge was required and inter-annotator agreement was greatly improved.
12
A potential solution is to not involve domain experts in the annotation step; however, this is not
as viable a solution as it first appears. We found that domain knowledge was necessary for
correct interpretation of the training texts and thus creating the annotation rules, such as
knowing when to use SUBTYPE vs. PART OF. For example, knowing that bacteria are PART OF the
microbial loop requires knowledge of what the microbial loop is. Correctly annotating actors,
products, and subprocesses of processes like photosynthesis can be difficult if you do not know
what photosynthesis is. Even annotation of common terms can be problematic without domain
knowledge. For example in the cryosphere, “old ice” is sea ice which has survived “at least one
summer's melt” and can be further subdivided into “residual first-year ice”, “second year ice” and
“multiyear ice” depending on 1) the number of summer's melts that it has survived and 2)
whether or not it is in a "new cycle of growth." A domain expert would recognize “summer’s
melt” and “new cycle of growth” as being tied to a season, making it a kind of time expression
that they would annotate as a TIME entity. A non-domain expert would likely miss that
connection. Units themselves can be complicated to annotate correctly without domain
knowledge, such as “% per decade”, which would likely not be recognized as a unit by a non-
expert.
Alternately, one could solely involve domain experts in the annotation step. This also is not a
particularly viable solution. Linguistics training is just as important to the development and
implementation of the guidelines as domain knowledge. Once the guidelines are developed,
annotation is not the best use of a domain expert’s skill set. Domain experts’ time tends to be
expensive and they are less likely to have the temperament for text annotation. A good mix of
domain expertise and linguistics expertise is needed for the successful development and
implementation of annotation guidelines.
Hyperinformative Proper Nouns
An important annotation rule of thumb from the linguistics discipline is to avoid doubly
annotating a term. This was a problem for some proper nouns, such as the 1906 San Francisco
Earthquake. The entire phrase refers to a specific earthquake, which is an eventuality, but the
name also gives the time, 1906, and the location, San Francisco. To solve this problem we
decided to give preference to location and time when annotating proper nouns. Thus, “1906”
would be annotated as TIME, “San Francisco” would be annotated as a LOCATION, and
“earthquake” would be annotated as an EVENTUALITY. If the proper noun does not include a
location or a time, such as “The Good Friday Earthquake”, then the whole name is annotated as
a MWE.
Double annotation is a problem largely because of inter-annotator agreement. If text can be
annotated multiple times, then a door is opened to a potentially endless array of annotations
because nearly everything can be annotated multiple ways. If double annotation is allowed,
inter-annotator agreement can be very low. A second problem arises from the standard machine
learning approaches used to identify and classify entities. Such approaches work well for spans
with unique annotations over the tokens that comprise them. Although it is possible to train
13
systems to identify nested annotations, performance on such annotations is of much lower
accuracy.
Complicated Properties, Qualities, Traits, and Values
The terms PROPERTY, QUALITY, TRAIT, and VALUE were often used interchangeably between the
three scientific domains to describe a characteristic of a thing. Examples include thickness of
ice, the color of a feather, or the duration of an earthquake. Modeling a statement about a
characteristic of a thing is a very basic triplet: ice type - thickness - 34 cm. However, community
agreement on how best to model this knowledge between and even within domains has not
been reached. Further discussion of domain-specific knowledge modeling is below in the
“Collaboration Issues” section.
The annotation of qualities in ecology text was challenging. Initially characters were annotated
as trait value pairs. For example “The [catfish] has a [round] [dorsal fin shape].” “Catfish” is
annotated as a BIOTIC ENTITY, “round” is a VALUE and “dorsal fin shape” is the TRAIT. Properties
were used to connect the three: “catfish” HAS TRAIT “dorsal fin shape” and “round” VALUE FOR
“dorsal fin shape”. However, this method worked well for a very limited number of statements.
Most of the time only a value was given and the trait was understood, “The [catfish] has a
[round] [dorsal fin].”. In this example “dorsal fin” is an entity. The “shape” trait is not mentioned.
This was confusing for annotators and led to high disagreement.
We solved this problem by annotating all traits and values as qualities. The previous example
sentence “The [catfish] has a [round] [dorsal fin].” would now be annotated with “round” as a
quality. The properties would state: “catfish” HAS PART “dorsal fin” and “round” QUALITY OF
“dorsal fin”. If the trait was explicit then they would both be annotated as qualities with one as a
SUBTYPE of the other. We agreed that QUALITY and TRAIT would be synonymous across
domains. VALUE would now be used for numerical values only. All mentions of characteristics
would be annotated as a QUALITY as described above in the “Annotation Methods” section.
Qualifiers and Relative Values
Sometimes information is communicated in relation to something else. For example, “TWTTs
from the 2009 IceBridge survey were found to contain consistently shorter radar-wave delays
than the 2009 McGrath ground-based survey despite being collected only 2 weeks earlier, with
a mean equivalent ice thickness approximately 10 m lower and therefore a significant outlier
relative to the other surveys.” The phrase “10 m lower” is not an ICE THICKNESS. It is a change in
ice thickness between two surveys whose values are not reported. Another example, “Tigers,
Panthera tigris, are the largest members of the cat family, Felidae”. This sentence is
communicating the size of Panthera tigris without giving any precise measurements. How can
we capture that information in the annotations? Terms like “maximum, “minimum”, “third lowest”,
“rapid” are all used to communicate important information about an entity or process but can
only be precisely defined in relation to other things which may or may not be explicitly
mentioned.
14
No solutions were developed within this project. In ecology, this was because most assertions
made using qualifiers and relative values, such as the examples given above, were not relevant
for the types of ontologies we were seeking to make. In cryology, these assertions could be
important, but the project did not have the resources to resolve this problem, which is a larger
issue in NLP research (Palmer & Xue 2010; Lassiter 2015; Lassiter & Goodman 2017; Kennedy
2007; Bos & Nissim 2006). Thus, qualifiers and relative values were not annotated.
Negation
Often, in the text, negation was used to describe a condition that was not true rather than stating
what was true. For example, in the sentence “The great white shark does not form schools.” the
negation is a very important part of describing the social behaviors of the great white shark.
Negation is an ongoing area of NLP research that has been discussed at length elsewhere
(Morante & Sporleder 2012; Palmer & Xue 2010; Wu et al. 2014; Blanco & Sarabi 2016;
Morante et al. 2008; Blanco & Moldovan 2011; Zou et al. 2014). This project handles negation
by annotating all entities and eventualities, but not the negated properties. Using the shark
example above, “great white shark” and “schools” would be annotated as BIOTIC ENTITIES while
“form” would be annotated as an EVENTUALITY. Since sharks do not form schools, the property
HAS EVENTUALITY would not be used to link “great white shark” with “form”. This is not a solution
to negation in NLP, strictly speaking, but it captures what is needed to train an algorithm for our
purposes.
In our geology corpora, negation often is used to dispel misinformation about what causes or
happens during earthquakes or it is used to differentiate the properties of one seismic region or
process from another region or process. This sentence is an example of the latter: “In Eastern
and Northern Canada, earthquakes are not related to volcanic processes.” For this sentence we
would annotate the locations as well as the mention of earthquake and volcanic processes, but
we would not link volcanic processes as having (i.e. causing) the eventuality “earthquake” as a
property.
Our cryology texts do not have as many instances of negation as our biology and geology texts.
This leaves little need to use negation to define the entities, events, processes, and states.
When we do see negation used, it functions as way to mark the differences in the current state
of the cryosphere from its state in previous years: “Coastal polynyas are not unusual, at this
time of year, but the polynyas we are currently seeing appear larger and more numerous than
usual.”
Another study of information extraction from biology texts found that negation only led to 7% of
all errors (for example, incorrectly linking great white sharks with schooling behavior),
suggesting that, at least in biological texts, while negation errors are important, they are not the
largest source of information extraction error (Thessen & Parr 2014).
15
Identifying Eventualities
The most difficult area to get good inter-annotator agreement was in identification of
eventualities. In geology, there is enormous variation in regards to the terms used to describe
earthquakes (earthquake, temblor, shock, tremor, earthquake activity, seismic activity, shaking,
ground shaking, break, rupture, etc.) and geological eventualities involved in or affected by this
activity. When we first started this project, we wanted to only capture eventualities that were
considered standard terms in the geological field. However, this led to disagreement between
annotators: one would annotate “activity” as an EVENTUALITY while the other would consider that
to be too abstract and generic. We quickly noticed that the same issue was cropping up in our
ecology and cryology annotations. We solved this dilemma by changing our thinking about
eventualities so that we no longer categorized them as generic versus specific. In this model,
we considered a term to be specific if it was standardized and recognized in its field. This was a
poor model for the kind of corpora we were working with, especially considering that most of our
ecology and geology corpora presented simplified explanations of key biological and geological
concepts for their grade-school audiences. We find explanations like “plants capture the sun’s
energy and turn it into food” as often as we find the term “photosynthesis.” Similarly in geology,
a tectonic plate may undergo subduction or it might “fall, be consumed, or sink.” We needed to
create a method that allowed all of our annotators to recognize “capture” and “photosynthesis”
as being eventualities that describe plant processes. The method also had to be applicable to
the other two fields so that we would not miss out on “sink” and “subduction” in geology, or
“increase” or “ice growth” in cryology.
Our solution was simple: Annotate an eventuality if it coincides with the focus of this project and
is an essential process or theme within the context of the domain. The annotation focus of this
project is to capture the key process, concepts, and entities in the field while ignoring mentions
in the text that refer to human interactions with the data. For example, while we would annotate
a Bengal tiger’s body weight, we would not annotate the mention of the scientist doing the
measuring in the sentence, “The zoologist weighed the Bengal tiger at 325 kg.” This approach is
followed in geology and cryology too. In geology, we will often see mentions of the cost of
earthquakes in regards to loss of life and damage to human structures, which we have decided
not to annotate. In cryology, references to people are usually in reference to scientists’
observations and measurements of the state of the cryosphere. Again, we do annotate these
observations and measurements, but do not annotate the act of observing and measuring as
eventualities because they do not fall within the scope of the ontologies we aim to create within
the scope of this project.
To help the annotators consistently identify eventualities, we created themes for each domain
reflecting the types of ontologies we intended to produce. For example, we are not creating
ontologies to model knowledge about the practice of science, so mentions of models,
equipment, or theories were not annotated. The essential cryology themes include: heating and
cooling; ice loss and ice growth; amounts of precipitation and movement of ice. For geology
texts we narrowed the themes to those that were relevant to earthquakes: formation,
deformation, heating, cooling, strain and stress, strain release, and movement. The essential
16
themes for ecology included processes, states, or events related to survival of organisms (i.e.
predation, reproduction, habitat loss, etc.) and the entities involved in these processes (e.g.
organisms, substances, environments, etc.). Only eventualities that fit within the boundaries of
these themes were annotated and this restriction improved agreement between annotators.
Text Quality
There is an abundance of texts available in digital format for each domain across the entire
spectrum of potential audiences. There are no universal rules for NLP text selection, only that
texts are chosen to fit the task at hand (Palmer & Xue 2010). The ideal texts for this project
used and defined domain-specific terms, such as a textbook intended for an educated novice.
These texts would use a term and then define it. The simpler texts written for children and the
general public were often written using an artistic prose that described processes of interest, but
used ambiguous terms that were difficult to annotate. The more complicated texts were
scientific journal publications that used many useful terms, but assumed the reader had all
necessary background knowledge and thus did not define the terms.
To address this problem, we used domain glossaries for algorithm training in addition to the
annotated articles. Sometimes articles that were poorly written or intended for children had to be
abandoned in the annotation process to improve inter-annotator agreement. We invested time in
careful selection of text that used and defined relevant terms. This saved significant resources
and resulted in better annotations.
Collaboration Issues
The team for this project was comprised of computational linguists, NLP experts, geologists,
biologists, and cryologists. It is a remarkably interdisciplinary project.
Communication
As is true for every interdisciplinary collaboration, communication was often difficult (Edwards et
al. 2011), at least initially. All domains and disciplines used their specialized vocabularies that
others were not familiar with. Some terms, like “event”, “feature”, and “process” were used by all
domains, but had different definitions (see Appendix A and Donovan et al. 2015). Several terms
were used by one domain or discipline, but unknown to others. Communication issues can have
a disproportionate impact on a project by increasing the time necessary to work through
technical issues.
When differences in term usage were suspected, the group would take time to identify the
difference and listen to each other’s understanding of the term. No “official” definition was made
for the project, but the discussion helped the team understand how the term was used by
others. When a team member used a term that others were not familiar with, the meeting was
17
stopped and definitions were requested. Specifically allowing such interruptions is important at
the beginning of a project. This practice was not immediately arrived at by our group.
Communication regarding how to use the computational linguists’ annotation software, Anafora,
was critical and central to the project. Anafora was relatively easy to use and had
documentation, so the annotators were able to learn it quickly. When there were problems with
the software, we were able to get help from the developers who were at the same institution as
one of the collaborators. Anafora is open source, so even if we could not get the developers’
attention, we would eventually be able to fix issues ourselves, albeit with more effort. The
importance of having annotation software that is easy to use and maintained cannot be
understated.
Remote Participation
The distributed multi-institutional and multi-city nature of the team added another layer of
difficulty to collaboration. These problems were paritally solved with periodic face-to-face and
remote video meetings. Collaborators would hold some of their more complicated questions
until those meetings. Communication in between was primarily through group email and the
GitHub issue tracker. Project code and documents were shared using a Google Drive folder and
a GitHub repository. Not all team members were familiar with the use of Git and could not take
full advantage of the version control and collaborative editing features.
Team Management
To begin, not all team members fully understood the goals of those outside their immediate
circle and discipline. Meetings sometimes had to be interrupted to take explanations from
another domain and expectations had to be adjusted. The project was designed to have a
project manager at post-doctorate level, but the draw of lucrative commercial positions in the
same field of research made it very difficult to hire. As a result, the project did not gain a project
manager until its last year and that person remained for less than the full term. Clearly, the
project would have benefited from an earlier appointment to focus on deliverables and
dependencies across the various domains. The project, with so many moving parts, seems to
have required more project management than usual.
An absolutely critical appointment was the manager of the annotator teams to act as a go-
between for the annotators and the domain specialists and who also synthesized the annotation
guidelines. The effectiveness of this person was absolutely critical. In this there were many
coodinations involved: with the annotators, the domain specialists, the computational linguists,
and with the developers of the annotation software.
Knowledge Modeling
The way in which each discipline thought about how knowledge should be arranged (modeled)
was often different and had to be reconciled. For example, premodifiers like dead, living, and
18
decaying were viewed as processes (and thus eventualities) by the linguists while the domain
annotators viewed them as qualities. Another example is the annotation of abiotic vs. biotic
entities. Linguists thought of abiotic vs biotic as dead vs alive; whereas the domain annotator
viewed it as inorganic vs. organic. Much effort had to be expended early in the project to arrive
at a joint position on such issues – before the actual annotation work began. Unfortunately, the
differences persisted as work proceeded from low to high levels of granularity. Generally the
solution was not to be overly specific or restrictive on terms as in the annotation issue of the
QUALITY vs. TRAIT vs. VALUE annotation issue discussed above.
In biology, especially in bio-ontologies, a QUALITY is a defined class derived from the Basic
Formal Ontology (BFO http://purl.obolibrary.org/obo/BFO_0000019). In the Phenotype Quality
Ontology, the qualities are modeled as subclass hierarchies. For example, “color” is a QUALITY
(as in BFO) and all of the actual colors (blue, red, etc.) are its subclasses. Data about
characteristics are modeled using EQ (Entity Quality) syntax that pairs ENTITIES, like “dorsal fin”,
with a QUALITY, like “rounded” (Mabee et al. 2007). Another standard in biology is Darwin Core
that models biodiversity information with data tables linked using a star schema (Wieczorek et
al. 2012). In this model, characteristics are modeled as trait-value pairs, where the TRAIT might
be “fin shape” and the VALUE might be “rounded”. The trait-value pairs are linked to an
OCCURRENCE which is linked to a TAXON. In Darwin Core, an OCCURRENCE can be an
observation or a specimen.
With the development of the Semantic Web for Earth and Environmental Terminology (SWEET)
(Raskin & Pan 2005), semantics and ontologies have been a part of earth science informatics
writ large. Focused primarily on terminology to help users find relevant data sets, SWEET has
nine top-level concepts - representation, process, phenomena, matter, realm, human activities,
property, state, and relation. Of these, PROPERTY comes closest to BFO’s concept of QUALITY,
including having a large subclass hierarchy. However, properties are focused primarily on
things that can be measured or observed utilizing conceptualizations from math and physics.
SWEET uses the Global Change Master Directory’s (GCMD) science keywords which are used
worldwide by the members of the Committee on Earth Observation Satellites (CEOS) and the
International Directory Network (IDN). For example, “Temperature”, a subclass of
“ThermodynamicQuantity”, is expected to have a VALUE and UNITS and is a measure of the
energy property “Heat” while the term “Color” is characterized as an “OrdinalProperty” with
subclasses of “Luster”, “Pigment”, and “Streak”.
As it is primarily a middle level ontology, different disciplines within the earth sciences have
extended SWEET to provide more depth in their domain. A list of earth science domain specific
as well as application specific ontologies has been compiled (Whitehead 2017). However, the
only cryospheric ontologies on that list are the sea ice ontologies developed as a part of the
SSIII project mentioned earlier (Duerr et al. 2015). More recently, a variety of cryospheric terms
have been added as extensions to the Environment Ontology (EnvO), which represents
knowledge about “environments, environmental processes, ecosystems, habitats, and related
entities” and is interoperable with other ontologies in the Open Biomedical Ontologies (OBO)
Foundry (Buttigieg et al. 2016).
19
Given the vastly differing heritages of OBO and SWEET, it isn’t surprising that the two
foundational ontologies are not currently interoperable. However, just this year the SWEET
ontology was open sourced and the Earth Science Information Partners (ESIP) semantic
community is actively in the process of bringing it up to date with modern ontology engineering
practices. Working with the OBO/EnvO community, ESIP’s long term goal is to harmonize
SWEET and EnvO to facilitate interoperability.
It can be argued that these technical, very precise knowledge models – the ontologies – are not
exactly what is required for the operation of processing natural, people-authored texts. The
natural texts were not written with the ontologies at-hand. Rather, dictionaries and thesauri, or
popular usage are the basis – particularly for some genres. It has to be allowed that many
written texts will deviate from the highly technical, precise definitions in the ontologies, and
those deviations could degrade the NLP result. The project therefore made the judgement that
broader thematic solutions to semantic and knowledge-model questions were more workable for
the NLP. Including for the achievement of inter-annotator agreement. The project decision to
annotate all characteristics of an entity as a QUALITY is an example, with this usage widespread
in biology but also partially in use in cryology and geology. The adoption of EVENTUALITIES is
another example. In this way, the annotation guidelines formed during ClearEarth present an
alternative knowledge model for the terms in the texts, one oriented to successful automated
parsing and analysis of texts.
Future Work
This project identified three areas of NLP research that need further investment to improve
mining of scientific text. First, is the proper annotation of time entities that don’t explicitly
mention time. The entity type and properties that should be assigned in these cases are not
clear. These entities are usually associated with seasons, such as “winter’s growth”, which can
be very important in domains that involve earth and life processes. Second, is the use of relative
measurements, such as “the tiger is the largest member of the cat family”. This is a very difficult
problem that the NLP community is not addressing, likely because the exact interpretation of
relative terms can be domain specific, making a single solution difficult. Third, is negation, which
has received substantial research attention, but still is not solved.
Another major issue, possibly a grand challenge of NLP, is the time and labor intensive nature
of high quality annotation. Any techniques that can reduce the amount of needed effort to get
significant performance improvements will make this technology much more portable to new
domains. This project had a pre-annotation step where a “white list” of terms gathered from
domain-specific glossaries was automatically annotated in text. Even this very simple pre-
annotation step reduced annotation effort and made it easier for annotators to decide if a
document was worth annotating, for example, by looking at the number of pre-annotated terms.
Very rarely was a pre-annotation deleted by an annotator. More sophisticated methods of
automated pre-annotation methods, such as active learning that can deliberately select the most
20
informative instances for annotation, should be targeted for further development in order to
reduce the annotation work load and the amount of needed training data.
These annotations will be used to train NLP algorithms to automatically generate ontologies for
Earth and Life Science. The existing sea ice ontology will be compared the sea ice ontology
developed via ontology induction as a measure of method success. The ecology ontology will
be placed in GitHub where it will be vetted by the existing bio-ontology community. Portions of
the ontology that are accepted will be transferred to the new ecocore ontology that was
developed during the ClearEarth hack-a-thon and is now awaiting approval by the OBO
Foundry.
Conclusions
Natural Language Processing and Machine Learning methods are promising technology for the
extraction of data from natural science text. Taking the first step in applying these technologies,
creating an annotated corpus, requires collaborations between scientists and linguists, two
groups that do not often work together. This collaboration resulted in several annotation and
collaboration issues that had to be addressed. Our solutions have been documented and are
available for use and modification. This way, subsequent projects will avoid the inefficiency
involved in reworking concepts and workflows that we encountered.
Continued interdisciplinary collaborations are necessary to build the data infrastructures that
support addressing large-scale problems in natural science. Collaborative, interdisciplinary
annotation projects should consider these recommendations:
1. Any domain that uses its own specialized vocabulary and writing style will need its own
annotated corpus and annotation guidelines. Sometimes a domain will have more than
one specialized vocabulary and writing style depending on the audience. Each will need
its own annotation guidelines and corpus.
2. Projects should budget significant resources for annotation. At least a year is needed for
development of guidelines for annotation of text.
3. Development of the corpus and annotation guidelines will be an iterative process for all
new domains.
4. Take time at the beginning of the project to discuss goals and expectations. Include a
strong project manager who can focus on deliverables and dependencies.
5. Wait to annotate text in bulk until after the guidelines and schema have been tested in
small groups and inter-annotator agreement is good.
6. Several annotators will be needed. This project used eight among the three domains.
7. Face-to-face project meetings are invaluable for good communication, especially in
cases where multiple disciplines are involved.
8. Sometimes granularity of the knowledge modeling will have to relax in order to get good
inter-annotator agreement. Only annotate as much as you need for the project goals.
9. Differences in term usage between the domains should be expected, identified, and
worked through.
21
10. Projects greatly benefit from someone to manage the annotators and act as a go-
between for the annotators and the domain scientists.
11. A good annotation tool is a necessity. The tool must have a user-friendly interface and
have readily available tech support for when things break.
12. At the beginning of the project the group should agree on and document tools and
methods for managing group communications, such as document version control.
Appendix A
Glossary of Important Cross-Disciplinary Terms
Head Word: The word in a phrase that is the subject. This is a linguistic term that is not used in
ecology, cryology, or geology.
Modifier: The word in a phrase that describes the subject. This is a linguistic term that is not
used in ecology, cryology, or geology.
Semantics: This is a linguistic term that refers to the meaning of words and sentences. The
distinction between semantics and syntax is important. Two very different words or
sentences can have the same meaning and two very similar words or sentences can have
very different meanings.
Syntax: This is a linguistic term that refers to the set of rules that govern sentence formation.
This can manifest itself as “writing style” and can be quite different in different types of text.
For example, “He is sitting” and “Sitting, he is” are two sentences with the same meaning
(See semantics), but very different syntax.
Event, Process, and Eventuality: In the natural domains, process referred to something that
happened over time. Processes have a beginning and an end. Our domain specialists
viewed process as a generic term for stuff that happens, such as photosynthesis or melting.
Event was viewed as being much more specific, for example, an earthquake is a process
while the 1906 San Francisco Earthquake is an event. From a linguistics perspective, events
and processes are all eventualities, which are defined as any occurrence or outcome.
Environment, Habitat, and Locality: These three terms are often used interchangeably and
ambiguously across and within disciplines, but have specific meaning in existing bio-
ontologies. An environment is defined by environmental characteristics. A desert, a forest,
and a gut are all examples of environments. Habitats are defined by the conditions favorable
to an organism of interest, such as zebra habitat or amoeba habitat and could include
multiple environments. A locality is a very specific description of a place and can usually be
found on a map. For example, “Bariloche, 25 km NNE via Ruta Nacional 40” is a locality.
Agent: Actor in an event who initiates and carries out the event intentionally or consciously, and
who exists independently of the event. This is a linguistic term that is not used in ecology,
cryology, or geology.
Patient: The linguistic definition of a patient is: Undergoer in an event that is usually structurally
changed, for instance by experiencing a change of state, location or condition, is often acted
upon by an agent that is causally involved or directly affected by other participants, and
exists independently of the event. In science, “patient” specifically refers to a person
receiving medical care, but linguistics does not have this restriction. Any entity that is
receiving an action is a patient.
22
Theme: Undergoer that is central to an event or state that does not have control over the way
the event occurs, is not structurally changed by the event, and/or is characterized as being
in a certain position or condition throughout the state. This is a linguistic term that is not
used in ecology, cryology, or geology.
Entity: In linguistics, this term refers to anything, eventuality, or abstract concept. In science this
term is used to refer more to physical objects.
Feature, Property, Quality, and Trait: The sciences think of all of these terms as a characteristic
of a physical object, such as a morphological feature, and tend to use these terms
interchangeably. In linguistics, a feature is a characteristic of a word that defines its class or
category.
Attribute: In linguistics, an attribute is a modifier. In the sciences, an attribute is the same as a
feature, property, quality, or trait.
Agentive: In linguistics, something is agentive if it is performing or causing the action in the
sentence. This is a linguistic term that is not used in ecology, cryology, or geology.
Multiword Expression (Multiword Phrase): Two or more words that have a different meaning
together than they do separately. This is a linguistic term that is not used in ecology,
cryology, or geology.
Schema: In linguistics, this term refers to the list of entity and property types available for
annotation. In the sciences this word broadly refers to any a priori data structure and
vocabulary. It can refer to database structure or an XML document.
Genre: A type or level of text specifically addressed to varying audiences, or for different levels
of communication within a domain; such as magazine, professional journal, textbook, as a
definition, newspaper, instructions manual. This is a linguistic term that is not used in
ecology, cryology, or geology.
References
Albright, D. et al., 2013. Towards comprehensive syntactic and semantic annotations of the
clinical narrative. Journal of the American Medical Association, 20(5), pp.922–930.
Bahl, L.R., Jelinek, F. & Mercer, R.L., 1983. A maximum likelihood approach to continuous
speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
PAMI-5(2), pp.179–190.
Berger, A.L., Della Pietra, V.J. & Della Pietra, S.A., 1996. A maximum entropy approach to
natural language processing. Computational Linguistics, 22(1), pp.39–71.
Bethard, S., Ogren, P. & Becker, L., 2014. ClearTK 2.0: Design Patterns for Machine Learning
in UIMA. In LREC. pp. 3289–3293.
Bies, A. et al., 2016. A comparison of event representations in DEFT. In Proceedings of the 4th
Workshop on Events: Definition, Detection, Coreference, and Representation. San Diego,
pp. 27–36.
Blanco, E. & Moldovan, D., 2011. Semantic representation of negation using focus detection. In
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.
Portland, OR, USA: Association for Computational Linguistics, pp. 581–589.
Blanco, E. & Sarabi, Z., 2016. Automatic generation and scoring of positive interpretations from
negated statements. In Proceedings of NAACL-HLT 2016. San Diego, pp. 1431–1441.
Bos, J. & Nissim, M., 2006. An empirical approach to the interpretation of superlatives. In
Proceedings of the 2006 Conference on Empirical Methods in Natural Language
Processing. Sydney: Association for Computational Linguistics, pp. 9–17.
23
Boser, B.E., Guyon, I.M. & Vapnik, V.N., 1992. A training algorithm for optimal margin
classifiers. In Proceedings of the Fifth Annual Workshop on Compuational Learning
Theory. Pittsburgh: ACM, pp. 144–152.
Buttigieg, P.L. et al., 2016. The environment ontology in 2016: bridging domains with increased
scope, semantic density, and interoperation. Journal of Biomedical Semantics, 7(1), p.57.
Cambria, E. & White, B., 2014. Jumping NLP Curves: A Review of Natural Language
Processing Research [Review Article]. IEEE Computational Intelligence Magazine, 9(2),
pp.48–57.
Chen, W. & Styler, W., 2013. Anafora: A Web-based General Purpose Annotation Tool. In HLT-
NAACL. pp. 14–19.
Cortes, C. & Vapnik, V., 1995. Support-vector networks. Machine Learning, 20(3), pp.273–297.
Donovan, S.M., O’Rourke, M. & Looney, C., 2015. Your Hypothesis or Mine? Terminological
and Conceptual Variation Across Disciplines. SAGE Open, 5(2).
Dostert, L.E., 1955. The Georgetown-I.B.M. Experiment. In Machine Translation of Languages.
New York: John Wiley & Sons, Ltd, pp. 124–135.
Duerr, R.E. et al., 2015. Formalizing the semantics of sea ice. Earth Science Informatics, 8(1),
pp.51–62.
Edwards, P.N. et al., 2011. Science friction: Data, metadata, and collaboration. Social Studies of
Science, 41(5), pp.667–690.
Hajič, J., 1998. Building a syntactically annotated corpus: The Prague Dependency Treebank. In
Issues of Valency and Meaning. Prague: Karolinum, pp. 106–132.
Hodges, K.E., 2008. Defining the problem: terminology and progress in ecology. Frontiers in
Ecology and the Environment, 6(1), pp.35–42.
Isasi-Catalá, E., 2011. Indicator, umbrellas, flagships and keystone species concepts: Use and
abuse in conservation ecology. Interciencia, 36(1), pp.31–38.
Kennedy, C., 2007. Vagueness and grammar: the semantics of relative and absolute gradable
adjectives. Linguistics and Philosophy, 30(1), pp.1–45.
Kim, J.-D., Ohta, T. & Tsujii, J., 2008. Corpus annotation for mining biomedical events from
literature. BMC Bioinformatics, 9(1), p.10.
Krallinger, M. & Valencia, A., 2005. Applications of Text Mining in Molecular Biology, from Name
Recognition to Protein Interaction Maps. In Data Analysis and Visualization in Genomics
and Proteomics. Chichester, UK: John Wiley & Sons, Ltd, pp. 41–59.
Lassiter, D., 2015. Adjectival modification and gradation. In S. Lappin & C. Fox, eds. Handbook
of Contemporary Semantic Theory. Chichester, UK: John Wiley & Sons, Inc., pp. 143–167.
Lassiter, D. & Goodman, N.D., 2017. Adjectival vagueness in a Bayesian model of
interpretation. Synthese, 194(10), pp.3801–3836.
Lease, M. & Charniak, E., 2005. Parsing biomedical literature. In R. Dale et al., eds. IJNLP. Jeju
Island, Korea: Springer, pp. 58–69.
Mabee, P.M. et al., 2007. Phenotype ontologies: the bridge between genomics and evolution.
Trends in ecology & evolution, 22(7), pp.345–350.
Manning, C.D., 2015. Computational Linguistics and Deep Learning. Computational Linguistics,
41(4), pp.701–707.
Marcus, M.P., Santorini, B. & Marcinkiewicz, M.A., 1993. Building a large annotated corpus of
English: The Penn Treebank. Computational Linguistics, 19(2), pp.313–330.
McCarty, C.A. et al., 2011. The eMERGE Network: A consortium of biorepositories linked to
electronic medical records data for conducting genomic studies. BMC Medical Genomics,
4, p.13.
Morante, R., Liekens, A. & Daelemans, W., 2008. Learning the scope of negation in biomedical
texts. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language
Processing. pp. 715–724.
Morante, R. & Sporleder, C., 2012. Modality and Negation: An Introduction to the Special Issue.
24
Computational Linguistics, 38(2), pp.223–260.
National Snow and Ice Data Center, 2016a. All About Sea Ice. Available at:
http://nsidc.org/cryosphere/seaice/index.html [Accessed February 1, 2016].
National Snow and Ice Data Center, 2016b. Arctic Sea Ice News & Analysis. Available at:
https://nsidc.org/arcticseaicenews [Accessed February 1, 2016].
Pakhomov, S., Coden, A. & Chute, C., 2006. Developing a corpus of clinical notes manually
annotated for part-of-speech. International Journal of Medical Informatics, 75(6), pp.418–
429.
Palmer, M., Gildea, D. & Kingsbury, P., 2005. The proposition bank: An annotated corpus of
semantic roles. Computational Linguistics, 31(1), pp.71–106.
Palmer, M. & Xue, N., 2010. Corpus Annotation. In S. Lappin, A. Clark, & C. Fox, eds.
Handbook of Computational Linguistics and Natural Language. Blackwell Press, pp. 238–
270.
Partington, K. et al., 2003. Late twentieth century Northern Hemisphere sea-ice record from
U.S. National Ice Center ice charts. Journal of Geophysical Research, 108(C11), p.3343.
Pearl, J., 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference,
Los Angeles: University of California Press.
Pustejovsky, J. et al., 2003. The TIMEBANK corpus. In Corpus Linguistics. pp. 647–656.
Pyysalo, S. et al., 2006. Lexical adaptation of link grammar to the biomedical sublanguage: a
comparative evaluation of three approaches. BMC Bioinformatics, 7(Suppl 3), p.S2.
Raskin, R.G. & Pan, M.J., 2005. Knowledge representation in the semantic web for Earth and
environmental terminology (SWEET). Computers & Geosciences, 31(9), pp.1119–1125.
Rimell, L. & Clark, S., 2009. Porting a lexicalized-grammar parser to the biomedical domain.
Journal of Biomedical Informatics, 42(5), pp.852–865.
Styler, W.I. et al., 2014. Temporal annotation in the clinical domain. Transactions of the
Association for Computational Linguistics, 2, pp.143–154.
Thessen, A.E. & Parr, C.S., 2014. Knowledge extraction and semantic annotation of text from
the encyclopedia of life. PLoS ONE, 9(3).
Whitehead, B., 2017. Geoscience-semantics: First release--stripped down version (Version
v0.1) [Data set].
Wieczorek, J. et al., 2012. Darwin Core: An evolving community-developed biodiversity data
standard. PLoS One, 7(1), p.e29715.
World Meteorological Organization, 2014. WMO Sea-Ice Nomenclature, Secretariat of the World
Meteorological Organization.
Wu, S. et al., 2014. Negation’s not solved: Generalizability versus optimizability in clinical
natural language processing. PLoS ONE, 9(11), p.e112774.
Zou, B., Zhou, G. & Zhu, Q., 2014. Negation focus identification with contextual discourse
information. In ACL. pp. 522–530.