Conference PaperPDF Available



Abstract and Figures

Report on Results of a Hackathon to Progress with the Training Resources for Natural Language Processing (NLP) in Ecology
Content may be subject to copyright.
Sponsored by the
National Science
Foundation, USA
Results and Discussions from the ClearEarth bioHackathon
Held at University of Colorado, Boulder 14-19th August 2017
A.E. Thessen1, C.J. Jenkins, R.L. Walls, S. Antony, R. Batista-Navarro, P.L. Buttigieg, J. Deck,
R. Guralnick, M.A. Laporte, J. Nair, M. Regan, K. Schulz, B.J. Stucky, W. Ulate, J. Verdolin, C.
Weiland, R. Bhat, R. Duerr, J. Martin, S. Myers, M. Palmer, J. Preciado
Results and Discussions from the ClearEarth bioHackathon
University of Colorado, Boulder 14-19th August 2017
Anne E. Thessen1, Chris Jenkins2, Ramona Lynn Walls3, Shibu Antony4, Riza Batista-Navarro6,
Pier Luigi Buttigieg10, John Deck12, Rob Guralnick, Marie Angélique Laporte9, Jishnu Nair4,
Michael Regan2,8, Katja Schulz11, Brian J. Stucky7, William Ulate5, Jennifer Verdolin13, Claus
Weiland14, Riyaz Bhat2, Ruth Duerr1, Jim Martin2, Skatje Myers2, Martha Palmer2, Jenette
1. Ronin Institute for Independent Scholarship; 2. University of Colorado Boulder; 3.
CyVerse, University of Arizona; 4.VIT University,Vellore; 5. Center for Biodiversity
Informatics, IT Division, Missouri Botanical Garden; 6. University of Manchester; 7.
Florida Museum of Natural History, University of Florida, 8. University of New Mexico; 9.
Bioversity International, Montpellier, France; 10: Alfred-Wegener-Institut,
Helmholtz-Zentrum für Polar- und Meeresforschung; 11: Smithsonian National Museum
of Natural History; 12. Berkeley Natural History Museums, UC Berkeley; 13. University of
Redlands, CA USA.; 14. Senckenberg Biodiversity & Climate Research Centre (SBIK-F)
In August 2017, INSTAAR hosted a hack-a-thon in Boulder, Colorado, USA for the development
of semantic tools for biodiversity applications. Sixteen participants worked in parallel to develop
annotation tools, ontologies, and text mining tools. This report describes each of the projects
developed during the hack-a-thon, discusses their synthesis, and outlines a plan for moving
27 December 2018
Suggested citation: Thessen, A.E, Jenkins, C.J. and 20 others, 2018. Extending the Ecological
Ontologies Using Machine Learning and Natural Language Processing: Results and
Discussions from the ClearEarth bioHackathon, University of Colorado, Boulder 14-19th August
2017. Occasional Report, INSTAAR, University of Colorado, Boulder USA.
Table of Contents
Preface 5
Project Overview 6
Hack-a-thon Project Descriptions 6
Gathering Training Data for Ecology 6
Creating a Cryospheric Open-Access Text Corpus 7
Creating the ECOCORE Ontology 7
Auto Annotation of biological text using IBM Watson’s Knowledge studio 8
Ontologies for insect life history and a method for automatic, guided extraction of ontology
terms from domain text 10
Harvesting 40 million sentences of earthquakes text 11
Mapping Text-mined Environment Mentions to ENVO 16
T3. A Toolset for processing Text and building Taxonomies 19
Synthesis of the Meeting Results 22
Future Activity Proposals 23
Proposal - Add ecology and cryosphere terms to WordNet 23
Paper - Describe Ecocore 23
Proposal - Workshop Using ECOCORE for domain ecologists 23
Bibliography 23
The fields of semantics, machine learning and natural language processing are extremely active
at the moment, and a lot is expected of them. This is true of their application in the field of
scientific literature and scientific information. The hope is that information extraction from past
and present and new publications will be able to bring measured data and factual assertions to
our finger-tips in large quantities with high accuracy.
But there is a way to go yet, before that happens. Software tools are making rapid evolutionary
progress, but the hard slog of building corpora of sufficient volume and accuracy to ‘teach’ that
software is proving quite a challenge.
In this project we attempt to leverage from the generously-invested biomedical semantic
technologies to advance the more modestly-funded natural sciences, particularly the earth
sciences, cryosciences, and ecological sciences – earth, ice and biology. This should allow us
to use well-tried methods, and adopt practicable standards to get the machine-learning off
scientific texts operating as soon as possible.
We drew together a group of energetic, leading natural sciences’ and computational linguistics’
experts in a 4-day event to work together and explore potential new ways of scaling the
mountain-sized problem of making the corpora, and making the grade with them on volumes
and accuracy.
Necessarily, the efforts had to be focused on what they thought were very promising salients
along the wide front of this work. The precise directions they took were diverse, but held
together as an exploration of the problem.
This report offers to the community the results and findings of this effort, so please read with
interest and cite the authors’ work, and communicate with them for future collaborations.
Chris Jenkins and Anne Thessen
Chairs of the bioHackathon
(*; ).
Project Overview
ClearEarth was a project funded by the National Science Foundation to apply ClearTK, an NLP
system used in biomedicine, to other scientific domains. The goals of this hack-a-thon included:
1. Use one or all of the tools and services developed by ClearEarth in another project
2. Develop a new resource that would help ClearEarth achieve its project goals
3. Improve the text annotation process
4. Assess the efficacy of the ClearEarth tools for other, similar projects
Sixteen participants from all over the world attended the week-long hack-a-thon and contributed
to seven projects related to developing ontologies, text mining, and text annotation.
Hack-a-thon Project Descriptions
Below are brief descriptions of the work completed during the hack-a-thon.
Gathering Training Data for Ecology
Authors: Anne Thessen, Katja Schulz, Jennifer Verdolin, Riyaz Bhat
The purpose of this work group was to generate training data for the ClearEarth algorithm in
addition to the annotated text corpus.
● Glossaries. We scraped 16 glossaries. All of the terms were merged and deduplicated
(3266) and can be used as keywords to find relevant documents. Some glossaries were
scraped manually and some using code.
● Text. We gathered 400,000 sentences from EOL and 100,000 from PeerJ. The EOL
sentences came from a data dump accessed through The PeerJ text
was scraped using the code here.
Is_a relations. We manually created a list of is_a relations (420).
Next Steps:
Give keywords to Michael Regan so he can run his tool over Wikipedia, BHL, PLoS, and
BioMed Central to gather relevant documents.
Find sentences that contain pairs of words from the glossaries. Code. Right now I’m
looking in the PeerJ text that I harvested. I could also look through the corpora that
Michael is using and that Ruth is producing.
Actually induce some ontologies
Creating a Cryospheric Open-Access Text Corpus
Authors: Ruth Duerr
Glossary of ~2,200 cryospheric terms, many with multiple definitions which will be
hosted by as soon as the next version goes live later this month.
Examined text results from open access PDFs from the Cryosphere and determined that
the XML versions should be used instead
~400,000 sentences had been accumulated previously of a total of one million
sentences needed
Began accumulating and cleaning up open access PDFs and HTML articles from the
journal Polar Regions - so far another 150,000 sentences; though another 150 or so
documents need yet to be cleaned up
Next Steps:
Complete cleaning up the remaining journal articles
Create metadata for each corpus of cryospheric text and deposit in the NSF Arctic Data
Creating the ECOCORE Ontology
Authors: Pier Luigi Buttigieg, Anne Thessen, Marie La Porte, Jennifer Verdolin
The goal of this work group was to start an ontology of core ecological concepts to help guide
development of the NLP algorithm. When ClearEarth is able to induce ontologies, these will be
placed in the ecocore GitHub repository for human curation before being added to the ontology.
This ontology aims to provide core semantics for ecological entities, such as ecological
functions (for predators, prey, etc), food webs, and ecological interactions. Through ECOCORE,
we look forward to creating a semantic rallying point for the ecological community, a need
expressed repeatedly over the past few years at workshops focused on ecological,
environmental, and population-based semantics. We're working closely with the Environment
Ontology (ENVO), Population and Community Ontology (PCO), the Ontology of Biological
Attributes (OBA), and the Neuro Behavior Ontology (NBO) to build a robust and interoperating
Pier initialized the ontology and added classes from other bio-ontologies
Other ontologies (PCO, EnvO, NBO) have been improved to make ECOCORE work
Motivated changes in PCO and ENVO, which led to a release of PCO with an
updated production workflow
Added terms from the skeletal ontologies that Anne Thessen made
Added terms from domain specialist (Jennifer Verdolin)
Next Steps:
Jennifer Verdolin (behavioral ecologist) is learning how to submit issues in github to
continue improving the ontology
We need to publish something describing/announcing ECOCORE
Official release (ecocore now has an OBO Foundry PURL)
Merging the auto-generated ontologies
Linking with SDGIO to place biodiv and ecology in the framework of Sustainable
Auto Annotation of biological text using IBM Watson’s
Knowledge studio
Authors: Shibu Antony & Jishnu Nair
IBM Watson’s NLU tool is a very
versatile tool which can be designed
to adapt to user’s needs. It makes
use of the Watson Knowledge studio
to develop custom models for natural
language processing. The goal of this
work group was to develop an
annotation tool that would greatly
decrease the time spent by humans
creating training corpora.
Pre-annotation workflow in
IBM Watson Knowledge
Next Steps:
Explore scaling up
Watsons knowledge studio makes use of the human annotated text to develop a
machine learning algorithm specific to the task at hand and this annotated text is used
as training data for the generation of the algorithm
We achieved this by uploading 20 adjudicated human annotated texts from ClearEarth
All the named entities and relation types have to be predefined in the entity types page
of the knowledge studio (See Figure)
The Watson knowledge studio divides these docs into three sets for training and testing
(See Figure).
Once the testing and training data are complete, the training and evaluation process can
be started. This is done through the annotator component page. The machine learning
tool can be created and trained here (See Figure).
The resulting algorithm can be exported to the Watson explorer for use.
Results were promising for such a small training data set (See Figure).
Next Steps:
Creating a more efficient algorithm for auto annotation of text by increasing the volume
of the training set data.
Incorporating the algorithm in Watson Explorer and performing the auto-annotation
Ontologies for insect life history and a method for automatic,
guided extraction of ontology terms from domain text
Authors: Brian Stucky, Katja Schulz, Jennifer Verdolin, John Deck.
We started by acquiring the text of relevant entomology textbooks and encyclopedias, with the
plan to try running them through the subsumption inferencing tool. This proved to not work well
-- the inferred relations were mostly not useful, likely due to mismatch between the domain text
and the model’s training set. So, I (Brian) switched to investigating methods for identifying key
phrases in domain text as a way to generate a candidate set of terms to “seed” human ontology
engineering. I first tested naive keyword extraction (RAKE algorithm), but the signal:noise ratio
was too poor to be useful. I then implemented a method to use human-curated lists of key
phrases, extracted from book indices, to identify the most relevant key phrases in text
passages. This work is here:
Harvesting 40 million sentences of earthquakes text
Authors: Michael Regan, Chris Jenkins
GitHub: n/a
Objective: Implement an unsupervised learner to automatically identify and extract a number of
documents of a specified domain from a given collection. The test domains for our initial
investigation are earthquake- and ecology- related topics, though we keep in mind that our
techniques and tools should be extendable to other domains and multiple corpora, using simple
Python scripts with a minimum of dependencies. Ideally, command-line tools can be transitioned
easily into a web-based platform.
Presently, our primary objective was to identify articles of interest to our research group, namely
all articles with the predominant topic being earthquakes or ecology. This is not a trivial task, as
articles that mention an earthquake or an ecological issue are not necessarily primarily about
earthquakes or ecology. To illustrate this, take the Wikipedia article devoted to the Greek
philosopher Anaximander in which a short anecdote is included about the philosopher predicting
an earthquake and thus being able to save a town’s inhabitants. The article itself includes three
mentions of the token earthquake
, however we would probably not want to classify this as a
predominantly earthquake-related article. To go beyond frequency counts of individual tokens
within each article to identify topics, our first method involves developing a probabilistic
approach to topic modeling based on Gibbs sampling and expectation propagation, as
described below.
Methods: The primary approach tested in the implementation is Latent Dirichlet Allocation, a
Bayesian model which first approximates the distribution of topics over an entire collection of
documents and then determines the most likely topic for individual documents within the
collection (Blei, Ng & Jordan, 2003). For the present experiments, an open source tool available
on Github was used for LDA :
Documentation for the tool can be found here:
Briefly, given a collection of documents, LDA determines a set of topics within that collection,
with each topic being represented as a set of keywords (here, n=8). The hyperparameters of the
algorithm include:
1 Another machine learning library for topic modeling (and other NLP applications) is MALLET:
- Number of topics to be identified (depending on the number of articles in each batch)
- The number of keywords for each topic to be extracted
- Number of iterations (maximizing the log-likelihood of the data)
- Number of documents in each batch (larger batches decrease time efficiency)
- Input size for each article (here, the first 1500 characters of the article served as input)
Similarity function: The LDA topic model generates a list of topics each consisting of a given
number of words and then assigns a topic to each article. To determine how closely the
assigned topic relates to the target domains we would like to identify (e.g., earthquakes
), word embeddings (high-dimensional vector representations) are hypothesized to
generalize better than direct keyword searches. Our approach has three parts:
1. Select a small set of ‘core’ terms for each target domain. Each term should be
unambiguous and frequent. We experimented with lists of about ten words in size plus
plural forms. The terms for each target domain searched for here were:
a. Earthquake: {earthquake, quake, epicenter, epicentre, mantle, tectonics,
seismic, tectonic, aftershock, lithosphere},
b. Ecology: {ecology, ecosystem, autotroph, heterotroph, environment, habitat,
algae, protist, nutrient, gene, biome, species};
2. Calculate the vector representation for each of these terms using GloVe pre-trained word
embeddings (Pennington, Socher & Manning, 2014), specifically 100-dimensional word
embeddings trained on 6B tokens; word embeddings are also determined for each term
of the topic models generated by the LDA algorithm;
3. Each set of word embeddings for each topic generated by LDA is compared to the word
embeddings of the target domains to determine similarity. Here, pairwise cosine
similarity is calculated, and the two lowest scores are averaged (recall that a perfect
match in the cosine similarity space = 0). If the mean score < 0.45 (arbitrarily chosen
threshold value, a hyperparameter to be tuned), all of the articles matched to this topic
by LDA are assigned to their respective target domain. In a later step, with these mean
similarity scores, further filtering of the data will be possible.
Data: Present experiments were conducted using a Wikimedia data dump , a complete copy of
all the wikis written in English as of July 2017. A few preliminary corpus statistics are included
here to outline the size and content of this dataset:
# articles: 5,452,301
Estimate # tokens (instances of individual words): 566m
Estimate # types (unique word forms; CHECK): 37m
Estimate type/token ratio: 0.0653
As our primary objective was to identify articles related to earthquakes and ecology, additional
statistics are included as an upper limit on how many articles might be related to the two
domains currently under study:
# of mentions of the tokens “earthquake” / “earthquakes”: 31,783
# of mentions of the token “ecology”: 6911
An assumption for this upper limit is that all articles in the target domain mention the word
‘earthquake’ or ‘ecology,’ as the case may be, at least once. This may or may not be a
reasonable to assume (see Next Steps for further discussion).
Avg. time needed per article: 0.1 secs (quite slow)
Estimated time to complete LDA analysis of entire wikimedia dump (using single processor
laptop): 6.3 days (ouch)
# of articles identified as earthquake-related (from of the entire corpus): 1213
# of articles identified as ecology-related (from of the entire corpus): 1372
Next steps:
Next step (1). One means of evaluation of the goodness of the Bayesian inference model above
implemented is to compare it with a frequentist estimator. Reducing each article to only those
tokens from the core vocabulary sets will allow us to determine the distribution of those terms for
each article, providing a means to estimate to what extent this distribution is predictive of its
core topic. The results of the frequentist estimator can then be used to estimate the accuracy of
the LDA model using the standard metrics of precision, recall, and F1 measure.
Frequency counts of all ‘core’ terms for both domains in the present Wiki dataset
aftershock 390
algae 5741
aftershocks 586
earthquake 25606
earthquakes 6177
epicenter 1611
epicentre 598
lithosphere 704
mantle 6915
quake 1347
quakes 325
seismic 6391
tectonic 4080
tectonics 807
autotroph 13
autotrophs 73
biome 611
biomes 324
ecology 6911
ecosystem 7240
ecosystems 4742
environment 62662
environments 14518
gene 43953
genes 22775
habitat 52021
habitats 20304
heterotroph 23
heterotrophs 53
nutrient 4541
nutrients 6325
protist 167
species 536087
We would again like to identify articles that are primarily about a single topic throughout the
article. How can we exclude articles that use keywords related to the domain in only one
section? A simple approach is to divide each article into two and run the keyword filter on both
halves. If the size of the set of keywords in both halves is greater than some threshold, the
article may be considered to be domain-specific.
For the present study, statistics were collected for different thresholds examining different
sections of the article. In each case, the threshold represents the number of non-repeating
tokens in that part of the article, i.e. three mentions of the token ‘species’ only count as one.
Both halves of the article must have more than the threshold. Statistics are for three cases are
i.) Search the entire article, split into two halves; threshold of more than two keywords per half
ii.) Search the entire article, split into two halves; threshold of more than one keyword per half
iii.) Search the first 1500 characters of the article, split into two halves; threshold of at least one
keyword per half
Running time needed for each search: ~45 mins to process the entire Wikipedia dataset
i.) Search entire article; Threshold: >2 keywords in each half
# of earthquake-related articles identified (entire corpus): 351
# of tokens in the earthquake-related articles: 646,722
Approximate # of sentences (earthquake-related): 32,000
# of ecology-related articles identified (entire corpus): 1813
# of tokens in the ecology-related articles: 5,800,078
Approximate # of sentences (ecology-related): 290,000
ii.) Search entire article; Threshold: >1 keyword in each half
# of earthquake-related articles identified (entire corpus): 945
# of tokens in the earthquake-related articles: 1,502,816
Approximate # of sentences (earthquake-related): 75,000
# of ecology-related articles identified (entire corpus): 8249
# of tokens in the ecology-related articles: 17,056,870
Approximate # of sentences (ecology-related): 850,000
iii.) Search first 1500 characters of article; Threshold: >0 keywords in each half
# of earthquake-related articles identified (entire corpus): 2098
# of tokens in the earthquake-related articles: 1,322,487
Approximate # of sentences (earthquake-related): 66,000
# of ecology-related articles identified (entire corpus): 76,880
# of tokens in the ecology-related articles: 21,235,208
Approximate # of sentences (ecology-related): 1,060,000
The results of (iii.) suggest that many more shorter articles could be identified be lowering the
threshold and looking at only the first section of the article. Merging the results of searches (ii.)
and (iii.) leads to datasets composed of both short and longer articles:
Final datasets: Merging results of (ii.) and (iii.)
# of earthquake-related articles identified: 2274
# of tokens in earthquake-related articles: 2,093,962
Approximate # of sentences (earthquake-related): 105,000
# of ecology-related articles identified: 81,031
# of tokens in ecology-related articles: 32,062,719
Approximate # of sentences (ecology-related): 1,600,000
Avg # tokens per earthquake article: 920.83
Avg # tokens per ecology article: 395.68
Next step (2). Split articles into individual sentences. How can these be further classified into
specific sub-topics of the earthquake and ecology domains? Can these sentences be used to
train a machine learning algorithm to harvest additional sentences?
Next step (3). The two approaches implemented here both rely on single token features, i.e.
words like ‘earthquake’ and ‘biome’. Phrases, including ‘active fault’, ‘body-wave magnitude’,
‘continental shelf’, etc. have not been considered. This is a significant limitation to the above
approaches, especially if a more fine-grained classification of the articles is to be desired. A
pre-processing step to identify phrases would be feasible, a NLP task known as text chunking
Open source implementations for text chunking are readily available , which could be tuned to
identify domain-specific phrases extracted from an ontology or glossary of terms.
Alternatively, a computationally inexpensive means of identifying phrases would be to perform
frequency counts of bigrams and trigrams in each article. This would allow a distribution of likely
domain-specific phrases to be determined, providing a means of tuning the extraction process
to be able to classify articles by sub-domain as well.
Mapping Text-mined Environment Mentions to ENVO
Authors: Riza Batista-Navarro, Marie Angelique LaPorte, Michael Regan, William Ulate, Claus
Our project (an overview of which is depicted in the figure below) is focussed on the following
two tasks: (1) automatically extracting habitat mentions from text, and (2) assigning the
extracted mentions with identifiers from the Environment Ontology (ENVO). We then
demonstrate how we applied our results to two use cases: (1) proposing and submitting new
terms to ENVO, and (2) curating habitat information in the World Flora Online (WFO) platform.
3 For example,
Figure (1.): Workflow for text-mining, term recommendation and data annotation pipeline.
a.) Extract habitat mentions from WFO descriptions based on CFRs. b.) Obtain relevance score of habitat and
corresponding terms in ENVO. This determines a threshold for branching into two use cases: c.) Propose new ENVO
terms based on identifying semantically related ENVO concepts for the habitat mentions. d.) Annotate habitat data in
World Flora Online with ENVO terms (WWF Biome subset) which matches super threshold with WFO descriptions.
a.) Extracting habitat mentions
We cast the extraction of habitat mentions from text as a named entity recognition (NER) task.
Specifically, we took a machine learning-based approach employing the conditional random
fields algorithm (CRFs) which has been shown to obtain state-of-the-art performance on
problems where the goal is to assign labels to sequences of items (e.g., tokens of a sentence).
Taking NERsuite , a C/C++ implementation of CRFs, we trained an NER model on a corpus in
which habitat mentions have been manually annotated. This corpus consists of 523 documents
(i.e., 268 pages from the Biodiversity Heritage Library (BHL), 155 journal article abstracts and
100 reports from grey literature), all pertaining to the Dipterocarpaceae family of forest trees
which are important for their timber value as well as contribution to wildlife habitat, climatic
balance and stronghold on water releases.
In training the CRF model, various types of features were employed, including character
-grams, unigrams and bigrams of tokens, their lemmatised forms, part-of-speech and chunk
tags. Furthermore, we made use of matches between tokens in the text and terms in two
relevant dictionaries: the International Union for Conservation of Nature (IUCN) Habitats
Classification Scheme and the World Wide Fund Habitat Types.
5 6
We applied the trained model on two corpora: (1) a set of 22,316 textual narratives from IUCN,
retrieved using their Red List application programming interfaces (APIs) and (2) a set of 16,529
species descriptions from World Flora Online (WFO). This resulted in the automatic extraction of
6,873 habitat mentions.
b.) Assigning ENVO identifiers
For mapping habitat mentions extracted in the previous step to ENVO, we compared them with
ontology terms using an approach based on the following methods: (1) querying the
RDF-formatted version of ENVO using the SPARQL Processor ARQ to create a subset of
ENVO terms containing habitat-related terms, (2) WordNet-driven lemmatization using the NLTK
Lemmatizer , and (3) fuzzy string matching using the fuzzyset
python package which computes
the Levenshtein distance based on a set of n
-gram permutations). This provided a relevance
score in range {1,0}, which we used as a metric to map terms to corresponding concepts in
c.) Proposing new ENVO terms
Subsequently, we checked relevance score and document frequency to identify a set of terms
with sufficient correspondence with ENVO concepts as well as adequate occurrence in the
IUCN and WFO corpora (threshold chosen during the Hackathon: score < 0.84 and frequency >
The result is a list of 727 habitat mentions that can potentially be added to ENVO. To verify the
mapping and extraction pipeline, a second data set, the Coastal and Marine Ecological
Classification Standard , was processed. This provided 505 marine habitat terms.
In order to streamline the incorporation of new terms into ENVO, we have developed a means
for grouping together terms based on their semantic relatedness. We implemented a simple
pipeline to process the text and be able to identify clusters. Our pipeline has two steps:
1) For each newly identified term, discover a set of related ENVO concepts, and
2) Cluster the set of newly identified terms based on the intersection of ENVO concepts
each term mapped to in step #1.
Notes on each of these two steps are included here.
Notes on Step 1:
The set of 6,873 habitat mentions extracted are nearly all multi-word noun phrases, of various
syntactic forms. For example, many phrases take the form [Modifier - Noun Phrase], with a
[Noun Phrase] possibly of the form [Noun Noun]. Examples of habitat mentions with their
syntactic forms are given here:
Example habitat mentions Form (syntactic)
mountain ridge [Noun Phrase] = [Noun Noun
low land marshes [(Noun Phrase
) Noun
] = [(Mod
) Noun
dry limestone flats [(Noun Phrase
) Noun
] (if limestone has property ‘dry’), or
(Noun Phrase
)] (if the flats are ‘dry’)
snowy flats [Mod Noun
tropical humid forests [Mod Mod Noun
] (note: typically a property such as
‘tropical’ is found nearer the head noun)
An automatically generated semantic representation of these multi-word phrases can be
computed using pre-trained word embeddings. For these mappings into semantic space, we
use Stanford GloVe embeddings (Pennington, Socher & Manning, 2014), specifically
300-dimensional vectors trained on 880B words of text. Different sizes of the word embedding
vocabulary available are experimented with, from 200K tokens to 1.5m. To balance out needs to
minimize out of vocabulary (OOV) words with processing speed, limiting the number of
embeddings to 1.2m returned sufficiently satisfactory results.
Pre-processing the habitat mention phrases included removing punctuation and stopwords
(words such as ‘the’ and ‘of’ which add little or no semantic content). Word embeddings for the
phrases were then composed by adding and averaging the word embeddings for each of the
single words in the phrase. Approximately 1% of the tokens were found to be OOV and thus not
included in the embedding for the phrase representation. These phrase embeddings were then
compared to embeddings for all the ENVO terms associated with each concept resulting in a
slice of representations nearest to one another in the multi-dimensional vector space.
On average, 14.14 concepts were mapped to each habitat mention. These mappings with
similarity scores are included in the document on the group Github page
Notes on Step 2:
To cluster these new habitat mentions in terms of semantic relatedness, we made use of the set
of mappings to ENVO concepts generated in Step 1. Specifically, pairwise set intersections for
all of the mappings for each habitat mention were generated. For the present case, a threshold
of six (6) shared concepts was determined as being evident of high degree of semantic
similarity. With this threshold, 1271 groups are formed with an average size of 26.81 members
per cluster. Increasing the threshold will decrease the number of members per group while
increasing the degree of semantic relatedness among the terms.
The set of intersections may be visualized as a graph, with each vertex representing a habitat
mention and each edge shared membership in an intersection. A limited section of that graph is
reproduced below in Figure 2. The complete set of intersections can be found on the group
Github page (Group_8/Data/habitat_mention_intersections.json).
Figure (2.)
A small section of the graph produced in Step 2 as a means of identifying possible clusters of newly extracted terms.
Each new term is a vertex in the graph, while edges represent membership in an intersection of shared semantic
relations of ENVO concepts. The section of the graph shown here demonstrates relations between new items related
to forests. Items that appear in multiple intersections as determined by the number of edges incident to an individual
vertex (e.g., ‘old growth forest’) may be better candidates for new ENVO terms than those that are isolated (e.g.,
‘natural forest’ and ‘overall forest’).
d.) Curating World Flora Online
The World Flora Online (WFO) is an information discovery portal which brings together
authoritative information on all known species of plants: including bryophytes, ferns and their
allies, gymnosperms, and flowering plants. The WFO was created in response to GSPC 2020
Target 1 (an online flora of all known plants), and aims to primarily be a resource for
conservationists, especially those working towards GSPC Targets and those involved in other
activities of the Convention on Biological Diversity (CBD), CITES, Ramsar, etc.
WFO will include information on all plant names, their classification and history of publication,
descriptions, distributions, conservation status and images. Biodiversity data from published
floristic, monographic and other sources will be included within WFO and presented using a
single, consensus classification. This classification of accepted names and their synonyms (the
Taxonomic Backbone) is managed and updated by a network of taxonomic experts to reflect
current global knowledge of plant species and their relationships (Wyse-Jackson & Miller 2017).
Given the lack of providers of specific habitat information the proposal was to mine the
descriptions available in WFO to automatically determine the habitat of the plants from the
available text.
We constructed a mapping between ENVO terms from the WWF Biome subset (Olsen et al
2001) and the habitats currently accepted by the WFO Portal. Unfortunately, both
classifications don’t match perfectly as the terms recognized by the WFO Portal are too different
heterogeneous values. Nevertheless, it was recognized that many of the habitat mentions
matched to ENVO terms could help in determining a habitat manually by an expert.
For this use case, we decided to retain only those text-mined habitat mentions which were
linked to ENVO with high confidence, i.e., those assigned with identifiers with similarity scores of
0.84 and above.
We are contributing the following resources as part of the Hackathon: (1) the IUCN and WFO
corpora containing, respectively, 22,316 and 16,529 textual descriptions in which habitat
mentions have been automatically annotated; (2) source code for the various tools developed
including the Java-based framework for training and applying CRF models using NERsuite,
scripts for fuzzy string matching and determining semantic relatedness; and (3) lists containing
727 terrestrial and 505 marine habitat terms for potential addition to ENVO.
All of the above resources will be made available in the GitHub site:
Future Directions
We seek to design a mechanism for systematically proposing new terms for incorporation into
ontologies such as ENVO. This can be facilitated by the use of GitHub's APIs to
programmatically submit term requests as issues.
In terms of extending what text mining can automatically generate, we will look into a more fine
grained association between the ENVO terms that were extracted from the text that could
correspond to a corresponding WWF biome (values currently used in WFO). To achieve this,
we plan to extract from corpora habitat attributes and their values (e.g. measurements of height,
elevation) and to add these as data properties of habitat terms. Furthermore, we can recognise
names of geographic locations in order to enrich ENVO with specific instances of habitats. For
these efforts, we shall be collaborating and working closely with the curators of ENVO and
T3. A Toolset for processing Text and building Taxonomies
Authors: John Deck, Brian Stucky
Proposed “A Toolset for processing Text and building Taxonomies (T3)” - a python package
and/or virtual machine image whose purpose will be to aid in the processing of raw text (stored
as PDFs, HTML, Text) quickly extract keywords, discover relationships, and build taxonomies
(represented in the Web Ontology Language- OWL). The word "taxonomy" used here broadly
defines any classification of related words or phrases denoting a type of subsumption
relationship or hierarchy of terms. While there exists a myriad of tools for creating text from
images, annotating text, cleaning up text, and building taxonomies or ontologies, the processing
steps are often disjoint and difficult to use. The intended goal here is to streamline this process
in a single accessible package, thereby accelerating the pace of work and lowering barriers for
Code and examples are at
Synthesis of the Meeting Results
After one week, all of the work groups made some degree of progress. Some of the important
take-home messages included:
Manual annotation is powerful, but costly. One way to remedy this may be investment in
pre-annotation tools (such as the IBM Watson tool described).
It may be much more efficient to build ontologies by starting with software that identifies
terms to put into the ontology and have experts determine how they are related than
having experts spend time annotating text to train a subsumption-inferencing tool.
Some very diverse approaches can be imagined for tackling the problem. How could
their products be merged ?
Future-Activity Proposals
Add ecology and cryosphere terms to WordNet
Anne Thessen, Martha Palmer, Christelle Wauthier, Ruth Duerr
Adding ecology terms that are in ECOCORE to WordNet. This will improve NLP algorithms that
use WordNet, which is nearly all of them (if not all of them). Could also add cryospheric terms in
ENVO/SWEET, etc. to WordNet.
Paper: Describe Ecocore
Pier Luigi Buttegieg, Anne Thessen, Jennifer Verdolin, Marie LaPorte
This will work best if someone is using ECOCORE to give a context for discussing it. Perhaps
Workshop: Using ECOCORE for domain ecologists
Anne Thessen and Jennifer Verdolin
What are some ontologies that have had lots of community participation and uptake?
What value did the community see in that ontology that inspired them to use it?
Can we find a similar motivation for users of ecocore?
Can we transform an ecology data repo?
Can we find examples of the ontology being used to improve research outcomes?
A case study
David Blei, Andrew Ng, and Michael Jordan. 2003. Latent Dirichlet Allocation
. Journal of
Machine Learning Research 3 (2003), 993-1022.
Olson, D. M., Dinerstein, E., Wikramanayake, E. D., Burgess, N. D., Powell, G. V. N.,
Underwood, E. C., D'Amico, J. A., Itoua, I., Strand, H. E., Morrison, J. C., Loucks, C. J., Allnutt,
T. F., Ricketts, T. H., Kura, Y., Lamoreux, J. F., Wettengel, W. W., Hedao, P., Kassem, K. R.
2001. Terrestrial ecoregions of the world: a new map of life on Earth. Bioscience
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
GloVe: Global
Vectors for Word Representation
. Empirical Methods in Natural Language Processing
(EMNLP), 1532-1543.
Wyse Jackson, P. & J. S. Miller. 2015. Developing a World Flora Online: A 2020 challenge to
the world's botanists from the international community.
Rodriguésia 66(4): 939–946.
... Since the nineties, Machine Learning (ML) models have largely served Natural Language Processing (NLP) applications. See [1] for a review on the marrying of these two areas. The increasing use of corpus-based learning instead of manual encoding has as primary goal to overcome the difficulties related to the manual construction of lexicons, rules and grammars. ...
... The increasing use of corpus-based learning instead of manual encoding has as primary goal to overcome the difficulties related to the manual construction of lexicons, rules and grammars. As stated in [1], the association between ML and NLP takes origin from the observation that most NLP problems can be viewed as classification problems. The most famous application treated as a classification problem is Part-Of-Speech (POS)-tagging with a good accuracy generally superior to 95% we can cite [2], [3], [4], [5]. ...
Conference Paper
Full-text available
This paper deals with supertagging Arabic texts with ArabTAG formalism, a semi-lexicalised grammar based on TAG and adapted for Arabic. Supertagging is a very useful task because it reduces and speeds the work of parsing. We view this problem as a classification task where elementary structures supertags (classes) are affected to words in a given sentence according to their description (morpho-syntactic and contextual information). We propose to combine three classifiers: Naïve Bayes, k-Nearest Neighbors (k-NN) and Decision tree by a voting procedure. The primary results were satisfactory as we obtained an accuracy rate of 76% although the small size of our training corpus (5,000 words) and the difficulties related to Arabic language specificities.
... Machine learning is a promising option due to its validated performance in several fields, including computer vision, natural language processing and pattern recognition for performing inferences from noisy data [11]. This clearly shows the advantage of automatically learning features from data and their correlations for producing a target label [12]. ...
Full-text available
Many engineered approaches have been proposed over the years for solving the hard problem of performing indoor localization using smartphone sensors. However, specialising these solutions for difficult edge cases remains challenging. Here we propose an end-to-end hybrid multimodal deep neural network localization system, MM-Loc, relying on zero hand-engineered features, but learning automatically from data instead. This is achieved by using modality-specific neural networks to extract preliminary features from each sensing modality, which are then combined by cross-modality neural structures. We show that our choice of modality-specific neural architectures can estimate the location independently. But for better accuracy, a multimodal neural network that fuses the features of early modality-specific representations is a better proposition. Our proposed MM-Loc system is tested on cross-modality samples characterised by different sampling rate and data representation (inertial sensors, magnetic and WiFi signals), outperforming traditional approaches for location estimation. MM-Loc elegantly trains directly from data unlike conventional indoor positioning systems, which rely on human intuition.
... The scalability of the rule sets has been the main limiting factor for the progress in this course. Research in the field flourished in late eighties with the advent of machine learning techniques such as neural networks and support vector machines (SVM) [16]. ...
Conference Paper
Full-text available
Automatic question answering has been a major problem in natural language processing since the early days of research in the field. Given a large dataset of question-answer pairs, the problem can be tackled using text matching in two steps: find a set of similar questions to a given query from the dataset and then provide an answer to the query by evaluating the answers stored in the dataset for those questions. In this paper, we treat the text matching problem as an instance of the inexact graph matching problem and propose an efficient approximate matching scheme. We utilize the well known quadratic optimization problem metric labeling as the framework of graph matching. In order to solve the text matching, we first embed the sentences given in natural language into a weighted directed graph. Next, we present a primal-dual approximation algorithm for the linear programming relaxation of the metric labeling problem to match text graphs. We demonstrate the utility of our approach on a question answering task over a large dataset which involves matching of questions as well as plain text.
... Natural language processing (NLP) is a field of artificial intelligence (AI) that is intended to enable computers to analyze and process natural language text or speech in a human-like manner. Examples of NLP techniques include tokenization, part-of-speech (POS) tagging, named entity recognition, and co-reference resolution etc. (Marquez 2000). Information extraction (IE) is a subfield of NLP that aims at extracting targeted information from text sources to fill in pre-defined information templates. ...
Conference Paper
Full-text available
This paper presents a new approach for automated compliance checking in the construction domain. The approach utilizes semantic modeling, semantic Natural Language Processing (NLP) techniques (including text classification and information extraction), and logic reasoning to facilitate automated textual regulatory document analysis and processing for extracting requirements from these documents and formalizing these requirements in a computer-processable format. The approach involves developing a set of algorithms and combining them into one computational platform: (1) semantic machine-learning-based algorithms for text classification (TC); (2) hybrid syntactic-semantic rule-based algorithms for information extraction (IE); (3) semantic rule-based algorithms for information transformation (ITr); and (4) logic-based algorithms for compliance reasoning (CR). This paper focuses on presenting our algorithms for ITr. A semantic, logic-based representation for construction regulatory requirements is described. Semantic mapping rules and conflict resolution rules for transforming the extracted information into the representation are discussed. Our combined TC, IE and ITr algorithms were tested in extracting and formalizing quantitative requirements in the 2006 International Building Code, achieving 96% and 92% precision and recall, respectively.
... Machine Learning (ML) is a rapidly expanding field with many applications in diverse areas such as natural language processing (Marquez, 2000), bioinformatics (Baldi and Brunak, 2001), image processing (Sajn and Kukar, 2010;Lee et al., 2010). It provides tools by which large quantities of data can be automatically analyzed. ...
Full-text available
Feature selection is a task of crucial importance for the application of machine learning in various domains. In addition, the recent increase of data dimensionality poses a severe challenge to many existing feature selection approaches with respect to efficiency and effectiveness. As an example, genetic algorithm is an effective search algorithm that lends itself directly to feature selection; however this direct application is hindered by the recent increase of data dimensionality. Therefore adapting genetic algorithm to cope with the high dimensionality of the data becomes increasingly appealing. Approach: In this study, we proposed an adapted version of genetic algorithm that can be applied for feature selection in high dimensional data. The proposed approach is based essentially on a variable length representation scheme and a set of modified and proposed genetic operators. To assess the effectiveness of the proposed approach, we applied it for cues phrase selection and compared its performance with a number of ranking approaches which are always applied for this task. Results and Conclusion: The results provide experimental evidences on the effectiveness of the proposed approach for feature selection in high dimensional data
Construction accidents often lead to injuries, pain, loss of future earnings and even deaths. One way to lower the likelihood of accidents is to levy high compensation on wrongdoers, including main contractors, subcontractors and even the workers who fail to take safety measures or are careless. Under the common law system, precedents are part of the legal system. Hong Kong is no exception. Therefore, construction companies and legal firms are interested to know the possible amount of compensation. FastText-based classification, a kind of Computer-based automated text classification, classifies documents into predefined categories according to the content of the papers, is proposed in this book chapter for accident compensation in courts. We utilised 3000 sentences in court cases in Hong Kong. 90% of the data was used for training, and 10% was used for testing. The results show that the system’s precision for classifying construction accident cases into successfully or unsuccessfully obtained compensation was 95.7%. This demonstrates that the fastText-based classification can successfully classify papers with a high level of accuracy. This pilot research provides a practical example to showcase the possibility of utilising artificial intelligence for predicting the likelihood of obtaining construction accident compensation. This approach could offer a rough estimation of the chance of getting compensation, save human resources, and allow non-specialists without much legal knowledge to have a quick reference on the likelihood of obtaining compensation for accidents. The results can also be generalised to other types of accidents and regions operated under the common law system.KeywordsAccident compensationConstruction accidentClassificationNatural language processingfastTextArtificial intelligence
Conference Paper
Retrieving pertinent parts of a meeting or a conversation recording can help for automatic summarization or indexing of the document. In this paper, we deal with an original task, almost never presented in the literature, which consists in automatically extracting questions utterances from a recording. In a first step, we have tried to develop and evaluate a question extraction system which uses only acoustic parameters and does not need any textual information from a speech-to-text automatic recognition system (called ASR system for Automatic Speech Recognition in the speech processing domain) output. The parameters used are extracted from the intonation curve of the speech utterance and the classifier is a decision tree. Our first experiments on French meeting recordings lead to approximately 75% classification rate. An experiment in order to find the best set of acoustic parameters for this task is also presented in this paper. Finally, data analysis and experiments on another French dialog database show the need of using other cues like the lexical information from an ASR output, in order to improve question detection performance on spontaneous speech.
Full-text available
To date, more than 16 million citations of published articles in biomedical domain are available in the MEDLINE database. These articles describe the new discoveries which accompany a tremendous development in biomedicine during the last decade. It is crucial for biomedical researchers to retrieve and mine some specific knowledge from the huge quantity of published articles with high efficiency. Researchers have been engaged in the development of text mining tools to find knowledge such as protein-protein interactions, which are most relevant and useful for specific analysis tasks. This chapter provides a road map to the various information extraction methods in biomedical domain, such as protein name recognition and discovery of protein-protein interactions. Disciplines involved in analyzing and processing unstructured-text are summarized. Current work in biomedical information extracting is categorized. Challenges in the field are also presented and possible solutions are discussed.
Full-text available
The aim of this study was to investigate relations among different aspects in supervised word sense disambiguation (WSD; supervised machine learning for disambiguating the sense of a term in a context) and compare supervised WSD in the biomedical domain with that in the general English domain. The study involves three data sets (a biomedical abbreviation data set, a general biomedical term data set, and a general English data set). The authors implemented three machine-learning algorithms, including (1) naïve Bayes (NBL) and decision lists (TDLL), (2) their adaptation of decision lists (ODLL), and (3) their mixed supervised learning (MSL). There were six feature representations (various combinations of collocations, bag of words, oriented bag of words, etc.) and five window sizes (2, 4, 6, 8, and 10). Supervised WSD is suitable only when there are enough sense-tagged instances with at least a few dozens of instances for each sense. Collocations combined with neighboring words are appropriate selections for the context. For terms with unrelated biomedical senses, a large window size such as the whole paragraph should be used, while for general English words a moderate window size between 4 and 10 should be used. The performance of the authors' implementation of decision list classifiers for abbreviations was better than that of traditional decision list classifiers. However, the opposite held for the other two sets. Also, the authors' mixed supervised learning was stable and generally better than others for all sets. From this study, it was found that different aspects of supervised WSD depend on each other. The experiment method presented in the study can be used to select the best supervised WSD classifier for each ambiguous term.
  • D M Olson
  • E Dinerstein
  • E D Wikramanayake
  • N D Burgess
  • G V N Powell
  • E C Underwood
  • J A D'amico
  • I Itoua
  • H E Strand
  • J C Morrison
  • C J Loucks
  • T F Allnutt
  • T H Ricketts
  • Y Kura
  • J F Lamoreux
  • W W Wettengel
  • P Hedao
  • K R Kassem
Olson, D. M., Dinerstein, E., Wikramanayake, E. D., Burgess, N. D., Powell, G. V. N., Underwood, E. C., D'Amico, J. A., Itoua, I., Strand, H. E., Morrison, J. C., Loucks, C. J., Allnutt, T. F., Ricketts, T. H., Kura, Y., Lamoreux, J. F., Wettengel, W. W., Hedao, P., Kassem, K. R. 2001. Terrestrial ecoregions of the world: a new map of life on Earth. Bioscience 51(11):933-938.
Developing a World Flora Online: A 2020 challenge to the world's botanists from the international community
  • Wyse Jackson
  • P J S Miller
Wyse Jackson, P. & J. S. Miller. 2015. Developing a World Flora Online: A 2020 challenge to the world's botanists from the international community. Rodriguésia 66(4): 939-946.