Anne ThessenRonin Institute
Anne Thessen
Ph.D. Biological Oceanography
About
98
Publications
40,855
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,026
Citations
Introduction
I am a biologist who has become involved in numerous "big data" projects.
Additional affiliations
June 2012 - June 2013
May 2013 - October 2013
April 2008 - July 2012
Education
September 2002 - December 2007
September 1997 - December 2001
Publications
Publications (98)
Genome-Wide Association Studies (GWAS) are widely used to infer the genetic basis of traits in organisms, yet selecting appropriate thresholds for analysis remains a significant challenge. In this study, we developed the Sequential SNP Prioritization Algorithm (SSPA) to elucidate the genetic underpinnings of two key phenotypes in Sorghum bicolor: m...
The exposome refers to all of the internal and external life-long exposures that an individual experiences. These exposures, either acute or chronic, are associated with changes in metabolism that will positively or negatively influence the health and well-being of individuals. Nutrients and other dietary compounds modulate similar biochemical proc...
Over the last couple of decades, there has been a rapid growth in the number and scope of agricultural genetics, genomics and breeding databases and resources. The AgBioData Consortium (https://www.agbiodata.org/) currently represents 44 databases and resources (https://www.agbiodata.org/databases) covering model or crop plant and animal GGB data,...
Knowledge graphs have become a common approach for knowledge representation. Yet, the application of graph methodology is elusive due to the sheer number and complexity of knowledge sources. In addition, semantic incompatibilities hinder efforts to harmonize and integrate across these diverse sources. As part of The Biomedical Translator Consortium...
Over the last several decades, there has been rapid growth in the number and scope of agricultural genetics, genomics and breeding (GGB) databases and resources. The AgBioData Consortium (https://www.agbiodata.org/) currently represents 44 databases and resources covering model or crop plant and animal GGB data, ontologies, pathways, genetic variat...
Motivation:
Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of knowledge graphs is lacking.
Results:
Here we present KG-Hub, a platform that enables standardized cons...
Introduction
Climate change is already affecting ecosystems around the world and forcing us to adapt to meet societal needs. The speed with which climate change is progressing necessitates a massive scaling up of the number of species with understood genotype-environment-phenotype (G×E×P) dynamics in order to increase ecosystem and agriculture resi...
Existing phenotype ontologies were originally developed to represent phenotypes that manifest as a character state in relation to a wild-type or other reference. However, these do not include the phenotypic trait or attribute categories required for the annotation of genome-wide association studies (GWAS), Quantitative Trait Loci (QTL) mappings or...
Background:
Evaluating the impact of environmental exposures on organism health is a key goal of modern biomedicine and is critically important in an age of greater pollution and chemicals in our environment. Environmental health utilizes many different research methods and generates a variety of data types. However, to date, no comprehensive data...
Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of knowledge graphs is lacking. Here we present KG-Hub, a platform that enables standardized construction, exchange, and...
Existing phenotype ontologies were originally developed to represent phenotypes that manifest as a character state in relation to a wild-type or other reference. However, these do not include the phenotypic trait or attribute categories required for the annotation of genome-wide association studies (GWAS), Quantitative Trait Loci (QTL) mappings or...
Within clinical, biomedical, and translational science, an increasing number of projects are adopting graphs for knowledge representation. Graph‐based data models elucidate the interconnectedness among core biomedical concepts, enable data structures to be easily updated, and support intuitive queries, visualizations, and inference algorithms. Howe...
Toxicological evaluation of chemicals using early-life stage zebrafish (Danio rerio) involves the observation and recording of altered phenotypes. Substantial variability has been observed among researchers in phenotypes reported from similar studies, as well as a lack of consistent data annotation, indicating a need for both terminological and dat...
Despite progress in the development of standards for describing and exchanging scientific information, the lack of easy-to-use standards for mapping between different representations of the same or similar objects in different databases poses a major impediment to data integration and interoperability. Mappings often lack the metadata needed to be...
Decades of reductionist approaches in biology have achieved spectacular progress, but the proliferation of subdisciplines, each with its own technical and social practices regarding data, impedes the growth of the multidisciplinary and interdisciplinary approaches now needed to address pressing societal challenges. Data integration is key to a rein...
People are one of the best known and most stable entities in the biodiversity knowledge graph. The wealth of public information associated with people and the ability to identify them uniquely open up the possibility to make more use of these data in biodiversity science. Person data are almost always associated with entities such as specimens, mol...
The rapidly decreasing cost of gene sequencing has resulted in a deluge of genomic data from across the tree of life; however, outside a few model organism databases, genomic data are limited in their scientific impact because they are not accompanied by computable phenomic data. The majority of phenomic data are contained in countless small, heter...
Research collections are an important tool for understanding the Earth, its systems, and human interaction. Despite the importance of collections, many are not maintained or curated as thoroughly as we would like. Part of the reason for this is the lack of professional reward for collection, curation, or maintenance. To address this gap in attribut...
In biology and biomedicine, relating phenotypic outcomes with genetic variation and environmental factors remains a challenge: patient phenotypes may not match known diseases, candidate variants may be in genes that haven't been characterized, research organisms may not recapitulate human or veterinary diseases, environmental factors affecting dise...
Research collections are an important tool for understanding the Earth, its systems, and human interaction. Despite the importance of collections, many are not maintained or curated as thoroughly as we would like. Part of the reason for this is the lack of professional reward for collection, curation, or maintenance. To address this gap in attribut...
Annotation of Texts - Preparation of Resources for NLP in the Earth Sciences
To develop new semantic software tools and resources for the earth science fields of geology, biology and cryology-specifically earthquakes, ecology, sea-ice. • Achieve this with high efficiency and effectiveness by porting resources, tools and methods from the biomedical field.
Explanation of the ClearEarth project
Annotation Methods for Creation of Training Data for Natural Language Processing in the Earth Sciences
Logical definitions, in particular those following the Entity-Quality approach, are increasingly used to drive automated classification of phenotypes and integrate phenotypes across species semantically. Over the years, the lack of consistent and widespread use of common standards resulted in conceptually equivalent or similar phenotypes with logic...
Background: When phenotypic characters are described in the literature, they may be constrained or clarified with additional information such as the location or degree of expression, these terms are called “modifiers”. With effort underway to convert narrative character descriptions to computable data, ontologies for such modifiers are needed. Such...
Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked...
The institutions of science are in a state of flux. Declining public funding for basic science, the increasingly corporatized administration of universities, increasing “adjunctification” of the professoriate and poor academic career prospects for postdoctoral scientists indicate a significant mismatch between the reality of the market economy and...
The cTAKES package (using the ClearTK Natural Language Processing toolkit Bethard et al. 2014,http://cleartk.github.io/cleartk/) has been successfully used to automatically read clinical notes in the medical field (Albright et al. 2013, Styler et al. 2014). It is used on a daily basis to automatically process clinical notes and extract relevant inf...
Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked...
Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked...
The institutions of science are in a state of flux. Declining public funding for basic science, the increasingly corporatized administration of universities, increasing “adjunctification” of the professoriate and poor academic career prospects for postdoctoral scientists indicate a significant mismatch between the reality of the market economy and...
The institutions of science are in a state of flux. Declining public funding for basic science, the increasingly corporatized administration of universities, increasing “adjunctification” of the professoriate and poor academic career prospects for postdoctoral scientists indicate a significant mismatch between the reality of the market economy and...
The institutions of science are in a state of flux. Declining public funding for basic science, the increasingly corporatized administration of universities, increasing “adjunctification” of the professoriate and poor academic career prospects for postdoctoral scientists indicate a significant mismatch between the reality of the market economy and...
Biodiversity informatics, the application of informatics techniques to biodiversity data, is rooted in physical objects and nomenclatural codes. Through two user stories, one from wildlife conservation and another from agriculture, we demonstrate the importance and process of biodiversity informatics. We discuss the importance and integration of ta...
Natural Language Processing (NLP) is an important field of study dedicated to improving automated reading and understanding of human text by machines through the development of specialized algorithms. These algorithms need a large corpus of annotated text in order to learn the semantics and syntax of human language, which is often specific and nuan...
The size of biodiversity data sets, and the size of people’s questions around them, are outgrowing the capabilities of desktop applications, single computers, and single developers. Numerous articles in the corporate sector (Delgado 2016) have been written on how much time professionals spend manipulating and formatting large data sets compared to...
Report on Results of a Hackathon to Progress with the Training Resources for Natural Language Processing (NLP) in Ecology
Biodegradation is an important process for hydrocarbon weathering that influences its fate and transport, yet little is known about in situ biodegradation rates of specific hydrocarbon compounds in the deep ocean. Using data collected in the Gulf of Mexico below 700 m during and after the Deepwater Horizon oil spill, we calculated first-order degra...
This project is funded by NSF-Award ACI 1443085. ClearEarth aims to bring semantic technologies from the biomedical field into the earth-surface earth, ice and life sciences. The products will be applied to operations such as query and reasoning.
Scientists have unprecedented access to a wide variety of high-quality datasets. These datasets, which are often independently curated, commonly use unstructured spreadsheets to store their data. Standardized annotations are essential to perform synthesis studies across investigators, but are often not used in practice. Therefore, accurately combin...
Cancer and ecology datasets.
(ZIP)
Background
The natural sciences, such as ecology and earth science, study complex interactions between biotic and abiotic systems in order to understand and make predictions. Machine-learning-based methods have an advantage over traditional statistical methods in studying these systems because the former do not impose unrealistic assumptions (such...
The need for a names-based cyber-infrastructure for digital biology is based on the argument that scientific names serve as a standardized metadata system that has been used consistently and near universally for 250 years. As we move towards data-centric biology, name-strings can be called on to discover, index, manage, and analyze accessible digit...
Today's low cost digital data provides unprecedented opportunities for scientific discovery from synthesis studies. For example, the medical field is revolutionizing patient care by creating personalized treatment plans based upon mining electronic medical records, imaging, and genomics data. Standardized annotations are essential to subsequent ana...
Understanding the interplay between environmental conditions and phenotypes is a fundamental goal of biology. Unfortunately, data that include observations on phenotype and environment are highly heterogeneous and thus difficult to find and integrate. One approach that is likely to improve the status quo involves the use of ontologies to standardiz...
Process studies and coupled-model validation efforts in geosciences often require integration of multiple data types across time and space. For example, improved prediction of hydrocarbon fate and transport is an important societal need which fundamentally relies upon synthesis of oceanography and hydrocarbon chemistry. Yet, there are no publically...
Holistic understanding of estuarine and coastal environments across interacting domains with high-dimensional complexity can profitably be approached through data-centric synthesis studies. Synthesis has been defined as “the inferential process whereby new models are developed from analysis of multiple data sets to explain observed patterns across...
The difficult job market for PhD scientists has forced many from more traditional academic paths to look for opportunities in industry positions. This workshop will include talks from entrepreneurs and others describing their journey from academia to industry and general advice from scientists entering the private work force.
A better understanding of oil droplet formation, degradation, and dispersal in deep waters is needed to enhance prediction of the fate and transport of subsurface oil spills. This research evaluates the influence of initial droplet size and rates of biodegradation on the subsurface transport of oil droplets, specifically those from the Deepwater Ho...
Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human- and machine-interpretable language for describing phenotypes and their genomic and environmental contexts is perhaps the most pressing scientific bot...
Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human-and machine-inter-pretable language for describing phenotypes and their genomic and environmental contexts is perhaps the most pressing scientific bot...
Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human- and machine-interpretable language for describing phenotypes and their genomic and environmental contexts is perhaps the most pressing scientific bot...
Background. Mexico has the world’s fifth largest population of amphibians and the second country with the highest quantity of threatened amphibian species. About 10% of Mexican amphibians lack enough data to be assigned to a risk category by the IUCN, so in this paper we want to test a statistical tool that, in the absence of specific demographic d...
Background: Mexico is the fourth richest country in amphibians and the second country with the highest quantity of threatened amphibian species, and this number could be higher as many species are too poorly known to be accurately assigned to a risk category. The absence of a risk status or an unknown population trend can slow or halt conservation...
Background: Mexico is the fourth richest country in amphibians and the second country with the highest quantity of threatened amphibian species, and this number could be higher as many species are too poorly known to be accurately assigned to a risk category. The absence of a risk status or an unknown population trend can slow or halt conservation...
Background: Mexico is the fourth richest country in amphibians and the second country with the highest quantity of threatened amphibian species, and this number could be higher as many species are too poorly known to be accurately assigned to a risk category. The absence of a risk status or an unknown population trend can slow or halt conservation...
The role that ontologies play or can play in designing and employing semantic technologies has been widely acknowledged by the SemanticWeb and Linked Data communities. But the level of collaboration between these communities and the Applied Ontology community has been much less than expected. Also, ontologies and ontological techniques appear to be...
+++ UPDATED VERSION PUBLISHED IN APPLIED ONTOLOGY VOL.9, ISSUE 2, 2014+++
This version 1.0.0 (2014.04.29-10:45) of the OntologySummit2014_Communique was adopted by the community at the Ontology Summit 2014 Symposium (Arlington, Virginia, USA). It summarizes the activity of 4+ months of discussions of the Ontology Community (IAOA) and its collabora...
Numerous digitization and ontological initiatives have focused on translating biological knowledge from narrative text to machine-readable formats. In this paper, we describe two workflows for knowledge extraction and semantic annotation of text data objects featured in an online biodiversity aggregator, the Encyclopedia of Life. One workflow tags...
Synthesis science requires significant investment in data discovery, access, and integration, which can be difficult when the data have not been published or deposited. The modeling efforts of the Gulf Integrated Spill Research Consortium (GISR) required an integrated database of oceanographic and hydrocarbon field measurements collected from the G...
Among the key services that institutional data management infrastructures must provide are provenance and lineage tracking and the ability to associate data with contextual information needed for understanding and use. These functionalities are critical for addressing a number of key issues faced by data collectors and users, including trust in dat...
Synergy between science and informatics is required to develop a more robust understanding of the earth as a system of systems. Interaction of these systems is recorded in both geological and biological data, yet the capability to integrate across disciplines is hampered by diverse social and technological approaches to research and communication....
Data sharing has become an important issue in modern biodiversity research to address large scale questions. Despite the steadily growing scientific demand, data are not easily accessed. Why is this the case? This study explores the reasons for the reluctance to share data on the one hand and the motivations for sharing on the other by summarising...
Taxonomists have been tasked with cataloguing and quantifying the Earth's biodiversity. Their progress is measured in code-compliant species descriptions that include text, images, type material and molecular sequences. It is from this material that other researchers are to identify individuals of the same species in future observations. It has bee...
Names of species of Gymnodinium and their synonym groups.
(DOCX)
Names of Gymnodinium no longer associated with the genus [309]–[337]. The current name and/or the reason for rejecting the name is given. A name is listed as not code compliant if it is used without the existence of an original description. A name is listed as erroneous if it is an incorrect combination of genus name and species epithet.
(DOCX)
Names associated with extinct species of Gymnodinium
[304]–[308].
(DOCX)
List of species of Gymnodinium following removal of oncers that do not meet the selection criteria used here.
(DOCX)