PresentationPDF Available

Using Wikidata and Metaphactory to Underpin an Integrated Flora of Canada

Authors:
  • Atlantic Canada Conservation Data Centre

Abstract

We are using Wikidata and Metaphactory to build an Integrated Flora of Canada (IFC). IFC will be integrated in two senses: First, it will draw on multiple existing flora (e.g. Flora of North America, Flora of Manitoba, etc.) for content. Second, it will be a portal to related resources such as annotations, specimens, literature, and sequence data. Background We had success using Semantic Media Wiki (SMW) as the platform for an on-line representation of the Flora of North America (FNA). We used Charaparser (Cui 2012) to extract plant structures (e.g. “stem”), characters (e.g. “external texture”), and character values (e.g. “glabrous”) from the semi-structured FNA treatments. We then loaded this data into SMW, which allows us to query for taxa based on their character traits, and enables a broad range of exploratory analysis, both for purposes of hypothesis generation, and also to provide support for or against specific scientific hypotheses. Migrating to Wikidata/Wikibase We decided to explore a migration from SMW to Wikibase for three main reasons: simplified workflow; triple level provenance; and sustainability. Simplified workflow: Our workflow for our FNA-based portal includes Natural Language Processing (NLP) of coarse-grained XML to get the fine-grained XML, transforming this XML for input into SMW, and a custom SMW skin for displaying the data. We consider the coarse-grained XML to be canonical. When it changes (because we find an error, or we improve our NLP), we have to re-run the transformation, and re-load the data, which is time-consuming. Ideally, our presentation would be based on API calls to the data itself, eliminating the need to transform and re-load after every change. Provenance: Wikidata's provenance model supports having multiple, conflicting assertions for the same character trait, which is something that inevitably happens when floristic data is integrated. Sustainability: Wikidata has strong support from the Wikimedia Foundation, while SMW is increasingly seen as a legacy system. Wikibase vs. Wikidata Wikidata, however, is not a suitable home for the Integrated Flora of Canada. It is built upon a relatively small number of community curated properties, while we have ~4500 properties for the Asteraceae family alone. The model we want to pursue is to use Wikidata for a small group of core properties (e.g. accepted name, parent taxon, etc.), and to use our own instance of Wikibase for the much larger number of specialized morphological properties (e.g. adaxial leaf colour, leaf external texture, etc.) Essentially, we will be running our own Wikidata, over which we would exercise full control. Miller (2018) decribes deploying this curation model in another domain. Metaphactory Metaphactory is a suite of middleware and front-end interfaces for authoring, managing, and querying knowledge graphs, including mechanisms for faceted search and geospatial visualizations. It is also the software (together with Blazegraph) behind the Wikidata Query Service. Metaphactory provides us with a SPARQL endpoint; a templating mechanism that allows each taxonomic treatment to be rendered via a collection of SPARQL queries; reasoning capabilities (via an underlying graph database) that permit the organization of over 42,000 morphological properties; and a variety of search and discovery tools. There are a number of ways in which Wikidata and Metaphactory can work together, and we are still exploring questions such as: Will provenance be managed via named graphs, or via the Wikidata snak model?; How will data flow between the two platforms? Etc. We will report on our findings to date, and invite collaboration with related Wikimedia-based projects.
Using Wikidata and Metaphactory to
Underpin an Integrated Flora of
Canada
Jocelyn Pender, Joel Sachs, Beatriz Lujan-Toro, James Macklin, Peter Haase, Robin Malik
Photo by Luke Barnard on Unsplash
An Integrated Flora of Canada
Image credit Levin, G., and Macklin, J.
Online, integrated,
query on character
traits, etc.
Community space for
collaboration and
discussion
Treatments
Built using
templates
Semantic MediaWiki Wikidata Metaphactory
Parsed treatments
dev.floranorthamerica.org
Semantic MediaWiki Wikidata Metaphactory
Search semantically
Semantic MediaWiki Wikidata Metaphactory
Problems with Semantic MediaWiki
Provenance
Sustainability
Scalability
Semantic MediaWiki Wikidata Metaphactory
We’re exploring other options...
What We Like About Wikidata
Provenance built in
Growing community support, reminiscent of early days of
SMW, when everyone was building extensions
Potential to plug-in to other Wikidata biodiversity projects
Semantic MediaWiki Wikidata Metaphactory
The Wikidata Provenance Model
A Wikidata statement has
two parts: a claim, and a
list of zero or more
references that support
that claim.
Semantic MediaWiki Wikidata Metaphactory
Treatments
Built using
templates via
handlebars.js
SPARQL queries
that pull in data on
page load
Semantic MediaWiki Wikidata Metaphactory
Knowledge Graph User Interface
Ability to explore
data in the graph
visually
Semantic MediaWiki Wikidata Metaphactory
Structured Semantic Search
Customizable user
interface for graph
exploration and
faceted filtering
Semantic MediaWiki Wikidata Metaphactory
SPARQL Query Interface
Ability to
query
knowledge
base
Option to
return results
with labels
Semantic MediaWiki Wikidata Metaphactory
Our Successes
Semantic MediaWiki Metaphactory Wikidata
http://floranorthamerica.org/property_
work/color_armature/arm_col_data/vis
ualize.html
Our Successes
https://dakadabra.shinyapps.io/shiny_arules/
Semantic MediaWiki Metaphactory Wikidata
Considerations
Cost of Metaphactory:
https://amzn.to/31ABppu
Lack of maturity of Wikidata (e.g., Ease of
use)
Community collaboration
Images via M. Krötzsch, P. Haase, Planemad in Wikidata
or Wikimedia Commons
Collaborate With Us!
Contribute use cases: http://bit.ly/2oRgPnu
Use our data: dev.floranorthamerica.org
End
Provenance
Scrap notes
A Tale of Three Software
1. Semantic MediaWiki
2. Metaphactory
3. Wikidata
Images via M. Krötzsch, P. Haase, Planemad in Wikidata
or Wikimedia Commons
Requirements for software
Treatment pages are editable
Provenance is emphasized
Embedded within the global biodiversity knowledge graph
Cost effective and community supported
Semantic MediaWiki
Flora of North America is our basis for the Integrated Flora of
Canada
Semantic MediaWiki (SMW) is a MediaWiki extension
Progress to date:
Parsing of Flora of North America volumes published to date (~50% of
anticipated corpus)
Completion of a SMW instance
Semantic MediaWiki Metaphactory Wikidata
Our Experience
Pros
Cognitive tractability (wiki
framework)
Revision history
Import and export (pages,
RDF)
Query features
Extension ecosystem:
Visual editor, skin extensions
Cons
Limited support for provenance
Practical limitations
Limits on the size of query results
Promising features fail on large
datasets
Documentation needs improvement
SMW increasingly seen as a legacy
system
Semantic MediaWiki Metaphactory Wikidata
Does Semantic MediaWiki Meet Our
Requirements?
Treatment pages are editable
Provenance is emphasized
Embedded within the global biodiversity knowledge graph
Sort of
Cost effective and community supported
Semantic MediaWiki Metaphactory Wikidata
We’re exploring other options...
Provenance via Named Graphs
There are multiple ways to represent
provenance in RDF – reification, quads,
and named graphs are the most
common.
Named graphs are the best.
Each data source is stored in its own
named graph
Easy syntax, intuitive semantics,
vastly simplifies reification
Semantic MediaWiki Metaphactory Wikidata
Using named graphs to manage inference
We are experimenting with different property hierarchies:
Abaxial surface coloration subPropertyOf Color property
Corolla coloration subPropertyOf Flower coloration subPropertyOf Color
property
Bristle length subPropertyOf Bristle property subPropertyOf Armature
property
Each mechanism for generating a hierarchy (e.g. based on
part_of relationships, based on syntax analysis, based on
synonymy) gets its own named graph
Semantic MediaWiki Metaphactory Wikidata
Using named graphs to manage inference
(cotd.)
We are experimenting with different approaches to inference
based on those hierarchies.
SPARQL property paths
USE CWM to forward chain, and place resulting triples into named
graph.
Semantic MediaWiki Metaphactory Wikidata
Metaphactory
A platform for knowledge graph management
Use Neptune or Blazegraph as a backend
Built by the company Metaphacts (Peter Haase, Robin Malik,
et al.)
Semantic MediaWiki Wikidata Metaphactory
Our Experience
Pros:
Professional support
SPARQL query interface
Built-in visualizations
Workbench for data mining
Enables users to do their own analyses
Documentation
Cons:
No inferencing support
with Neptune
Neptune = backend
offering with
Metaphactory on AWS
marketplace
Semantic MediaWiki Metaphactory Wikidata
Does Metaphactory Meet Our Requirements?
Treatment pages are editable
Provenance is emphasized
Embedded within the global biodiversity knowledge graph
Cost effective and community supported Sort of
Semantic MediaWiki Metaphactory Wikidata
Pricing
https://amzn.to
/31ABppu
Semantic MediaWiki Wikidata Metaphactory
Marketplace software is maturing...
Wikidata
Provides infobox support to Wikipedia & enables sharing of
data among languages
Integration within the semantic web (SPARQL endpoint, RDF
& APIs)
Wikibase is the soware behind Wikidata
Open source soware
Semantic MediaWiki Wikidata Metaphactory
Wikibase
Wikibase repository and Wikibase client
Store and embed structured data into a client wiki
SPARQL query interface component can be installed
Query federation is possible
Semantic MediaWiki Metaphactory Wikidata
Sustainability
Wikidata curators
Community support going forward
Semantic MediaWiki Metaphactory Wikidata
Our Experience
Pros
Useful extensions
Wikidata Integrator
SPARQL query
interface
Existing user base
(Wikidata)
Cons
Less mature
The property creation
interface is difficult to use
To be explored...
Semantic MediaWiki Wikidata Metaphactory
Does Wikibase Meet Our Requirements?
Treatment pages are editable To be determined
Provenance is emphasized
Embedded within the global biodiversity knowledge graph
Cost effective and community supported
Semantic MediaWiki Wikidata Metaphactory
We’re very interested...
Workflow
Still a problem
How to edit and push changes, synchronize data edits...
Article
An essential component in describing, delimiting, and understanding the evolutionary context of a taxon is characterizing the habitats in which the taxon is found. We report on a simple habitat ontology that we have developed, and on our ongoing experience using volunteers to annotate legacy habitat descriptions with terms from the ontology. Our botanical informatics group is building the Canadian Flora Commons, a knowledge platform to aggregate, integrate and facilitate collaboration on information about Canadian plants. Species pages in the Commons are seeded with structured data extracted from authoritative sources such as the Flora of North America (FNA), Flora of British Columbia, etc. In previous TDWG talks (e.g., Sachs et al. 2019), we described our workflow for extracting and structuring morphological data. To understand why habitat descriptions are different and pose a unique set of challenges, consider the following (from Plectocephalus rothrockii in FNA): “Damp soil near streams, roadsides, open pine-oak woodlands and forests”. Here, the single field “habitat” is used to capture environmental conditions, canopy coverage, and taxonomic associations. We also find it often used for geology, climate, etc. Information in the habitat field is often detailed, but it is presented in free text with little editorial guidance, and comparison between treatments within a given flora and among floras is challenging. Environment ontologies that could aid in the standardization of habitat descriptors exist, notably ENVO (ENVironment Ontology; Buttigieg et al. 2016). However, ENVO’s goals have been primarily focused on describing the biomes, environmental features and environmental materials of molecular datasets, resulting in an ontology that thus far does not serve our needs. To our knowledge, no habitat ontology exists that supports species-level use cases (but see the habitat classification scheme developed by the IUCN). To address this, we developed a small and simple habitat ontology by examining over 3000 habitat descriptions across multiple families, and asked “what is the author trying to tell us?”. In our taxonomic treatment authoring tool, being developed as part of another project, we will use this ontology to replace or supplement the single “habitat” field with multiple habitat dimensions (“soil type”, “canopy coverage”, etc.), some with controlled vocabularies (e.g. {open, closed, partial} for canopy coverage). We are also “translating” legacy habitat descriptions into instance data for the ontology. This is a time-consuming process and has the potential to be dependent on interpretations made by the translator. The crowdsourcing experiment described below is aimed at addressing the first issue and quantifying the second. With our centre's support, we recruited a team of volunteers (6–8 at any given time), and taught them how to annotate habitat descriptions with WebProtegé (Horridge et al. 2014). We divided volunteers into two groups, with each group working with the same dataset, so that we could compare results. While a purpose-built habitat ontology offers advantages over existing environment ontologies and a consensus was reached on habitat class definitions (e.g., moisture, elevation, canopy coverage), we discovered that it is difficult to achieve consensus on the application of habitat classes. Between the two groups, shared annotations represented 57% of the total annotations added to terms and phrases and unique annotations represented 43%. This aligns with previous efforts to build a controlled vocabulary for FNA treatments, where differences between term categorizations represented 49% of the effort (Endara et al. (2017)). Amongst classes in our ontology, unique annotations varied between 11% and 76% (see Fig. 1). Our talk will describe our findings, discuss the subjectivity of habitat classes and other difficulties we’ve encountered while building our ontology, and demonstrate the power of a habitat-driven search interface. This interface will live alongside parsed morphological descriptions (see dev.floranorthamerica.org). We invite collaboration towards increasing the robustness and applicability of the ontology.
Article
Biodiversity information organization is looking beyond the traditional document-level metadata approach and has started to look into factual content in textual documents to support more intelligent and semantic-based access. This article reports the development and evaluation of CharaParser, a software application for semantic annotation of morphological descriptions. CharaParser annotates semistructured morphological descriptions in such a detailed manner that all stated morphological characters of an organ are marked up in Extensible Markup Language format. Using an unsupervised machine learning algorithm and a general purpose syntactic parser as its key annotation tools, CharaParser requires minimal additional knowledge engineering work and seems to perform well across different description collections and/or taxon groups. The system has been formally evaluated on over 1,000 sentences randomly selected from Volume 19 of Flora of North American and Part H of Treatise on Invertebrate Paleontology. CharaParser reaches and exceeds 90% in sentence-wise recall and precision, exceeding other similar systems reported in the literature. It also significantly outperforms a heuristic rule-based system we developed earlier. Early evidence that enriching the lexicon of a syntactic parser with domain terms alone may be sufficient to adapt the parser for the biodiversity domain is also observed and may have significant implications.
Wikibase for Research Infrastructure — Part 1
  • Miller
• Miller M (2018) Wikibase for Research Infrastructure -Part 1. https://medium.com/ @thisismattmiller/wikibase-for-research-infrastructure-part-1-d3f640dfad34. Accessed on: 2019-4-04.