ChapterPDF Available

Geoinformatics: Toward an integrative view of Earth as a system

Authors:

Abstract and Figures

Synergy between science and informatics is required to develop a more robust understanding of the earth as a system of systems. Interaction of these systems is recorded in both geological and biological data, yet the capability to integrate across disciplines is hampered by diverse social and technological approaches to research and communication. Ontology-based informatics provides the ability to share, access, and discover data across disciplines. This ability will lead to data integration and new models that enable evaluation of past, present and future changes associated with earth systems. Significant challenges that must be met in order to promote such an understanding encompass social and technical considerations, such as professional credit for data sharing, development of data registration services for ready access to heterogeneous and distributed data, and development of new approaches for evaluating trust and security in a web environment. Integration of data from different scientific disciplines will require development and management of new earth system ontology. If done properly this development will not only enable but engage the next generation workforce.
Content may be subject to copyright.
doi:10.1130/2013.2500(19)
Geological Society of America Special Papers 2013;500; 591-604
A. Krishna Sinha, Anne E. Thessen and Calvin G. Barnes
Geoinformatics: Toward an integrative view of Earth as a system
Geological Society of America Special Papers
E-mail alerting services
this article to receive free e-mail alerts when new articles citewww.gsapubs.org/cgi/alerts
click
Subscribe
Special Papers to subscribe to Geological Society of Americawww.gsapubs.org/subscriptionsclick
Permission request to contact GSA.www.geosociety.org/pubs/copyrt.htm#gsaclick
viewpoint. Opinions presented in this publication do not reflect official positions of the Society.
positions by scientists worldwide, regardless of their race, citizenship, gender, religion, or political
article's full citation. GSA provides this and other forums for the presentation of diverse opinions and
articles on their own or their organization's Web site providing the posting includes a reference to the
science. This file may not be posted to any Web site, but authors may post the abstracts only of their
unlimited copies of items in GSA's journals for noncommercial use in classrooms to further education and
to use a single figure, a single table, and/or a brief paragraph of text in subsequent works and to make
GSA,employment. Individual scientists are hereby granted permission, without fees or further requests to
Copyright not claimed on content prepared wholly by U.S. government employees within scope of their
Notes
© 2013 Geological Society of America
on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from on October 30, 2013specialpapers.gsapubs.orgDownloaded from
591
The Geological Society of America
Special Paper 500
2013
Geoinformatics: Toward an integrative view of Earth as a system
A. Krishna Sinha*
Department of Geosciences, Virginia Tech, Blacksburg, Virginia, USA
Anne E. Thessen*
Center for Library and Informatics, Marine Biological Laboratory, Woods Hole, Massachusetts, USA
Calvin G. Barnes*
Department of Geosciences, Texas Tech University, Lubbock, Texas, USA
ABSTRACT
Synergy between science and informatics is required to develop a more robust
understanding of the Earth as a system of systems. Interaction of Earth systems is
recorded in both geological and biological data, yet the capability to integrate across
disciplines is hampered by diverse social and technological approaches to research
and communication. Ontology-based informatics provides the ability to share, access,
and discover data across disciplines. This ability will lead to data integration and new
models that enable evaluation of past, present, and future changes associated with
Earth systems. Signifi cant challenges that must be met in order to promote such an
understanding encompass social and technical considerations, such as professional
credit for data sharing, development of data registration services for ready access to
heterogeneous and distributed data, and development of new approaches for evalu-
ating trust and security in a web environment. Integration of data from different
scientifi c disciplines will require development and management of new Earth system
ontologies. If done properly, this development will not only enable but engage the next
generation workforce.
*Sinha—pitlab@vt.edu; Thessen—annethessen@gmail.com; Barnes—cal.barnes@ttu.edu.
Sinha, A.K., Thessen, A.E., and Barnes, C.G., 2013, Geoinformatics: Toward an integrative view of Earth as a system, in Bickford, M.E., ed., The Web of Geologi-
cal Sciences: Advances, Impacts, and Interactions: Geological Society of America Special Paper 500, p. 591–604, doi:10.1130/2013.2500(19). For permission to
copy, contact editing@geosociety.org. © 2013 The Geological Society of America. All rights reserved.
INTRODUCTION
Data gathered by scientists over the centuries has led to a
deeper understanding of the physical, chemical, and biologi-
cal processes that shaped the Earth as we know it today. These
data, often collected by individuals, are kept in notebooks or in
personal computers, and collectively provide the largest, most
heterogeneous and most distributed database known to man.
These data are distributed around the world, are recorded in dif-
ferent languages, often using descriptive terms that are not glob-
ally accepted or widely known. Modern remotely sensed data,
such as those measured by satellites or underwater sensors, have
added another dimension to the global data inventory by provid-
ing very large volumes of homogeneous data whose management
requires dedicated data centers. This bimodal data environment
(few sources with large amounts of data and many sources with
small amounts of data) poses a daunting challenge to informatics
specialists because of its scale, distribution, and heterogeneity.
CELEBRATING ADVANCES IN GEOSCIENCE
1888 2013
8
0
2
592 Sinha et al.
Nevertheless, access and discovery of these resources is required
to enable a better integrative view of the Earth as a system.
It is now well established that geosphere, hydrosphere, atmo-
sphere, biosphere, and anthrosphere compose the Earth system
(Fig. 1). Interactions of these “spheres” over time have produced
the modern-day physical, chemical, and biological environment.
We emphasize that four of the systems (geosphere, hydrosphere,
atmosphere, biosphere) can be considered to be naturally occur-
ring, because their development and interactions span billions of
years of Earth history that predate human activity and the anthro-
sphere. However, the interdependence of man-made “events” and
the constructed environment with those that occur naturally sug-
gest that all fi ve systems are currently interactive.
Within each system, multiple sub-systems interact over time
to infl uence the larger system as a whole. For example, within
the geosphere system, the diversity of compositions of igneous
rocks is related both to tectonic settings within the “plate tectonic
sub-system” and composition of source regions within an “Earth
realm sub-system,” e.g., crust or mantle that produced these rocks.
Other examples of interactive systems include those between
microorganisms within the biosphere system and oxygenation of
the atmosphere, which in turn led to a dramatic increase in the
number of mineral species (Hazen and Ferry, 2010). Interactions
between systems have societal signifi cance in many ways, such
as climate change, formation of ore deposits, and sea level fl uc-
tuations. Recognition of such system-level interactions was artic-
ulated by Bretherton (chair of the Earth System Sciences Com-
mittee, NASA Advisory Council, 1988) as being responsible
for global change, with emphasis on the contribution of human
Figure 1. Physical environments of the Earth have been tradition-
ally represented by three systems: hydrosphere, geosphere, and at-
mosphere. Biosphere as a system emerged in Earth history when life
originated. Recent changes in all four of these environments through
anthropogenic changes is represented as a new anthrosphere system.
The interactions between these systems over time have shaped the
Earth as we know it today. Data associated with each system are gath-
ered by individuals and automated sensor technologies, and generate
two data environments that require innovative informatics solutions for
discovery and integration.
activities (anthrosphere system) to such a change. This report led
Congress to codify the Global Change Research Act of 1990 to
“assist the Nation and the world to understand, assess, predict,
and respond to human-induced and natural processes of global
change.” The Bretherton diagram (contained in the NASA report)
shows system-wide, modern-day interactions between recog-
nized environments and demonstrates that, in order to discover
new knowledge associated with these interactions, scientists will
need to develop informatics-oriented technologies that enable a
more robust understanding of the Earth as a system. However,
because global-scale Earth system science requires a deep under-
standing of the physical, chemical, and biological interactions
that determine the current and future state of Earth, data from
smaller-scale parts of the Earth system, such as origins of ore
deposits, breakup of supercontinents, and extinction of species,
are required to understand the interrelationships between the four
naturally occurring but separate systems (Fig. 1).
Numerous agencies and government panels, such as the
National Science Foundation’s EarthCube initiative (http://
earthcube.ning.com/), the European Commission report: Riding
the Wave (http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/
hlg-sdi-report.pdf), and NASA’s Earth System Science program
(http://eospso.gsfc.nasa.gov/ess20/agenda.php) emphasize the
critical need to develop an informatics-based infrastructure to
advance research capabilities for integrative science and to com-
municate with the public. The ten points identifi ed in Riding the
Wave (p. 22 and 23) form a very useful reference for the organi-
zation of this paper, and a slightly reorganized version is given
in Table 1.
Although not all challenges listed in Table 1 are addressed
in depth in this paper, we show that some of the fundamental
concerns about data heterogeneity, complexity, volatility, vol-
umes, and resources are common to all those seeking a data-to-
knowledge infrastructure, as are tools and services that render and
represent data and data products. An infrastructure that enables
discovery and integration must address major technological and
cultural challenges associated with access, sharing, and discov-
ery of data and tools. This combination is necessary to identify
the content of the database, so that meaningful integration across
various types of data can be undertaken.
We add one more concern to these challenges: communi-
cation and collaboration between geoscientists and cyberinfra-
structure developers. We see this as the most diffi cult challenge,
because it requires intense interaction and cooperation among two
communities with few shared experiences. At a recent meeting
organized by the National Science Foundation (http://earthcube
.ning.com/), geoscientists and cyberinfrastructure developers
were asked if there was suffi cient communication between them.
Their response rating was the lowest of all responses. We sug-
gest that cyberinfrastructure developers are trained to respond
primarily to business challenges (Hepp, 2008), such as inventory
control, process execution, and use of controlled vocabulary for
enterprise-level computing. They fi nd it diffi cult to share their
expertise in a more unstructured environment typical of science
Geoinformatics: Toward an integrative view of Earth as a system 593
communities. We suggest that through scientifi c use-cases, all
levels of real world complexity can be shared with infrastructure
developers to help them recognize and focus on the challenges
inherent in transforming data to knowledge.
In this paper we highlight the cyberinfrastructure needs of
individual scientists as well as those who work with large struc-
tured data generated by sensors and available through dedicated
data centers. The challenge to integrate beyond data silos through
Geoinformatics-based techniques was voiced by Jacobs (2012)
when he noted that “Although outputs from these systems—e.g.
UNIDATA (meteorology), IRIS (seismology), and OOI (ocean-
ography)—are of great value to the communities they serve, the
outcome with respect to understanding and predicting the Earth
as a single complex system remains to be fully realized.
Similarly, the need to solve large-scale problems such as cli-
mate change and food production will require biological cyber-
infrastructure (Stein, 2008; National Academy of Science, 2009;
Hey et al., 2009; Thessen and Patterson, 2011). There are many
challenges to achieving the level of data sharing and manage-
ment necessary to bring about this transformation, but progress
has been made in development of incentives (citable data sets
gshare.com), standards (MIBBI), and vocabularies (Hyme-
noptera Anatomy Ontology and Systems Biology Markup Lan-
guage). Care must now be taken to avoid a “biology data silo” in
which data from other disciplines, such as earth science, cannot
be integrated with biological data.
As an example, we recognize that organisms are known to
directly affect their physical surroundings (Wright et al., 2002)
and vice versa (Hart and Finelli, 1999). Therefore, fi elds of
study that specialize in the intersection between biosphere and
geosphere, such as oceanography, pedology, and paleontology,
enable an integrative, temporal view of the physical and chemi-
cal environment of life. Similar intersections include impacts of
climate change on species migration and habitats, co-occurrence
of species and geological phenomenon, and interactions between
biological and geological processes, such as nuclear waste reme-
diation through engineered bacterial organisms. All of these
examples required shared data resources.
Data management in the biological sciences can be divided
into two spheres: biomedical and environmental. The biomedical
branch is far more advanced in terms of standards, vocabular-
ies, and informatics tools owing to the monetization by indus-
tries, such as pharmaceuticals, and the prevalence of molecular
techniques that generate large data sets in need of advanced ana-
lytical tools. The well-used term bioinformatics is often used to
refer solely to informatics applied to molecular biology. With the
advent of metagenomic sequencing, environmental biology is
now generating data sets that require similar informatics tools.
In addition, large-scale questions, such as the effect of climate
change on species, are pushing environmental biology further
into the realm of big data. Numerous biological databases and
ontologies exist (Lambrix et al., 2007; also table 3 in Thessen
TABLE 1. SCIENTIFIC E-INFRASTRUCTURE—SOME CHALLENGES TO OVERCOME
Data publication and access: How can data producers be rewarded for publishing data? How can we know who has deposited
what data and who is re-using them—or who has the right to access data which are restricted in some way? How do we deal
with the various “filters” that different disciplines use when choosing and describing data? What about differences in these
attitudes within disciplines, or from one time to another?
Collection: How can we make sure that data are collected together with the information necessary to reuse them?
Diversity: How do we overcome the problems of diversity—heterogeneity of data, but also of backgrounds and data-sharing
cultures in the scientific community? How do we deal with the diversity of data repositories and access rules—within or
between disciplines, and within or across national borders?
Interoperability: How can we implement interoperability within disciplines and move to an overarching multidisciplinary way of
understanding and using data? How can we find unfamiliar but relevant data resources beyond simple keyword searches, but
involving a deeper probing into the data? How can automated tools find the information needed to tackle unfamiliar data?
Trust: How can we make informed judgments about whether certain data are authentic and can be trusted? How can we judge
which repositories we can trust? How can appropriate access and use of resources be granted or controlled?
Security: How can we guarantee data integrity? How can we avoid data poisoning by individuals or groups intending to bias them in
their interest? How can we react in the case of security breaches to limit their impact?
New social paradigms: How can we learn from the wisdom of crowds about what and whom to trust, while avoiding being misled
by concerted campaigns of deceit?
Education and training: How can the citizen make these benefits available for sensible investigations, and how can they be
safeguarded from fakes? How can scientific e-infrastructure foster and increase popular interest and trust in science? How
can we foster the training of more data scientists and data librarians, as important professions in their own right?
Usability: How can we move to a situation where non-specialists can overcome the high barriers to their being able to start sensible
work on unfamiliar data, perhaps using intelligent automated tools for an initial investigation?
Preservation and sustainability: How can we be sure that the important information we collect will be usable and understandable
in the future; in particular how can we fund our information resources in the long term? How can we share the costs and
efforts required for sustainability? How can we decide what to preserve?
Commercial exploitation: How can the infrastructure benefit from commercial developments in data management? How can the
revenue-generating expertise of the commercial world be brought into play for the long-term sustainability of these
resources?
Note: Adapted from European Commission report Riding the Wave (http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi
-report.pdf).
594 Sinha et al.
and Patterson, 2011), and life science is one of the most widely
represented disciplines on the semantic web (Bizer et al., 2011).
Geoinformatics and bioinformatics (including biodiversity
informatics) have common goals that support an integrative view
of the Earth as a system, but they are currently being developed as
separate endeavors. The historical development of geoinformat-
ics is given in Sinha et al. (2010), and a similar summary for bio-
informatics activities is given in Thessen and Patterson (2011).
Because separate development of these two informatics initia-
tives is unlikely to result in the solution of common goals, we
suggest that a new umbrella initiative called Earth System Infor-
matics be supported by agencies and governments to coordinate
development of a new infrastructure that includes both geo- and
bioinformatics. Such an approach would encourage international
partnerships, reduce duplication in developing technologies, pro-
mote Earth system science, and support a new generation work-
force that can solve societal challenges.
This paper guides the reader through the major issues that
need resolution to achieve useful levels of data sharing. The fi rst
section, “Data Publication, Access, Collection, and Diversity,
discusses the practical diffi culties of data sharing, including lack
of incentives and data heterogeneity, and proposes a solution
using promising new semantic web technology. The second sec-
tion, “Interoperability: From Vocabulary to Data Level Ontolo-
gies,” proposes a path from existing vocabularies to creation of
data oriented ontologies needed to make semantic technologies
functional. The third section, “Trust and Security: A New Social
Paradigm,” discusses the practical realities of trust on the inter-
net as they apply to scientifi c data and proposes a solution using
existing social networking tools. The fourth section, “Preserva-
tion and Sustainability,” outlines current preservation strategies
for scientifi c data and discusses the role of libraries in the data
preservation infrastructure. The fi nal section, “Education and
Training,” describes the potential impact a fully integrated data
system could have on science education and the need to educate
current and future researchers about good data practices that rein-
force data sharing.
DATA PUBLICATION, ACCESS, COLLECTION,
AND DIVERSITY
The National Science Board (2005) identifi ed the critical
need to manage a spectrum of data collections available electron-
ically in all forms and formats, e.g., text, numbers, images, etc.,
toward supporting the creation of a new generation of researchers
and educators. The report recommended a clear policy-oriented
pathway for management of heterogeneous resources through
activities such as fi nancial strategy for support of data collec-
tions, community proxy functions, sunsetting of data collections,
and data management plans for research support. These clearly
identifi ed recommendations to support the goal of geoscientists
to engage in innovative solutions to emerging science-based
societal challenges, which are strongly infl uenced by on-demand
access to data and computational facilities. Over the last sev-
eral decades, signifi cant progress has been made in the use of
technologies that enable access and discovery of large databases
hosted at well-recognized data centers, such as the U.S. Geologi-
cal Survey (e.g., Eros data center, http://eros.usgs.gov/), NASA
(e.g., Atmospheric Science Data Center, ASDC, at NASA Lang-
ley Research Center (http://eosweb.larc.nasa.gov/), NOAA (e.g.,
National Geophysical Data Center, NGDC, http://www.ngdc
.noaa.gov/), UCAR (e.g., Data for Atmospheric Research, http://
rda.ucar.edu/), and others. Although availability of these sensor-
based data is crucial for the research goals of geoscientists, a vast
amount of data generated by individual scientists is not acces-
sible; even its existence may not be known. This is primarily the
result of data residing on the personal computers of individual
scientists or collaborative groups. This diffi culty of discovery is a
characteristic of the “long tail” of science (Fig. 2). The term long
tail refers to a published plot of 2007 National Science Foun-
dation awards organized by size (Heidorn, 2008). The term is
used to describe the large number of small data sets and to con-
trast them with the small number of large data sets. Based on
an assessment of National Science Foundation grants awarded
Figure 2. Recognized data environments provide an incentive to devel-
op technologies that are able to promote sharing and access to multiple
classes of data.
Geoinformatics: Toward an integrative view of Earth as a system 595
in 2007 (80% of funding was for grants of $1M or less; these
grants constituted 98% of awards), it was estimated that the large
majority of data was held in such an environment (Heidorn,
2008). The assessment also recognized that large projects that
generated high volumes of homogeneous data (usually gathered
by remote sensing instruments) were well planned, curated with
approved International Organization for Standardization (IOS)
metadata, and thus highly visible and readily discovered by sci-
entists worldwide, e.g., National Science Foundation–supported
Ocean Observatories Initiative (www.oceanobservatories.org/).
In contrast, data generated by individuals were considered to be
“dark,” i.e., poorly curated, and thus less visible to other scien-
tists. These data are also more heterogeneous and are not appro-
priately indexed with standard metadata for discovery, becoming
nearly invisible to most scientists and other potential users. As a
result, data generated by individuals are more likely to remain
under-utilized and are eventually lost. Such an environment,
when coupled with holdings of heterogeneous data within librar-
ies requiring indexing and support services, makes access to dark
data even more challenging (Palmer et al., 2007; Gold, 2007).
Nevertheless, no adequate solution to emerging scientifi c
challenges can be addressed without access to the types of his-
torical observations, or legacy data that are typically found in the
“long tail.” Therefore, making these data available on demand
must be one of the highest priorities for any enterprise seeking to
develop a cyberinfrastructure capable of promoting new ways to
examine the Earth system through time.
The need to merge these two fundamentally different data
environments (sensor data and “long tail” data) exemplifi es the
need for development of information management methods that
can access and discover data of varying types. Typically, sensor
data are used to measure, evaluate, and model present-day Earth
processes, but are unable to capture past processes and events.
Development and modeling of more complex, temporal views
of the Earth requires a spectrum of data types—a large num-
ber of instruments, techniques, and computational services—
resulting in both syntactic and semantic heterogeneity associated
with both data and services. We also emphasize that access to
data generated by thousands of individuals in many countries
requires social modifi cations from the community of scientists
in order to change its culture to openly and freely share data and
other resources (Foster et al., 2012). Both challenges are readily
addressed through the use of a pathway defi ned by the goal of the
Geoinformatics Division: Data to Knowledge (Fig. 3).
Data sharing practices that lead to discovery by scientists
are rooted in the research culture; therefore, changes based solely
on technology are unlikely to be successful. However, data dis-
covery is possible if individual scientists are able and willing to
Figure 3. Data to Knowledge pathway infrastructure is enabled by informatics-based technologies but also requires
changes in the research culture of scientists. Shared resources are required for the success of this transformative pathway,
and multiple data types representing measurements and observations about the Earth are necessary prior to discovery and
integration. Data repositories (data pools) must provide access on demand to enable users to organize data products for
research, education, or policy decisions.
596 Sinha et al.
share data. Although scientists are aware of this need, they do not
participate readily because of diffi culties associated with under-
standing the linkages between data lifecycle and research life-
cycle, i.e., without sharing of data, research cannot be completed
(http://www.jisc.ac.uk/whatwedo/campaigns/res3/jischelp.aspx).
When asked why data are not shared, scientists cite con-
cerns about future publications, control and misuse of data
(Tenopir et al., 2011), and the technical challenges related to
publishing their data. However, it is possible to change such a
culture through an informatics protocol that enables data shar-
ing (Killeen, 2012; http://www.nsf.gov/pubs/2012/nsf12058/
nsf12058.jsp). In this approach, an objective assessment of
credit is made via data citations, where the number of times
the data are accessed (access counts = reuse of data) has a
quantitative data impact factor (Altman and King, 2007; Cro-
sas, 2011; Mons, et al., 2011) similar to impact factors com-
monly used to rank or evaluate journal publications (http://
thomsonreuters.com/content/press_room/science/686112;
http:// thomsonreuters.com/products_services/science/free/
essays/impact_factor/). Data sharing can also be encouraged
through availability of value-added services such as analytical
tools like BLAST in GenBank (http://www.ncbi.nlm.nih.gov/
BLAST/Blast.cgi?CMD=Web&PAGETYPE=BLASTHome)
and community annotation like FilteredPush (http://etaxonomy
.org/mw/FilteredPush). In fact, such services and community
activities could be transformative in promoting data sharing and
would support re-analysis of data to create or verify models and
results. This level of informatics-enabled community interac-
tion would facilitate different interpretations of existing data and
minimize the need to collect new data with a signifi cant cost/
benefi t ratio (Tenopir et al., 2011).
During the past decade, semantic technologies, coupled with
the Semantic Web (Berners Lee et al., 2001) and associated stan-
dards, developed and ratifi ed by the World Wide Web consortium
(http://www.w3.org/2001/sw/), emerged as a prime approach for
improving data interoperability, integration, and reuse (Doan and
Halevy, 2005; Fox et al., 2008). Controlled vocabularies, or their
richly structured form, i.e., ontologies, that use a formal language
can capture and represent agreements or shared interpretation
within a community. We adopt the defi nition of ontology as a
formal set of knowledge terms, including vocabulary, seman-
tic interconnections, and rules of inference and logic for some
particular topic (Gruber, 1993; Noy, 2004). Ontologies can then
be used to associate meaning to data through vocabulary-based
annotations. Vocabularies and ontologies, along with metadata
(Berkley et al., 2009) enable semantic approaches to searching,
browsing, integration, and analysis. In contrast, current syntac-
tic, keyword-based information retrieval relies on individuals to
defi ne and use search terms that may not access relevant long-tail
data. Recently, semantic approaches and a variety of semantic
(and Semantic Web) technologies have seen broad applicabil-
ity (Cardoso et al., 2008; Baker and Cheung, 2007; Noy, 2004;
Sheth, 2011) and are exemplifi ed in Google’s semantic search
(Duncan, 2012).
One can envision a fi rst necessary and transformative capa-
bility for data access and content discovery by providing indi-
vidual scientists the ability to publish and share their data through
a metadata registration (Berkley et al., 2009) service that uses
community-accepted, controlled vocabularies. Examples of
these vocabularies include those recommended by the American
Geosciences Institute (AGI) and NASA’s Global Change Mas-
ter Directory (GCMD). Although GCMD and the associated
SWEET (Semantic Web for Earth and Environmental Termi-
nology) ontology (http://sweet.jpl.nasa.gov/ontology/) are well
suited to index NASA’s mission-oriented data, they do not yet
adequately cover the spectrum of data types generated by the
broader geoscience community. Therefore, it can be argued that
a geoscience infrastructure requires the use of ontologies to sup-
port best practices for publishing, sharing, and discovering data
via the Semantic Web (Gahegan et al., 2009).
The Linked Open Data (LOD) initiative (http://linkeddata
.org/) has emerged from Semantic Web technologies and pro-
vides a new paradigm for publishing, querying, and reasoning
from structured and unstructured data on the Web. Currently
LOD covers a broad range of domains such as Life Sciences,
Geography, Government, Media, and Education. Furthermore,
the Open-Geospatial Consortium (OGC), especially the GeoSe-
mantics Domain Working Group (DGW) and GeoSPARQL Stan-
dard Working Group (SWG), are already working toward using
LOD (http://www.opengeospatial.org/blog/1673/) and provide a
platform for continued growth of LOD initiative.
In contrast to web-based publication of data by individu-
als, large homogeneous data resources stored in data centers
have well established policies for data curation and sharing
(Fig. 4). Multiple data centers are well suited to data manage-
ment through existing data grid technologies (Foster et al., 2002;
Moore, 2006). Such capabilities can integrate current and future
technologies through infrastructure independence (Moore,
2008) and ensure access to large-scale computational facilities
e.g., http://www.eu-egee.org/. However, as data centers prolifer-
ate owing to large volumes of data generated by new sensor tech-
nologies, the data centers will face challenges associated with
discovery of content, as many different vocabularies will need to
be adopted to enable the application of new metadata standards.
It is noteworthy that increases in the numbers of data centers
is usually accompanied by signifi cant costs, which recently led
the U.S. Government to develop a digital government strategy
to consolidate data centers (see CIO Council; http://www.cio
.gov/). A balanced informatics approach that manages an indi-
vidual scientist’s data in a Linked Open Data environment, and
when coupled with access to data centers, is necessary for inte-
gration of the long tail with sensor data.
INTEROPERABILITY: FROM VOCABULARY TO
DATA LEVEL ONTOLOGIES
In the new e-Science paradigm, geoscientists have moved
toward using the Web as a medium to exchange and discover
Geoinformatics: Toward an integrative view of Earth as a system 597
vast amounts of data (Reitsma et al., 2009). The current prac-
tice is dominated by establishing methods to access data, with
little emphasis on capturing meaning of data that would facili-
tate interoperability and integration. Some common current
methods for integration include schema integration as well
as the use of mediated schemas that provide a uniform query
interface for multiple resources (Halevy et al., 2006). The use
of peer data management (Aberer, 2003; Langille and Eisen,
2010) can allow participating data sources to retrieve data
directly from each other, and are likely to extend data inte-
gration to the Internet scale. However, such query capabilities
require syntactic and semantic mapping across resources to be
effective. In our view, ontologies are a prerequisite for effec-
tive semantic integration.
Multiple classes of ontologic frameworks have been sug-
gested for discovery and integration of data: Object (e.g., materi-
als), process (e.g., chemical reactions), and service (e.g., simula-
tion models or geochemical fi lters) (Sinha et al., 2006a; Malik et
al., 2007a). Objects represent our understanding of the state of the
system when the data were acquired, whereas processes capture
the physical and chemical forcings on objects that may lead to
changes in state and condition over time. Service provides tools
(e.g., simulation models and analysis algorithms) to assess mul-
tiple hypotheses, including inference or prediction. Object ontol-
ogy characterizes the semantics of the data. It maps the metadata
in the databases to specifi c concepts essential for data search
and integration. The service ontology maps instances of services
to conceptual tasks, thereby permitting semantic searches and
automatic linkages to types of data. The process ontology cap-
tures the broad domain knowledge, including information such
as understanding of the data set, relationships among the differ-
ent variables, normal ranges of the variables, and known causal
relationships (e.g., Reitsma and Albrecht, 2005; Sinha et al.,
2006a; Barnes, 2006). These three classes of ontologies are thus
required to enable automated discovery, analysis, utilization, and
understanding of data through both induction and deduction. We
Figure 4. Schematic representation of contrasting mechanisms for sharing and discovery of data that exist in two dif-
ferent environments.
598 Sinha et al.
suggest that development of object ontologies is the fi rst prereq-
uisite for semantic interoperability (Sinha et al., 2006b; Rezgui
et al., 2007, 2008).
Object ontologies can be represented at four levels of
abstraction: upper level (Semy et al., 2004), mid-level (Raskin
and Pan, 2005), foundation or data level, and discipline level
(e.g., Earth science; Fig. 5) and recently summarized by Orbst
(2010). Upper-level ontologies, e.g., SUO (Phytila, 2002; Niles
and Pease, 2001) and the Descriptive Ontology for Linguistic
and Cognitive Engineering (DOLCE) (Masolo et al., 2002),
are domain independent and provide universal concepts appli-
cable to multiple domains, whereas mid-level ontologies, e.g.,
SWEET (Semantic Web for Earth and Environmental Terminol-
ogy, sweet.jpl.nasa.gov/ontology/), constitute a concept space
that organizes knowledge of Earth system science across its
multiple, overlapping subdisciplines. Foundation-level ontolo-
gies capture relationships between conceptual organizations of
data types, including their measurements, whereas domain-level
ontologies are primarily vocabulary term specifi c, and can be
used for effi cient, reliable, and accurate discovery of databases
(Sinha et al., 2006a, 2006b). In particular, the SWEET ontology
contains formal defi nitions for terms used in Earth and Space
sciences and encodes a structure that recognizes the spatial dis-
tribution of Earth environments (Earth realm) and the interfaces
between different realms (Raskin and Pan, 2005; Raskin, 2006).
Thus SWEET provides an extensible mid-level terminology that
can be readily utilized by both foundation-level and domain-
specifi c ontologies (Malik et al., 2010).
Figure 5. Conceptual organization of object ontologies (UML diagram) at various levels of granularity is necessary for transformation of data to
knowledge. Both SUO (http://suo.ieee.org/SUO/Ontology-refs.html) and SWEET (http://sweet.jpl.nasa.gov) ontologies can be used to provide
connectivity to existing and future ontologies related to all science disciplines. Such high level UML diagrams show that Materials have proper-
ties, age, structure, and location, whereas Services include all analytical tools including human observations used for gathering data associated
with any object. Although domain-specifi c ontologies are primarily based on vocabularies, the adoption of foundation ontologies would enable
semantic integration across disciplines.
Geoinformatics: Toward an integrative view of Earth as a system 599
Foundation ontologies are applicable to all sciences, and
can be viewed as a representation of formal declarative speci-
cations of all objects, phenomena, and their interrelationships.
We emphasize that the concept of Matter (labeled as Material in
Fig. 5), including all thermodynamic states of matter, is the most
fundamental of all ontologies. Clearly, without matter there
can be no semantic concept of location, time, and structure,
or physical properties of matter and instruments that measure
these properties. These foundation ontologies may then be used
to capture discipline-specifi c terms such as those for minerals,
rocks, geologic time scale, geologic structures, and phenom-
ena. This approach also readily accepts geoscience terms being
developed through GeoSciML (http://www.geosciml.org/.), a
markup language designed to promote syntactic integration of
heterogeneous resources (Boisvert et al., 2003; Simons et al.,
2006; Malik et al., 2010).
Discovery and access to databases and other resources is
key to application of informatics technologies that enable users
to fi nd data and services. This discovery requires that data and
services utilize metadata, including the term that describes the
content of the data. These terms are derived from community-
supported vocabularies, such as those advocated by the American
Geosciences Institute thesaurus or the NASA supported Global
Change Master Directory. A quick survey of these dictionaries
(referred to as high level ontologies = taxonomy) shows that
many tens of thousands of terms are available for annotating data
and services. In order to make it possible for a geoscientist to
utilize such term-based discovery, it is possible to annotate data
at multiple levels of concepts. For example, a data provider can
use terms such as igneous rocks > geochemistry > isotopes to
annotate their data, which can then be discovered by others who
may use any of the three terms to search for the data. With the
motivation to enable discovery, the use of vocabulary to tag data
has become the focus of the informatics community. In order to
map multiple ontologies to each other, the technology commu-
nity has undertaken the development of sophisticated software
engines that interlink established vocabularies to data (Obrst
and Cassidy, 2011), but the challenge of matching terms from
multiple vocabularies remains formidable. It is easy to recognize
that when one attempts to search for data from another discipline
without familiarity with the terms used by that community, data
discovery is unlikely to be a simple task.
The challenges of discovery and access in a single discipline
(vocabulary, data, etc.) pale when compared to the same chal-
lenges in cross-disciplinary or multidisciplinary research. As one
example, we cite research on the relationships between bedrock
type and fl oral diversity and productivity. Input for such studies
will include remotely sensed spectroscopic data, LIDAR data,
climate data, and local and regional geologic maps. However,
basic inputs will also be derived from detailed, land-based sur-
veys of species diversity and productivity, geochemical data for
bedrock and soil compositions and mineral assemblages, stud-
ies of local nutrient input and cycling, etc. The challenge then
becomes: Is it possible to develop vocabularies (ontologies) that
permit researchers to gather and link these diverse data sets,
with diverse vocabularies, in such a way as to extract knowledge
about physical, chemical, and biological interactions? Through
such linkages, data from truly disparate research groups can be
combined and analyzed to make societally important decisions
about, for example, the effects of climate change and land use on
terrestrial plant productivity at a variety of scales.
In contrast to the vocabulary-only-based approach, e.g.,
marine metadata interoperability (Rueda et al., 2009), some
researchers have supported the use of terms that are directly
related to the data themselves, enabling “smart searches” (Lin
and Ludäscher, 2003; Fox et al., 2008; Sinha et al., 2010; Malik
et al., 2010). For example, geochemical data for a rock contains
abundances of elements, and if a data provider registers each
column in a database to the concept of that element (contained
in a formal element ontology with globally accepted terms and
defi nitions), then both syntactic and semantic heterogeneity are
resolved, and a query can return the data of interest (Fig. 5; from
Sinha, 2011).
1. Keyword-based registration: Discovery of data resources
(e.g., gravity, geologic maps, etc.) requires registration
through the use of high-level index terms. For instance,
the popular AGI Index terms (American Geologic Insti-
tute GeoRef Thesaurus; http://www.agiweb.org/news/
spot_nov8_thesaurus.html) can be used. If necessary,
other index terms, such as those provided by AGU
(American Geophysical Union, http://www.agu.org/
pubs/authors/manuscript_tools/journals/index_terms/)
and NASA’s Global Change Master Directory (GCMD;
http://gcmd.nasa.gov/) can be used and eventually be
cross-indexed to each other.
2. Class-level registration: Discovering the semantic con-
tent of databases, for example, heterogeneous data sets
that include images or excel fi les require registration at
class-level ontology, such as rock geochemistry, gravity
database, etc.
3. Item-detail–level registration: Item-detail–level registra-
tion consists of associating a column in a database with a
specifi c concept or attribute of an ontology that is based
on foundation ontology of objects described earlier. This
approach allows a data resource to be queried using con-
cepts instead of actual values. This mode of registration
is most suitable for data sets built on top of relational
databases. However, item-detail level registration can be
extended to cover Excel spreadsheets and maps in ESRI
Shapefi le format by internally mapping such data sets to
PostgreSQL tables. Ontological data registration at item
detail level uses the concepts of Subject, Object, Value,
and Unit. Figure 6 shows the relationship between these
concepts and how it is possible to map columns of data
sets to these concepts. In an example utilizing geochemi-
cal data, Rock represents the Subject that contains the
element compound SiO2 as one of its Objects. The Object
SiO2 has a Value of 50.72 and is measured in wt% unit.
600 Sinha et al.
TRUST AND SECURITY: A NEW SOCIAL PARADIGM
In a web-dominated world, where data can be discovered on
demand, signifi cant concerns exist with regard to reliability of the
discovered resource as well as the trustworthiness of the resource
provider. This is an enormously complex topic, as both social and
technological challenges have to be addressed (Golbeck et al.,
2003). In general, two different mechanisms could be utilized to
initiate a community dialogue on meeting the challenges. One
would be the use of the web address of the data provider, such as
from a government agency ( = gov), university ( = edu), and orga-
nization ( = org); then the user of the data can consider the type of
source to be a fi lter for reliability and trust associated with the data.
Additional fi lters that use authentication mechanisms and digital
signatures would add to users’ confi dence in the quality of the
data. In contrast, techniques that have been developed using social
media as a template may have a signifi cant role in trusting content
of the source through an assessment based on individual feedback
(Gil and Ratnakar, 2002), or use of group assertions for determin-
ing membership within a group (Levien and Aiken, 1998). More
recently, Golbeck et al. (2003) suggested use of ontology-based
trust implementation through the creation of a web of acquain-
tances (based on email addresses), where users can indicate the
level of trust for people they know. This is called a “Web of Trust.
Based on a numerical scale of 1 = distrust absolutely to 9 = trust
absolutely, users can generate a schema that attaches a numerical
score to an individual researcher. In contrast to the social aspects of
the Web of Trust approach (Golbeck et al., 2003), another avenue
of research emphasizes the need to model the trustworthiness of
the content based on the source of the content (Gil and Artz, 2006).
This approach provides information on the “trust of individual
users based on an actual context of use of the source as well as their
expertise on the topic as they go through the analysis.” We support
active research in this fi eld, using the geoscience community as
a resource to develop mechanisms to enhance security, reliability,
and trustworthiness of data and data providers.
PRESERVATION AND SUSTAINABILITY
Preservation of new and legacy data and data products
is necessary for future reuse and analysis (National Science
Board, 2005). It promotes signifi cant effi ciency in research and
education, as well as benefi ts in cost reduction as new data need
not be gathered for the same purpose. However, preservation of
data requires large storage capacity and expertise in archiving,
preservation, and distribution that are likely to increase in cost
over time. As we have stated earlier in this paper, two fun-
damentally different types of data management are required:
those generated by the long tail of science and those generated
by sensor technologies. The latter reside in well designed and
curated data centers, and are likely to be maintained on site,
with the only signifi cant challenge lying in evolving technolo-
gies that may make the earlier data unusable or inaccessible. In
contrast, the long tail of science has no such equivalent pres-
ervation strategy. The LOD technology enables sharing and
discovering of data, while the data providers are able to retain
access to their personal computers. But who will manage the
data after the original data providers have retired? One pos-
sible solution is for libraries at academic institutions to develop
repositories that can be curated by local data librarians. This
approach would require data providers to be compliant with
institutional repository requirements, and librarians to have
data management expertise. Both will require signifi cant work
to achieve, since neither group has a history of working with
institutional data repositories. We suggest that such a system
can remain as a distributed network with resources being allo-
cated through internal mechanisms. However, we envision that
with rising costs associated with preservation of legacy data,
decisions will have to be made on which data should be kept
for the long term, as was explicitly recognized by the European
Commission report Riding the Wave. Moreover, there should be
an expectation of long-term fi nancial commitments from par-
ticipating libraries and their administrations.
Figure 6. Schematic representation of
ontology-enhanced registration can
include use of concepts of subject, ob-
ject, value, and unit. Subject can be
considered as any feature such as rocks,
location, etc., while objects are enti-
ties (from an ontology) such as SiO2 or
gravity, which have associated values
and units.
Geoinformatics: Toward an integrative view of Earth as a system 601
EDUCATION AND TRAINING
Providing a strong science education for all students is key to
ensuring an informed citizenry. We endorse the idea that relevant
science learning experiences designed to model methodologies
used by scientists provides the best context for understanding
the nature of science. In classrooms where students participate
in demonstrations and investigations, there is an increased level
of student engagement. Also, investigative strategies can encour-
age collaboration among students. For optimal learning, students
need opportunities to create mental models that connect their
learning experiences to science concepts (National Research
Council, 2000). Learning cycles that allow students to explore a
concept in depth will support students’ ability to make sense of
their observations.
We argue that informatics-based science education can ele-
vate science achievement by allowing students to relate observa-
tions to real-time visualization of their own data, of data gener-
ated by students in similar grades, and eventually of data from
the global scientifi c community. The key concept behind the use
of informatics-based teaching and learning lies in providing stu-
dents the technological infrastructure that engages their curiosity
and addresses high-profi le earth-system–level challenges such
as the growth of continents through time, climate change in the
geologic record, and evolution of life. Cyberinfrastructure can
enable innovative ways to discover databases, including maps,
through the use of ontologies and query-based systems that facil-
itate semantic integration of Earth science data. For example, in
providing students the opportunity to more readily understand
how continents have grown through Earth’s history, it is impor-
tant that students recognize and understand the geologic histories
of the continents, their relationships to plate tectonics (Condie,
1997), and the dynamics of the Earth as a system. We suggest
that scientifi c data and observations made by students intellectu-
ally linked to larger scientifi c challenges will motivate students to
be engaged in inquiry-based investigations of Earth as a system.
If the long tail is to be fully integrated, then a crucial part of
moving into the future will be training new scientists in best prac-
tices of data management and integration. An effective approach
might include (1) providing data management and best practices
courses for science students, and (2) providing workshops and
seminars for practicing scientists to improve their data practices.
Educating students at the college level can take the form of an
entire course dedicated to data management and sharing best
practices, but a more effective strategy might be to integrate good
data practices into existing lectures and labs. This change would
necessitate educational workshops for science professors to learn
how to best integrate good data practices into their courses, but
it would be more effective in convincing students that good data
practices are an essential part of research practice, not an after-
thought. Changing habits of practicing researchers is different
from educating students, so workshops designed to promote
good data practices among professionals will require emphasis
on incentives and value-added services. Educating professors
about the importance of data sharing and good data practices
will not only change their own research practice but will make
them more likely to applaud the good data practices of their col-
leagues and condemn bad practices, thereby further reinforcing
good data habits. Unfortunately, most researchers only respond
to requirements such as the NSF Data Management Plan, which
has caused an upsurge in scientists needing to educate themselves
about good data practices.
We also encourage the library community, agencies, and
institutions to participate in monitoring and implementing best
practices as they relate to preservation, storage, archiving, and
curation. As with education of the professorate, this activity will
require signifi cant effort, because most librarians do not have
data management expertise, and many of the strategies for pres-
ervation of traditional library materials (books and specimens)
are not applicable to data. Currently, only a few library science
and information science graduate programs offer meaningful
data curation curricula.
SUMMARY
The impact of informatics on science has been enormous,
but it can be even more transformative as people and data come
together in support of a more secure future for Earth and its
environments. In this paper we have identifi ed challenges and
presented solutions that can enable communities of scientists
to support education and research toward a better assessment of
Earth as a system. We envision a series of research and training
activities (Fig. 7) that provide a pathway toward meeting such
a goal. We emphasize that the culture of sharing data and tools
is the most fundamental problem facing the science community,
as only partial mechanisms (such as mandatory requirements
imposed by funding agencies) are in place as incentives. Volun-
teering to share data is likely to be more effective than mandating
such a task, particularly when scientists and other data providers
recognize the benefi ts of data on demand. Although research has
shown that motivation to share data can be encouraged through
a data citation index, its adoption has not occurred because uni-
versities and agencies still utilize the journal citation index as a
metric for performance. We also support the idea of individuals
determining the content of the web by publishing applications
(similar to APPS of the telecommunication industry) that are
readily found through annotations based on ontologies. With the
availability of both data and applications, the discovery of data
and service resources (data pools) could be enhanced through the
use of ontologies at many levels of granularity. The creation of
simple metadata standards that work with high-level terms in all
science disciplines would enable users to browse an index library
to fi nd terms for data discovery, even in unfamiliar disciplines.
Current methods are so time consuming that it is not effi cient for
a non-specialist to utilize metadata as a browsing agent, making
utilization of informatics impossible. This limitation is particu-
larly important when non-specialists attempt to utilize informat-
ics to address policy considerations.
602 Sinha et al.
The discovery of databases is only a step in the data-to-
knowledge pathway, because newly discovered data with newly
invented formats, acronyms, and units will make integration and
further analysis impossible. Therefore, we suggest that data-level
ontologies are the most effi cient technique to resolve syntactic
and semantic heterogeneity, leading to integration (data fusion).
Semantically enabled search and integration engines could read-
ily access data and services registered to ontologies for further
analysis and modeling. The challenges of moving beyond the
“what, where, and when” of data to an understanding of “how
and why” will require the availability of process ontologies. This
enabling step would lead all scientists to work with the funda-
mental scientifi c method of multiple hypotheses (Chamberlin,
1890) as we seek to understand Earth as a system of systems.
ACKNOWLEDGMENTS
We are very appreciative of advice and support from colleagues
in geological and biological sciences, as well as those in infor-
mation management. We are especially pleased to acknowl-
edge the Geoinformatics Division of the Geological Society of
America for promoting the use and application of informatics in
both research and education. The senior author acknowledges
support of National Science Foundation informatics oriented
awards EAR 1238438 and EAR 022558.
REFERENCES CITED
Aberer, K., ed., 2003, Special Issue on Peer to Peer Data Management: SIG-
MOD Record, v. 32, p. 18.
Altman, M., and King, G., 2007, A proposed standard for the scholarly cita-
tion of quantitative data: D-Lib Magazine, v. 13, p. 1082, doi:10.1045/
march2007-altman.
Baker, C.J.O., and Cheung, K.-H., 2007, Semantic Web: Revolutionizing
Knowledge Discovery in the Life Sciences: New York, Springer Science,
446 p.
Barnes, C., 2006, From object to process ontology, in U.S. Geological Survey
Scientifi c Investigations Report 2006-5201, p. 40–41.
Berkley, C., Bowers, S., Jones, M., Madins, J., and Schildauer, M., 2009, Improv-
ing data discovery in metadata repositories through semantic search, in
International Conference on Complex, Intelligent and Software Intensive
Systems Publication, p. 1152–1159, doi:10.1109/CISIS.2009.122.
Berners-Lee, T., Hendler, J., and Lassila, O., 2001, The semantic web: Scien-
tifi c American, v. 284, p. 34–43, doi:10.1038/scientifi camerican0501-34.
Figure 7. Schematic diagram showing stages of development required in achieving an integrative view of the Earth as a
system of systems. The many steps involved in transitioning from acquiring data to knowledge discovery are enabled by
informatics-based solutions along the path of transforming data to knowledge. Many of the steps are self explanatory;
others are explained in the text.
Geoinformatics: Toward an integrative view of Earth as a system 603
Bizer, C., Jentzsch, A., and Cyganiak, R., 2011, State of the LOD Cloud ver.
0.3, http://www4.wiwiss.fu-berlin.de/lodcloud/state/ (last accessed 21
September 2012).
Boisvert, E., Johnson, B.R., Schweitzer, P.N., and Anctil, M., 2003, XML
Encoding of the North American Data Model: U.S. Geological Sur-
vey Open-File Report 03-471, http://pubs.usgs.gov/of/2003/of03-471/
boisvert/index.html (last accessed 18 September 2012).
Cardoso, J., Hepp, M., and Lytras, M., 2008, The Semantic Web: Real-World
Applications from Industry, v. 6 of Semantic Web and Beyond: New York,
Springer, 308 p.
Chamberlin, T.C., 1890, The method of multiple working hypotheses: Science
(reprinted in Science in 1965), v. 148, p. 754–759.
Condie, K.C., 1997, Plate Tectonics and Crustal Evolution: Oxford,
Butterworth-Heinemann, 282 p.
Crosas, M., 2011, The Dataverse Network: An Open-Source Application for Shar-
ing, Discovering and Preserving Data: D-Lib Magazine, v. 17, http://www.dlib
.org/dlib/january11/crosas/01crosas.html (last accessed 18 September 2012).
Doan, A., and Halevy, A.Y., 2005, Semantic integration research in the database
community: A brief survey: AI Magazine, v. 26, p. 83–94.
Duncan, G., 2012, Inside Knowledge Graph: Google’s deep-diving semantic search,
http://www.digitaltrends.com/mobile/inside-knowledge-graph-googles
-deep-diving-semantic-search/ (last accessed 18 September 2012).
Earth System Science Committee, 1988, Earth System Science: A Program for
Global Change: Washington, D.C., NASA, 207 p.
Foster, I., Katz, D., Malik, T., and Fox, P., 2012, Wagging the long tail of earth
science: Why we need an earth science data web, and how to build it,
http://semanticommunity.info/@api/deki/files/13867/=079_Foster.pdf
(last accessed 19 September 2012).
Foster, I., Kesselman, C., Nick, J.M., and Tuecke, S., 2002, Grid services for
distributed system integration: Computer, v. 35, p. 37–46, doi:10.1109/
MC.2002.1009167.
Fox, P., McGuinness, D., Raskin, R., and Sinha, A.K., 2008, Integrating inter-
disciplinary science data with semantic mediation, http://esto.nasa.gov/
conferences/estc2008/papers/fox_a2p3.pdf (accessed 19 September 2012).
Gahegan, M., Luo, J., Weaver, S., Pike, W., and Banchuen, T., 2009, Connecting
GEON: Making sense of the myriad resources, researchers and concepts
that comprise a geoscience cyberinfrastructure: Computers & Geosci-
ences, v. 35, p. 836–854, doi:10.1016/j.cageo.2008.09.006.
Gil, Y., and Artz, D., 2006, Towards Content Trust of Web Resources, in
WWW’06, Proceedings of the International Conference on World Wide
Web, 15th: New York, Association for Computing Machinery, p. 565–574.
Gil, Y., and Ratnakar, V., 2002, Trusting information sources one citizen at a
time, in Proceedings of the First International Semantic Web Conference
(ISWC), Sardinia, Italy, June 912, p. 162–176.
Golbeck, J.A., Parsia, B., and Hendler, J., 2003, Trust networks on the seman-
tic web, in Klusch, M., Omicini, A., Ossowski, S., and Laamanen, H.,
eds., Proceedings of Cooperative Intelligence: Heidelberg, Springer, CIA
2003, LNCS, v. 2782, p. 238–249.
Gold, A., 2007, Cyberinfrastructure, data, and libraries, pt. 1, in A Cyberinfra-
structure Primer for Librarians: D-Lib Magazine, v. 13.
Gruber, T.R., 1993, A translation approach to portable ontologies: Knowledge
Acquisition, v. 5, p. 199–220, doi:10.1006/knac.1993.1008.
Halevy, A., Rajaraman, A., and Ordille, J., 2006. Data integration: The teenage
years, in Dayal, U., Whang, K., Lomet, D., Alonso, G., Lohman, G., Ker-
sten, M., Cha, S.K., and Kim, Y., eds., Proceedings of the International
Conference on Very Large Databases, 32nd: Seoul, Korea, p. 9–16.
Hart, D.D., and Finelli, C.M., 1999, Physical-biological coupling in streams:
The pervasive effects of fl ow on benthic organisms: Annual Review
of Ecology and Systematics, v. 30, p. 363–395, doi:10.1146/annurev
. ecolsys.30.1.363.
Hazen, R.M., and Ferry, J.M., 2010, Mineral evolution: Mineralogy in the
fourth dimension: Elements, v. 6, p. 9–12, doi:10.2113/gselements.6.1.9.
Heidorn, P.B., 2008, Shedding light on the dark data in the long tail of science:
Library Trends, v. 57, p. 280–299, doi:10.1353/lib.0.0036.
Hepp, M., 2008, Ontologies: State of the art, business potential, and grand chal-
lenges, in Hepp, M., and Sure, Y., Ontology Management: Semantic Web,
Semantic Web Services, and Business Applications: New York, Springer
Science, 291 p.
Hey, T., Tansley, S., and Tolle, K., 2009, The Fourth Paradigm, Microsoft
Research: Redmond, Washington, 251 p.
Jacobs, C., 2012, EarthCube: Developing a Framework to Create and Manage
Knowledge in the Geosciences, Earth Observation, Technology, http://
www.earthzine.org/2012/02/01/earthcube-developing-a-framework-to
-create-and-manage-knowledge-in-the-geosciences/, (last accessed 18
September 2012).
Killeen, T., 2012, Data Citation in the Geosciences, http://www.nsf.gov/
pubs/2012/nsf12058/nsf12058.jsp (last accessed 18 September 2012).
Lambrix, P., Tan, H., Jakoniene, V., and Stromback, L., 2007, Biological ontol-
ogies, in Baker, C.J.O., and Cheung, K.-H., eds., Semantic Web: Revolu-
tionizing Knowledge Discovery in the Life Sciences: New York, Springer
Science, p. 85–100.
Langille, M.G.I., and Eisen, J.A., 2010, BioTorrents: A fi le sharing ser-
vice for scientifi c data: PLoS ONE, v. 5, e10071, doi:10.1371/journal
.pone.0010071.
Levien, R., and Aiken, A., 1998, Attack resistant trust metrics for public key
certifi cation, in Proceedings of the USENIX Security Symposium, 7th:
San Antonio, Texas, January 26–29: The Advanced Computing Systems
Association, p. 229–242.
Lin, K., and Ludäscher, B., 2003, A system for semantic integration of geo-
logic maps via ontologies, in Ashish, N., and Goble, C., eds., Semantic
Web Technologies for Searching and Retrieving Scientifi c Data (SCIS):
Aachen, Germany, ISWC 2003 Workshop, v. 83.
Malik, Z., Rezgui, A., and Sinha, A.K., 2007, Ontologic integration of geosci-
ence data on the semantic web: U.S. Geological Survey Scientifi c Investi-
gations Report 2007-5199, p. 41–43.
Malik, Z., Rezgui, A., Medjahed, B., Ouzzani, M., and Sinha, A.K., 2010,
Semantic integration in geosciences: International Journal of Semantic
Computing, v. 4, p. 1–30, doi:10.1142/S1793351X10001036.
Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A., and Schneider,
L., 2002, The Wonderweb Library of Foundational Ontologies and the
DOLCE Ontology: Italy, Technical Report D17, Laboratorio di Ontologia
Applicata, p. 1–36.
Mons, B., Haagen, H., Chichester, C., Hoen, P., Dunnen, J., Ommen, G., Mul-
ligen, E., Singh, B., Hooft, R., Roos, M., Hammond, M., Kiesel, B., Giar-
dine, B., Velterop, J., Groth, P., and Schultes, E., 2011, The value of data:
Nature Genetics, v. 43, p. 281–283, doi:10.1038/ng0411-281.
Moore, R.W., 2006, Building Preservation Environments with Data Grid Tech-
nology: American Archivist, v. 69, p. 139–158.
Moore, R.W., 2008, Towards a theory of digital preservation: International
Journal of Digital Curation, v. 3, p. 63–75, doi:10.2218/ijdc.v3i1.42.
National Academy of Science, 2009, Committee on a New Biology for the 21st
Century: Ensuring the United States Leads the Coming Biology Revolu-
tion: Washington, D.C., National Academy Press, 98 p.
National Research Council, 2000, How People Learn: Brain, Mind, Experience,
and School: Washington, D.C., Commission on Behavioral and Social
Sciences and Education, National Academy Press, 357 p.
National Science Board, 2005, Long lived digital data collections: Enabling
research and education in the 21st century: National Science Board (NSB-
05–40, revised 23 May 2005). Retrieved 16 July 2010, from http://www
.nsf.gov/pubs/2005/nsb0540/ (last accessed 18 September 2012).
Niles, I., and Pease, A., 2001, Towards a standard upper ontology, in Welty,
C., and Smith, B., eds., Proceedings of the International Conference
on Formal Ontology in Information Systems (FOIS-2001), 2nd, Ogun-
quit, Maine, 17–19 October: Association for Computing machinery,
v. 2001, p. 2–9.
Noy, N.F., 2004, Semantic integration: A survey of ontology-based approaches:
SIGMOD Record, v. 33, p. 65–70, doi:10.1145/1041410.1041421.
Obrst, L., 2010, Ontological Architectures, in Poli, R., Seibt, J., Healy, M., and
Kameas, A., eds., Theory and Applications of Ontology: Computer Appli-
cations: Springer Science + Business Media, New York, p. 27–66.
Obrst, L., and Cassidy, P., 2011, The need for ontologies: Bridging the barriers
of terminology and data structures, in Sinha, A.K., Arctur, D., Jackson, I.,
and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geo-
logical Society of America Special Paper 482, p. 99–123.
Palmer, C.L., Cragin, M.H., Heidorn, P.B., and Smith, L.C., 2007, Data cura-
tion for the long tail of science: The case of environmental sciences: Paper
presented at the Third International Digital Curation Conference, Wash-
ington, D.C.
Phytila, C., 2002, An analysis of the SUMO and Description in Unifi ed Model-
ing Language, http://ontology.teknowledge.com/Phytila/Phytila-SUMO
.html (unpublished).
Raskin, R.G., 2006, Development of ontologies for Earth system science, in
Sinha, A.K., ed., Geoinformatics: Data to Knowledge: Geological Society
of America Special Paper 397, p. 195–200.
604 Sinha et al.
Raskin, R.G., and Pan, M.J., 2005, Knowledge representation in the semantic
web for earth and environmental terminology (SWEET): Computers &
Geosciences, v. 31, p. 1119–1125, doi:10.1016/j.cageo.2004.12.004.
Reitsma, F., and Albrecht, J., 2005, Modeling with the semantic web in the
geosciences: IEEE Intelligent Systems, v. 20, p. 86–88, doi:10.1109/
MIS.2005.32.
Reitsma, F., Laxton, J., Ballard, S., Kuhn, W., and Abdelmoty, A., 2009, Seman-
tics, ontologies and e-Science for the geosciences: Computers & Geosci-
ences, v. 35, p. 706–709, doi:10.1016/j.cageo.2008.03.014.
Rezgui, A., Malik, Z., and Sinha, A.K., 2007, DIA Engine: Semantic discovery,
integration, and analysis of Earth science data: U.S. Geological Survey
Scientifi c Investigations Report 2007-5199, p. 15–18.
Rezgui, A., Malik, Z., and Sinha, A.K., 2008, Semantically enabled registration
and integration engines (SEDRE and DIA) for the Earth sciences: U.S.
Geological Survey Scientifi c Investigations Report 2008-5172, p. 47–52.
Rueda, C., Bermudez, L., Fredericks, J., 2009, The MMI Ontology Registry and
Repository: A Portal for Marine Metadata Interoperability, in OCEANS
2009, MTS/IEEE Biloxi—Marine Technology for Our Future: Global and
Local Challenges: Oceanic Engineering Society, p. 1–6.
Semy, S., Pulvermacher, M., and Obrst, L., 2004,Towards the use of an upper
ontology for U.S. Government and military domains: An evaluation, The
MITRE Corporation (04–0603), http://handle.dtic.mil/100.2/ADA459575
Sheth, A., 2011, Semantics scales up: Beyond search, in Web 3.0: IEEE Internet
Computing, v. 15, p. 3–6, doi:10.1109/MIC.2011.157.
Simons, B., Boisvert, E., Brodaric, B., Cox, S., Duffy, T., Johnson, B., Lax-
ton, J., and Richard, S., 2006, GeoSciML: Enabling the exchange of geo-
logical map data, in Proceedings, Australian Earth Sciences Convention
(AESC), Melbourne, 4 p.
Sinha, A.K., 2011, Infusing semantics into the knowledge discovery process for
the new e-geoscience paradigm, in Sinha, A.K., Arctur, D., Jackson, I.,
and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geo-
logical Society of America Special Paper 482, p. 165–181.
Sinha, A.K., Zendel, A., Brodaric, B., Barnes, C., and Najdi, J., 2006a, Schema
to ontology for igneous rocks, in Sinha, A.K., ed., Geoinformatics: Geo-
logical Society of America Special Paper 397, p.169–182.
Sinha, A.K., Lin, K., Raskin, R., and Barnes, C., 2006b, Cyberinfrastructure for
the Geosciences—Ontology based discovery and integration: U.S. Geo-
logical Survey Scientifi c Investigations Report 2006-5201, p. 1–2.
Sinha, A.K., Malik, Z., Rezgui, A., Zimmerman, H., Barnes, C.G., Thomas,
W.A., Jackson, I., Gundersen, L.C., Heiken, G., Raskin, R., Fox, P.,
McGuinness, D.L., and Seber, D., 2010, Geoinformatics: Transforming
Data to Knowledge for Geosciences: GSA Today, v. 20, no. 12, p. 4–10,
doi:10.1130/GSATG85A.1.
Stein, L., 2008, Towards a cyberinfrastructure for the biological sciences: Prog-
ress, visions and challenges: Nature (reviews/genetics), v. 9, p. 678–688.
Tenopir, C., Allard, S., Douglass, K., Aydinglou, A., Wu, L., Read, E., Manoff,
M., and Frame, M., 2011, Data sharing by scientists: Practices and per-
ceptions: PLoS ONE, v. 6, e21101, doi:10.1371/journal.pone.0021101.
Thessen, A.E., and Patterson, D.J., 2011, Data Issues in the Life Sciences:
Data Conservancy, Marine Biological Laboratory, Woods Hole, Massa-
chusetts, 55 p.
Wright, J.P., Jones, C.G., and Flecker, A.S., 2002, An ecosystem engineer,
the beaver, increases species richness at the landscape scale: Oecologia,
v. 132, p. 96–101, doi:10.1007/s00442-002-0929-1.
MANUSCRIPT ACCEPTED BY THE SOCIETY 21 JANUARY 2013
Printed in the USA
... Data from these projects, large and small, have in the past been poorly curated and thus less visible to other scientists, largely not publicly available online, and hence named ''Dark Data'' (Heidorn [5]). But as Sinha et al. [6] emphasize, without access to the types of historical observations or legacy data that make up the ''dark data'' in the ''long tail'' of science, emerging scientific challenges will not be addressable. ''...making these data available on demand must be one of the highest priorities for any enterprise seeking to develop a cyberinfrastructure capable of promoting new ways to examine the earth system through time'' (Sinha et al. [6]). ...
... But as Sinha et al. [6] emphasize, without access to the types of historical observations or legacy data that make up the ''dark data'' in the ''long tail'' of science, emerging scientific challenges will not be addressable. ''...making these data available on demand must be one of the highest priorities for any enterprise seeking to develop a cyberinfrastructure capable of promoting new ways to examine the earth system through time'' (Sinha et al. [6]). One international project designed to rescue historical oceanographic data was the IOC/IODE GODAR project, which focused mainly on physical data (Conkright et al. [7]; Caldwell [8]). ...
Article
Full-text available
Data generated as a result of publicly funded research in the USA and other countries are now required to be available in public data repositories. However, many scientific data over the past 50+ years were collected at a time when the technology for curation, storage, and dissemination were primitive or non-existent and consequently many of these datasets are not available publicly. These so-called “dark data” sets are essential to the understanding of how the ocean has changed chemically and biologically in response to the documented shifts in temperature and salinity (aka climate change). An effort is underway to bring into the light, dark data about zooplankton collected in the 1970s and 1980s as part of the cold-core and warm-core rings multidisciplinary programs and other related projects. Zooplankton biomass and euphausiid species abundance from 306 tows and related environmental data including many depth specific tows taken on 34 research cruises in the Northwest Atlantic are online and accessible from the Biological and Chemical Oceanography Data Management Office (BCO-DMO).
... Informatics specialists like to contrast it with the smaller number of large, more accessible data sets (e.g. Sinha et al., 2013). The name 'long tail' derives from graphs drawn of the size of data sets against their number: there are relatively few large datasets and a lot of smaller ones. ...
... While the Earth is a whole by itself, the studies of the Earth are not. Studying the Earth as a system requires knowledge across disciplines, access to vast amount of data, and the capability to analyze those data (Loudon 2011; Sinha et al. 2013). ...
Conference Paper
Full-text available
Geoinformatics is now facing the tremendous changes and opportunities initiated by the Semantic Web. The Semantic Web is an extension to the World Wide Web, aiming at transforming the current Web from a Web of Documents to a Web of Data. In such an open space underpinned by abundant data sources and computational facilities, geoinformatics and geomathematics are in a transition from the integration stage to the intelligent stage. Semantic Web technologies have already been successfully applied in data management and integration, and more future works can be done for data analysis, in which a data-driven abductive approach should be highlighted.
Article
The Institute of Marine Sciences (ISMAR) and the Institute of Polar Sciences (ISP) of the Italian National Research Council (CNR) have gathered a substantial amount of heterogeneous geodata through the years in the Adriatic Sea, with different methodologies and for multiple scopes regarding geological, oceanographic, biological, anthropogenic aspects, and their interactions. To overcome challenges in datasets heterogeneity and fragmentation, a Marine Spatial Data Infrastructure (MSDI) has been set up, aiming at integrating and preserving geodata, fostering their reuse (e.g. the generation of scenarios for geological past and future developments by the application of numerical models), and ensuring a good degree of FAIRness (FAIR: Findable, Accessible, Interoperable, and Reusable). The MSDI consists of a Spatial Relational Database Management System (RDBMS) based on specific data models designed following in part the INSPIRE Directive data specifications, a WebGIS, a metadata catalogue, and a cloud system. This paper shows the potentialities of this MSDI and discusses the main implementation steps, the elements that make up the infrastructure, the level of FAIRness reached, the main elements promoting FAIRness, and the gaps to be covered. Compliance with the FAIR principles represents a fundamental step to developing interoperability with European and international marine data management infrastructures for handling and exchanging multidisciplinary data.
Conference Paper
Full-text available
SUMMARY The CGI data model working group have established an initial geology data model and XML based exchange language to accommodate geological map data, referred to as GeoSciML. The language is based on prior work carried out at North American, European and Australian geological survey and research organisations. Unified Modelling Language (UML) has been used as a design aid for capturing the geological concepts and their properties. The UML model has then been converted to the GML-conformant GeoSciML. The design of GeoSCiML meets the short-term goal of accommodating the geoscience information presented on geological maps, as well as being fully extensible to include the full range of geological concepts covered by the geosciences. To demonstrate the ability of GeoSciML to deliver data via web feature services, a small subset has been selected as a testbed. This testbed will deliver lithostratigraphic units, boreholes, faults, contacts and compound materials from different national geological surveys.
Article
Full-text available
Extracting knowledge from the rock record stored in databases is one of the primary goals of the information-oriented geoscientist. This activity requires well-designed organizational structures to facilitate queries, and ultimately cyber-aided geological research. Such structures need to encompass information about geologic objects and the processes that affect or produce the objects. Therefore, our goal is to create a prototype of a computer-based knowledge environment that specifically reflects the reasoning used by a geoscientist, with the recognition that his/her primary interest lies in understanding processes that affect the rock record. In order to start development of such capabilities, we have utilized an organization of attributes and their definitions to construct a database schema for field-based igneous rocks, and show that its conversion into a knowledge base requires the application of both object and process ontologies.
Article
Full-text available
An integrative view of Earth as a system, based on multidisciplinary data, has become one of the most compelling reasons for research and education in the geosciences. It is now necessary to establish a modern infrastructure that can support the transformation of data to knowledge. Such an information infrastructure for geosciences is contained within the emerging science of geoinformatics, which seeks to promote the utilization and integration of complex, multidisciplinary data in seeking solutions to geoscience-based societal challenges.
Article
This chapter describes the need for complex semantic models, i.e., ontologies of real-world categories, referents, and instances, to go beyond the barriers of terminology and data structures. Terms and local data structures are often tolerated in information technology because these are simpler, provide structures that humans can seemingly interpret easily and easily use for their databases and computer programs, and are locally salient. In particular, we focus on both the need for ontologies for data integration of databases, and the need for foundational ontologies to help address the issue of semantic interoperability. In other words, how do you span disparate domain ontologies, which themselves represent the real-world semantics of possibly multiple databases? We look at both sociocultural and geospatial domains and provide rationale for using foundational and domain ontologies for complex applications. We also describe the use of the Common Semantic Model (COSMO) ontology here, which is based on lexical-conceptual primitives originating in natural language, but we also allow room for alternative choices of foundational ontologies. The emphasis throughout this chapter is on database issues and the use of ontologies specifically for semantic data integration and system interoperability.
Article
The Dataverse Network is an open-source application for publishing, referencing, extracting and analyzing research data. The main goal of the Dataverse Network is to solve the problems of data sharing through building technologies that enable institutions to reduce the burden for researchers and data publishers, and incentivize them to share their data. By installing Dataverse Network software, an institution is able to host multiple individual virtual archives, called "dataverses" for scholars, research groups, or journals, providing a data publication framework that supports author recognition, persistent citation, data discovery and preservation. Dataverses require no hardware or software costs, nor maintenance or backups by the data owner, but still enable all web visibility and credit to devolve to the data owner
Article
By adopting, adapting, and applying semantic web and software-as-a-service technologies, we can make the use of geoscience data as easy and convenient as consumption of online media. Consider Alice, a geoscientist, who wants to investigate the role of sea surface temperatures (SSTs) on anomalous atmospheric circulations and associated precipitation in the tropics. She hypothesizes that nonlinear dynamics can help her model transport processes propagated long distances through the atmosphere or ocean, and asks a graduate student to obtain daily weather, land-cover, and other environmental data products that may be used to validate her hypothesis. Like the vast majority of NSF-funded researchers (see Table 1), Alice works with limited resources. Indeed, her laboratory comprises just herself, a couple of graduate students, an undergraduate, and a technician. In the absence of suitable expertise and infrastructure, the apparently simple task that she assigns to her graduate student becomes an information discovery and management nightmare. Data are either not available or are of poor quality. Downloading and transforming datasets takes weeks. Alice then faces new challenges. Will these new data enrich her compute-intensive model, or simply propagate errors? Or should they seek other, higher-resolution datasets? What software can she use to help answer these questions? We cannot blame Alice if she ultimately abandons this promising avenue of research. Contrast Alice's experience that evening at home, as she seeks to relax with a movie. She enters a few keywords in a Web browser. The Web integrates distributed sources and discovers deep information to present a wide range of choices; it can even keep her updated of new information, if she subscribes to an alert mechanism. In effect, the Web helps her transform a vast amount of information into knowledge and actionable intelligence, and to pick and choose what is useful from what is not. And once she identifies a suitable movie, it streams reliably to her chosen playback device. She can also, if she so desires, share her experience easily with her friends via email and social networking tools.
Article
Preservation environments for digital records are successful when they can separate the digital record from any dependence on the original creating infrastructure. Data grid technology, which supports the management of records that are located on multiple storage systems, provides the software needed for infrastructure independence. This article provides a description of how data grid technology can be used to support preservation processes and of existing preservation environments that are based upon data grids. At the conclusion of its first phase, the InterPARES project (International Research on Permanent Authentic Records in Electronic Systems) issued a number of requirements for authenticity of digital records and methods of selection and preservation. 2 The conceptual foundation of these products is exemplified in an Intellectual Framework for Policy Development. Among the key principles expressed in the framework are the following: • It is not possible to preserve a digital record as a stored physical object, but only the ability to reproduce it; and • The preservation of authentic electronic records is a continuous process that begins with the process of records creation and whose purpose is to transmit authentic records across time and space. 3 In light of the above, it is essential to identify clearly the digital entity that needs to be preserved when we talk about an "authentic electronic record." The InterPARES definition of an The results presented here were supported by the InterPARES project, NSF NPACI ACI-9619020 (NARA supplement), the Persistent Archives Testbed (NHPRC grant number 2004-008), the NSF NSDL/UCAR Subaward S02-36645, the DOE SciDAC/SDM DE-FC02-01ER25486 and DOE Particle Physics data grid, the NSF National Virtual Observatory, the NSF Grid Physics Network, and the NASA Information Power Grid.