ChapterPDF Available

Metadata for Social Science Datasets

Authors:

Abstract and Figures

Data are valuable but finding the right data is often difficult. This chapter reviews current approaches to metadata about numeric data and considers approaches that may facilitate the identification of relevant data. In addition, the chapter reviews how metadata support repositories, portals, and services. There are many emerging metadata standards, but they are applied unevenly so that there is no comprehensive approach. There has been greater emphasis on structural issues than on semantic descriptions.
Content may be subject to copyright.
4
METADATA FOR SOCIAL
SCIENCE DATASETS
Robert B. Allen
Introduction ........................................................................................................................................ 40
Data Elements and Datasets ............................................................................................................ 40
Metadata Schemas and Catalogues ................................................................................................ 41
Linked Data ........................................................................................................................................ 42
Richer Semantics ................................................................................................................................ 42
Data Repositories and Collections of Datasets .............................................................................. 43
Repository Services ........................................................................................................................... 46
Infrastructure ...................................................................................................................................... 49
Conclusion.......................................................................................................................................... 49
Acknowledgements ........................................................................................................................... 50
References .......................................................................................................................................... 50
Notes .................................................................................................................................................. 51
04_LANE_CH_04.indd 39 28/05/2020 11:05:39 AM
In: Rich Search and Discovery for Research Datasets: Building the Next Generation of Scholarly Infrastructure,
Edited by J.I. Lane, I. Mulvany, and P. Nathan, Sage Publishing, 2020.
License: CC BY-NC-SA 3.0
RICH SEARCH AND DISCOVERY FOR RESEARCH DATASETS
40
Introduction
Evidence-based policy needs relevant data (Commission on Evidence-Based Policy-
making, 2017; Lane, 2016). Such data is often difficult to find and use. The FAIR open
access guidelines suggest that data should ideally be findable, accessible, interoper-
able, and reusable (FAIR).1 Broad and consistent metadata can support these needs.
Metadata and other knowledge structures could also supplement and ultimately even
replace text.
This chapter surveys the state of the art of metadata for numeric datasets, focusing on
metadata for administrative and social science records. Administrative records describe
details about the state of the world as collected by organizations or agencies. They include
governmental, hospital, educational, and business records. By comparison, social science
data generally is collected for the purpose of developing or applying theory.
We start by considering data and datasets, and then the basic principles of meta-
data and their application to datasets. Modern metadata is often implemented with
Resource Description Framework (RDF) linked data. Next, we introduce ontologies and
other semantic approaches. We then move on to applications which use metadata. We
examine repositories that hold and distribute collections of datasets. We then describe
services and techniques associated with repositories. We conclude by briefly describing
the computing infrastructure for repositories.
Data Elements and Datasets
While data may be incorporated in text, image or video, here we focus on numeric obser-
vations recorded and maintained in machine-readable form. Individual observations are
rarely used in isolation. Rather, they are typically collected into datasets.
A dataset is defined in the W3C-DCAT (W3C Data Catalog Vocabulary)2 as ‘a col-
lection of data, published or curated by a single agent’3 such as a statistical agency.
There are many different types of datasets; they differ in their structure, their source,
and their use. A given data element may appear in many different datasets and may
be numerically combined with other data to form derived data elements which then
appear in still other datasets. In some cases, they are single vectors of data; in other
cases, they comprise all the data associated with one study or across a group of related
datasets. Reference datasets are generally collected and archived because they are of
enduring value and can be used for answering many different types of questions. Other
datasets, such as an individual’s medical records, are associated with a relatively narrow
set of applications.
There is wide variability in the organization and contents of datasets, as well as in the
extent to which datasets are validated and curated. Potentially with frameworks such as
the SDMX (Statistical Data and Metadata eXchange) Guidelines for the Design of Data
Structure Definitions,4 concise structured descriptions can be developed for how data
elements are combined to form datasets.
04_LANE_CH_04.indd 40 28/05/2020 11:05:40 AM
METADATA FOR SOCIAL SCIENCE DATASETS 41
Metadata Schemas and Catalogues
Many datasets are available; the DataCite repository alone contains over 5 million data-
sets. Metadata can support users in finding datasets and enable users to know what is
in them. Metadata are short descriptors which refer to a digital object. However, there
is tremendous variability in the types of metadata and how they are applied. One cat-
egorization of metadata identifies structural (or technical), administrative and descrip-
tive metadata (Riley, 2017). Structural metadata includes the organization of the files.
Administrative metadata describes the permissions, rights, preservation and usage relat-
ing to the data.5 Descriptive metadata covers the contents.
A metadata element describes an attribute of a digital object. The simplest metadata
(e.g., a digital object identifier (DOI) or ORCiD6) identifies the digital object or its creator.7
Metadata elements are generally part of a schema or frame. DCAT8 is a schema standard for
datasets that is used by many repositories such as data.gov. Other structured frameworks
for datasets include the DataCite9 metadata schema and the Inter-university Consortium
for Political and Social Research Data Documentation Initiative (ICPSR DDI; see below). ISO
19115–1:2014 establishes a schema for describing geographic information and services.10
The schema specifications provide a flexible framework. For instance, DCAT allows
the inclusion of metadata elements drawn from domain schemas and ontologies. Some
of these domain schemas are widely used resources which DCAT refers to as ‘assets’.
Figure 4.1 shows a fragment of properties (i.e. metadata elements) from an implementa-
tion of the Schema.org11 dataset schema to describe gross domestic product (GDP).
Figure 4.1 Fragment of GDP properties described by the Schema.org dataset
schema12
04_LANE_CH_04.indd 41 28/05/2020 11:05:40 AM
RICH SEARCH AND DISCOVERY FOR RESEARCH DATASETS
42
Metadata terms for an application are often assembled into namespaces from different
metadata schemas. Metadata application profiles13 provide constraints on the types of
entities that can be included in the metadata for a given application. Moreover, appli-
cation profiles can be used to validate standards. For instance, the DCAT application
profile for data portals in Europe (DCAT-AP) supports the integration of data drawn from
repositories in different jurisdictions in the EU.14
A collection of dataset schemas,15 such as all the datasets in a repository, forms a cat-
alogue. For data streams, there needs to be continuity but also the ability to update the
records. In some cases, there may be relatively infrequent periodic updates. These could
be given version numbers rather than an entirely new DOI.16 However, collections of
highly dynamic data streams present challenges; most of the data stay the same but some
of the data and/or metadata (e.g. number of records) change.
Linked Data
RDF extends XML by requiring triples which assert a relationship (property) between two
identifiers: ‘identifier – property – identifier’. RDF Schema (RDFS) extends RDF by sup-
porting subclass relationships. A graph is formed by linking triples.
Hierarchical classification systems are another knowledge structure with a long his-
tory. Indeed, Schema.org is based around a hierarchical ontology. Simple classification
relationships are handled by the Simple Knowledge Organization System (SKOS). SKOS
represents the hierarchical structure of traditional thesauri with RDFS. Collections of data
organized by SKOS are often described as ‘linked data’.
Depending on the rigour with which they are developed, these collections can support
limited logical inference. Many administrative and social-science-related thesauri, such
as EDGAR and those of the World Bank and the OECD, have now been implemented
with SKOS. A knowledge base is, primarily, a SKOS graph that links real-world entities.
For example, Wikidata17 is an effort to develop a knowledge base based on structured data
from Wikimedia projects, and VIVO18 is a knowledge graph of scholarship.
But there are also many stand-alone classification schemes. The Extended Knowledge
Organization System (XKOS)19 was developed to allow classification systems to be incor-
porated into a SKOS framework.
Richer Semantics
Ontologies provide a coherent set of relationships between entities which cover a given
domain. Well-constructed ontologies can support logical inference. Some vocabularies,
such as Dublin Core, which is implemented in RDF, are said to have an ontology, but
they are limited because relationships among the terms are not specified. FOAF (Friend of
a Friend) provides a somewhat richer ontology which includes attributes associated with
04_LANE_CH_04.indd 42 28/05/2020 11:05:40 AM
METADATA FOR SOCIAL SCIENCE DATASETS 43
people. Still more extensive ontologies often use OWL (Web Ontology Language) which
can support stronger logical inference than RDFS.
One way to coordinate across terms is an upper ontology. Upper ontologies provide
top-down structures for the types of entities allowed in domain and application onto-
logies. One of the best-known upper ontologies is the Basic Formal Ontology (BFO; Arp
et al., 2015), which is a realist, Aristotelian approach. At the top-level, BFO distinguishes
between continuants (endurants) and occurrents (perdurants) and also between univer-
sals and particulars (instances). Many biomedical ontologies based on BFO are collected
in the Open Biomedical Ontology (OBO) Foundry.20
There are fewer rich ontologies dealing with social science content than for natural sci-
ence. Social ontology, that is, developing rigorous definitions for social terms, is often a
challenge. It is difficult to define precisely what is a family, a crime, or money. In most cases,
an operational or approximate definition may suffice when formal definitions are difficult.
However, those operational definitions often do not interoperate well across studies.
Data Repositories and Collections of Datasets
A data repository holds datasets and related digital objects. Ideally, it contains a sta-
ble collection selected according to a collection policy. It is organized by metadata and
knowledge structures. It provides access to the datasets and typically supports search.
Version
Study Title
Alternate Title
PIs & Affliation
Funding Agencies
Summary
Subject Terms
Geographic Coverage Areas
Geographic Representation
Study Time Periods and
Time Frames
Collection Notes
Study Purpose
Study Design
Description of Variables
Sampling: Sampling Procedure, Sampling
Unit, Sampling Notes
Oversampled Group
Time Method
Data Source Type
Mode of Collection
Weight
Response Rates
Scales
Analysis Unit
Unit of Observation
Smallest Geographic Unit
Data Format
Restrictions
Version History
Figure 4.2 ICPSR DDI metadata elements
04_LANE_CH_04.indd 43 28/05/2020 11:05:40 AM
RICH SEARCH AND DISCOVERY FOR RESEARCH DATASETS
44
The Inter-university Consortium for Political and
Social Research (ICPSR)
The ICPSR21 is a major repository of public-use social science and administrative datasets
derived mostly from questionnaires and surveys. We go into depth about it here because
the ICPSR DDI22 (e.g. Vardigan et al., 2008) is especially well crafted.23 The DDI codebook
saves the exact wording of all the questions and the ICPSR provides an index of all vari-
able names. DDI-Lifecycle is an extension that describes the broader context in which the
survey was administered as well as the details about the preservation of the file. DDI uses
XKOS to provide linked data. Figure 4.2 shows the ICPSR DDI metadata schema.
The ICPSR metadata elements incorporate aspects of the implementation and design
of research studies. However, many of the ICPSR metadata elements are not indepen-
dent; potentially, they could be interlinked with terms such as organizations, locations,
individuals, and research designs from other knowledge bases. Moreover, they could be
linked with higher-level workflows and mechanisms.
Additional Examples of Repositories
Statistical data collection is a core function of government. Such collections often empha-
size social data on employment, criminal justice, and public health, for example. They
also include related indicators such as agricultural and industrial output and housing.
Most countries have national statistical agencies such as Statistics New Zealand, and the
Korean Social Science Data Archive. European datasets are maintained in the Consortium
of European Social Science Data Archives24 and the European Social Survey.25 Australia
has a broad data management initiative, the Australian National Data Service.26 Many US
federal governmental datasets are collected at data.gov. In addition, there are many other
social survey repositories,27 and many US states and cities have online statistics sites at
varying levels of sophistication.
There are also many non-governmental and intergovernmental agencies such as the
OECD, the World Bank, and the United Nations that manage datasets. Similarly, there
are very large datasets from medical research such as from clinical trials and from clinical
practice including electronic health records.
Many datasets are produced, curated, and used in the natural sciences such as astronomy
and geosciences. Some of these datasets have highly automated data collection, elaborate
archives, and established curation methods. Many repositories contain multiple datasets
for which access is supported with portals or data cubes. For instance, massive amounts of
geophysical data and related text documents are collected in the EarthCube28 portal. The
science.gov portal is maintained by the US Office of Science Technology and Policy. NASA
supports approximately 25 different data portals. Each satellite in the Earth Observation
System may provide hundreds of streams of data,29 with much common metadata. Likewise,
there are massive genomics and proteomics datasets which are accessible via portals such as
UniProt30 and the Protein Data Bank,31 along with suites of tools for exploring them.
04_LANE_CH_04.indd 44 28/05/2020 11:05:40 AM
METADATA FOR SOCIAL SCIENCE DATASETS 45
Repository Registries
There are a lot of different repositories, so it is useful to have a registry with a standard
schema structure for describing them. The Registry of Research Data Repositories,32 which
is operated by DataCite, links to more than 2000 repositories, each of which holds many
datasets. Each of those repositories is described by the re3data.org schema (Rücknagel
et al., 2015).
Ecosystems of Texts and Datasets
Datasets are often associated with text reports, whether they describe the development of
the datasets or their use. Ultimately, we would like to be able to move seamlessly from data-
sets to texts and other related materials. However, as demonstrated by several of the papers
in this volume, it is often difficult to extract details about datasets from legacy publications.
Text associated with a dataset may be used to support searching for it. Indeed, Google
Dataset Search uses texts marked up with Schema.org JavaScript Object Notation for
Linked Data microdata to generate an index.
Going forward, great value can be achieved by persuading editors and authors to
clearly cite and deposit datasets. In some cases, a separate data editor may be appointed.
The Dryad Digital Repository33 captures datasets from scholarly publications. It requires
the deposit of data associated with scholarly papers accepted for publication. Such data-
sets are most often used to validate the conclusions of a research publication, but they
may also be used more broadly.
Research datasets may be given DOIs34 and cited in much the same way that research
reports are cited. Formal citations can support tracing the origins of data used in analyses
and help to acknowledge the work of the creators of the datasets.
Information Institutions and Organizations
The Open Archival Information System (OAIS) provides a reference model for the man-
agement of archives (Lee, 2010). A key part of the model is the inclusion of preservation
planning and the requirement for stable administration over time. These attributes are
part of all information institutions. Libraries, archives and museums have formal collec-
tion management strategies, metrics and policies.
In addition to traditional information institutions, there are now many other players.
CrossRef35 and DataCite are DOI registration agencies. CrossRef is a portal to metadata
for scholarly articles, while DataCite provides metadata for digital objects associated with
research. Schema.org’s primary mission is to provide a structure that improves index-
ing by search engine companies. Still other organizations such as Health Level Seven
International36 and the Kyoto Encyclopedia of Genes and Genomes37 manage controlled
vocabularies and frameworks. These organizations are increasingly adopting best prac-
tices similar to those of traditional information organizations.
04_LANE_CH_04.indd 45 28/05/2020 11:05:40 AM
RICH SEARCH AND DISCOVERY FOR RESEARCH DATASETS
46
Repository Services
Administrative Metadata and Related Services
Administrative metadata is one of the three broad categories of metadata. Administrative
metadata describes the permissions, rights, preservation, and usage of the data. While the
focus of a traditional library is to support access and the focus of an archive is to ensure stabil-
ity and quality, digital repositories must increasingly address both access and preservation.
Preservation and Trusted Datasets
Although data storage prices are declining dramatically, the cost of maintaining a trusted
repository remains substantial, and we cannot save everything. These challenges are
familiar from traditional archives; selection policies typical in archives could help in
controlling the many poorly documented datasets in some repositories. Yet, prioritiza-
tion of what to select is difficult (Whyte and Wilson, 2010).38
Lost data is often irreplaceable. Even if the data is not entirely lost, users need confi-
dence that the validity of stored data has not been compromised. Indeed, some data may
become the target of malicious attacks. Trust is a result of both technology and organi-
zational procedures. Technology may include hash-based encoding of data. CLOCKSS
(Controlled Lots of Copies Keep Stuff Safe)39 is a distributed hash system for web-based
scholarly literature. Blockchains provide hashed records of transactions and can be
applied to data records.
The OAIS framework has been incorporated into the ICPSR DDI-Lifecycle model. The
integrated Rule-Oriented Data System40 is a policy-based archival management system41
developed for large data stores. It implements a service-oriented architecture to support
best practices established by archivists. Further, audits, such as by the Digital Repository
Audit Method Based on Risk Assessment,42 may be conducted to assess how well reposi-
tories implement trustworthy procedures.
Preservation and provenance metadata schemes such as PREMIS43 and PROV-O44 are
state-based ontologies that include entities such as actors, events and digital objects.
They record the history of transitions (e.g. changes in format) for digital objects.
Rights Metadata
For some data, there are many advantages to open publication. The rights for that data
can be specified with a Creative Commons License. For other data, there can be strong
justifications for limited access, such as privacy and economic factors.
For example, although survey results are generally aggregated across individuals,
individual-level data is sometimes very useful. Some repositories of survey data include
microdata, that is, data for the responses that individuals gave to survey questions.45
However, analysis of such microdata raises privacy concerns and needs to be carefully
managed; access should be limited to qualified researchers. Repositories of individual
health records raise similar privacy concerns.
04_LANE_CH_04.indd 46 28/05/2020 11:05:40 AM
METADATA FOR SOCIAL SCIENCE DATASETS 47
Usage Statistics
The number of visits and downloads for a dataset can give an indication to later users
about the likely value of a given dataset. Such usage data are helpful for the managers and
funders of repositories to evaluate their service. Citations are indicators for how a dataset
is being used and its relationship to other work.
Analysis Platforms and Decision Support Systems
There is an increasingly rich set of analytic tools. Some of the earliest tools were sta-
tistical packages such as SPSS, R, SAS and STATA. These were gradually enhanced with
data visualization and other analytic software. The current generation of tools such as
Jupyter,46 RSpace, and eLab notebooks (ELN) integrate annotations, workflows, raw data,
data analysis, and annotations into one environment.
Virtual research environments are typically organized by research communities to
coordinate datasets with search and analytic tools. For instance, the Virtual Astronomy
Observatory uses Jupyter to provide users with a robust research environment. WissKI47
is a platform for coordinating digital humanities datasets which are based on Drupal.
Decision support systems are generally focused on finding optimal solutions in a param-
eter space. They often draw on data warehouses though recently they have begun to
incorporate feeds from unstructured data (e.g. web searches).
Most repositories support search on metadata terms. In addition, some repositories
have developed their own powerful data exploration tools such as ICPSR Colectica48 for
DDI and the GSS Data Explorer.49 The Amundsen data discovery and metadata engine50
uses metadata elements to provide a table explorer. Potentially, interactive visualization
tools such as TableLens (Rao and Card, 1994) could also be employed.
Metadata Development, Standardization, and Management
Metadata, whether for texts or datasets, needs to be complete, consistent, standardized,
machine processable, and timely (Park, 2009). Metadata registries provide clear defini-
tions and promote standardization (ISO/IEC 11179). For instance, the Marine Metadata
Interoperability Ontology Registry and Repository51 records usage of different metadata
terms. A registry may interoperate with editing tools for developers (Gonçalves et al.,
2019). These tools may suggest candidate metadata terms. One of the keys to the develop-
ment of good metadata is the involvement of a community that cares about the results.
Data Cubes, Data Warehouses, and Data Exchanges
An organization such as a large business often has many different databases. The data
in the databases will likely have different formats and definitions and can be organized
in a multidimensional cube. Some of cube’s cells may be well populated with data that
appears across many of the databases, but there will also be sparsely populated regions
and cells. Online analytical processing users can generate different views of the data by
04_LANE_CH_04.indd 47 28/05/2020 11:05:40 AM
RICH SEARCH AND DISCOVERY FOR RESEARCH DATASETS
48
drilling down, rolling up, and slicing and dicing across cells. To facilitate retrieval, there
can be a rich pre-coordinated index for common queries. Other queries can be imple-
mented with slower methods such as hashing or B-trees.
While many organizations now have integrated enterprise data management systems,
data cubes are still useful for warehousing data and for exchanging it across organiza-
tions. For instance, the W3C Data Cube52 standard is applied in inter-organizational
projects such as EarthCube53. SDMX54 enables data exchange among statistical agencies
in the EU.
Production Workflows, Research Workflows and Research Objects
Entities change over time, yet many knowledge representation frameworks do not model
change. To represent change, models need to represent transitions, processes, and other
sequential activities. Such modelling is closer to state machines, Petri nets, process ontol-
ogies, the Unified Modeling Language (UML) or even programming languages than to
traditional knowledge representation.
One way to document a research project is by saving files developed during the study
(Borycz and Carroll, 2018). Data files (e.g. Excel files) are just one type of artefact from
a research programme; other research objects include workflows. Workflows are a natu-
ral fit for describing research methods and analyses (Austin et al., 2017). The Taverna55
workflow tool has been used for the MyExperiment56 project. It provides a framework for
capturing and posting Taverna and other types of research workflows and incorporates
simple ontologies such as FOAF. Workflows can also be used to specify and document
statistical analyses; several of the analysis platforms support them. Sequential activities
in the management of repositories are often tracked with workflows. For instance, the
Generic Statistical Information Model (GSIM)57 specifies workflows for the production of
datasets by statistical agencies.
Semantic Modelling and Direct Representation
Semantic models attempt to represent entities. They could support unified descrip-
tions of functionality, transitions of complex continuants, and sequential activities
(Allen, 2018). Changes in semantic models are a form of qualitative simulation.
While traditional knowledge representation is usually implemented with ontologies,
models which allow transitions are more like programming languages. Although
semantic modelling might be implemented by process ontologies, we have focused
on the use of an object-oriented programming language which supports threads
to allow parallel concurrent event streams and potentially to develop a ‘unified
temporal map’. Such semantic simulations may be useful for modelling historical
events. For instance, a community described in a newspaper may be cast into a
‘community model’. These go beyond social ontology to model social mechanisms
(Ylikoski, 2017).
04_LANE_CH_04.indd 48 28/05/2020 11:05:40 AM
METADATA FOR SOCIAL SCIENCE DATASETS 49
In addition, Allen (2015, 2018) has proposed rich semantic modelling of entire research
reports and datasets. Structured evidence and argumentation about claims might then
be applied for the evaluation of the models. Ultimately, such ‘direct representation’ may
replace text as the primary representation for research and scholarship.
Infrastructure
Repository Servers
Semantic representations may be implemented with triplestores. Triplestores facilitate logi-
cal inference, but retrieval may be more efficient with relational databases. Many metadata
catalogues are implemented with relational databases. Thus, they use SQL and are often char-
acterized by UML class diagrams. Information models (e.g., National Information Exchange
Model58) which could be used for metadata registries may be implemented as data dictionaries.
Some repositories are federated with the Open Archives Initiative Protocol for Metadata
Harvesting (OAI-PMH)59. This allows the ‘harvesting’ of metadata from separate reposito-
ries. OAI-PMH is increasingly used as an API to allow external users to query and interact
with the federated set of metadata.
Cloud Computing
We are well into the era of cloud computing (Foster and Gannon, 2017), allowing flexible
allocation of computing, networking and storage resources, which facilitates software as
a service. The compatibility of the versions of software packages needed for data man-
agement is often a challenge. Containers, such as those from Docker, allow compatible
versions of software to be assembled and run on a virtual computer. A cloud-based virtual
machine can hold datasets, workflows, and the programs used to analyse the data, which
can be a complete digital preservation package.
Highly networked data centres facilitate the Internet of Things which generates mas-
sive and dynamic data. Increasingly, cloud computing is supporting edge computing and
append-only stores which can capture streaming data. These technologies will provide
the foundation for smart cities and have implications for the kinds of questions we may
ask about social behaviour.
Conclusion
Many datasets, especially legacy datasets, are difficult to find and access. Some of the big-
gest issues for the retrieval of datasets concern information organization, which helps to
provide context. Metadata supports the discovery and access to datasets.
More attention to metadata would also further support evidence-based policy. We
need richer, more systematic, and more interoperable metadata standards. We need to
improve the metadata associated with existing datasets. And we need to aggressively
04_LANE_CH_04.indd 49 28/05/2020 11:05:40 AM
RICH SEARCH AND DISCOVERY FOR RESEARCH DATASETS
50
upgrade the application of high-quality metadata and knowledge organization systems
to datasets as they are created.
Acknowledgements
Julia Lane and members of NYU’s Center for Urban Science and Progress provided useful
advice and comments.
References
Allen, R. B. (2015) Repositories with direct representation. Preprint, arXiv:1512.09070.
Allen, R. B. (2018) Issues for using semantic modeling to represent mechanisms. Preprint,
arXiv:1812.11431.
Allen, R. B. and Kim, Y. H. (2017/2018) Semantic modeling with foundries. Preprint,
arXiv:1801.00725.
Arp, R., Smith, B. and Spear, A.D. (2015) Building Ontologies with Basic Formal Ontology.
Cambridge, MA: MIT Press.
Austin, C. C., Bloom, T., Dallmeier-Tiessen, S., Khodiyar, V. K., Murphy, F., Nurnberger, A.,
et al. (2017). Key components of data publishing: Using current best practices to
develop a reference model for data publishing. International Journal on Digital Libraries,
18(2) 77–92. doi: 10.1007/s00799–016–0178–2
Borycz, J. and Carroll, B. (2018) Managing digital research objects in an expanding
science ecosystem: 2017 conference summary. Data Science Journal, 17. http://doi.
org/10.5334/dsj–2018–016
Commission on Evidence-Based Policymaking (2017) The Promise of Evidence-Based
Policymaking. https://www.cep.gov/cep-final-report.html
Foster, I. and Gannon, D. B. (2017) Cloud Computing for Science and Engineering.
Cambridge, MA: MIT Press.
Gonçalves, R. S., O’Connor, M. J., Martínez-Romero, M., Egyedi, A. L., Willrett, D.,
Graybeal, J. and Musen, M. A. (2019) The CEDAR workbench: An ontology-assisted
environment for authoring metadata that describe scientific experiments. Preprint,
arXiv:1905.06480
InterPARES2 Project (2008) A framework of principles for the development of policies,
strategies and standards for the long-term preservation of digital records.
Lane, J. (2016) Big data for public policy: The quadruple helix, Journal of Policy Analysis
and Management, 35(3). doi: 10.1002/pam.21921
Lee, C.A. (2010) Open Archival Information System (OAIS) reference model. In M. J. Bates
and M. N. Maack (eds), Encyclopedia of Library and Information Sciences (3rd edition).
Boca Raton, FL: CRC Press.
Park, J.-R. (2009) Metadata quality in digital repositories: A survey of the
current state of the art. Cataloging & Classification Quarterly, 47, 213–228. doi:
10.1080/01639370902737240
Rao, R. and Card, S. K. (1994) The Table Lens: Merging graphical and symbolic
representations in an interactive focus+context visualization for tabular information.
In CHI ’94: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.
New York: ACM, pp. 318–322. doi: 10.1145/191666.191776
Riley, J. (2017) Understanding Metadata: What Is Metadata, and What Is It For?: A Primer.
Bethesda, MD: NISO Press.
04_LANE_CH_04.indd 50 28/05/2020 11:05:40 AM
METADATA FOR SOCIAL SCIENCE DATASETS 51
Rücknagel, J., Vierkant, P., Ulrich, R., Kloska, G., Schnepf, E., Fichtmüller, D. et al. (2015)
Metadata schema for the description of research data repositories: version 3.0. doi:
10.2312/re3.008
Vardigan, M., Heus, P. and Thomas, W. (2008) Data documentation initiative: Toward
a standard for the social sciences. International Journal of Digital Curation, 3(1). doi:
10.2218/ijdc.v3i1.45
Whyte, A. and Wilson, A. (2010) How to Appraise and Select Research Data for Curation.
Edinburgh: Digital Curation Centre.
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J. J., Appleton, G., Axton, M., Baak, A.,
et al. (2016) The FAIR Guiding Principles for scientific data management and
stewardship, Scientific Data, 3, 160018. doi: 10.1038/sdata.2016.18
Ylikoski, P. (2017) Social mechanisms. In S. Glennan and P. Illari (eds), The Routledge
Handbook of Mechanisms and Mechanical Philosophy. London: Routledge.
Notes
1 The FAIR guidelines have been extended from scholarly texts to datasets (Wilkinson
et al., 2016).
2 https://www.ncbi.nlm.nih.gov/pmc/
3 https://www.aclweb.org/anthology/
4 https://github.com/allenai/science-parse
5 https://manpages.debian.org/testing/poppler-utils
6 https://github.com/explosion/spaCy
7 https://wiki.dbpedia.org/services-resources/ontology
8 https://en.wikipedia.org/wiki/Category:Statistical_methods
9 https://spacy.io/api/tokenizer
10 https://rasa.com/docs/nlu
11 https://nlp.stanford.edu/projects/glove
12 https://fasttext.cc/docs/en/crawl-vectors.html
13 https://www.w3.org/TR/shacl/
14 https://joinup.ec.europa.eu/release/dcat-ap/11
15 This differs from library or archival collections, which are usually thematically related,
and for which the selection of items for inclusion is defined by an express collection
policy.
16 The challenges of metadata for data streams are related to the cataloguing of different
editions of a work and of serials in a text-based library.
17 https://wikidata.org/
18 https://duraspace.org/vivo/about/
19 https://ddialliance.org/Specification/RDF/XKOS
20 http://www.obofoundry.org/
21 https://www.icpsr.umich.edu/
22 http://ddialliance.org
23 DDI is also used for datasets from other organizations such as the National Opinion
Research Center (NORC).
24 https://www.cessda.eu/
25 https://www.europeansocialsurvey.org/data/
26 https://www.ands.org.au/
27 There are additional collections at http://data.census.gov, http://gss.norc.org, http://
electionstudies.org, http://psidonline.isr.umich.edu, and http://www.nlsinfo.org
28 https://www.earthcube.org/
04_LANE_CH_04.indd 51 28/05/2020 11:05:40 AM
RICH SEARCH AND DISCOVERY FOR RESEARCH DATASETS
52
29 https://pds.nasa.gov/
30 https://www.uniprot.org/
31 http://www.rcsb.org/
32 re3data.org
33 https://datadryad.org/
34 https://datacite.org/
35 https://www.crossref.org/
36 https://www.hl7.org/
37 https://www.genome.jp/kegg/
38 See also, for example, http://www.dcc.ac.uk/digital-curation/planning-preservation
39 https://clockss.org/
40 http://irods.org
41 The policies are based on the International Research on Permanent Authentic Records
in Electronic Systems (InterPARES) standard; see https://interparestrust.org/
42 http://www.dcc.ac.uk/sites/default/files/DRAMBORA_Interactive_Manual%5B1%5D.
pdf; see also http://www.dcc.ac.uk/resources/repository-audit-and-assessment/drambora
43 http://www.loc.gov/standards/premis/ontology/
44 https://www.w3.org/TR/prov-o/
45 The term microdata is used in two distinct ways. In the context of HTML, it is associ-
ated with embedding Schema.org codes into web pages similar to micro-formats. In
the context of survey data, it refers to individual-level data.
46 https://jupyter.org/
47 http://wiss-ki.eu
48 https://www.colectica.com/
49 https://gssdataexplorer.norc.org/
50 https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine–62d27254fbb9
51 https://mmisw.org/
52 https://www.w3.org/TR/vocab-data-cube/
53 https://www.earthcube.org/info/about
54 http://sdmx.org/
55 https://taverna.incubator.apache.org/
56 https://www.myexperiment.org/about
57 https://statswiki.unece.org/display/gsim/Generic+Statistical+Information+Model.
GSIM is coordinated with the Common Statistical Production Architecture; see https://
unstats.un.org/unsd/nationalaccount/workshops/2015/gabon/BD/CSPA-ENG.pdf
58 https://www.niem.gov/about-niem
59 https://www.openarchives.org/pmh/
04_LANE_CH_04.indd 52 28/05/2020 11:05:40 AM
... Rules are structured using the KIF operators: implies or IF, (=>, Exists, and And. 5 Other KIF operators include Or, ForAll, Not, and IF-and-only-IF, (<=>. ...
... Research Reports and Data Sets: We have proposed that direct representation of scientific research reports replace traditional text-based research reports. A broad ontology such as SUMO should be particularly helpful for organizing social science research data [5] and describing the issues addressed in the introductions and conclusions of the research reports. ...
... The distribution also includes YAGO-SUMO ([15],[24] chp 9) that has many facts pulled from Wikipedia. In addition, WordNet terms ([24] chp 5) are mapped to SUMO terms.5 Three different ways in which Rules are used can be roughly identified: (a) to specify contexts (e.g.,Figure 1), (b) to specify processes (e.g.,Figure 2), and (c) as part of the Structural Ontology to specify the effects and constraints of Relations and Predicates. ...
Preprint
Full-text available
While ontologies are typically applied to static descriptions of the world, we propose to apply them as representations for dynamic simulations. In this paper, we explore using the Suggested Upper Merged Ontology (SUMO) to develop a semantic simulation. We provide two proof-of-concept demonstrations modeling transitions in a simulated gasoline engine. In our models, the knowledge base evolves as the simulation executes. Faults can be detected at run-time.
... Rules are structured using the KIF operators: implies or IF, (=>, Exists, and And. 5 Other KIF operators include Or, ForAll, Not, and IF-and-only-IF, (<=>. ...
... In addition, WordNet terms ( [28] chp5) are mapped to SUMO terms. 5 In some cases (e.g., Figure 1), rules specify context and attributes for Objects. In other cases (e.g., Figure 2) they can specify Objects associated with processes (e.g., Figure 2). ...
... We should be able to develop rich knowledge management tools that could simulate the interaction of entities and even to "directly represent" entire research reports [3]. A broad ontology such as SUMO should be particularly helpful for organizing social science research data [5] and describing the high-level issues addressed in the introductions and conclusions of the research reports. ...
Article
Full-text available
We explore using the Suggested Upper Merged Ontology (SUMO) to develop a semantic simulation. We provide two proof-of-concept demonstrations modeling transitions in a simulated gasoline engine using a general-purpose programming language. Rather than focusing on computationally highly intensive techniques, we explore a less computationally intensive approach related to familiar software engineering testing procedures. In addition, we propose structured representations of terms based on linguistic approaches to lexicography. 1 Introduction We believe knowledge representation should be fully integrated with programming languages. Therefore, we are exploring the implementation of dynamic semantic simulations based on ontologies using a general-purpose programming language (cf., [4]). These simulations allow model-level constructs such as flows, states, transitions, microworlds, generalizations, and causation, and language features such as conditionals, threads, and looping. In this paper, we provide initial demonstrations for how the Suggested Upper Merged Ontology (SUMO) can be applied to Python-based semantic modeling. SUMO has both a rich ontology and a sophisticated inference environment built to use first-order predicate calculus [9, 15, 16, 25, 27, 28]. 1 The SUMO ontology incorporates approaches from several other ontologies ([28] p94). Like the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) [12], SUMO also incorporates insights from linguistics. In fact, one extension of SUMO explores Natural Language Generation ([17]). The SUMO ontology is implemented with SUO-KIF, which is a subset of KIF (Knowledge Interchange Format) [18]. KIF is a notation based on the operators of first-order logic (FOL). 2 As a type of description logic, SUMO includes rules, which are implemented with formulas. These represent constraints about the world. Computationally intensive theorem proving for large ontologies has been the focus of much of the recent research on SUMO. By comparison, we explore state-based modeling for small example applications. This is a companion to [4] semantic modeling using the Basic Formal Ontology (BFO) [10]. As in that work, the interactions studied are object-driven. We do not focus on complex inference in this paper; rather we apply simple test cases analogous to those used in requirements testing and model checking [14, 18] to detect possible conflicts in domains, states, and relationships following Transitions. Truth maintenance [24] considers how to ensure that there are only true statements in a knowledgebase as new statements are added. The original knowledgebase is assumed to be true and any incoming statements that conflict with those are rejected. Research on Truth Maintenance Systems (TMS) explores robust and general abstractions to detect and resolve conflicts. In the interest of practical applications, we support lightweight, tractable approaches to inference and truth maintenance. These approaches are related to those from software engineering used to 1 General information about the SUMO project is available at http://ontologyportal.com/ap/. SUMO's full KIF files, its code and other tools are at https://github.com/ontologyportal/sumo. The bulk of the SUMO ontology and the software are available with only light restrictions. When first exploring SUMO, we found it best to use only a few of the KIF files, together with the Python interface, and a moderately powerful computer. 2 FOL has been criticized as allowing too much flexibility [26, 33] to be a suitable platform for ontologies. However, much of the SUMO ontology avoids the possible pitfalls and any lapses could be remedied. Indeed, we believe that BFO itself could be largely implemented in SUO-KIF.
... A Fluid moves from a Source to a Goal along a Path or within an Area 5 The core Frame Elements of Fluidic_Motion are Fluid, Source, Goal, and Path/Area. Other, non-core Frame Elements include Configuration, which specifies parameters such as the volume and speed of the Fluid. ...
... Nonetheless, although the simulation can be highly numerical, the description may still be mostly qualitative. 5 In the case of the waterfall, a Path is a recursive series of segments with width, length, and slope. Other Parameters could be incorporated by adding the Configuration non-core Frame Element. ...
Conference Paper
Full-text available
We have proposed going beyond traditional ontologies to use rich semantics implemented in programming languages for modeling. In this paper, we discuss the application of executable semantic models to two examples, first a structured definition of a waterfall and second the cardiopulmonary system. We examine the components of these models and the way those components interact. Ultimately, such models should provide the basis for direct representation.
Article
Full-text available
Digital research objects are packets of information that scientists can use to organize and store their data. There are currently many different methods in use for optimizing digital objects for research purposes. These methods have been applied to many scientific disciplines but differ in architecture and approach. The goals of this joint digital research object (DRO) conference were to discuss the challenge of characterizing DROs at scale in volume and over time and possible organizing principles that might connect current DRO architectures. One of the primary challenges concerns convincing scientists that these tools and practices will actually make the research process easier and more fruitful. This conference included work from CENDI, the National Federal STI Managers Group, the National Federation of Advanced Information Services (NFAIS), the Research Data Alliance (RDA), and the National Academy of Science (NAS).
Conference Paper
Full-text available
The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developed—the CEDAR Workbench—is a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. The CEDAR Workbench provides a versatile, REST-based environment for authoring metadata that are enriched with terms from ontologies. The metadata are available as JSON, JSON-LD, or RDF for easy integration in scientific applications and reusability on the Web. Users can leverage our APIs for validating and submitting metadata to external repositories. The CEDAR Workbench is freely available and open-source.
Article
Full-text available
The availability of workflows for data publishing could have an enormous impact on researchers, research practices and publishing paradigms, as well as on funding strategies and career and research evaluations. We present the generic components of such workflows to provide a reference model for these stakeholders. The RDA-WDS Data Publishing Workflows group set out to study the current data-publishing workflow landscape across disciplines and institutions. A diverse set of workflows were examined to identify common components and standard practices, including basic self-publishing services, institutional data repositories, long-term projects, curated data repositories, and joint data journal and repository arrangements. The results of this examination have been used to derive a data-publishing reference model comprising generic components. From an assessment of the current data-publishing landscape, we highlight important gaps and challenges to consider, especially when dealing with more complex workflows and their integration into wider community frameworks. It is clear that the data-publishing landscape is varied and dynamic and that there are important gaps and challenges. The different components of a data-publishing system need to work, to the greatest extent possible, in a seamless and integrated way to support the evolution of commonly understood and utilized standards and—eventually—to increased reproducibility. We therefore advocate the implementation of existing standards for repositories and all parts of the data-publishing process, and the development of new standards where necessary. Effective and trustworthy data publishing should be embedded in documented workflows. As more research communities seek to publish the data associated with their research, they can build on one or more of the components identified in this reference model.
Article
Full-text available
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.
Article
Full-text available
A new generation of digital repositories could be based on direct representation of the contents with rich semantics and models rather than be collections of documents. The contents of such repositories would be highly structured which should help users to focus on meaningful relationships of the contents. These repositories would implement earlier proposals for model-oriented information organization by extending current work on ontologies to cover state changes, instances, and scenarios. They could also apply other approaches such as object-oriented design and frame semantics. In addition to semantics, the representation needs to allow for discourse and repository knowledge-support services and policies. For instance, the knowledge base would need to be systematically updated as new findings and theories reshape it.
Article
Full-text available
The Data Documentation Initiative (DDI) is an emerging metadata standard for the social sciences. The DDI is in active use by many data specialists and archivists, but researchers themselves have been slow to recognize the benefits of the standards approach to metadata. This paper outlines how the DDI has evolved since its inception in 1995 and discusses ways to broaden its impact in the social science research community.
Article
Full-text available
We present a new visualization, called the Table Lens, for visualizing and making sense of large tables. The visualization uses a focus context (fisheye) technique that works effectively on tabular information because it allows display of crucial label information and multiple distal focal areas. In addition, a graphical mapping scheme for depicting table contents has been developed for the most widespread kind of tables, the cases-by-variables table. The Table Lens fuses symbolic and graphical representations into a single coherent view that can be fluidly adjusted by the user. This fusion and interactivity enables an extremely rich and natural style of direct manipulation exploratory data analysis.
Chapter
The Reference Model for an Open Archival Information System (OAIS) describes components and services required to develop and maintain archives, in order to support long-term access to and understanding of the information in those archives.[1] The development of the OAIS took place within a standards development organization called the Consultative Committee for Space Data Systems (CCSDS), whose formal purview is the work of space agencies, but the effort reached far beyond the traditional CCSDS interests and stakeholders.[2] It has become a fundamental component of digital archive research and development in a variety of disciplines and sectors.
Article
This study presents the current state of research and practice on metadata quality through focus on the functional perspective on metadata quality, measurement, and evaluation criteria coupled with mechanisms for improving metadata quality. Quality metadata reflect the degree to which the metadata in question perform the core bibliographic functions of discovery, use, provenance, currency, authentication, and administration. The functional perspective is closely tied to the criteria and measurements used for assessing metadata quality. Accuracy, completeness, and consistency are the most common criteria used in measuring metadata quality in the literature. Guidelines embedded within a Web form or template perform a valuable function in improving the quality of the metadata. Results of the study indicate a pressing need for the building of a common data model that is interoperable across digital repositories.