ArticlePDF Available

Abstract and Figures

Governments are publishing enormous amounts of open data on the web every day in an effort to increase transparency and reusability. Linking data from multiple sources on the web enables the performance of advanced data analytics, which can lead to the development of valuable services and data products. However, Canada's open government data portals are isolated from one another and remain unlinked to other resources on the web. In this paper, we first expose the statistical data sets in Canadian provincial open data portals as Linked Data, and then integrate them using RDF Cube vocabulary, thereby making different open data portals available through a single search endpoint. We leverage Semantic Web Technologies to publish open data sets taken from two provincial portals (Nova Scotia and Alberta) as RDF (the Linked Data format), and to connect them to one another. The success of our approach illustrates its high potential for linking open government data sets across Canada, which will in turn enable greater data accessibility and improved search results.
Content may be subject to copyright.
Int. J. , Vol. x, No. x, xxxx 1
Copyright © 200x Inderscience Enterprises Ltd.
Towards Linked Provincial Open Data
in Canada
Enayat Rajabi
Shannon School of Business
Cape Breton University
Sydney, NS, Canada
Email: enayat_rajabi@cbu.ca
Abstract: Governments are publishing enormous amounts of open data on the
Web every day in an effort to increase transparency and reusability. Linking data
from multiple sources on the Web enables the performance of advanced data
analytics, which can lead to the development of valuable services and data
products. However, Canada’s provincial government open data portals are
isolated from one another, and remain unlinked to other resources on the Web.
In this paper, we first expose the statistical datasets in Canadian provincial open
data portals as Linked Data, and then integrate them using RDF Cube
vocabulary, thereby making different open data portals available through a single
search endpoint. We leverage Semantic Web technologies to publish open data
sets taken from two provincial portals (Nova Scotia and Alberta) as RDF (the
Linked Data format), and to connect them to one another. The success of our
approach illustrates its high potential for linking open data sets across Canada,
which will in turn enable greater data accessibility and improved search results.
Keywords: Open data; RDF cube; Linked Data; semantic web
Biographical notes: Enayat Rajabi is an Assistant Professor of Business Analytics
with the Shannon School of Business at Cape Breton University in Canada, as well
as an adjunct professor of Information Management with Dalhousie University’s
School of Information Management. Dr. Rajabi received his Ph.D. in Information
and Knowledge Engineering from the University of Alcala, Spain. His work has
focused on Semantic Web and Linked Data since 2010, and he has published his
related research in several JCR journals.
1. Introduction
E. Rajabi
The volume and variety of open governmental data is growing exponentially as new
data sources become available on the Web. In particular, the extraordinary volume
of statistical data published by governments around the world is a significant
resource that enables not only increased public transparency and accountability, but
also greater innovation (Jetzek et al., 2019; van Ooijen et al., 2019). For example,
most of the datasets published on the European Commission’s
1
open data portal are
statistical in nature. This portal was designed to allow European institutions and
other bodies access to an array of open datasets in the hopes that they will apply
these data in new and innovative ways, thereby unlocking their economic potential.
Open statistical data are published by governments with the aim of such data being
used, analyzed, and commercialized by a wide range of actors, from professional
statisticians to the lay public. However, publishing and sharing data openly is only
the first step in successfully reusing data. Indeed, as Wallis et al. (2013) inquire, “if
we share data, will anyone use them?” (Dawes et al., 2016) have similarly
identified the promotion of data re-use as a key challenge, especially considering
the many purposes it can serve . A survey conducted by the European Commission
(2017) identified five main benefits of re-using open statistical data: innovation,
reduced costs, data harmonization, enhanced business models, and increased
company reliability. For many of the companies interviewed in this study, open data
is a core component of their operations, and it is one of the key resources that
enabled them to start their business.
The Linked Data approach has been widely used by different governments
to increase data re-usability, as it allows data from different disciplines to be
interconnected (Zaveri et al., 2013)(Gür et al., 2012) (Oren et al., 2008) (Caracciolo
et al., 2012). Linked statistical datasets can be applied in various domains for
different purposes. For instance, the integration of statistical data from multiple
sources can enable the performance of advanced data analytics, which can in turn
lead to the production of valuable services and data products (Kalampokis,
Karamanou, et al., 2019). One example of such an application can be seen in the
work of Kämpgen and Harth (2011), who built a data warehouse by leveraging
Linked Data in order to analyze related statistical datasets.
Although open data in Canada is published and shared through an open data
portal,
2
the re-use of this data is still in its infancy and with little research having
been devoted to this area. Searching for and connecting relevant datasets in
provincial open data portals is costly, as most statistical datasets are published in
spreadsheet formats. Furthermore, processing the data across all of the datasets in
the different open data portals can be a tedious and time-consuming task due to the
lack of uniformity in the structures and vocabularies that are used to express it (Dong
et al., 2017). While some of the provincial data portals have developed an API to
1
https://data.europa.eu/euodp/en/data
2
http://open.canada.ca
Towards Linked Open Provincial Data in Canada
access data, they do not follow a unified standard. This is a significant shortcoming,
as a unified approach to searching multiple linked provincial datasets would enable
users to analyze and compare topically-related datasets with greater ease. In this
paper, we attempt to employ a unified structure to integrate statistical open data from
multiple Canadian sources in order to promote the domestic and national utilization
of this statistical data. Put differently, this research seeks to answer the key question:
“can we integrate the isolated data islands (provincial open datasets) and provide a
search platform to query the unified and linked statistical datasets?”
To answer this question, we expose the statistical datasets in the provincial
open data portals as Linked Data, and integrate them using RDF Cube vocabulary.
1
We leveraged a World Wide Web (W3C) standard vocabulary to expose the current
statistical and multi-dimensional data as a Resource Description Framework (RDF).
RDF Cube vocabulary is designed to publish multi-dimensional data in a way that
links all related data. This integration makes different open data portals available
through a single search endpoint, and consists of four phases:
a) Data selection: a common statistical dataset is selected from two or more
open data portals;
b) Vocabulary preparation: a new vocabulary is defined or an existing
vocabulary is reused;
c) Defining a data structure: identifying a structure that will allow the
selected datasets to be integrated; and
d) Data Storage: storing the exposed data in a triple store to run the queries.
The statistical datasets in the provincial open data portals are currently available in
raw data formats (e.g., CSV or XLS), and are converted to an RDF when connected
to other datasets in another portal. Thus, RDF links connect common statistical
datasets to each other, as well as to the other datasets in the different provincial data
portals. The process of converting a statistical dataset to an RDF is described in
Section 2. Furthermore, datasets can be also connected to external sources and
ontologies. Details relating to the publishing data and the interlinking of provincial
data with other vocabularies are presented in Section 3. As a proof of concept, the
proposed approach was implemented using statistical datasets taken from two
Canadian provincial data portals (Nova Scotia and Alberta). The selected datasets,
which were relevant to the same subject, were manually downloaded from the data
portal websites and transformed to RDF after defining a data structure for both.
Section 4 details the process that was used to link the two statistical datasets. Finally,
Section 5 provides a discussion of the lessons learned as a result of this research.
2. Background
1
https://www.w3.org/TR/vocab-data-cube/
E. Rajabi
2.1. RDF Cube
Statistical data consist of measures (e.g., birth rate) and dimensions describing the
measures (e.g., province, country, and year). Statistical data are structured as data
cubes and can be modelled as RDF graphs using the RDF Cube vocabulary, which
is the global standard for publishing multi-dimensional data in RDF. As such, the
RDF Cube vocabulary can be used to promote the utilization of statistics, both
domestically and internationally. The RDF Cube vocabulary has been widely
accepted by the Semantic Web community, and it is used to represent a large
proportion of existing linked statistical datasets on the Web (Martin et al., 2015).
One of the reasons for the RFD Cube Vocabulary’s popularity is that it allows
publishers to integrate and slice across their datasets. Data publishers can also
leverage the RDF Cube to publish the information model along with the raw data
using common terms for the dimensions and units in their datasets. This vocabulary
is compatible with the Statistical Data and Metadata eXchange XML format
(SDMX) (van Ooijen et al., 2019), which is defined by an initiative established in
2001 to support the exchange of statistical data.
The main element in an open statistical dataset is observation, which
consists of one or more dimensions, one or more measures, and optional attributes.
The RDF Data Cube Vocabulary uses one subject per observation, with the
dimensions, measures, and attributes being attached to the subject of the observation
in the dataset. An observation is connected to a dataset using an outgoing link.
Figure 1 depicts an observation from an open dataset (a provincial dataset in
Canada), which shows the quantity of an incident (124) in a given area (NS) for a
particular time period (2015).
Figure 1: An example of an observation.
Towards Linked Open Provincial Data in Canada
We applied the same concept in order to present statistical data records, with
the RDF Cube serving as the building block for integrating the two datasets used in
our approach.
2.2. Related Works
Kalampokis, Zeginis, et al. (2019) investigated and proposed approaches
to dealing with the modeling challenges associated with creating a linked
statistical open dataset. The present research adopts some of their proposed
guidelines—for example, those related to the naming of dimensions, measures,
and dataset structure—in creating a provincial dataset structure using RDF Cube.
Additionally, Asano et al., (2014) proposed a software template based on
RDF Cube standard, which could be used to manipulate statistical data (browse
and edit) in RDF; however, this tool was unavailable for use in this research. In
terms of data integration using Linked Data, Zaveri et al. (2013) translated over
50 statistical health-related datasets into a Linked Data format using RDF Data
Cube Vocabulary, which were then interlinked in order to lower the barrier for
data re-use and integration. Similarly, in their work for the Japanese Statistical
Center, Matsuda et al. (2018) were able to publish statistical and governmental
data with around 300 million triples by first publishing statistical datasets as
RDF and linking their vocabularies to existing vocabularies on the Web.
Several other studies, including Bukhari and Baker (2013), Salas et al.
(2012) and Máchová et al. (2018), have focused on reusing open data in national
open data portals. For example, Salas et al. (2012) proposed using an open data
framework to make open data portals more discoverable and intelligible for
potential data reuse purposes in the agriculture domain. However, their research
utilized a tagging and annotating approach and did not consider linking different
open datasets in their framework. Alternately, Máchová et al. (2018) proposed
using a usability evaluation approach to identify deficiencies in the usability of
several individual data portals, including Canada open data. Dong et al. (2017)
also examined this issue within a Canadian context, presenting an overview of
the status and issues associated with the open data provided by seven Canadian
cities.
The present research attempts to integrate the different Canadian statistical
open data portals, which, to the best of our knowledge, has yet to be studied. To
this end, we present an approach that enables us to publish statistical open
datasets, which are currently available in provincial open data portals, and to link
them to each other based on common vocabularies on the Web.
3. Materials and Methods
E. Rajabi
This section details the context of our study and the methodology we will employ
to publish and link provincial open data portals in Canada.
3.1 Provincial Open Datasets in Canada
We performed an exploratory analysis by gathering metadata from existing
provincial open data portals in Canada. Table 1 shows the results of this analysis,
including each portal’s number of open datasets and its current web address. Our
findings revealed that 11 provinces and territories had published approximately
11,771 datasets in different domains ranging from “Business and Economy” to
“Nature and Environment. Notably, most of the open data portals used different
standards to categorize their datasets, with some not using any categories at all. In
some cases, it was necessary to explore the entire website to find the published open
datasets, which was both tedious and time-consuming. As Table 1 shows, British
Columbia and Alberta published more open data than the other provinces. Many of
these portals presented their data using different formats, including CSV, JSON, and
Excel. Although a few of these data portals allowed users to export data in RDF
format, they do not follow the Linked Data vocabulary standards such as the RDF
Cube vocabulary, and they do not link their datasets to those of other provinces.
Table 1: Provincial open data portals in Canada based on province/territory.
Province/Territory
Number of datasets
British Columbia
2,939
Alberta
2,777
Ontario
2,656
Yukon
1,177
Manitoba
789
Nova Scotia
575
Saskatchewan
354
Prince Edward Island
202
New Brunswick
121
Northwest Territory
100
Newfoundland & Labrador
81
Towards Linked Open Provincial Data in Canada
3.2 Methodology
We propose a three-layer architecture to integrate all of the provincial open data
portals in Canada, wherein open datasets are exposed as RDF and linked to the
other datasets based on a common vocabulary and ontology. As depicted in
Figure 1, different statistical open datasets are extracted from different data
portals and converted to an RDF format based on a common predefined
vocabulary.
Figure 2: Architecture of the linked provincial datasets.
Since most of the open data portals do not provide an API to retrieve data, it
will be necessary to manually download the statistical datasets directly from
those websites. Based on the subject, a vocabulary is used to transform a
statistical dataset from its raw format (which can be CSV or XLS) to the RDF
format. In this process, data can be also linked to external sources or
vocabularies. For example, if there is a disease in a dataset, it can be linked to
E. Rajabi
the disease ontology on the Web. Once the integration process has been
completed, the integrated data is stored in a data store (known as the “triple
store”), and a query service is provided as a single search point to retrieve query
results (See Figure 2).
To map the statistical data to a graph database, we leveraged the RDF Data
Cube Vocabulary discussed in Section 2. The core concept of the Data Cube
Vocabulary is an observation class (qb:Observation), which is used to make all
statistical observations as being part of a Data Cube. Every observation must
follow a specific structure that is defined using the class,
qb:DataStructureDefinition (DSD), and referenced by a dataset resource (DS) of
type qb:DataSet. Since every observation should refer to one specific dataset
(which again refers to the corresponding DSD), the structure of the observation
is fully specified. DSD components are defined as a set of dimensions
(qb:DimensionProperty), attributes (qb:AttributeProperty), and measures
(qb:MeasureProperty) to encode the semantics of the observations. These
component properties are also used to link the corresponding elements of
dimensions, measure values, and units with the respective observational
resource. Furthermore, it is possible to define groups and slices of observations,
as well as hierarchical orders of dimension, using respective concepts.
According to what we have described in this section, the integration of open
statistical datasets will be achieved via the following steps (see Figure 3):
Data selection: In this phase, a statistical dataset common to two or
more two provincial open data portals is selected. Since the goal of
this integration is to provide a single-point search mechanism, it is
especially beneficial to identify datasets that are common to several
portals. Regardless, any dataset in an open data portal can be imported
into the data store.
Defining the dataset and data structure (DSD): The structure of an
open dataset should be defined during the integration process. Datasets
with same subject but different structures can be unified using the Data
Cube standard.
Vocabulary preparation: Necessary items are used to express the target
data as an RDF. When a standard vocabulary exists, we use it; when a
standard vocabulary does not exist, we define a new one.
Conversion of observations: After defining the structure of each
observation, we convert them to the RDF format. Once each dataset
has been converted, it is imported into a triple store for further analysis
and querying.
Towards Linked Open Provincial Data in Canada
Figure 3: Open statistical dataset integration process.
4. Implementation and Discussion
To implement the proposed approach, we present a scenario wherein one
statistical dataset common to two provincial open data platforms is selected. In
this scenario, we selected the “cause of death” dataset from the Nova Scotia and
E. Rajabi
Alberta data portals. Table 2 shows a set of records for this dataset in each
province.
Table 2: Cause of death dataset in Nova Scotia and Alberta.
Province/Territory
Year
Quantity
Dataset area
Neoplasms
2015
2,699
Nova Scotia
Diseases of the Circulatory System
2015
2,317
Nova Scotia
Mental and Behavioral Disorders
2014
591
Nova Scotia
Diseases of the Nervous System
2014
477
Nova Scotia
Malignant neoplasms of colon
2016
477
Alberta
Mental and behavioral disorders due to
use of alcohol
2015
193
Alberta
Alzheimer's disease
2014
299
Alberta
All other diseases of nervous system
2014
253
Alberta
As shown in Table 2, the datasets for both provinces consist of three
dimensions: year, cause of death, and quantity (number of deaths). For example,
the first row of Table 2 shows 2,699 cases of Neoplasm disease in Nova Scotia
in 2015. It is possible that each open data portal contains other datasets with
additional dimensions (e.g., ranking or gender). We downloaded the datasets in
CSV format directly from the provincial open data portals, but the Nova Scotia
dataset is also accessible in RDF format via their Socrata API.
1
However, the
RDF does not follow the W3C statistical standard for publishing data in the
proper format. Therefore, as noted above, we unified the two datasets by
transforming the data into RDF base on the RDF Cube vocabulary standard.
Prior to this transformation, we conducted a data cleaning step wherein we
assigned additional vocabularies to each dataset.
Since many of the causes of death in both datasets were related to diseases,
we linked each disease to the disease ontology,
2
which is an open-source
ontology organized around inheritable diseases, environmental factors leading
to disease, and the infectious origins of diseases. To find the corresponding
ontological term for each disease in the downloaded dataset, we used the disease
ontology lookup service. This type of vocabulary linking can be used both to
categorize the diseases in each dataset, and to connect the common terms in both
datasets using a common URI (Uniform Resource Identifier) (see Table 2). The
information for each ontological term was retrieved by manually searching for
cause of death in the disease ontology and assigning the relevant information to
each disease in the open dataset. For example, the disease ontology website lists
1
https://dev.socrata.com/
2
http://disease-ontology.org/
Towards Linked Open Provincial Data in Canada
campylobacteriosis as a kind of gastrointestinal disease; as such, we linked its
URI to the corresponding record in the dataset.
Table 3: Connecting diseases to the disease ontology.
Disease ontology URI
Disease
ontology term
http://purl.obolibrary.org/obo/
HP_0002664
Neoplasm
http://purl.obolibrary.org/obo/
UBERON_0015228
Circulatory
Organ
http://purl.obolibrary.org/obo/
DOID_10652
Alzheimer's
disease
http://purl.obolibrary.org/obo/
UBERON_0001016
Nervous
system
After linking the diseases to their corresponding disease ontologies, we
exposed all of the necessary items of an observation (measure, dimension, and
attributes) using the RDF Cube vocabulary. Beyond reusing existing
vocabularies (e.g., Dublin Core) to present the attributes of an observation, we
defined a new vocabulary with the aim of making it compatible with other
vocabularies. We also used certain aspects of the Statistical Data Metadata
Exchange (SMDX) vocabulary, which is associated with statistical regions.
URIs were defined based on unique identifiers for each item in the dataset using
the following naming convention:
Base_URI/province_code/dataset_name/observation_id
This URI is then used to find an observation in the entire system. The base
URI is a web address, which is a common element in all datasets and
observations. Each province is assigned a code (e.g., NS for Nova Scotia), while
the observation identifier (observation_id) serves as a unique identifier for each
individual observation in a dataset. For example, the following URI could be
used as an identifier for a record from Table 2:
http://www.example.org/NS/Cause_of_Death/obs-01
To define a structure for the “cause of death” dataset, we create a
“DataStructure” scheme, which is shown below. Each structure scheme contains
the following attributes: Data Cube dimensions, measures, description, and area
of observation, in this case, the province.
E. Rajabi
od:causeOfDeath-structure a qb:DataStructureDefinition;
rdfs:comment "cause of death structure"@en;
qb:component
[ qb:dimension od:causeOfDeath; ],
[ qb:dimension sdmx-dimension:refPeriod; ],
[ qb:measure od-measure:quantity; ],
[ qb:measure od-measure:rank; ].
qb:component
[qb:attribute sdmx-dimension:refArea;
qb:componentRequired "true"^^xsd:boolean;
qb:componentAttachment qb:DataSet; ].
After designing the dataset structure, the RDF Cube vocabulary is used to
define a metadata for each statistical dataset. The following dataset attributes are
considered in the metadata: title, unique address of dataset (URI), information
about the data publisher, area of dataset (e.g., Nova Scotia), published date,
dataset subject, and the web address from which the dataset is derived. The
following example shows a dataset definition in RDF Cube vocabulary derived
from the province of Alberta's "cause of death" dataset:
od:dataset-causeOfDeath-AB a qb:DataSet;
qb:structure od:causeOfDeath-structure;
dct:creator "Alberta Open Government".
dct:title "Causes of death in Alberta"@en;
dct:issued "2016-04-11"^^xsd:dateTime ;
dct:publisher "Open GovernmentAlberta"@en;
dct:subject
sdmx-subject:1.4, od:CauseOfDeath;
prov:wasDerivedFrom "https://open.alberta.ca/".
sdmx-dimension:refArea od:Alberta.
Note that the subject created for the above dataset (od:CauseOfDeath) can be
used in different areas and with other data portals. We also linked the dataset to
the SDMX vocabulary. With the dataset obtained, each observation and its
related information are also described in the RDF format. The following snippet
shows a single observation in Nova Scotia dataset, which shows there were 1,330
causes of “Acute myocardial infarction” disease in 2001. A can also be seen, this
observation is linked to the “cardiovascular system disease” category of the
disease ontology by the following URL:
http://purl.obolibrary.org/obo/DOID_1287.
Towards Linked Open Provincial Data in Canada
qb:obs-ab-1 a qb:Observation ;
od:category
"cardiovascular system disease";
od:causeOfDeath "Acute myocardial infarction" ;
od:dataSet od:dataset_causeOfDeath_AB;
sdmx_dimension:refPeriod 2001;
od_measure:quantity 1330;
skos:broader "http://purl.obolibrary.org/obo/DOID_1287 " .
This approach can be easily extended to the other provincial open data portals
as well as other types of datasets, such as censuses. Using the above-mentioned
components (i.e., the dataset and its observations), we generated an RDF dataset
that included all of the observations downloaded from both provincial data
portals. We wrote a Python program to convert each provincial open dataset to
RDF using Python RDFlib.
1
This library was used for three purposes: 1) to create
the main structure of the datasets and their corresponding observations; 2) to
create the structure of each dataset; and 3) to generate the observations. We then
created a semantic graph that included all of the observations from both
provincial datasets and loaded it into a data store (triple store) for further analysis
and writing queries. Specifically, we applied Jena Fuseki
2
as the triple store and
wrote two SPARQL queries to compare the two datasets. The scripts of the
SPARQL queries are provided in the Appendix. To construct the queries, we
designed the questions such that the two isolated data portals could be seen as a
single endpoint. In the first query, we asked: “What was the most prevalent cause
of death in both provinces in 2015?” As Figure 4 illustrates, “chlamydia” in
Nova Scotia and “chronic heart diseases” in Alberta accounted for the most
deaths in the datasets.
Figure 4. Query results for the first query
1
https://rdflib.readthedocs.io/en/stable/
2
https://jena.apache.org/documentation/fuseki2/
E. Rajabi
In the second query, we asked: “What were the common causes of death in both
provinces in 2014?” According to the results, “gastrointestinal system disease” was
a common disease category in both datasets (see Figure 5), with 189 deaths due to
“alcoholic liver disease” in the Alberta and 609 deaths due to “clostridium difficile”
in Nova Scotia.
To verify our results, we explored the CSV files that had been downloaded
from the open data portals and checked the number of cases manually for these
diseases. All of the numbers matched up.
Figure 5: Results for the second query.
5. Conclusion
In this paper, we proposed a semantic web approach to integrating open statistical
data using the RDF Cube vocabulary. As a proof of concept, we implemented our
approach using two provincial data portals in Canada. Although we were able to
successfully transform one statistical dataset common to both two provinces to the
Linked Data format, we encountered a few problems such as cumbersome
conversion and a lack of common categories and vocabularies in the designed
process.
Towards Linked Open Provincial Data in Canada
One of the key lessons we learned from exposing open data as RDF was in relation
to the integration phase, namely, that the observations descriptions did not follow a
universal standard. Many open data portals use free text to describe the
observations in statistical datasets without assigning them a vocabulary. This lack
of a universal standard creates the need to manually review the datasets for
unifying the concepts and observations. Another lesson learned relates to the
structure of the datasets. Specifically, datasets with same subject, but in different
data portals, will have different dimensions. For example, one dataset may contain
the quantity of observations, while the other may contain quantity and ranking.
Having different dimensions and measures makes the integration process
complicated, as some dimensions will not be used in the final integrated search
mechanism.
Linking observation items to the external vocabularies (e.g., the disease ontology)
was another challenge that was encountered in the interlinking phase. However,
this issue might be addressed by developing a software program to retrieve the
matched item from the external vocabulary.
We will continue to work towards linking more datasets from diverse open
provincial data portals. To this end, designing an ontology that is capable of
covering all of the open statistical data will be a key next step in this research
programme.
Acknowledgments
The work presented in this paper was partly funded by an NSERC (Natural Sciences
and Engineering Research Council) Discovery Grant (RGPIN-2020-05869):
Semantic Web Analysis over the Nova Scotia Open Data.
References
Asano, Y., Iwayama, M., Takeda, H., Koide, S., Kato, F., & Kobayashi, I.
(2014). A Template for Handling Statistical Data in RDF. Second
International Workshop on Semantic Statistics (SemStats2014).
Bukhari, S. A. C., & Baker, C. (2013). The Canadian health census as Linked
Open Data: Towards policy making in public health.
Caracciolo, C., Stellato, A., Rajbahndari, S., Morshed, A., Johannsen, G.,
Jaques, Y., & Keizer, J. (2012). Thesaurus maintenance,
aalignment,and publication as linked data: The AGROVOC use case.
E. Rajabi
International Journal of Metadata, Semantics and Ontologies, 7(1), 65–
75. https://doi.org/10.1504/IJMSO.2012.048511
Dawes, S. S., Vidiasova, L., & Parkhimovich, O. (2016). Planning and
designing open government data programs: An ecosystem approach.
Government Information Quarterly, 33(1), 15–27.
https://doi.org/10.1016/j.giq.2016.01.003
Dong, H., Singh, G., Attri, A., & El Saddik, A. (2017). Open data-set of seven
Canadian Cities. IEEE Access, 5, 529–543.
https://doi.org/10.1109/ACCESS.2016.2645658
European Commission. (2017). Re-using open data.
Gür, N., Díaz, L., & Kauppinen, T. (2012). GI Systems for Public Health with
an Ontology Based Approach.
Jetzek, T., Avital, M., & Bjorn-Andersen, N. (2019). The Sustainable Value of
Open Government Data. Journal of the Association for Information
Systems, 702–734. https://doi.org/10.17705/1jais.00549
Kalampokis, E., Karamanou, A., & Tarabanis, K. (2019). Interoperability
Conflicts in Linked Open Statistical Data. Information, 10(8), 249.
https://doi.org/10.3390/info10080249
Kalampokis, E., Zeginis, D., & Tarabanis, K. (2019). On modeling linked open
statistical data. Journal of Web Semantics, 55, 56–68.
https://doi.org/10.1016/j.websem.2018.11.002
Kämpgen, B., & Harth, A. (2011). Transforming statistical linked data for use
in OLAP systems. ACM International Conference Proceeding Series,
33–40. https://doi.org/10.1145/2063518.2063523
Máchová, R., Hub, M., & Lnenicka, M. (2018). Usability evaluation of open
data portals: Evaluating data discoverability, accessibility, and
reusability from a stakeholders’ perspective. Aslib Journal of
Information Management, 70(3), 252–268.
https://doi.org/10.1108/AJIM-02-2018-0026
Martin, M., Abicht, K., Stadler, C., Auer, S., Ngomo, A. C. N., & Soru, T.
(2015). CubeViz—Exploration and Visualization of Statistical Linked
Data. WWW 2015 Companion - Proceedings of the 24th International
Conference on World Wide Web, 219–222.
https://doi.org/10.1145/2740908.2742848
Matsuda, J., Mizutani, A., Asano, Y., Yamamoto, D., Takeda, H., Ohmukai, I.,
Kato, F., Koide, S., Harada, H., & Nishimura, S. (2018). Publication of
statistical linked open data in Japan. Lecture Notes in Computer Science
(Including Subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics), 11341 LNCS, 307–319.
https://doi.org/10.1007/978-3-030-04284-4_21
Towards Linked Open Provincial Data in Canada
Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., & Tummarello,
G. (2008). Sindice.com: A document-oriented lookup index for open
linked data. International Journal of Metadata, Semantics and
Ontologies, 3(1), 37–52. https://doi.org/10.1504/IJMSO.2008.021204
Salas, P. E. R., Martin, M., Mota, F. M. Da, Auer, S., Breitman, K., &
Casanova, M. A. (2012). Publishing statistical data on the web.
Proceedings - IEEE 6th International Conference on Semantic
Computing, ICSC 2012, 285–292.
https://doi.org/10.1109/ICSC.2012.16
van Ooijen, C., Ubaldi, B., & Welby, B. (2019). A data-driven public sector.
33. https://doi.org/10.1787/09ab162c-en
Wallis, J. C., Rolando, E., & Borgman, C. L. (2013). If We Share Data, Will
Anyone Use Them? Data Sharing and Reuse in the Long Tail of
Science and Technology. PLoS ONE, 8(7), e67332.
https://doi.org/10.1371/journal.pone.0067332
Zaveri, A., Lehmann, J., Auer, S., Hassan, M. M., Sherif, M. A., & Martin, M.
(2013). Publishing and interlinking the global health observatory
dataset. Semantic Web, 4(3), 315–322. https://doi.org/10.3233/SW-
130102
Article
Full-text available
Accident, injury, and fatality rates remain disproportionately high in the construction industry. Information from past mishaps provides an opportunity to acquire insights, gather lessons learned, and systematically improve safety outcomes. Advances in data science and industry 4.0 present new unprecedented opportunities for the industry to leverage, share, and reuse safety information more efficiently. However, potential benefits of information sharing are missed due to accident data being inconsistently formatted, non-machine-readable, and inaccessible. Hence, learning opportunities and insights cannot be captured and disseminated to proactively prevent accidents. To address these issues, a novel information sharing system is proposed utilizing linked data, ontologies, and knowledge graph technologies. An ontological approach is developed to semantically model safety information and formalize knowledge pertaining to accident cases. A multi-algorithmic approach is developed for automatically processing and converting accident case data to a resource description framework (RDF), and the SPARQL protocol is deployed to enable query functionalities. Trials and test scenarios utilizing a dataset of 200 real accident cases confirm the effectiveness and efficiency of the system in improving information access, retrieval, and reusability. The proposed development facilitates a new “open” information sharing paradigm with major implications for industry 4.0 and data-driven applications in construction safety management.
Article
Full-text available
An important part of Open Data is of a statistical nature and describes economic and social indicators monitoring population size, inflation, trade, and employment. Combining and analyzing Open Data from multiple datasets and sources enable the performance of advanced data analytics scenarios that could result in valuable services and data products. However, it is still difficult to discover and combine Open Statistical Data that reside in different data portals. Although Linked Open Statistical Data (LOSD) provide standards and approaches to facilitate combining statistics on the Web, various interoperability challenges still exist. In this paper, we propose an Interoperability Framework for LOSD, comprising definitions of LOSD interoperability conflicts as well as modelling practices currently used by six official open government data portals. Towards this end, we combine a top-down approach that studies interoperability conflicts in the literature with a bottom-up approach that studies the modelling practices of data portals. We define two types of LOSD schema-level conflicts, namely naming conflicts and structural conflicts. Naming conflicts result from using different URIs. Structural conflicts result from different practices of modelling the structure of data cubes. Only two out of the 19 conflicts are currently resolved and 11 can be resolved according to literature.
Article
Full-text available
Building on the promise of open data, government agencies support a continuously growing number of open data initiatives that are driven mainly by expectations of unprecedented value generation from an underutilized resource. Although data in general have undoubtedly become an essential resource for the economy, it has remained largely unclear how, or even whether, open data repositories generate any significant value. We addressed this void with a study that examines how sustainable value is generated from open data. Subsequently, we developed a model that explains how open data generate sustainable value through two underlying mechanisms. The first, the information sharing mechanism, explicates how open data are beneficial to forging informational content that creates value for society through increased transparency and improved decision making. The second, the market mechanism, explicates how open data are beneficial as a resource in products and services offered on the market, as well as how open data are used to make processes more efficient or to satisfy previously unmet needs. We tested and validated the model using PLS with secondary quantitative data from 76 countries. The study provides empirical support to the conjecture that openness of data as well as the digital governance and digital infrastructure in a country have a positive effect on the country's level of sustainable value. Overall, the study provides empirical evidence in favor of nurturing open data culture and insights about the conditions that support turning it into sustainable value for the benefit of citizens, business organizations, and society at large.
Article
Full-text available
Purpose The purpose of this paper is to conduct a usability evaluation of governmental data portals and provide a list of best practices for improving stakeholders’ ability to discover, access, and reuse of these online information sources. Design/methodology/approach The developed methodology was based on the comprehensive literature review that resulted in a benchmarking framework of the most important criteria. A usability testing method was then applied with accordance to unique requirements of open data portals. This approach was demonstrated by using of a case study. Findings The main found weakness was a lack of support for active engagement of stakeholders. The list of best practices was introduced to improve the quality of these portals. This should help to improve the discoverability and facilitate the access to data sets in order to increase their reuse by stakeholders. Social implications The creation of appropriate open data portals aims to fulfill the principles of open government, i.e., to promote transparency and openness through the publication of government data, enhance the accountability of public officials and encourage public participation, collaboration, and cooperation of involved stakeholders. Originality/value This paper proposed a new approach for the usability evaluation of open data portals on national level from an ordinary citizen’s point of view and provided important insights on improving their quality regarding data discoverability, accessibility, and reusability.
Article
Full-text available
Open data has attracted huge attention for the construction of smart city in terms of delivering useful city information to citizens and interacting with citizens from the city council perspective. In this paper, we present an overview of the current status and issues of open data opened by different seven Canadian cities. We start by presenting the characters of open data, followed by data format conclusion and detailed dataset explaination for each Canadian city (e.g., Calgary, Halifax, Surrey, Waterloo, Ottawa, Vancouver, and Toronto) including the different data catalogues and their detailed characteristics. Next, we discuss the state-of-the-art of the tools and applications developed over each city’s open data. Here, we not only illustrate the most successful examples, but particularly consider the potential issues due to the characters of the city datasets. This paper is not only beneficial for a government which can compare its open data status with that of the Canadian cities but also quite useful for users or companies interested in tool development over open city data.
Article
Full-text available
The open government data (OGD) movement has rapidly expanded worldwide with high expectations for substantial benefits to society. However, recent research has identified considerable social and technical barriers that stand in the way of achieving these benefits. This paper uses sociotechnical systems theory and a review of open data research and practice guidelines to develop a preliminary ecosystem model for planning and designing OGD programs. Findings from two empirical case studies in New York and St. Petersburg, Russia produced an improved general model that addresses three questions: How can a given government's open data program stimulate and support an ecosystem of data producers, innovators, and users? In what ways and for whom do these the ecosystems produce benefits? Can an ecosystem approach help governments design effective open government data programs in diverse cultures and settings? The general model addresses policy and strategy, data publication and use, feedback and communication, benefit generation, and advocacy and interaction among stakeholders. We conclude that an ecosystem approach to planning and design can be widely used to assess existing conditions and to consider policies, strategies, and relationships that address realistic barriers and stimulate desired benefits.
Conference Paper
Full-text available
Statistical data is one of the most important sources of information, relevant for large numbers of stakeholders in the governmental, scientific and business domains alike. In this article, we overview how statistical data can be managed on the Web. With OLAP2DataCube and CSV2DataCube we present two complementary approaches on how to extract and publish statistical data. We also discuss the linking, repair and the visualization of statistical data. As a comprehensive use case, we report on the extraction and publishing on the Web of statistical data describing 10 years of life in Brazil.
Conference Paper
Full-text available
CubeViz is a flexible exploration and visualization platform for statistical data represented adhering to the RDF Data Cube vocabulary. If statistical data is provided adhering to the Data Cube vocabulary, CubeViz exhibits a faceted browsing widget allowing to interactively filter observations to be visualized in charts. Based on the selected structural part, CubeViz offers suitable chart types and options for configuring the visualization by users. In this demo we present the CubeViz visualization architecture and components , sketch its underlying API and the libraries used to generate the desired output. By employing advanced in-trospection, analysis and visualization bootstrapping techniques CubeViz hides the schema complexity of the encoded data in order to support a user-friendly exploration experience .
Article
A major part of Open Data concerns statistics such as economic and social indicators. Statistical data are structured in a multidimensional manner creating data cubes. Recently, National Statistical Institutes and public authorities adopted the Linked Data paradigm to publish their statistical data on the Web. Many vocabularies have been created to enable modeling data cubes as RDF graphs, and thus creating Linked Open Statistical Data (LOSD). However, the creation of LOSD remains a demanding task mainly because of modeling challenges related either to the conceptual definition of the cube, or to the way of modeling cubes as linked data. The aim of this paper is to identify and clarify (a) modeling challenges related to the creation of LOSD and (b) approaches to address them. Towards this end, nine LOSD experts were involved in an interactive feedback collection and consensus-building process that was based on Delphi method. We anticipate that the results of this paper will contribute towards the formulation of best practices for creating LOSD, and thus facilitate combining and analyzing statistical data from diverse sources on the Web.
Chapter
The Japanese Statistics Center began publishing a statistical linked open data (LOD) site in 2016. The data currently consists of approximately 1.3 billion triples. The publication of statistical data as LOD enables datasets and categorizations to be clarified. This allows users not only to search objective data easily, but also to combine the data with other domestic or international data. This paper first introduces a design policy for LOD and a method for representing geographic areas. Then, it explains the method used to query the LOD by using SPARQL or GeoSPARQL, and provides one example application.