ArticlePDF Available

Schema on read modeling approach as a basis of big data analytics integration in EIS

Authors:

Abstract and Figures

Big Data analysis is the process that can help organizations to make better business decisions. Organizations use data warehouses and business intelligence systems, i.e. enterprise information systems (EISs), to support and improve their decision-making processes. Since the ultimate goal of using EISs and Big Data analytics is the same, a logical task is to enable these systems to work together. In this paper we propose a framework of cooperation of these systems, based on the schema on read modeling approach and data virtualization. The goal of data virtualization process is to hide technical details related to data storage from applications and to display heterogeneous data sources as one integrated data source. We have tested the proposed model in a case study in the transportation domain. The study has shown that the proposed integration model responds flexibly and efficiently to the requirements related to adding new data sources, new data models and new data storage technologies.
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=teis20
Enterprise Information Systems
ISSN: 1751-7575 (Print) 1751-7583 (Online) Journal homepage: http://www.tandfonline.com/loi/teis20
Schema on read modeling approach as a basis of
big data analytics integration in EIS
Slađana Janković, Snežana Mladenović, Dušan Mladenović, Slavko Vesković
& Draženko Glavić
To cite this article: Slađana Janković, Snežana Mladenović, Dušan Mladenović, Slavko Vesković
& Draženko Glavić (2018): Schema on read modeling approach as a basis of big data analytics
integration in EIS, Enterprise Information Systems, DOI: 10.1080/17517575.2018.1462404
To link to this article: https://doi.org/10.1080/17517575.2018.1462404
Published online: 18 Apr 2018.
Submit your article to this journal
View related articles
View Crossmark data
ARTICLE
Schema on read modeling approach as a basis of big data
analytics integration in EIS
Slađana Janković, Snežana Mladenović,Dušan Mladenović, Slavko Vesković
and Draženko Glavić
Faculty of Transport and Trac Engineering, University of Belgrade, Belgrade, Serbia
ABSTRACT
Big Data analysis is the process that can help organizations to make
better business decisions. Organizations use data warehouses and busi-
ness intelligence systems, i.e. enterprise information systems (EISs), to
support and improve their decision-making processes. Since the ultimate
goal of using EISs and Big Data analytics is the same, a logical task is to
enable these systems to work together. In this paper we propose a
framework of cooperation of these systems, based on the schema on
read modeling approach and data virtualization. The goal of data virtua-
lization process is to hide technical details related to data storage from
applications and to display heterogeneous data sources as one inte-
grated data source. We have tested the proposed model in a case
study in the transportation domain. The study has shown that the
proposed integration model responds exibly and eciently to the
requirements related to adding new data sources, new data models
and new data storage technologies.
ARTICLE HISTORY
Received 14 September 2017
Accepted 2 April 2018
KEYWORDS
Big data analytics; data
virtualization; schema on
read; data warehouse;
business intelligence system
Introduction
A large number of new approaches and technological solutions in data modeling, storage,
processing and analysis, grouped together under the common term Big Data, have the task of
keeping under control the massive inow of data and placing it in the service of organizations and
individuals. The initial successful initiatives in the application of Big Data technologies soon gave
rise to a problem known as Big Data integration. Big Data integration means any software
integration involving the data characterized as Big Data, i.e. the data with at least one of the
following features: volume, variety, velocity and veracity. According to Arputhamary and Arockiam
(2015), there are two categories of Big Data integration, namely integration of several Big Data
sources in Big Data environments and integration of the results of Big Data analysis with structured
corporate data. This research is focused on addressing the second, above-mentioned category of
the Big Data integration problem.
An Enterprise Information System (EIS) is an integrated information system with the basic task of
providing the management with the necessary information. This research addresses two major
challenges encountered by modern EISs in the sphere of data management in order to be qualied
as integratedas per the above denition. The promotion of business operation of organizations
nearly always involves the introduction of new sources of corporate data. If new data sets fall into
the category of Big Data, they require the application of Big Data storage, processing and analysis
CONTACT Slađana Jankovićs.jankovic@sf.bg.ac.rs Faculty of Transport and Trac Engineering, University of Belgrade,
Belgrade, Serbia
ENTERPRISE INFORMATION SYSTEMS, 2018
https://doi.org/10.1080/17517575.2018.1462404
© 2018 Informa UK Limited, trading as Taylor & Francis Group
methods. To use new corporate Big Data sets in a business context, they have to be integrated with
the existing corporate data sets, after which the integrated data should be subjected to Big Data
analysis. The integration of the existing and new corporate data sets to create the subject of the
future Big Data analysis is the rst challenge to which this research will try to respond. The second
challenge and the subject of this research is the integration of the results of Big Data analysis with
EIS. This task has to be solved regardless of whether corporate or external data are the subject of
Big Data analysis. External data, such as social media and web data, are increasingly used as the
subject of Big Data analyses in order to examine user satisfaction, habits and needs etc.
Zdravkovićand Panetto (2017) highlighted that current challenges in EISs development are
related to the growing need for exibility caused by cooperation with other EISs. EISs environment
has become very dynamic and variable not only in terms of collaboration with other EISs, but also
in terms of availability of data sources. The research aims to oer a solution that would eciently
meet the following three key requirements: frequent appearance of new Big Data sources (either
corporate or external), application of new data processing, analysis and visualization methods, and
integration of structured (i.e. relational) and semi- and non-structured data sources. To solve the
above problems, the schema alignment method of data integration has been selected. The
traditional schema alignment method of data integration has been adapted to Big Data sources
and methods of Big Data analysis by being based on the schema on read data modeling approach
and data virtualization concepts. Schema on read means you create the schema only when reading
the data. Structure is applied to the data only when its read, this allows unstructured data to be
stored in the database. Since its not necessary to dene the schema before storing the data it
makes it easier to bring in new data sources on the y. Data virtualization is any approach to data
management that allows an application to retrieve and manipulate data without requiring techni-
cal details about the data, such as how it is formatted at source, or where it is physically located.
The research also provides a technological framework for the implementation of the proposed
integration model. It includes the following three technological environments: NoSQL databases,
data virtualization servers and data integration tools.
The second section of this paper presents the reference literature review. In the third section, we
propose and describe our Big Data analytics integration approach based on the data integration
on demandapproach and the schema on demandmodeling approach. In order to evaluate our
approach, we have implemented the proposed approach in a case study in the transportation
domain. We have carried out the custom analysis of road trac data on a Big Data platform and
integrated it with the SQL Server database, Business Intelligence (BI) tool and trac geo-applica-
tion, according to the proposed integration approach. Finally, we will present our conclusions
about the possibilities and constraints of our integration approach.
Literature review
As pointed out in the introduction of the paper, this research does not deal with the integration of
dierent Big Data sources on Big Data platforms but with the integration of the results of Big Data
analysis with structured corporate data. For this reason, the literature review includes the data
integration approaches and solutions that can be applied to Big Data sources as well as the existing
EIS architectures.
For decades, there have been two main approaches to data integration, namely batch data
integration and real-time data integration. Both approaches have secured a place for themselves in
Big Data integration processes as well. From the data analytics perspective, Big Data systems
support the following classes of applications: batch-oriented processing, stream processing, OLTP
(Online Transaction Processing) and interactive ad-hoc queries and analysis (Ribeiro, Silva, and da
Silva 2015). The batch data integration approach is used in batch-oriented processing applications,
whereas the real-time data integration approach is used in stream processing, OLTP and interactive
ad-hoc queries and analysis applications. An overview of the most important approaches and
2S. JANKOVIĆET AL.
solutions in the eld of Big Data integration with EISs, both in the batch as well as the real-time
mode, will be given in the text below.
Batch data integration for big data
When data exchange between two systems is performed through periodic big le transfers on a
daily, weekly or monthly basis, we call this batch data integration. In the era of the Internet of
Things (IoT) and social media, i.e. the era of Big Data, this interval between two successive le
transfers can be much shorter and measured in hours or even minutes. The transferred les include
records with an unchangeable structure, which is adapted to the requirements of the system that
receives them. This approach to integration is known as a tightly coupledapproach, because it
implies that systems are compatible in terms of le and data format and that the format can only
be changed if both systems simultaneously implement specic changes (Reeve 2013). The standard
batch data integration process includes the following operations: extract, transform, and load (ETL).
Today, there is a large number of commercial and open-source ETL tools (Alooma 2018). The main
purpose of these tools is to upgrade and facilitate the warehousing, archiving, and conversion of
data.
Big Data are most frequently raw data, which are dirtyand incomplete and therefore it is
necessary to perform the operations of extracting, cleaning and data quality processing (Macura
2014; Chen and Zhang 2014) in order to work with them. In the Big Data context, ETL tools are
used to extract, clean and transform raw data from Big Data platforms and NoSQL databases into a
relational or another required form, as well as to load the results of Big Data analytics into
Enterprise Data Warehouses (EDWs) (Florea, Diaconita, and Bologa 2015). The task can only be
performed by ETL tools enabling the creation of interfaces according to both traditional data
sources (relational databases, at les, XML les, etc.) as well as Big Data platforms (Hortonworks
Data Platform, Cloudera Enterprise, SAP HANA Platform, etc.) and NoSQL databases (MongoDB,
Cassandra, HBase, Neo4j, etc.). Such commercial ETL tools include Informatica, Oracle Data
Integrator, Alooma, SAS ETL and Altova MapForce. The major open-source tools of this type include
Apache NiFi, Talend and Pentaho Data Integration.
Transformation as an operation can vary, ranging from an extremely simple operation to an
inexecutable operation, and it may require the use of additional data collections. In the simplest
case, it consists of the simple mapping of source elds to target elds, but most frequently it also
includes operations such as aggregation, normalization and calculation. Some ETL tools, such as
Altova MapForce, include a revolutionary interactive debugger to assist with the data mapping
design.
Apache Hadoop is an open-source distributed software platform for storing and processing
data. Central to the scalability of Apache Hadoop is the distributed processing framework known as
MapReduce (Sridhar and Dharmaji 2013). According to the research done by Russom (2013), the
main reason to integrate Hadoop into Business Intelligence or Enterprise Data Warehouse is the
expectation from Hadoop to enable Big Data analytics. The basic advantage of Hadoop is the
possibility to use advanced non-OLAP (Online Analytic Processing) analytic methods, such as data
mining, statistical analysis and complex SQL. However, in addition to the fact that it can be used as
an analytical sandbox, Apache Hadoop includes many components useful for ETL. For example,
Apache Sqoop is a tool for transferring data between Hadoop and relational databases. When data
are located in the Hadoop File System, they can be eciently subjected to the ETL tasks of
cleansing, normalizing, aligning, and aggregating for an EDW by employing the massive scalability
of MapReduce (Intel Corporation 2013). In this way, the Apache Hadoop platform represents a
powerful ETL tool enabling the integration of the results of Big Data analysis of structured and non-
structured data in an EDW.
Research (Wang et al. 2016) has shown that the most important Big Data technologies that
support batch data integration include the following: MapReduce, Hadoop (HDFS, Hive, HBase),
ENTERPRISE INFORMATION SYSTEMS 3
Flume, Scribe, Dryad, Apache Mahout, Jaspersoft BI Suite, Pentaho, Skytree Server, Cascading,
Spark, Tableau, Karmasphere, Pig and Sqoop.
Real-time data integration for big data
In many cases of data integration, the batch mode is unacceptable so that real-time or near real-
time data integration has to be performed instead. Real-time data integration involves the transfer
of much smaller quantities of data in one interaction, in the form known as a message(Gokhe
2016). The quantity of data transferred in this way is limited and each interaction means ensuring
security on all levels, the same as in batch data integration. Consequently, when it comes to larger
quantities of data, real-time data movement is slower than batch data movement. The traditional
point-to-pointinteraction model means that there are direct tightly coupledinterfaces between
each two systems which have to share data. The data from each data source have to be
transformed as per the requirements of each target data format. If the number of systems which
should be connected by an interface is n, the number of interfaces is (n * (n 1))/2. The most
signicant and most important design pattern for architecting real-time data integration solutions
is the hub-and-spokedesign for data interactions (Reeve 2013). The point of this interaction model
is that data from all sources are transformed into a common, shared format, from which they are
transformed into the target format. The number of interfaces for the connection of n systems is n in
this case. From the technological point of view, the central segment of the real-time data integra-
tion solution is the implementation of an enterprise service bus (ESB). An enterprise service bus is
an application used to coordinate the movement of data messages across dierent servers that
may be running dierent technologies.
XML (eXtensible Markup Language) has been a de facto standard for the exchange of
information in the past two decades and, consequently, it also plays a major role in the eld
of data integration. XML les are a typical example of semi-structured data (Gandomi and
Haider 2015). Modern data integration software enables the transformation of data from XML
les into other types of data warehouses (Big Data included) and vice versa. Other self-
documenting data interchange formats that are popular include JSON (Java Script Object
Notation).
Hadoop oers excellent performances in the processing of massive data sets, but query execu-
tion on the Hadoop platform (e.g. Hive queries) is measured in minutes and hours. This constitutes
a great challenge in the integration of Hadoop into a real-time analytics environment. Intel and SAP
have joined forces to tackle this challenge (Intel Corporation 2014). The Intel® Distribution for
Apache Hadoop (IDH) is highly optimized for performance on Intel® architecture. Intel and SAP
have enabled the generation of queries that will be eciently executed on both platforms, SAP
HANA as well as IDH.
Research (Wang et al. 2016) has shown that the most important Big Data technologies that
support stream processing and real-time integration include the following: Kafka, Flume, Kestrel,
Storm, SQLstream, Splunk, SAP Hana and Spark Streaming.
Schema alignment in big data integration
The main task of data integration, regardless of whether it is traditional or Big Data integration,
batch or real-time data integration, is to download the required data from their current warehouse,
to change their format in order to be compatible with the destination warehouse and to place
them at the target location (Loshin 2013). It is the challenges which data integration has to address
that have changed. The three main steps in data integration include schema alignment, record
linkage and data fusion. Schema alignment should respond to the challenge of semantic ambi-
guity, enabling the identication of attributes with the same meaning as well as those without it.
Record linkage should nd out which records refer to the same entity and which do not. Data
4S. JANKOVIĆET AL.
fusion should enable the identication of accurate data in an integrated data set in cases when
dierent sources oer conicting values.
Dong and Srivastava (2015,35)underlinethat,schema alignment is one of the major
bottlenecks in building a data integration system. They believe that in the Big Data context,
wherethenumberofdatasourcesispermanentlyontheriseandwheresourceschemasare
expected to change all the time, no up-to-date schema mappings are possible. In contrast, Gal
(2011) speaks of the important role schema matching plays in the data integration life cycle. He
believes that the Big Data challenges of variety and veracity can be dealt with by using schema
matching, while the challenges of volume and velocity can be dealt with by using entity
resolution (record linkage).
Big data analytics integration framework
This section of the paper presents the framework for the integration of Big Data sources with
structured data sources, which still form the backbone of EISs. In the previous section, we have
seen that both the batch data integration approach as well as the real-time data integration
approach have their advantages as well as disadvantages and, consequently, our goal has been
to propose a model capable of supporting both integration methods.
In view of the fact that EISs are based on structured data (data warehouses, predened
business analytics and reports, etc.), we believe that variety and veracity constitute the key
challenges in the integration of Big Data analysis and EISs. The integration framework we
propose is therefore based on the upgrade of the model of application of the schema
alignment (schema matching) method of data integration. The upgrade is expected to be the
result of the application of the schema on read modeling approach and data virtualization
concepts. In the text below, the two approaches will be rst briey outlined and then the
reason why they have been selected explained.
Schema on read modeling approach in big data integration process
Schema on write is a standard modeling approach, where we create a database schema and a
database for a specic purpose, and then we enter data into the database. This means that the data
must be adequately prepared for the developed schema. The schema on read approach involves
storing raw data, and then, when we need it for a specic purpose, we create a schema while
reading data from a data storage (Figure 1). Unlike schema on write, which requires you to expend
time before loading the data, schema on read involves very little delay and you generally store the
Figure 1. Schema on read modeling approach.
ENTERPRISE INFORMATION SYSTEMS 5
data at a raw level. In data-intensive computation problems data is the driver, not analytical human
or machines. When the schema on read modeling approach is used, these very large data sets can
be used multiple times in dierent ways, for various types of analysis. However, we believe that the
schema on read modeling approach has a big potential not only in the eld of Big Data analysis
but also in the eld of Big Data integration.
According to (EMC Education Services, ed 2015), the main phases of the data analytics life cycle
include data discovery, data preparation, model planning, model building, communicate results
and operationalize. However, in our experience, Big Data integration process, too, has to include
almost all above phases, as shown in Figure 2. Consequently, we shall speak of the roles the
schema on read modeling approach plays in all mentioned activities, as the phases of Big Data
integration process:
Phase discovery: at this stage, the schema on read modeling approach plays an important
role in getting to know the team with data and the selection of appropriate data preparation
methods.
Phase data preparation: given that the possibilities of data transformation with ETL tools are
nevertheless limited, the data in Big Data source systems have to be organized and formatted
so as to be able to be transformed with ETL tools into the format required by EIS. The data in
Big Data source systems can be prepared for ETL operations through adequate modeling.
Data modeling when necessary, at the point of reading, is precisely what the schema on read
modeling approach makes possible. In this way, ETL operations are more eectively realized
using the schema on read modeling approach.
Phase model planning: the schema on read modeling approach allows a deeper exploration
of data and recognition of the relationships between individual variables.
Phase model building: at this stage, the schema on read modeling approach has the
most signicant role, because it allows exible creation, testing and changing of the
models. In data integration process, the phases of model planningand model building
can occur several times. They will denitely occur during the ETL operations and, if there
is a data virtualization level, they will occur also during the creation of virtual tables.
Due to the above roles the schema on read modeling approach can play in Big Data
integration process, we believe that this modeling approach is imperative for ecient Big
Data integration.
Figure 2. Schema on read modeling approach in Big Data integration lifecycle.
6S. JANKOVIĆET AL.
Data virtualization integration approach for big data analytics
Big Data analytics is characterized by a permanent appearance of new data sources and new
requirements regarding analytical models and methods, so that we have tried to adopt an
integration approach likely to ensure a satisfactory degree of exibility. We have recognized the
data virtualization concept as a suitable basis for exible on-demandintegration and multiple use
of the same data, without copying.
As van der Lans (2012, 9) points out, Data virtualization is the technology that oers data
consumers a unied, abstracted, and encapsulated view for querying and manipulating data stored
in a heterogeneous set of data stores. Basically, when data virtualization is applied, the middle
layer that hides from an application most of the technical aspects on where and when data are
stored is provided. Besides that, all data sources are shown as one integrated data source. Data
virtualization is available in various implementation processes. Some of them include the following:
a server for data virtualization, Enterprise Service Bus (ESB) architecture, placing data warehouse on
the cloud, a virtual in-memory database and object-relational mappers.
We have concluded that all above phases of Big Data integration, which include data
discovery, data preparation, model planning, model building, communicate results and oper-
ationalize, can be performed on data virtualization servers. This is not the case in other data
virtualization implementation processes. Consequently, our approach to data virtualization
implies the use of data virtualization servers. The main parts of a data virtualization server
include source tables, mappings and virtual tables. Mappings represent the way to transform
data from source tables to virtual tables. What makes virtualization servers powerful tools is
the fact that source tables are not restricted to relational tables, but instead dierent data
sources such as data generated by websites, the result of a web service call, a HTML page, a
spreadsheet or a sequential le,canbeused.Userscanaccessvirtualtablesbyusingdierent
APIs (Application Programming Interface), such as the JDBC/SQL interface, MDX
(MultiDimensional eXpressions) and the SOAP-based interface. That means that same tables
would be seen dierentlybydierent users.
Accordingto(vanderLans2012), a data virtualization server consists of a design module
and a runtime module. When data consumers access the virtualization layer, they use the
runtime module of a data virtualization server. The design module is an environment which
data analysts and data model designers use to create concept denitions, data models, and
specications for transformation, cleansing and integration. Some data virtualization servers
enable the creation of unbound virtual tables. That means that it is possible to create data
models using them, and to join them with the real data source afterwards. The runtime
module of a data virtualization server represents a virtual sandbox for data scientists and
enables managed self-service reporting for business analysts.
At a time when new data sources appear on a daily basis, in order to ensure the understanding
and integrity of data, it is very important to manage metadata. Metadata must be a link between
the existing and new data sources. As Zdravkovićet al. (2015, 5) point out, the capability to
interoperate will be considered as the capability to semantically interoperate.It is very important
that data virtualization servers allow the entering and using of data models, glossaries and
taxonomies.
The data virtualization integration approach can help in two ways in data integration processes
enabling Big Data analytics. Firstly, data virtualization can help in the phases of data discovery and
data preparation according to the requirements of dierent analytical models. Big Data analyses
can include only external data or only internal historical data stored in an EDW, but they often
require the integration of external and corporate data. Considering that we are talking about
analyzing a huge amount of external data coming at a high speed, it makes no sense to consider
the physical integration of data based on their copying into a single central data warehouse.
Instead of that, Big Data analysis is performed on Big Data platforms and in NoSQL databases with
ENTERPRISE INFORMATION SYSTEMS 7
appropriate storage and processing performances. In that case, the required corporate data can be
ensured on the data virtualization layer, according to the requirements of a specic Big Data
analysis, and can then be exported to a Big Data platform, such as Hadoop. If data virtualization is
conducted via a virtualization server, the required data are ensured by means of virtual tables. This
means that no local copy of the selected data is made, but the data can instead be exported to
dierent warehouses, in the form dened by a given virtual table. Data virtualization servers have
built-in functions for the export of data to dierent warehouses, Big Data platforms included.
Secondly, the data virtualization integration approach can help in the phase of integration
of the results of Big Data analytics and EIS. After becoming familiar with the available data
sources, the operations of model planning and model building can be performed on a data
virtualization server, similarly as in any database management system. Data models are
designed by creating unbound virtual tables. Regrettably, at this point, not all data virtua-
lization servers have this option. Once a virtual table is created, it can be linked with some
external or internal data source. The design of virtual tables depends on the form of analysis
results which should be integrated and the data model into which they should be inte-
grated. We propose that the designing of virtual tables be based on the application of the
schema alignment method and the available data virtualization concepts, such as nested
virtual tables. Nested virtual tables are virtual tables created on top ofothervirtualtables.
The schema alignment method and the way it is applied on a data virtualization server will
be explained in detail in the next section.
Schema alignment based on schema on read and data virtualization
Schema alignment is used when one domain includes several dierent source schemas, which
describe it in dierent ways. The results of schema alignment include the following:
a mediated schema, which provides a uniform view over heterogeneous data sources, cover-
ing the most important domain aspects;
attribute matching, which matches attributes in all source schemas with the corresponding
attributes in a mediated schema;
schema mapping between each source schema and a mediated schema, specifying the
semantic ties between the data described by source schemas and the data described by a
mediated schema.
There are two classes of schema mappings: Global-as-View (GAV) and Local-as-View (LAV). GAV
denes a mediated schema as a set of views over source schemas. LAV expressions describe source
schemas as views over a mediated schema. We shall rst dene GAV and LAV schema mappings
and then, by using these two formalisms, we shall give an example to show how the application of
schema alignment method of data integration can be upgraded through the application of data
virtualization concepts and the schema on read modeling approach. To demonstrate this, we have
selected an example from the case study conducted to verify the proposed model. The case study
is described in detail in the next section of the paper.
This is followed by the denitions of GAV and LAV schema mappings according to Doan, Halevy,
and Ives (2012).
Denition 1 (GAV Schema Mappings). Let G be a mediated schema, and let
S¼S1;...;Sn
fg
be
schemata of n data sources. A Global-as-View schema mapping
Mis a set of expressions of the
form GiðXÞQðSÞ, where
Giis a relation in G,
and appears in atmost one expression in M, and QðSÞis a query over the relations in S
8S. JANKOVIĆET AL.
Denition 2 (LAV Schema Mappings). Let G be a mediated schema, and let
S¼S1;...;Sn
fg
be
schemata of n data sources. A Local-as-View schema mapping
Mis a set of expressions of the form
SiðXÞQiðGÞ, where
Qiis a query over the mediated schema G, and
Sia source relation and it appears in at most one expression in M.
The example: The backbone of the EIS architecture consists of an enterprise data warehouse
(EDW), a data virtualization server and a business intelligence (BI) tool. This particular EIS is used by
a road maintenance organization. We shall extract the relations modeling the road network, EDW.
Road and EDW.Road_section, from the EDW schema. The problem in hand is to integrate new data
sources, the Big Data analysis results and new reports to be created in the BI tool with the existing
EIS. There are two new data sources: one stores the road trac data, the other stores the data on
automatic trac counters monitoring trac. The new reports should enable the visualization of Big
Data analysis results over integrated data. The trac data are stored in TXT les. In view of the fact
that TXT les are semi-structured and that they contain a large amount of data that is constantly
growing, they are warehoused on the Big Data platform HDFS (Hadoop Distributed File System).
The following three tasks have been identied:
Data on trac counters, which are small in volume and do not change often, should be
integrated with EIS on the data warehouse level.
Data on tracow volume and structure, which will be the result of Big Data analysis, should
be integrated with EIS on the corporate data model level.
The new reports should be integrated with EIS on the corporate data model level.
What we are interested in are the Road and RoadSection relations, which belong to the EDW:
EDW.Road(RoadID, RoadName, RoadCategory),
EDW.RoadSection(SectionID, SectionName, RoadID, SectionLength).
The rst task will be solved by adding a new relation to the EDW system and by linking it to the
RoadSection relation. The new relation is Counter:
EDW.Counter(Location, Longitude, Latitude, SectionID, Type).
The second task requires a far more complex solution. The integration of a new data source
with EIS on the corporate data model level will be performed through the successive multiple
application of the schema alignment method. The results of the application of this method will
be implemented on a data virtualization server, by creating virtual tables and nested virtual
tables. We have adopted a top-down approach to this problem. This means that we rst
analyze the end goal to be achieved through integration. The end goal is a data schema as
required by new reports. Since this data schema should be a common, uniform view over the
EDW and the Big Data source, it will be designed as a mediated schema by using GAV schema
mappings. Its relations will be nested virtual tables (NVT_Counter and NVT_AADT), created as
views over virtual tables (VT_Road, VT_Section, VT_Counter, VT_Trac). The virtual tables
VT_Road, VT_Section, VT_Counter and VT_Trac will be created as unbound virtual tables.
Their role is very important. At this point, they will enable the application of GAV schema
mappings and the creation of a virtual mediated schema. The following expressions describe
the above GAV schema mappings:
Mediate.NVT_Counter(Location, Longitude, Latitude, RoadName, SectionName)
VT_Road(RoadID, RoadName),
VT_Section(SectionID, SectionName, RoadID),
VT_Counter(Location, Longitude, Latitude, SectionID).
Mediate.NVT_AADT(Location, Year, AADT, AADT_D1, AADT_D2)
ENTERPRISE INFORMATION SYSTEMS 9
VT_Trac(Location, Year, AADT, AADT_D1, AADT_D2, AADT_A0, AADT_A1, AADT_A2, AADT_B1,
AADT_B2, AADT_B3, AADT_B4, AADT_B5, AADT_C1, AADT_C2, AADT_X).
The AADT eld represents Annual Average Daily Trac, while AADT_D1 and AADT_D2 represent
AADT by vehicle movement direction. The other elds represent AADT by vehicle categories.
In the next phase, by using LAV schema mappings, the unbound virtual tables VT_Road,
VT_Section and VT_Counter are linked with the corresponding EDW relations. The EDW schema
represents a mediated schema in this case. The following expressions describe the above LAV
schema mappings:
VT_Road(RoadID, RoadName)
EDW.Road(RoadID, RoadName, RoadCategory)
VT_Section(SectionID, SectionName, RoadID)
EDW.RoadSection(SectionID, SectionName, RoadID, SectionLength)
VT_Counter(Location, Longitude, Latitude, SectionID)
EDW.Counter(Location, Longitude, Latitude, SectionID, Type)
Using LAV schema mappings, source schemas are created for the Big Data source (BD) based on
VT_Trac. The virtual table schema VT_Trac represents a mediated schema in this case. The
following expressions describe the above LAV schema mappings:
BD.AADT(Location, Year, AADT) VT_Trac(Location, Year, AADT, AADT_D1, AADT_D2,
AADT_A0, AADT_A1, AADT_A2, AADT_B1, AADT_B2, AADT_B3, AADT_B4, AADT_B5, AADT_C1,
AADT_C2, AADT_X)
BD.AADTByDirections(Location, Year, AADT_D1, AADT_D2) VT_Trac(Location, Year, AADT,
AADT_D1, AADT_D2, AADT_A0, AADT_A1, AADT_A2, AADT_B1, AADT_B2, AADT_B3, AADT_B4,
AADT_B5, AADT_C1, AADT_C2, AADT_X)
BD.AADTByCategories(Location, Year, AADT_A0, AADT_A1, AADT_A2, AADT_B1, AADT_B2,
AADT_B3, AADT_B4, AADT_B5, AADT_C1, AADT_C2, AADT_X) VT_Trac(Location, Year, AADT,
AADT_D1, AADT_D2, AADT_A0, AADT_A1, AADT_A2, AADT_B1, AADT_B2, AADT_B3, AADT_B4,
AADT_B5, AADT_C1, AADT_C2, AADT_X)
Once schemas for the Big Data sources BD.AADT, BD.AADTByDirections and BD.
AADTByCategories are designed, the designing of Big Data analysis begins so as to get the results
described in the above schemas. This is when the schema on read modeling approach comes into
play. It is applied to a Big Data source in situations when one knows what kind of data schema is
required. In other words, the data on a Big Data platform are organized according to the schema
derived through the successive application of GAV and LAV schema mappings. Once a Big Data
source is created according to the above schemas, it should be linked with a data virtualization
server. After that, the unbound virtual table VT_Trac is linked with the real Big Data source. This
solves the task of integrating the results of Big Data analysis with EIS on the corporate data model
level.
The third task, integration of new reports with EIS on the corporate data model level, will be
simply solved by linking the BI tool with the virtual schema Mediate on a data virtualization server.
We can say now that the key factors of the proposed model of Big Data integration include in
the following:
a top-down approach to solving the integration problem, i.e. starting with reports and
moving down to data sources,
application of GAV schema mappings in order to create a uniform view over the domain a
mediated schema, using the concept of unbound nested virtual tables on a data virtualization
server,
application of LAV schema mappings in order to create the required local and external data
source schemas, using the concept of unbound nested virtual tables on a data virtualization
server,
10 S. JANKOVIĆET AL.
application of the schema on read modeling approach in creating data schemas for Big Data
sources, derived by using the above combined GLAV (Global-as-Local-as-View) schema map-
ping approach.
Although some authors, such as Dong and Srivastava (2015), believe that schema alignment is
not an appropriate Big Data integration method, we have shown that it can be eectively
implemented using unbound nested virtual tables and bound virtual tables on the data virtualiza-
tion server.
Big data analytics integration scenarios
Between the enterprise information system and the Big Data analytic tool a two-way data
exchange is necessary. In Big Data analysis for business purposes, apart from data originating
from external sources, such as sensor data, data generated by various machines, social networking
data etc., corporative data are used, too. Corporative data that are used in Big Data analysis or are
crossed with Big Data analysis results frequently appear on their own as a result of some
predened analysis in a business intelligence system. Thus, it is necessary to enable integration
of corporative data and other data that are the object of Big Data analysis. One corporative data
part, which is archived and traditionally used for business reporting, is structured. However, a
signicant part of corporative data are semi-structured and unstructured data.
On the other hand, external sources generate heterogeneous data that are stored in dierent
types of data storages. The amount of external data that are of interest for corporative analysis as a
rule increases. The results of Big Data analysis should become available to business analysts and
other business users, and sometimes even end users, such as buyers, service users, etc. This can be
achieved through data integration or through integration on the report level. Integration of
corporate data, external data and Big Data is done in the phase of preparing input data for various
advanced Big Data analysis techniques. After Big Data analysis is completed, it is necessary to
integrate the results of the analysis with the corporate data. Big Data analysis scenarios can be
dierent. Only the data analyzed on a Big Data platform can be analyzed without the use of
corporate data. In this case, the only remaining task is to integrate the Big Data analysis results with
EIS. The Big Data analytics integration framework we suggest allows us to integrate Big Data
analysis and EIS on three levels: data warehouse level, corporate data model level and report level
(Figure 3). The example described in the previous section demonstrates all three levels of integra-
tion, as shown in Figure 3. It has been mentioned earlier that all data integration phases can be
conducted on the data virtualization server. Consequently, as seen in Figure 3, integration on the
corporate data model level is performed directly between the Big Data platform and the data
virtualization server, without the mediation of ETL tools.
Integration on the data warehouse level means that the Big Data platform is used to design
schema on read which is identical to the one segment of the data warehouse model. Data from the
Big Data platform can be obtained, transformed and loaded into data warehouse tables by using
some ETL tool. It has been mentioned earlier that, among other things, the goal of the schema on
read approach to modeling Big Data is to prepare Big Data so that ETL operations could be more
ecient. As seen in Figure 3, the ETL tool is linked with one of the schemas on readon the Big
Data platform. In the case of data warehouse level of integration, the rst four phases of the Big
Data integration process from Figure 2: discovery, data preparation, model planning and model
building are executed on the Big Data platform, or within ETL tools, and most often combined in
both environments (Figure 3). The last two phases from Figure 2: communicate results and
operationalize, are executed in the data warehouse (Figure 3). This kind of integration is suitable
for batch-oriented Big Data analysis which is repeated periodically (monthly, quarterly, yearly) or on
demand (Figure 4).
ENTERPRISE INFORMATION SYSTEMS 11
Integration on the corporate data model level can be carried out in two ways. The rst method
involves the prior preparation of the organization and storage of data on the Big Data platform and
the creation of schemas on read according to the corporate data model. The dierence between
this method of integration and integration on the data warehouse level is that, in this way, the
integration is done on the virtual level. The data virtualization server connects virtual tables derived
from internal data sources and virtual tables generated from external Big Data sources (schemas
on read). In the case of integration on the corporate data model level, the rst two phases of the
Big Data integration process from Figure 2: discovery and data preparation are executed on the Big
Data platform (Figure 3). The remaining four phases from Figure 2: model planning, model
building, communicate results and operationalize, are realized on the data virtualization server
(Figure 3). The second method involves the implementation of the schema on read modeling
approach only within the design module of the data virtualization server. This means that by
designing unbound virtual tables, a data model is created, which is subsequently associated with
real data sources.
The key stages of the schema on read modeling approach are Explore Data and Develop Model
(Figure 1). Both of these phases, according to our integration framework, can be performed on Big
Data platforms over Big Data sources, but also on the data virtualization server, over integrated
internal and external data sources. This is shown by schemas on reads in the form of a puzzle
puzzle segment in Figure 3.
If we observe the three mentioned levels of integration, only integration on the corporate data
model level enables all types of Big Data analysis applications: batch-oriented processing, stream
processing, OLTP (Online Transaction Processing) and interactive ad-hoc queries and analysis
(Figure 4).
Integration on the report level means creating schemas on read on Big Data platforms. These
schemas are created with the aim of representing the data sources for the predened reports and
are designed so as to suit the reportsrequirements. In the case of integration on the report level,
the rst four phases of the Big Data integration process from Figure 2: discovery, data preparation,
model planning and model building are executed on the Big Data platform (Figure 3). The
remaining two phases from Figure 2: communicate results and operationalize, are executed on
the BI tool (Figure 3). This kind of integration is used for the following Big Data analysis applica-
tions: batch-oriented processing, stream processing and OLTP (Figure 4).
Figure 3. Proposed framework for Big Data analytics integration in EISs.
12 S. JANKOVIĆET AL.
In existing batch data integration solutions, which are based only on the use of ETL tools,
phases: discovery, data preparation, model planning and model building, include copying and
temporary storage of large amounts of data in the data staging area. Our integration framework
does not envision data staging area, because these Big Data integration phases are performed
either on the Big Data platform, or at the virtual level on the server for data virtualization. If
these four phases are implemented on the Big Data platform, our integration framework does
not exclude the use of some existing solutions, such as Apache Hadoop components useful
for ETL.
When it comes to real-time data integration scenarios, our integration framework does not
exclude existing ESB-based solutions. On the contrary, our approach enables the development of a
traditional ESB approach, by the implementation of the hub-and-spokedesign on the data
virtualization server. As described in the previous section, data from all sources are transformed
into a unied shared format called the mediated schema.
If we do not want to store permanently the data in a data warehouse, integration on the data
warehouse level can be replaced with integration on the corporate data model level. Additionally,
integration on the report level can be replaced with integration on the corporate data model level.
The prerequisite for that is to imply data virtualization as an integration approach.
As the needs of business analysts and data analysts are becoming similar, the proposed
approach enables the integration of reporting and analytical tools with enterprise data warehouse
and external data sources. Depending on the categories of Big Data analytics use cases and the
specic needs and skills of a particular user, the proposed framework enables the following
integration scenarios:
(1) integration on the data warehouse level, for data analysts and developers;
(2) integration on the corporate data model level, for business analysts (self-service analysis),
data analysts and developers;
(3) integration on the report level, for end users, business analysts, data analysts and
developers;
(4) integration on the corporate data model level and data warehouse level, for data analysts
and developers;
Figure 4. Levels of integration and Big Data analysis applications.
ENTERPRISE INFORMATION SYSTEMS 13
(5) integration on the corporate data model level and report level, for business analysts (self-
service analysis), data analysts and developers;
(6) integration on the data warehouse level and report level, for data analysts and developers,
and
(7) integration on the corporate data model level, the data warehouse level and the report
level, for data analysts and developers.
The integration scenarios appropriate for particular user categories are presented in Figure 5.
Implementation of integration framework in transportation domain
Trac data are an excellent example of heterogeneous data that are continuously coming, making
a demand for Big Data storage and analysis. Excellent tailor-made trac data are the best basis for
excellent transportation models (Jankovićet al. 2016a). We want to provide the trac engineers
and authorities with pre-attributed maps tailored to their specic needs. For the analysis of trac
ow, the trac engineers calculate the indicators on an annual basis. For example, Annual Average
Daily Trac (AADT), along with its main characteristics of composition and time distribution
(minutes, hourly, daily, monthly, yearly), is the basic and key input to the trac-technical dimen-
sioning of road infrastructure and road facilities. This parameter is used in capacity analysis, level of
service analysis, cost benet analysis, safety analysis, environmental assessment impact analysis of
noise emission and air pollution, analyses of pavement construction, as well as for the static
calculation of road infrastructure objects, trac forecasting, etc.
To count the trac at the specied locations on the state roads in the Republic of Serbia, 391
inductive loop detectors were used (Lipovac et al. 2015). These detectors are QLTC-10C automatic
trac counters (ATC). The case study included the analysis of trac data in ten locations on the
state roads and streets in the city of Novi Sad, Serbia, which the trac counters generated during
2015. In order to have sensor data, it is necessary to link them to the trac infrastructure data. As
two dierent data categories exist, namely one that is continually generated and the other that is
changed rarely, we recognized the need to process them dierently. The trac data that are
continually generated in our case study are analyzed on the Big Data platform, while the data
related to the trac infrastructure are stored in the local relational database. Obviously, there is a
need for their integration. In this study, we have integrated Big Data analytics with the existing EIS,
rst traditionally, without a data virtualization layer, then by using a data virtualization server.
Figure 5. Levels of integration from Big Data analytics use cases point of view.
14 S. JANKOVIĆET AL.
Before developing the integration solution for our use case, we needed to go through the
following phases:
(1) A relational data model was developed and the SQL server database STATE ROADS created.
These enable storing the data on the state road reference system in the Republic of Serbia
and the data on the automatic trac counters used on these roads. The most important
entities of the relational model are the following: road, road section, intersection, automatic
trac counter, etc.
(2) Each automatic trac counter generated 365 text les in 2015. Each le contained about
10,000 records on average, so that the collected data amounted to 10 365
10,000 = 36,500,000 records.
(3) For the storage and processing of trac data, the Apache Hadoop platform was chosen.
Using the Apache Ambari user interface, on the Hortonworks Sandbox a single-node
Hadoop cluster, with the help of Apache Hive data warehouse software and HiveQL query
language, a Hive database named TRAFFIC ANALYSIS was created.
(4) An ETL application was designed to clean upthe text les of any invalid records generated
by trac counters. Also, for each counter, this application consolidated the content of all
365 .txt les into a single text le which generated ten large .txt les. After that, we
uploaded each of the ten large .txt les into the HDFS (Hadoop Distributed File System).
White (2015) did useful work on HDFS. Using HiveQL query language we lledHive
database tables with the data from the .txt les that are stored on HDFS.
Integration approach without data virtualization
The traditional integration solution without data virtualization is presented in Figure 6. This
integration solution was implemented in the following phases:
(5) We carried out numerous HiveQL queries on the Hadoop TRAFFIC ANALYSIS database
resulting in useful information on trac volumes, trac structure, vehicle speeds, etc.
(Jankovićet al. 2016b). HiveQL has a powerful technique known as Create Table As Select
(CTAS). This type of HiveQL queries allow us to quickly derive Hive tables from other tables in
order to build powerful schemas for Big Data analysis. This data modeling approach is known
as schema on read. Schemas of Hive tables are designed so as to be joined to the relational
model of the local SQL server database. This enables the integration of Big Data analytics
with EIS on the corporate data model level. The query results include trac volume and
trac safety indicators for each counting place: AADT, AADT by directions and vehicle
categories, Monthly Average Daily Trac (MADT), average speed of vehicles, 85
th
percentile
of vehicle speed, percentage of vehicles that exceed the speed limit, average speeding, etc.
(6) In the IDE Microsoft Visual Studio 2015, a Windows Forms geo-application called Trac
Counting was developed. It has the following features:
An intuitive GUI that allows the trac engineers to dene the query parameters and start
executing the queries against Hive tables on the Hadoop database TRAFFIC ANALYSIS and
tables from the local SQL server database STATE ROADS. This enables the integration of Big
Data analytics with EIS on the report level. Access to the Hadoop database TRAFFIC
ANALYSIS from the Windows Forms geo-application Trac Counting was enabled with
the help of Hortonworks ODBC Driver for Apache Hive.
A GUI for graphical and tabular visualization of query results and their geo-location. For the
geo-location of query results in the Trac Counting application, we used Bing Maps and
OpenStreetMaps.
ENTERPRISE INFORMATION SYSTEMS 15
(7) The results of the query of the Hadoop database TRAFFIC ANALYSIS were stored in the SQL
Server database STATE ROADS with the help of Hortonworks ODBC Driver for Apache Hive
and the Windows Forms geo-application Trac Counting. This enabled the integration of Big
Data analytics with EIS on the data warehouse level.
Integration approach based on data virtualization
The architecture of the integration solution based on data virtualization is presented in Figure 7.
This integration solution includes the rst phase of the traditional integration approach, while its
second and third phase dier from the traditional approach:
(2) A virtual data source was created on the Denodo Express 6.0 data virtualization platform, by
virtualizing and integrating data from the local SQL Server database STATE ROADS and the
Hadoop database TRAFFIC ANALYSIS (Figure 8). In this way, the data on volume, structure
and speed of tracow that are generated on the Big Data platform are connected with the
locally stored data on state roads in the Republic of Serbia. This enables the integration of Big
Figure 6. Big Data analytics integration solution without data virtualization.
16 S. JANKOVIĆET AL.
Data analytics with EIS on the corporate data model level. In Figure 9, a tree view and a
relationship view of data schemas created by combining (merging) elds from the local and
Big Data sources are shown. One should notice that the local data source is a relational table,
and a non-relational table that does not even have the primary key. The query results based
on which eld combining from the mentioned heterogeneous data sources is performed are
presented in Figure 10.
(3) The Trac Counting geo-application, which was developed during the sixth phase of the
traditional approach to integration, was linked to the unique virtual data source on the
Denodo Express 6.0 data virtualization platform. In this way, the Trac Counting geo-
application uses the results of Big Data analysis from the Hadoop database TRAFFIC
ANALYSIS and data from the local SQL Server database STATE ROADS, integrated on the
Figure 7. Architecture of the proposed Big Data analytics integration solution.
ENTERPRISE INFORMATION SYSTEMS 17
data virtualization platform. Figure 11 shows one window from the Trac Counting geo-
application that displays average speeding for each counting place. As seen in Figure 11,
visualization is achieved on the tabular and graphical level and the maps.
Conclusions
The continuous emergence of new data sources, data models, database management systems and
data integration platforms, coupled with the pronounced need for the self-service analytics used by
business analysts, makes it increasingly necessary to integrate Big Data analytics with traditional
EISs on demand. IFAC TC5.3 Technical Committee for Enterprise Integration and Networking of the
International Federation for Automatic Control has recognized the most serious challenges that
must be solved in the Next Generation Enterprise Information System (NG EIS) development. The
following have been selected as its key required features: omnipresence, a model-driven architec-
ture and openness. In the ideal scenario, NG EIS will become a software shell, a core execution
Figure 8. Big Data analytics integration solution based on data virtualization.
18 S. JANKOVIĆET AL.
environment with the integrated interoperability infrastructure. Such an environment is foreseen as
a highly exible and scalable, deployable on any and every platform, using the external models and
services infrastructure, exclusively or on a sharing basis.(Zdravkovićand Trajanović2015). Our
approach to integration of Big Data analytics with EISs is based on a exible appending of external
models and their joining with the existing corporate data model on the virtual level. In this
research, an approach that enables exible integration from heterogeneous external sources and
Big Data analytics with EISs is developed. The key drivers of our integration approach include
exibility, the reuse of raw/atomic data and querying multiple data stores and types at once.
The proposed Big Data analytics integration framework enables seven integration scenarios,
which can include integration on the corporate data model level, on the data warehouse level and
on the report level. Only integration on the corporate data model level enables all kinds of Big Data
analysis applications, which include batch-oriented processing, stream processing, OLTP and inter-
active ad-hoc queries and analysis. Integration on the data warehouse level enables integration in
Big Data analysis applications based on batch-oriented processing. Integration on the report level
enables integration in Big Data analysis applications based on batch-oriented processing, stream
processing and OLTP. All integration scenarios start with the designing of schemas for data analysis
at the time of reading raw Big Data sources. Schemas on read are designed so as to be integrated
into the existing relational corporate data models and/or the existing business reports, taking into
account the structure of the source data les.
From the point of view of Big Data analytics use cases, integration scenarios can be divided into
three categories. For end users, integration scenarios that include integration on the report level
Figure 9. Tree view and relationships view of data source on data virtualization platform.
ENTERPRISE INFORMATION SYSTEMS 19
are appropriate. For business analysts and business reporting, integration scenarios that include
integration on the corporate data model level and/or on the report level are appropriate. For data
analysts, data discovery and developers, integration scenarios that include integration on all three
levels, namely the corporate data model level, the data warehouse level and the report level, are
appropriate.
The case study conducted has conrmed that the use of a data virtualization layer oers
numerous advantages. These can be classied into three groups. The rst group of advantages
comes into play if the user accesses only one data source and it consists of the following: a data
virtualization layer with the capability of language translation and API supported by a language
data warehouse and API suitable for data users, independence from data source technologies (in
the era of the IoT and Big Data, the possibility of exchanging a non-SQL data warehouse with a SQL
warehouse is very important), and minimal negative user inuence of data warehouse perfor-
mance. The second group of advantages is connected to metadata specication, such as: a table
structure, cleansing and transformation operations, aggregation and similar. When data virtualiza-
tion is used metadata specication is implemented only once and it is not necessary to copy it for
several data users. In other words, data users share and use metadata specications on multiple
occasions, with which they achieve more simple table structures, centralized data transformation,
centralized data cleansing, simplied application development, more consistent application beha-
vior and more consistent results. The third group refers to data integration from multiple data
sources and includes the following: a unied approach to dierent types of data warehouses (SQL
Server database, Excel worksheets, index sequential les, NoSQL databases, XML les, HTML web
Figure 10. Query results view on data virtualization platform.
20 S. JANKOVIĆET AL.
pages, etc.), centralized data integration and sharing of integration programming code, consistent
report results and ecient distributed data access.
In view of the positive experiences gained while using a data virtualization platform, the
authorsfuture research will focus on the use of the above platform in the integration of Big
Data analytics with NoSQL databases, such as column and key-value databases.
Disclosure statement
No potential conict of interest was reported by the authors.
Funding
This paper has been partially supported by the Ministry of Education, Science and Technological Development of the
Republic of Serbia project under No. 36012, and the project under the name Novel Decision Support tool for
Evaluating Strategic Big Data investments in Transport and Intelligent Mobility Services NOESIS. NOESIS project has
received funding from the European Unions Horizon 2020 research and innovation programme under grant agree-
ment No 769980. The data generated by automatic trac counters have been provided by the company MHM -
Project from Novi Sad.
References
Alooma. 2018.ETL Tools.January 4. https://www.etltools.net/
Arputhamary, B., and L. Arockiam. 2015.A Review on Big Data Integration.International Journal of Computer
Applications Proceedings on International Conference on Advanced Computing and Communication Techniques for
High Performance Applications 5: 2126.
Chen, C. L. P., and C. Y. Zhang. 2014.Data-Intensive Applications, Challenges, Techniques and Technologies: A Survey
on Big Data.Information Sciences 275: 314347. doi:10.1016/j.ins.2014.01.015.
Doan, A., A. Halevy, and Z. Ives. 2012.Principles of Data Integration. Waltham: Morgan Kaufmann.
Figure 11. Trac Counting geo-application Average Speeding window.
ENTERPRISE INFORMATION SYSTEMS 21
Dong, X. L., and D. Srivastava. 2015.Big Data Integration (Synthesis Lectures on Data Management). Williston: Morgan &
Claypool Publishers.
EMC Education Services, ed. 2015.Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting
Data. Indianapolis: John Wiley & Sons.
Florea, A. M. I., V. Diaconita, and R. Bologa. 2015.Data Integration Approaches Using ETL.Database Systems Journal
(VI) 3: 1927.
Gal, A. 2011.Uncertain Schema Matching. (Synthesis Lectures on Data Management). Williston: Morgan & Claypool
Publishers
Gandomi, A., and M. Haider. 2015.Beyond the Hype: Big Data Concepts, Methods, and Analytics.International
Journal of Information Management 35: 137144. doi:10.1016/j.ijinfomgt.2014.10.007.
Gokhe, P. 2016.Enterprise Real-Time Integration.E-book. http://www.enterpriserealtimeintegration.com/enterprise-
real-time-integration/
Intel Corporation. 2013.Extract, Transform, and Load Big Data with Apache Hadoop.White Paper Big Data Analytics.
https://software.intelcom/sites/default/les/article/402274/etl-big-data-with-hadoop.pdf
Intel Corporation. 2014.Real-Time Big Data Analytics for the Enterprise.White Paper Intel® Distribution for Apache
Hadoop. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/big-data-hadoop-real-
time-analytics-for-the-enterprise-paper.pdf
Janković, S., D. Mladenović, S. Mladenović, S. Zdravković, and A. Uzelac. 2016a.Big Data in Trac.In Proceedings of
the First International Conference Transport for Todays Society TTS 2016, edited by M. M. Todorova, 2837. Bitola,
Macedonia: Faculty of Technical Science.
Janković, S., S. Zdravković,S. Mladenović, D. Mladenović, and A. Uzelac. 2016b.The Use of Big Data Technology in the
Analysis of Speed on Roads in the Republic of Serbia.In Proceedings of the Third International Conference on Trac
and Transport Engineering - ICTTE Belgrade 2016, edited by O. Čokorilo, 219226. Belgrade: City Net Scientic
Research Center.
Lipovac, K., M. Vujanić, T. Ivanišević, and M. Rosić.2015.Eects of Application of Automatic Trac Counters in Control
of Exceeding Speed Limits on State Roads of Republic of Serbia.In Proceedings of the 10th Road Safety in Local
Community International Conference, edited by Pro. K. Lipovac and M. Nešić, 131140. Belgrade: Academy of
Criminalistic and Police Studies.
Loshin, D. 2013.Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and
Graph. Waltham: Elsevier.
Macura, M. 2014.Integration of Data from Heterogeneous Sources Using ETL Technology.Computer Science 15 (2):
109132. doi:10.7494/csci.2014.15.2.109.
Reeve, A. 2013.Managing Data in Motion. Waltham: Elsevier.
Ribeiro, A., A. Silva, and A. R. da Silva. 2015.Data Modeling and Data Analytics: A Survey from A Big Data Perspective.
Journal of Software Engineering and Applications 8: 617634. doi:10.4236/jsea.2015.812058.
Russom, P. 2013.Integrating Hadoop into Business Intelligence and Data Warehousing. Renton, WA: Data Warehousing
Institute.
Sridhar, P., and N. Dharmaji. 2013.AComparative Study on How Big Data Is Scaling Business Intelligence and
Analytics.International Journal of Enhanced Research in Science Technology & Engineering 2 (8): 8796. -izbaciti.
van der Lans, R. F. 2012.Data Virtualization for Business Intelligence Systems. Waltham: Elsevier.
Wang, H., Z. Xu, H. Fujita, and S. Liu. 2016.Towards Felicitous Decision Making: An Overview on Challenges and
Trends of Big Data Technologies.Information Sciences 367368: 747765. doi:10.1016/j.ins.2016.07.007.
White, T. 2015.Hadoop: The Denitive Guide. Sebastopol, CA: OReilly Media.
Zdravković, M., F. Luis-Ferreira, R. Jardim-Goncalves, and M. Trajanović.2015.On the Formal Denition of the
SystemsInteroperability Capability: An Anthropomorphic Approach.Enterprise Information Systems 11 (3): 389
413. doi:10.1080/17517575.2015.1057236.
Zdravković, M., and H. Panetto. 2017.The Challenges of Model-Based Systems Engineering for the Next Generation
Enterprise Information Systems.Information Systems and e-Business Management 15 (2): 225227. doi:10.1007/
s10257-017-0353-z.
Zdravković, M., and M. Trajanović. 2015. On the Runtime Models for Complex, Distributed and Aware SystemsIn
Proceedings of the 5th International Conference on Information Society and Technology ICIST 2015, edited by M.
Zdravković, M. Trajanović, and Z. Konjović, 236240. Kopaonik, Serbia: Society for Information Systems and
Computer Networks.
22 S. JANKOVIĆET AL.
... Based on the three levels of an EIS -corporate data model level, data warehouse level and report level -there are three different approaches for integrating big data analytics. According to [38] integration must take place at least at the corporate data model level and one other level. This can be extended to integration at all three levels. ...
Chapter
In a world that is more and more driven by data, decision makers are provided with a huge amount of information. However, while this appears to be a good development, they also face the challenge of getting through those masses to get to the actually important insights. To ease this task for managers that oversee highly complex situations, management information systems (MIS) provide valuable support. While those systems were usually drawing from internal data sources that were rather structured, for several years the paradigm of big data has been on the rise. This however brings not only new possibilities for gaining insights, but also additional challenges. To help in dealing with those, the publication at hand presents a review of the literature that considers the incorporation of big data in MIS and reflects on current trends as well as challenges for future researchers.
... To perform traditional data integration or big data integration, data is to be downloaded from its original data source, to be formatted according to the new warehouse requirements and to be transferred to it [8]. According to the data integration architecture [9], [10], the three main steps to be performed are schema alignment, record linkage and data fusion. ...
Article
Full-text available
The presented paper deals with data integration and sorting of Covid-19 data. The data file contains fifteen data fiels and for the design of integration and sorting model each of them is configured in data type, format and field length. For the data integration and sorting model design Talend Open Studio is used. The model concerns the performance of four main tasks: data integration, data sorting, result display, and output in .xls file format. For the sorting process two rules are assigned in accordance with the medical and biomedical requirements, namely to sort report date descending order and the Country Name field in alphabetical one
... To perform traditional data integration or big data integration, data is to be downloaded from its original data source, to be formatted according to the new warehouse requirements and to be transferred to it [9]. According to the data integration architecture [10], [11], the three main steps to be performed are schema alignment, record linkage and data fusion (Fig. 1). ...
Conference Paper
The paper below presents an overview of big data integration techniques, methods and approaches. The report describes the data integration architecture and its components. Also, an overview of five big data integrators is included. Integrator.io, Oracle Data Integrator, Apache Spark, SQL Server Integration Service, Zapier are discussed in details. In addition, recommendations for their application are presented.
... Choosing the proper methods or models for a specific solution or making a decision can be the only way. Practically, at the heart of Big Data is the innovation of analysis methods such as machine learning that generate models, and the new era of Big Data is catapulting any kind of knowledge learning or algorithm to the forefront of research and practical applications (Landset et al. 2015;Janković et al. 2018;Berra and Gaulton 2021). ...
Article
This study aims to incorporate the application of RS technology with management strategies and proposes the Remote Sensing Knowledge-Based Decision Support System (RS-KBDSS) framework. It is not only focusing on considering RS technology for discussing the accuracy of estimation but also conducting strategic thinking that is needed for keeping the inventory and monitoring task running by obtaining knowledge and decision support. This study offers more flexible and accurate estimation models for different resolution images and different kinds of models. The results obtained by the study is satisfied with the practical application.
... Specifically, the HDFS is widely used for data storage. In fact, this approach leads to load raw and unprocessed data with a structure based on a versatile processing according to the application requirements (Janković et al. 2018;de Assis et al. 2017). ...
Article
Remote sensing (RS) data are undergoing an explosive growth. In fact, RS data are regarded as RS big data which generates several challenges such as data storage, analysis, applications, and methodologies. In this paper, a suitable method to forecast the Normalized Difference Vegetation Index (NDVI) time series (TS) from RS big data is introduced. In fact, we propose a non-stationary NDVI TS forecasting model by combining big data system, wavelet transform (WT), long short-term memory (LSTM) neural network. In the first step, the MapReduce algorithm was investigated for RS data storage and NDVI TS extraction. Then, the WT was used to decompose the TS into different components. Finally, LSTM was used for NDVI TS forecasting. Additionally, we have compared the forecasting results using only LSTM, recurrent neural network (RNN), and WT-RNN. Our results show that the proposed methodology using WT-LSTM model provides us an efficient method for forecasting NDVI TS in terms of root mean square error (RMSE) and Pearson correlation coefficient (R). Finally, we have evaluated the performance of the big data model.
... According to Earley (2016), DV -together with master data management (MDM) -ensures data consistency and accessibility from different sources, thus providing value for the business. Different frameworks for DV (Jankovic et al., 2018) have been provided to combine heterogeneous data sources, displaying it as one integrated data source (Mousa & Shiratuddin, 2015). An integrated data source is flexible and efficient when new data sources, data models, or new data storage technologies are combined. ...
Article
The objective is to conceptualise data-driven and fact-based product portfolio management (PPM). The study examines how PPM process is internalised in companies and proposes a concept that covers all PPM performance management areas (strategic fit, value maximisation, and portfolio balance) to transform profitability analysis from company-level to product-level. The study is founded by focusing on PPM process and other key business processes, data-driven decision-making, company data assets, and business information technology (IT). The findings highlight how the strategic role of PPM process and related targets and key performance indicators must be internalised before adjusting business IT to utilise data assets for data-driven, fact-based PPM. The means for a data-driven approach are provided by the effective connection of the PPM process, company-widely governed data assets, and business IT systems to realise their full potential for fact-based decision-making over lifecycle. New contribution relates to introducing a technology-independent concept for data-driven, fact-based PPM. ARTICLE HISTORY
Article
Full-text available
Purpose This paper aims to analyze how decision support systems manage Big data to obtain value. Design/methodology/approach A systematic literature review was performed with screening and analysis of 72 articles published between 2012 and 2019. Findings The findings reveal that techniques of big data analytics, machine learning algorithms and technologies predominantly related to computer science and cloud computing are used on decision support systems. Another finding was that the main areas that these techniques and technologies are been applied are logistic, traffic, health, business and market. This article also allows authors to understand the relationship in which descriptive, predictive and prescriptive analyses are used according to an inverse relationship of complexity in data analysis and the need for human decision-making. Originality/value As it is an emerging theme, this study seeks to present an overview of the techniques and technologies that are being discussed in the literature to solve problems in their respective areas, as a form of theoretical contribution. The authors also understand that there is a practical contribution to the maturity of the discussion and with reflections even presented as suggestions for future research, such as the ethical discussion. This study’s descriptive classification can also serve as a guide for new researchers who seek to understand the research involving decision support systems and big data to gain value in our society.
Article
China has a long tradition of sports and fitness Qigong, which is a wealth of Chinese sport culture. Today, people pay more and more attention to their own health, while bearing in mind that exercise and health care are common not only among middle-aged and elderly people but also among young people, particularly young college students. Fitness Qigong, as a type of exercise that can take into account both health and exercise, has improved the physical quality of college students and enhanced the content of campus courses. Teenagers also spread traditional Chinese culture while practicing Qigong. The topic of physical education itself is very unique and needs to be balanced with the field environment. In coastal areas, Qigong sports have a certain demand, so it is important to set up Qigong sport courses in coastal city colleges and universities. According to data from the tourism department, we can see that there is a wide market for Qigong sports in China, so it has a great potential for growth. In order to further the teaching of Qigong sports and to encourage the growth of sports in coastal areas, we should make full use of good resources in coastal areas. Centered on the study of the geological environment and the development requirements of different coastal areas and towns, this paper assesses the rationality of Qigong sport teaching in different areas. This is conducive to enriching the campus’ cultural activities so that students build a fitness habit so that students who are involved in Qigong fitness can, at the same time, make students aware of lifelong physical activity so that Qigong fitness can have a stronger fitness value.
Conference Paper
Full-text available
Failure to comply with speed limits and speeding belong to the most common unsafe behaviors of traffic participants. Therefore, the speed control represents one of the most effective mechanisms for raising the level of road traffic safety. The velocity analysis must come before the process of road speed control. Nowadays, most of the data on vehicle speeds on public roads in the Republic of Serbia is received from the automatic traffic counters (ATC). If we want to analyze the speeds obtained from 391 counters in the Republic of Serbia, on an annual basis, it is necessary to process over one billion records. This paper proposes an original model of processing data on vehicle speeds that are received using the ATC, in the Big Data technological environment. Big Data technology was developed with the aim to enable an efficient processing of massive quantities of heterogeneous data that arrive faster than the speed of processing, and cannot be handled with the help of conventional tools. The proposed model has been tested on the Apache™ Hadoop® Big Data platform by analyzing vehicle speeds that have been registered by ten traffic counters, positioned in the city of Novi Sad and its surroundings during the year 2015. The indicators were calculated for each counting place using HiveQL query language and Hadoop platform's service, in total and for each direction, annually, as follows: average vehicle speed, vehicle speed standard deviation, median and mode vehicle speed, maximum and minimum vehicle speed, 85. percentile of vehicle speed, distribution of the number of vehicles per speed categories, the number of occurrences of each registered speed at the counting site, percentage of distribution of vehicles by the levels of speeding-for all vehicles or particular category, etc. All calculated parameters were presented graphically and geo-referenced using modern business intelligence tools. Testing has shown that Big Data technology enables more efficient data processing of vehicle speeds compared to the conventional software tools.
Article
Full-text available
Data integration is a crucial issue in environments of heterogeneous data sources. At present mentioned heterogeneity is becoming widespread. Whenever, based on various data sources, we want to gain useful information and knowledge we must solve data integration problem in order to apply appropriate analytical methods on comprehensive and uniform data. Such activity is known as knowledge discovery from data process. Therefore approaches to data integration problem are very interesting and bring us closer to the "age of information". The paper presents an architecture, which implements knowledge discovery from data process. The solution combines ETL technology and wrapper layer known from mediated systems. It also provides semantic integration through connections mechanism between data elements. The solution allows for integration of any data sources and implementation of analytical methods in one environment. The proposed environment is verified by applying it to data sources on the foundry industry.
Article
Full-text available
These last years we have been witnessing a tremendous growth in the volume and availability of data. This fact results primarily from the emergence of a multitude of sources (e.g. computers, mobile devices, sensors or social networks) that are continuously producing either structured, semi-structured or unstructured data. Database Management Systems and Data Warehouses are no longer the only technologies used to store and analyze datasets, namely due to the volume and complex structure of nowadays data that degrade their performance and scalability. Big Data is one of the recent challenges, since it implies new requirements in terms of data storage, processing and visualization. Despite that, analyzing properly Big Data can constitute great advantages because it allows discovering patterns and correlations in datasets. Users can use this processed information to gain deeper insights and to get business advantages. Thus, data modeling and data analytics are evolved in a way that we are able to process huge amounts of data without compromising performance and availability, but instead by “relaxing” the usual ACID properties. This paper provides a broad view and discussion of the current state of this subject with a particular focus on data modeling and data analytics, describing and clarifying the main differences between the three main approaches in what concerns these aspects, namely: operational databases, decision support databases and Big Data technologies.
Conference Paper
Full-text available
Recent developments in the area of Internet of Things increase the pressure on the feasibility of current architectures of the Enterprise Information Systems (EIS), in terms of their complexity, flexibility and interoperability in a pervasive computing world. The fact that EISs are today hosted by the growing diversity of platforms and devices, urges as to consider new concepts that would take into account rapid deployment and setup in any circumstances. This paper presents the discussion of modeldriven architectures and proposes the concept of EIS design that is ontology-driven, persistence-neutral, runtime-modelbased. These concepts are to some extent demonstrated in the case of OntoApp tool for ontology scaffolding.
Article
The era of Big Data has arrived along with large volume, complex and growing data generated by many distinct sources. Nowadays, nearly every aspect of the modern society is impacted by Big Data, involving medical, health care, business, management and government. It has been receiving growing attention of researches from many disciplines including natural sciences, life sciences, engineering and even art & humanities. It also leads to new research paradigms and ways of thinking on the path of development. Lots of developed and under-developing tools improve our ability to make more felicitous decisions than what we have made ever before. This paper presents an overview on Big Data including four issues, namely: (i) concepts, characteristics and processing paradigms of Big Data; (ii) the state-of-the-art techniques for decision making in Big Data; (iii) felicitous decision making applications of Big Data in social science; and (iv) the current challenges of Big Data as well as possible future directions.
Article
Data virtualization can help you accomplish your goals with more flexibility and agility. Learn what it is and how and why it should be used with Data Virtualization for Business Intelligence Systems. In this book, expert author Rick van der Lans explains how data virtualization servers work, what techniques to use to optimize access to various data sources and how these products can be applied in different projects. You'll learn the difference is between this new form of data integration and older forms, such as ETL and replication, and gain a clear understanding of how data virtualization really works. Data Virtualization for Business Intelligence Systems outlines the advantages and disadvantages of data virtualization and illustrates how data virtualization should be applied in data warehouse environments. You'll come away with a comprehensive understanding of how data virtualization will make data warehouse environments more flexible and how it make developing operational BI applications easier. Van der Lans also describes the relationship between data virtualization and related topics, such as master data management, governance, and information management, so you come away with a big-picture understanding as well as all the practical know-how you need to virtualize your data. First independent book on data virtualization that explains in a product-independent way how data virtualization technology works. Illustrates concepts using examples developed with commercially available products. Shows you how to solve common data integration challenges such as data quality, system interference, and overall performance by following practical guidelines on using data virtualization. Apply data virtualization right away with three chapters full of practical implementation guidance. Understand the big picture of data virtualization and its relationship with data governance and information management.
Article
Learn the techniques, technologies, and best practices to moving, transforming, and integrating disparate data in a complex enterprise data and application environment.
Article
The extended view of enterprise information systems in the Internet of Things (IoT) introduces additional complexity to the interoperability problems. In response to this, the problem of systems? interoperability is revisited by taking into the account the different aspects of philosophy, psychology, linguistics and artificial intelligence, namely by analysing the potential analogies between the processes of human and system communication. Then, the capability to interoperate as a property of the system, is defined as a complex ability to seamlessly sense and perceive a stimulus from its environment (assumingly, a message from any other system), make an informed decision about this perception and consequently, articulate a meaningful and useful action or response, based on this decision. Although this capability is defined on the basis of the existing interoperability theories, the proposed approach to its definition excludes the assumption on the awareness of co-existence of two interoperating systems