Conference PaperPDF Available

Data Quality Problems in ETL: The State of the Practice in Large Organisations

Authors:

Abstract and Figures

This paper presents a review of the data quality problems that arise because of Extract, Transform and Load (ETL) technology in large organisations by observing the context in which the ETL is deployed. Using a case study methodology, information about the data quality problems and their context arising from deployments in six large organisations is reported. The findings indicate that ETL deployments most commonly introduce data accessibility problems which are caused by (1) the ETL failing part way and not delivering the data on time, (2) the information systems being locked during ETL execution, and (3) users not being able to find data in the target because of errors in the way the primary keys are transformed. Furthermore, accuracy, timeliness, believability, and representational consistency problems were also found to be caused by the ETL technology.
Content may be subject to copyright.
Paper 7, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016
Data Quality Problems in ETL: The State of the Practice in Large
Organisations
PHILIP WOODALL, University of Cambridge
ALEXANDER BOREK, Gartner
MARTIN OBERHOFER, IBM
JING GAO, University of South Australia
This paper presents a review of the data quality problems that arise because of Extract, Transform and Load
(ETL) technology in large organisations by observing the context in which the ETL is deployed. Using a case
study methodology, information about the data quality problems and their context arising from deployments
in six large organisations is reported. The findings indicate that ETL deployments most commonly introduce
data accessibility problems which are caused by (1) the ETL failing part way and not delivering the data on
time, (2) the information systems being locked during ETL execution, and (3) users not being able to find
data in the target because of errors in the way the primary keys are transformed. Furthermore, accuracy,
timeliness, believability, and representational consistency problems were also found to be caused by the ETL
technology.
INTRODUCTION
With the ever-growing digitalization of our economy, the dependence of business
operations and decision making on deriving new insights from data is constantly
increasing [Davenport and Harris 2007]. An essential requirement for this is that data
needs to be with decision makers, analysts and managers when and where they need
it [Smith et al. 2008], and one of the well-established enabling technologies for
transferring data is Extract, Transform and Load (ETL). However, ETL is not a simple
technology to implement, and while it can be used to address data quality problems
[Kimball and Caserta 2004], it can also yield data quality problems in different
business settings.
This paper surveys the ways in which ETL is actually used in organisations to
determine the data quality problems that arise from this use. The results can be used
to determine what data quality problems future ETL solutions (and other data
integration technologies) must address to be most useful to organisations. Other
survey research into ETL has focussed on the technology itself and how it can be
improved with respect to the state of the art [Vassiliadis 2009] or has focussed on the
ETL-related tools [Thomsen and Pedersen 2005; Thoo and Randall 2015]. Our focus is
different: this paper takes a state of the practice viewpoint and our particular focus is
on identifying the data quality problems that are caused because of the way the ETL
is used, implemented and configured (i.e. because of the particular context in which it
is deployed) in an organisation. It is important to note that we do not intend to report
the data quality problems that may already exist in the source databases and that may
be simply transferred by the ETL to the target; it is the problems that are introduced
by the ETL itself which we aim to identify.
ETL use and the context of the business is likely to change dramatically with different
sizes of business (e.g. small vs very large organisations), and this research therefore
chooses to focus on only large organisations. In order to determine what data quality
problems are caused by ETL in large organisations, this paper first answers the sub
questions:
1. What is ETL used for, and why is it used like this in the organisation?
2. How and why is it configured/used like this?
7
7:2 Woodall et al
Paper 7, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016
Using a case study approach with findings from six different organisations, the results
indicate that data accessibility problems are most common followed by accuracy,
timeliness, believability and representational consistency problems.
EXISTING ETL RESEARCH
Existing work transcending ETL and data quality focusses on how the ETL technology
can be used to address data quality issues (see for example, [Rodic and Baranovic 2009]
[Galhardas et al. 2001]). Furthermore, existing research describes some of the
problems caused by the ETL itself: these include data unavailability/lack of freshness
caused by the ETL not finishing on time because of a long execution time or a failure
in the ETL which causes it to need to be run again [Simitsis et al. 2005; Simitsis et al.
2010]. In the future, new uses of ETL, such as in data lakes (see Figure 1) are likely to
put further demands on ETL technologies as they become part of the core business
especially with regards to automation: much of the ETL is still manual, and a current
limitation of the self-service approach is that the commercial tools have limited support
for automatically generating and including complex data cleansing processes into the
data movement logic [Maier et al. 2012].
Figure 1: Future uses of ETL in a data lake including self-service access
Gartner lists five data integration scenarios [Thoo and Randall 2015]: Data acquisition
for business intelligence (BI), analytics and data warehousing (A); sourcing and
delivery of master data in support of master data management (MDM) (B); data
migrations/conversions (C); Data consistency between operational applications (D);
and Interenterprise data sharing (E). The analysis of the results from the case studies
uses these to classify the ETL scenarios found.
METHODOLOGY
In order to answer the research question, a case study approach was applied which
utilised, primarily, interviews and direct observations [Yin 2009]. A detailed inspection
Data Quality Problems in ETL: The State of the Practice in Large Organisations 7:3
Paper 7, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016
of the various different cases was obtained to fully understand the context and
determine the actual data quality problems present because of the context. For the
selection of organisations to inspect, a convenience sample was used based on the
different companies that the authors had access to (either by conducting a consulting
or research project at the company, or by having access to experts from the
organisation who were available to answer the questions). The only selection criteria
applied was that the organisations were required to have more than $1B revenue. In
the case of the healthcare organisation, it is a national service for a particular country
and so is large in scale. The data collection was designed to be completely confidential
(no names of experts or companies are revealed) because this gave the best chances of
obtaining uncensored answers; especially as respondents were asked to report
problems/failures. The ways in which the data extraction was carried out for each case
are summarised in Table I.
Table I. Selected cases and the primary data extraction method used
Case
Business Sector
Primary Data Extraction Method
1
Manufacturing
Multiple telephone interviews with a user of the related IT
systems
2
Manufacturing
Multiple interviews with various users of the IT systems
3
Retail Bank
Direct observation via a consulting project
4
Financial Asset Manager
Direct observation via a consulting project
5
Healthcare Authority
Multiple interviews
6
Wholesale Distributor
Direct observation via a consulting project and telephone
interviews with a user of the IT systems
RESULTS: STATE OF THE PRACTICE ETL USE, CONFIGURATION AND DATA QUALITY
PROBLEMS
The following subsections describe in-detail the results for each organisation (case).
The descriptions are structured in terms of what the ETL is used for and why, how and
why it is configured the way it is, and finally the data quality problems associated with
the ETL processes. Using the conceptual framework of data quality (see figure 2 in
[Wang and Strong 1996]), the data quality problems are classified in accordance with
the dimensions in that framework.
Case 1: Manufacturer
What is it used for, and why is it used like this?
The manufacturer exports data from their ERP system (SAP in this case) into bespoke
reporting tools via ETL because the ERP system does not provide the reporting fidelity
required. In particular, reports are used, to determine what engineering parts have
increased in volume over the last year, what parts are late (have not yet been delivered
when expected), and to predict what parts are going to be late, etc. Furthermore, it is
necessary to observe these reports on a detailed part-by-part basis as well as an
aggregated view. The main problem is that the ERP system cannot provide a
convenient way to view and manipulate aggregations of parts. One of the reporting
tools used is IBM’s Cognos, which helps to satisfy the aggregation reporting
requirements. The data volume is large as the entire bill of materials (BOM) for the
various aerospace products that the company manufactures is contained in the ERP
system. Some sub-components have in excess of 300 parts, and so overall total for the
final product and for all of the different types of products is large. The exact number
of parts cannot be revealed because of confidentiality reasons.
7:4 Woodall et al
Paper 7, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016
How and why is it configured/used like this?
The ETL executes every Sunday and takes at least 12 hours to complete the transfer
of all the data. Batch loading is used where all the data from the ERP system is
extracted and loaded into the target reporting systems. The reporting systems are used
from Monday to Friday by various users, as is the ERP system. The developers of the
ETL chose to use the most reliable and uncomplicated method of transferring the data
to ensure that the data is always available (i.e. to minimise the likelihood of data
transfer failure). It is known that their method of batch loading sacrifices data
timeliness, but this is a trade-off that is accepted. Other options like incremental
loading are not used because of their complexity and therefore increased likelihood of
failure. The complexity referred to is the need to develop code to determine what data
needs updating, which is relatively easy compared to the additional need to ensure the
integrity of the transactions in the target systems (for example, to ensure no duplicate
records or incorrect deletion of records which should still be present).
What are the problems?
Clearly the main problem is the timeliness of the data with the reporting tools only
being up to date on Monday morning, and with the data becoming increasingly stale
throughout the week. Note that the ERP system is always kept up to date throughout
the week, and, importantly, there are many updates to the system which should ideally
be transferred to the reporting tools as soon as possible. Another problem that occurred
once, according to the respondent, was that the ETL failed and the data did not
transfer correctly. The result was that the reporting system was unavailable for a
whole day while the data transfer was re-run. This is classified as an accessibility
problem.
Case 2: Manufacturer
What is it used for, and why is it used like this?
In this case, the manufacturer transfers data via ETL between various operational
systems in order to complete the business process of procurement of its engineering
parts. For instance, an engineer will request a particular part to be procured and this
enters a queue in the procurement agents’ system. Once the procurement agent has
obtained quotations from suppliers, the quotations are transferred via ETL to the
purchase ordering system so that the order can be placed. There are also other ETL
processes that transfer data from the purchase order system to various other systems,
such as a final order release system, which can send an order to a supplier. Various
COTS systems were procured and installed because the company has not found one
system that can deal with all of the scenarios in procurement (e.g. buying ordinary
consumable items individually from a retailer and buying hundreds of thousands of
specialist parts via a contract with a specific supplier over a long period). Hence, a
different system is needed for each scenario that existing systems cannot cater for.
They currently have at least 5 different systems which enable procurement agents to
buy parts for various scenarios.
How and why is it configured/used like this?
All the ETL processes in this case are batch processes. Note that some of these may
act as an incremental load if the source system maintains a queue of orders that are
removed from the system as they are transferred to another system. Most ETL
processes are executed “out of working hours” while some are executed in the afternoon.
Data Quality Problems in ETL: The State of the Practice in Large Organisations 7:5
Paper 7, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016
Some legacy systems don’t support real-time updating, and the system is locked while
the data is extracted/loaded. Therefore, in these cases, ETL “out of working hours” is
preferable as it does not lock users out of the systems when they need access to them.
Furthermore, batch loading is a simple and fast way to implement data transfer
between these systems, and so that is what is used rather than incremental loading,
which is more complex
What are the problems?
If the ETL process is executed during working hours, then it often effectively “locks
the systems as it extracts and loads data, resulting in the systems and data within
being unavailable for decision makers until they are released. We observed this
situation in one organisation where staff must stop their work on Friday afternoons
while a batch of data is sent via ETL from one system to another. The reason was
because of computing power which was being taken up with the ETL load process, and
so other systems would slow down to a point of becoming unusable (and effectively
“locked”). This is an accessibility data problem (rather than a timeliness problem)
because the data exists in the system in an up-to-date fashion but it cannot be accessed.
Furthermore, the different source systems reference the same data product (e.g. a work
order) in different ways (for example, in one system a “project code” is used and in
another system “buyer id” is used for work orders). When data is transferred to another
system, it becomes a problem to find work orders for people who use a different
identifier to the one selected by the ETL in the target system (hence this is also an
accessibility problem). There are also various problems related to translation errors,
synchronization errors and issues about the timing of updates, which result in
inconsistencies across systems. Hence it is sometimes difficult for the procurement
agents and planners to be sure if the data is correct or not. The relevant data quality
dimension for this latter point is believability (due to the staff not be confident in the
data).
Case 3: Retail Bank
What is it used for, and why is it used like this?
The retail bank uses ETL for integrating data from over 50 source systems to create a
360 degree view of the customer. This includes customer master data (such as address,
birth date, phone number etc.), all contracts and products that the customer has with
the bank, a history of interactions across all possible channels (e.g. physical branch,
mobile app banking, online website banking, call center etc.), risk scorings,
advertisements and promotions sent out to this customer, and many other pieces of
relevant data. IBM DataStage is used as the ETL tool alongside custom made SQL
scripts. The data is transferred via batch runs from the source systems during the
night. The integrated data is used by many stakeholders in the bank. For example, it
is used to create performance reports for marketing and sales using a combination of
off the shelf reporting tools, namely IBM Cognos and SAS. The 360 customer view data
is also used as input for self-service Business Intelligence (BI) and discovery tools,
where additional ETL processes are run on a desktop environment of the business
users to integrate additional data sources on an ad-hoc basis and to make further data
transformations. Tools used by the business users for this purpose are, for instance,
Qlikview, TIBCO Spotfire, and SAS. Finally, this data is also used as input for data
mining and advanced analytics model building for sales and marketing by data
scientists working on the business side, who also run further ETL processes on their
7:6 Woodall et al
Paper 7, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016
machines. Such models are used, for example, to identify customers that are most
likely to buy.
How and why is it configured/used like this?
Every night an incremental load executes and copies only the changes made during
the day. A full copy of the data is made once a week on Sunday, as it takes at least
eight hours for the full batch run to complete. The daily incremental load ensures that
data is up to date every day, while the weekly full run is needed to fix data
inconsistencies that are a result of the incremental load during the week. In the past,
real-time data was not a business requirement, which is most likely to change in the
near future.
What are the problems?
A problem that occurs frequently is that the ETL process is interrupted because the
data loaded from the sources does not meet the specifications. The ETL process is in
some cases very intolerant to any inconsistencies in the source data. The result is that
the process needs to be restarted after the source data has been corrected, which causes
long delays for the data users. This can be caused because of 1) data defects in the
source systems, e.g. if the data has not been entered accurately by the personnel in the
branch, and 2) the source systems are not able to provide the data required in time,
which has an impact on the start time of the ETL processes (this happens less
frequently). Ultimately, these cause a data accessibility problem as users of the target
system need to wait for the data.
Case 4: Financial Asset Manager
What is it used for, and why is it used like this?
A data warehouse is used to calculate a weekly view of the market risk of all financial
assets owned by the investment management company. Therefore, data about all
trading activities during the day and the current financial assets owned (quantity and
market value) is extracted from 12 different trading and operational systems. Then,
the market risk is calculated by simulating different events and volatilities that could
happen in the market and generating an evaluation for each asset class and sub-class.
The market risk figures are the key input for risk management and to inform the
customers of the asset manager about the risk evaluations of their financial assets.
The data can be analysed with a custom written business intelligence tool after further
transformation using self-service BI.
How and why is it configured/used like this?
There is a weekly batch run to transfer the data, which takes half a day and runs
during the night. At the end of the ETL transfer, the simulations are run to calculate
the market risk evaluations. The ETL process uses self-coded SQL scripts and moves
the data into an ODBC database platform. The data is transformed into a common
format and an asset class and a sub-class are assigned to each financial asset. The
amount of input data is limited before the simulation run (less than 1 GB), but the
simulated data has a higher volume (>5 GB). After completion of the ETL process, a
number of plausibility checks are run before the data is released. There is currently no
requirement and business need to calculate the market risk more often. Hence, there
needs to be a significant benefit if the system should be changed to daily ETL process
runs.
Data Quality Problems in ETL: The State of the Practice in Large Organisations 7:7
Paper 7, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016
What are the problems?
In some cases, financial investments are mapped to a wrong asset class during the
ETL, which results in an inaccurate risk evaluation. This is clearly an accuracy
problem in the data which causes any calculations to be inaccurate also (i.e. “garbage
in garbage out”). The problem can be detected if there is a large change in the daily
market risk value that cannot be explained with events in the real world, this often
indicates that some parts of the simulations and/or calculations have gone wrong.
Case 5: Healthcare Authority
What is it used for, and why is it used like this?
In this case, the healthcare organisation’s health funding model is being changed to an
activity-based funding model, and BI and reporting are key enablers. The state
department has, therefore, increasing demand to harvest data from various public care
services providers and local hospitals etc. on a daily basis. This data is used for
reporting purposes, for example to report the performance of all the local health
services. The performance evaluation unit at the state level controls a data warehouse
and runs regular reports to monitor local activities and alert the local health services
if unusual behaviour is detected (e.g. they are over budget, too many surgeries of the
same kind within a short period, unusual long waiting period for elective surgeries
etc.). The BI setup involves a two-level data warehouse infrastructure. On a daily basis,
data is extracted from systems such as patient admission systems and emergency
service systems (at the local health districts), and is then transferred to the data
warehouse at the local health district level. Data is then transferred via ETL to the
data warehouse at the state department level from the local health districts, a data
quality profiling engine scans all loaded data first.
How and why is it configured/used like this?
An incremental load ETL process between the various local health district data source
systems and the local health data warehouse is run daily. Also, weekly incremental
loading is scheduled between the local health district data warehouse and the state
data warehouse. Incremental loading is used because of a slow network infrastructure
which charges by data usage. Furthermore, because patient data is sensitive,
incremental loading minimises the data volume that is exposed for any data
breaches/releases. The autonomous nature of health service providers enables them to
choose and implement IT systems themselves. Over time, the state-level data
warehouse therefore plays an important role in integrating data from these
heterogeneous and disparate source systems.
What are the problems?
The maintenance of data quality rules causes problems. For example, the ETL matches
a treatment to a health coding system and the general dental code 101 contains the
sub codes d01 and d02 (e.g. check-up and simple filling). With this ETL rule, everything
is validated and loaded into the target. However, if a new rule is enforced that changes
d01 to no longer be part of code 101, then the new rule will reject all 101/d01 records
in the transformations of the ETL. Furthermore, the 101/d01 records that already exist
in the data warehouse will no longer be correct and need to be updated. This is
therefore both an accessibility problem (because of the failure of the ETL to transfer
the data to the target) and an accuracy problem.
7:8 Woodall et al
Paper 7, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016
Case 6: Wholesale distributor
What is it used for, and why is it used like this?
The respondent mentioned two main examples of how ETL is used in the organisation:
for sharing of operational data between applications, and as a one-off transfer for
system migrations.
Sub case 1 (sharing operational data between applications): In the first case,
the data is transferred from an ERP system (SAP in this case) into various target
systems. One example is into a Master Data Management (MDM) system (via a batch
transfer), which includes data for sellers (50-60K records periodically, with an annual
batch update of approx. 300K records), and billing and accounts data (1000 records per
day, the largest number being 50K records). Data is extracted from the ERP system so
that it can be entered into the MDM solution and data stewards can, for example,
ensure the integrity of the data including reconciling complex cases of duplicate
records. Also, any non-core business activities that need data from the ERP system
have it extracted so that the core activities (such as placement of orders) have
maximum performance from the ERP system and are not burdened by other processes.
Sub case 2 (one-off system migrations):
Due to an upgrade in the platform used to host the company website, a migration from
the old systems to the new platform was needed. This involved transferring the data
via ETL to the new platform. This opportunity was taken to clean the data during the
transforms. Although each migration is a one-off case, the respondent indicated that
the company is constantly doing these; other examples besides the website include on-
boarding new applications and integrating external data (such as from data analytics
providers) into the organisation; or sending data to external cloud/Software as a
Service (SaaS) solutions for external analysis (in this case, Salesforce). These are
always done in batch and involve large amounts of data: monthly, 1.5 - 3 million
records and annually approx. 24 million records for the external provider case.
How and why is it configured/used like this?
Sub case 1 (sharing operational data between applications): Data is mainly
transferred periodically as the business need is for data is in batch updates rather than
continuous real-time data. All data is extracted into a staging area which is then used
to identify and to load only the changed data (i.e. batch extract and incremental load).
The previous snapshot of the data is compared to the new data in the staging area in
order to determine what data has changed. ETL is generally done in the
evenings/overnight, so that daily business processes are not disrupted.
Sub case 2 (one-off system migrations): Clearly, for one-off migrations of data,
real-time synchronisation is not needed, hence batch loading is done. When integrating
data from external providers it is also done in batch because the data is provided in
batches (e.g. a monthly subscription gives a monthly update of data). It is still possible
to do an incremental load if the new monthly data is compared to the old data before
loading and only new/updated data is loaded. However, an entire load is done annually
as a refresh.
What are the problems?
Sub case 1 (sharing operational data between applications): The ERP system
has validation rules that are different to the target systems’ rules, so when the data is
transferred to the MDM system and subsequently to the target systems, the data may
have passed the ERP rules, but then fails the target system rules (so it is considered
to be inaccurate in the target). Note that the MDM system is configured to be flexible
Data Quality Problems in ETL: The State of the Practice in Large Organisations 7:9
Paper 7, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016
to allow any data to be entered (with the aim of checking and cleaning the data later).
An on-going problem that has a workaround implemented to keep it functioning is the
dropping of the connection to the ERP system when transferring data. This causes the
ETL to fail and hence data to be unavailable (an accessibility problem).
Sub case 2 (one-off system migrations): Stale data (timeliness) is a problem when
integrating data from external data providers because updates are only done on a
monthly basis. Also, the inconsistencies in character sets between the source and
target systems (representational consistency) have caused problems: in one case,
Spanish and French characters were dropped during the transfer and so the target
system contained erroneous data.
DISCUSSION AND CONCLUSION
The second column in table II shows that all scenarios (A-E) reported by Gartner (see
section 2) were found in the cases. In terms of the data quality problems found, the
most common problem is accessibility of data (see the summary of problems in table II
and Figure 2). Accessibility problems are caused in three ways: the ETL failing and
hence not delivering the data on time to the target system(s), the information systems
being locked during ETL execution, and users not being able to find data in the target.
The ETL could fail because of transformation problems (which include mostly data
validation rule failures) and dropped connections. Even with the simplest form of ETL
(the batch load), the ETL can fail. Failures are exacerbated by the fact that most of the
ETL processes are executed out of working hours and so failures are often not
discovered early. The information systems are locked when the ETL executes to allow
it access to read and write data. Hence, users must wait until the ETL releases the
lock before they can use the data. Another cause of accessibility problems is users not
being able to find data in the target system because of inconsistencies introduced into
the primary keys (by the ETL transformations) which are used to find data.
Figure 2: Frequencies of data quality problems occurring from ETL processes
7:10 Woodall et al
Paper 7, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016
Table II. Summary of the ETL case results
What is it
used for?
Why is it used?
How is it
configured?
Why is it configured/
used like this?
What are the
problems?
Reporting
and
analytics (A)
ERP does not
have necessary
reporting
capabilities
Batch transfer,
executed once per
week
For simplicity and
reliability, and once
per week because
(12+ hours to run)
Stale data as the week
progresses (timeliness)
Data not transferred
(unknown ETL failure)
(accessibility)
Connecting
operational
systems (D)
The company
has not found
one IT system
that can deal
with all of the
scenarios in
procurement
Batch transfer,
executed overnight
or once per week.
Staff are already
unsure about the
consequences of
making changes, so
want as simplistic
ETL process as
possible
Locked systems, and
users cannot find data
because of the ETL
introducing
inconsistent keys
(accessibility)
Staff are not confident
in the data
(believability)
Reporting
and
analytics (A)
and
MDM (B)
Customer data
is found in
multiple source
systems
Incremental load
(once per day and
overnight).
Batch load on
Sunday
Incremental load
keeps data up-to-
date, and the batch
load is needed to
correct data
inconsistencies that
occur during the
weekly loads
Data fails ETL
validation rules and
leaves users without
data in the target
(accessibility)
Transferring
data to a
data
warehouse
(A)
To integrate
data and
provide a target
system on
which to run
simulations
Batch transfer,
once per week
overnight (takes
approx. 12 hours)
No need to have the
data more frequently
Incorrectly mapped
values (accuracy)
Reporting
and
analytics (A)
To integrate
data
Incremental load
daily (local level)
and incremental
load once per week
(state level)
Slow network
infrastructure with
charges linked to
amount of data
transferred.
Minimises the data
volume exposed for
any data breaches/
releases
Changes in data
quality rules render
data in the target
inaccurate (accuracy)
and cause the ETL to
fail (accessibility)
Connecting
operational
systems (D)
and
MDM (B)
To unburden
the ERP system
and transfer
data into MDM
software
Batch transfers
(full extract and
incremental load)
No need to have the
data more frequently,
and avoids inefficient
data transfers
Data fails target
system validation
rules (accuracy).
Dropped connections
during ETL leaves
data inaccessible
(accessibility)
data
migrations
(C)
and Inter-
enterprise
data sharing
(E)
It is an simple
way to transfer
data in bulk
Batch transfers
Batch is the only
possible option
(external data is not
available in real-
time)
Infrequent data
updates (timeliness)
Inconsistencies in
character sets between
the source and target
(representational
consistency).
Data Quality Problems in ETL: The State of the Practice in Large Organisations 7:11
Paper 7, ICIQ 2016, Ciudad Real (Spain), June 22-23, 2016
Inconsistencies also cause accuracy problems. These accuracy problems can be
introduced by ETL processes either because the ETL transforms incorrectly map the
data or because of differences in data validation rules between source and target
systems. These can also lead to believability problems when users lose confidence in
the data in the target system, or representational consistency problems with
incorrectly mapped characters (for example, in different languages).
Timeliness problems are also caused by the ETL processes when they cannot be
executed frequently enough to refresh the data as required. However, these are only
problematic if the data is actually needed before the next data refresh. ETL solutions
should therefore consider the refresh rate needed before developing solutions that
attempt to refresh data more often than needed.
ACKNOWLEDGMENTS
The authors would like to thank all the respondents for their time and willingness to share their experiences.
REFERENCES
Y. Cui and J. Widom. 2003. Lineage tracing for general data warehouse transformations. VLDB J. 12, 1
(2003), 41–58. DOI:http://dx.doi.org/10.1007/s00778-002-0083-8
Thomas H. Davenport and Jeanne G. Harris. 2007. Competing on Analytics: The New Science of Winning,
Harvard Business School Press.
Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. 2001.
Declarative Data Cleaning: Language, Model, and Algorithms. In Proceedings of the 27th
International Conference on Very Large Data Bases. VLDB ’01. San Francisco, CA, USA: Morgan
Kaufmann Publishers Inc., 371–380.
Ralph Kimball and Joe Caserta. 2004. The Data Warehouse ETL Toolkit: Practical Techniques for
Extracting, Cleaning, Conforming, and Delivering Data 1 edition., Wiley.
Albert Maier, Martin Oberhofer, and Thomas Schwarz. 2012. Industrializing Data Integration Projects
using a Metadata Driven Assembly Line. It - Inf. Technol. 54, 3 (May 2012), 114–122.
DOI:http://dx.doi.org/10.1524/itit.2012.0671
Jasna Rodic and Mirta Baranovic. 2009. Generating data quality rules and integration into ETL process.
In ACM Press, 65. DOI:http://dx.doi.org/10.1145/1651291.1651303
A. Simitsis, P. Vassiliadis, and T. Sellis. 2005. Optimizing ETL Processes in Data Warehouses. In 21st
International Conference on Data Engineering, 2005. ICDE 2005. Proceedings. 564–575.
DOI:http://dx.doi.org/10.1109/ICDE.2005.103
A. Simitsis, K. Wilkinson, U. Dayal, and M. Castellanos. 2010. Optimizing ETL workflows for fault-
tolerance. In 2010 IEEE 26th International Conference on Data Engineering (ICDE). 385–396.
DOI:http://dx.doi.org/10.1109/ICDE.2010.5447816
K. Smith, L. Seligman, and V. Swarup. 2008. Everybody Share: The Challenge of Data-Sharing Systems.
Computer 41, 9 (September 2008), 54–61. DOI:http://dx.doi.org/10.1109/MC.2008.387
Christian Thomsen and Torben Bach Pedersen. 2005. A Survey of Open Source Tools for Business
Intelligence. In A.Min Tjoa & Juan Trujillo, eds. Data Warehousing and Knowledge Discovery.
Lecture Notes in Computer Science. Springer Berlin Heidelberg, 74–84.
E. Thoo and L. Randall. 2015. Gartner research report: Magic Quadrant for Data Integration Tools,
Panos Vassiliadis. 2009. A survey of Extract–transformLoad technology. Int. J. Data Warehous. Min.
IJDWM 5, 3 (2009), 1–27.
R.Y. Wang and D.M. Strong. 1996. Beyond Accuracy: What Data Quality Means to Data Consumers. J.
Manag. Inf. Syst. 12, 4 (1996), 5–34.
Robert K. Yin. 2009. Case Study Research: Design and Methods, SAGE.
... This is exemplified in this DT, which integrates data from sources such as real-time sensors, BMS, cloud services, and AMS, among others. From a technical point of view, there are many technologies available to support the integration of data, from extract, transform, and load (ETL) technologies that support the transfer of data between systems (Vassiliadis 2009;Woodall et al. 2016), to service-oriented architectures that can expose data as a service (Budgen et al. 2007), data virtualization, and data warehouses and data lakes (Beyer et al. 2017). Generally, no one solution fits all problems, and a mixture of these technologies is often deployed in organisational integration settings (Araújo et al. 2017). ...
... Synchronization costs include the cost of resources used, such as information technology (IT) staff and computing resources, among others. Computing resources can cause a considerable disruption cost to the business because systems often need to be locked (Woodall et al. 2016) in order to access the data, and any reduction in computer power can reduce the power available for critical business operations (Qu and Jiang 2018). ...
Article
Full-text available
A Digital Twin (DT) refers to a digital replica of physical assets, processes and systems. DTs integrate artificial intelligence, machine learning and data analytics to create living digital simulation models that are able to learn and update from multiple sources, and to represent and predict the current and future conditions of physical counterparts. However, the current activities related to DTs are still at an early stage with respect to buildings and other infrastructure assets from an architectural and engineering/construction point of view. Less attention has been paid to the operation & maintenance (O&M) phase, which is the longest time span in the asset life cycle. A systematic and clear architecture verified with practical use cases for constructing a DT would be the foremost step for effective operation and maintenance of buildings and cities. According to current research about multi-tier architectures, this paper presents a system architecture for DTs which is specifically designed at both the building and city levels. Based on this architecture, a DT demonstrator of the West Cambridge site of the University of Cambridge was developed, which integrates heterogeneous data sources, supports effective data querying and analysing, supports decision-making processes in O&M management, and further bridges the gap between human relationships with buildings/cities. This paper aims at going through the whole process of developing DTs in building and city levels from the technical perspective and sharing lessons learnt and challenges involved in developing DTs in real practices. Through developing this DT demonstrator, the results provide a clear roadmap and present particular DT research efforts for asset management practitioners, policymakers and researchers to promote the implementation and development of DT at the building and city levels.
... From a technical point of view, there are many technologies available to support the integration of data, from Extract Transform and Load (ETL) technologies that support the transfer of data be-tween systems (Vassiliadis, 2009) (Woodall et al., 2016), to Service-oriented Architectures that can expose data as a service (Budgen et al., 2007), to data virtualisation, to data warehouses and data lakes (Beyer et al., 2017). ...
... The problem is non-trivial because a trade-off exists between synchronisation costs and quality (staleness) of the data (Qu and Jiang, 2018). Synchronisation costs include the cost of resources used, such as Information Technology (IT) staff and computing resources etc. Further-more, to avoid disruptions to systems during business hours, organisations often resort to batch synchronisation of data, which is attempted out of business ours (such as overnight) (Woodall et al., 2016). However, for DTs with a requirement to monitor engineering assets in realtime, a continuous stream of data will be needed, which shifts the trade-off towards high synchronisation costs. ...
... From a technical point of view, there are many technologies available to support the integration of data, from Extract Transform and Load (ETL) technologies that support the transfer of data be-tween systems (Vassiliadis, 2009) (Woodall et al., 2016), to Service-oriented Architectures that can expose data as a service (Budgen et al., 2007), to data virtualisation, to data warehouses and data lakes (Beyer et al., 2017). ...
... The problem is non-trivial because a trade-off exists between synchronisation costs and quality (staleness) of the data (Qu and Jiang, 2018). Synchronisation costs include the cost of resources used, such as Information Technology (IT) staff and computing resources etc. Further-more, to avoid disruptions to systems during business hours, organisations often resort to batch synchronisation of data, which is attempted out of business ours (such as overnight) (Woodall et al., 2016). However, for DTs with a requirement to monitor engineering assets in realtime, a continuous stream of data will be needed, which shifts the trade-off towards high synchronisation costs. ...
Conference Paper
A Digital Twin (DT) refers to a digital replica of physical assets, processes and systems. DTs integrate artificial intelligence, machine learning and data analytics to create dynamic digital models that are able to learn and update the status of the physical counterpart from multiple sources. A DT, if equipped with appropriate algorithms will represent and predict future condition and performance of their physical counterparts. Current developments related to DTs are still at an early stage with respect to buildings and other infrastructure assets. Most of these developments focus on the architectural and engineering/construction point of view. Less attention has been paid to the operation & maintenance (O&M) phase, where the value potential is immense. A systematic and clear architecture verified with practical use cases for constructing a DT is the foremost step for effective operation and maintenance of assets. This paper presents a system architecture for developing dynamic DTs in building levels for integrating heterogeneous data sources, support intelligent data query, and provide smarter decision-making processes. This will further bridge the gaps between human relationships with buildings/regions via a more intelligent, visual and sustainable channels. This architecture is brought to life through the development of a dynamic DT demonstrator of the West Cambridge site of the University of Cambridge. Specifically, this demonstrator integrates an as-is multi-layered IFC Building Information Model (BIM), building management system data, space management data, real-time Internet of Things (IoT)-based sensor data, asset registry data, and an asset tagging platform. The demonstrator also includes two applications: (1) improving asset maintenance and asset tracking using Augmented Reality (AR); and (2) equipment failure prediction. The long-term goals of this demonstrator are also discussed in this paper.
... To understand this phenomenon, the author looks at real life everyday example and considers the usual approach what a manager pursues. The team collects business requirements, 16 determines the data required to compensate those requirements and determines the SOR needed to source the data. After that, the team evaluates the s o r by gaining field layouts conversing to the IT staff supporting this SOR, and finally writing queries to fetch the relevant information about the SOR contents. ...
Article
Full-text available
This research paper is based on the Integration of Information in Business Intelligence. This research paper is focused on tools, methods, approaches, challenges involved with data integration in a big data environment. This paper will further explain the integration architecture and transformation of data integration into sense-making, actionable information, which is finally integrated in business intelligence. This paper also targets the inefficiencies of traditional integration approaches in certain situations. Techniques including ontology-based information and data integration are also reviewed in this paper. Several sources, research materials, arguments are used to justify relevant scenarios.
... In practice, however, unless a synchronization operation can be run at each scheduled checkpoint and completed before the next checkpoint, the assumed system check frequency would be infeasible. For instance, the ETL processes are very complex (El-Sappagh et al. 2011), and can take more than 12 hours to complete one ETL cycle for large organizations (Woodall et al. 2016 Essentially, we should restrict the length of the check interval, so that in expectation, the staleness cost caused by the data errors accumulated in a given check interval to queries arriving in the next interval should not be larger than . Otherwise, in expectation it would be optimal to synchronize the CDB at every decision epoch, then there is no need to develop a dynamic synchronization policy. ...
Article
Full-text available
Data is becoming an increasingly important asset in today’s organizations. Various challenges in the age of big data, such as high volume and high velocity, call for efficient database maintenance policies. This study focuses on deriving an optimal synchronization policy for consolidated database systems. By trading off synchronization cost with data staleness cost, we propose a time-based dynamic synchronization (TDS) policy, which evaluates the system state at predetermined checkpoints and synchronizes a database only if given staleness thresholds are crossed. Although several database synchronization and knowledge refreshment policies have been proposed in the prior literature, the TDS policy retains their advantages and overcomes their inflexibility in that, under the TDS policy, system check and synchronization are easy to schedule, disruptions to business operations can be avoided, and synchronization is run only if necessary. Experimental results show that the TDS policy consistently outperforms benchmark policies, leading to substantial cost savings. In particular, the performance gap between the TDS policy and a static periodic policy is greater when data changes arrive less frequently but carry higher unit staleness costs, or when queries arrive more frequently and suffer higher unit staleness costs.
Book
Full-text available
Buku ini merupakan hasil kerja keras dan kolaborasi antara sejumlah ahli data yang berpengalaman, yang berbagi pengetahuan dan wawasan mereka tentang Data Lake. Dalam era di mana data menjadi komoditas yang sangat berharga, penting bagi kita untuk memiliki pemahaman yang mendalam tentang cara mengelola dan menganalisis data dengan efektif. Data Lake telah muncul sebagai paradigma yang kuat dalam pengelolaan data, memungkinkan kita untuk menyimpan, mengolah, dan menggali potensi data dalam skala yang belum pernah terbayangkan sebelumnya. Melalui buku ini, kami berusaha untuk memberikan gambaran komprehensif tentang konsep, prinsip, dan implementasi Data Lake. Kami menjelaskan tentang arsitektur Data Lake yang fleksibel, infrastruktur yang diperlukan, serta alat dan teknologi yang dapat digunakan untuk membangun dan mengelola Data Lake dengan efisiensi dan efektivitas maksimal. Selain itu, kami juga membahas tentang bagaimana Data Lake dapat memberikan wawasan yang berharga melalui analisis data yang mendalam. Kami membahas teknik dan praktik terbaik untuk memproses, menggali, dan menganalisis data di dalam Data Lake, dengan contoh kasus nyata yang memberikan ilustrasi konkret tentang manfaat yang dapat diperoleh. Selain aspek teknis, kami juga tidak melupakan pentingnya aspek keamanan, privasi, dan kepatuhan dalam pengelolaan Data Lake. Kami memberikan wawasan tentang kerangka kerja keamanan data yang kokoh, serta praktik terbaik dalam menjaga kerahasiaan dan integritas data yang disimpan di dalam Data Lake. Kami ingin menyampaikan rasa terima kasih kepada tim penulis yang berdedikasi dan berkomitmen untuk menyampaikan pengetahuan mereka dalam buku ini. Tanpa upaya mereka, buku ini tidak akan menjadi kenyataan. Akhir kata, kami berharap buku "Data Lake Insights" ini memberikan pemahaman yang mendalam dan praktis kepada Anda tentang bagaimana memanfaatkan Data Lake secara efektif. Semoga buku ini menginspirasi dan memberikan wawasan baru bagi Anda dalam menjelajahi dunia yang kaya potensi dari data.
Conference Paper
Full-text available
Many data quality projects are integrated into data warehouse projects without enough time allocated for the data quality part, which leads to a need for a quicker data quality process implementation that can be easily adopted as the first stage of data warehouse implementation. We will see that many data quality rules can be implemented in a similar way, and thus generated based on metadata tables that store information about the rules. These generated rules are then used to check data in designated tables and mark erroneous records, or to do certain updates of invalid data. We will also store information about the rules violations in order to provide analysis of such data. This could give a significant insight into our source systems. Entire data quality process will be integrated into ETL process in order to achieve load of data warehouse that is as automated, as correct and as quick as possible. Only small number of records would be left for manual inspection and reprocessing.
Conference Paper
Full-text available
The industrial use of open source Business Intelligence (BI) tools is not yet common. It is therefore of interest to explore which possibilities are available for open source BI and compare the tools. In this survey paper, we consider the capabilities of a number of open source tools for BI. In the paper, we consider three Extract-Transform-Load (ETL) tools, three On-Line Analytical Processing (OLAP) servers, two OLAP clients, and four database management systems (DBMSs). Further, we describe the licenses that the products are released under. It is argued that the ETL tools are still not very mature for use in industry while the DBMSs are mature and applicable to real-world projects. The OLAP servers and clients are not as powerful as commercial solutions but may be useful in less demanding projects.
Conference Paper
Full-text available
Extraction-transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. Usually, these processes must be completed in a certain time window; thus, it is necessary to optimize their execution time. In this paper, we delve into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as a state and fabricate the state space through a set of correct state transitions. Moreover, we provide algorithms towards the minimization of the execution cost of an ETL workflow.
Article
Data integration is essential for the success of many enterprise business initiatives, but also a very significant contributor to the costs and risks of the IT projects supporting these initiatives. Highly skilled consultants and data stewards re-design the usage of data in business processes, define the target landscape and its data models, and map the current information landscape into the target landscape. Still, the largest part of a typical data integration effort is dedicated to the implementation of transformation, cleansing, and data validation logic in robust and highly performing commercial systems. This effort is simple and doesn´t demand skills beyond commercial product knowledge, but it is very labour-intensive and error prone. In this paper we describe a new commercial approach to data integration that helps to “industrialize” data integration projects and significantly lowers the amount of simple, but labour-intensive work. The key idea is that the target landscape for a data integration project has pre-defined data models and associated meta data which can be leveraged for building and automating the data integration process. This approach has been implemented in the context of the support of SAP consolidation projects and is used in some of the largest data integration projects world-wide.
Article
Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of transformations, which may vary from simple algebraic operations or aggregations to complex “data cleansing” procedures. In a warehousing environment, the data lineage problem is that of tracing warehouse data items back to the original source items from which they were derived. We formally define the lineage tracing problem in the presence of general data warehouse transformations, and we present algorithms for lineage tracing in this environment. Our tracing procedures take advantage of known structure or properties of transformations when present, but also work in the absence of such information. Our results can be used as the basis for a lineage tracing tool in a general warehousing setting, and also can guide the design of data warehouses that enable efficient lineage tracing.
Conference Paper
Extract-Transform-Load (ETL) processes play an important role in data warehousing. Typically, design work on ETL has focused on performance as the sole metric to make sure that the ETL process finishes within an allocated time window. However, other quality metrics are also important and need to be considered during ETL design. In this paper, we address ETL design for performance plus fault-tolerance and freshness. There are many reasons why an ETL process can fail and a good design needs to guarantee that it can be recovered within the ETL time window. How to make ETL robust to failures is not trivial. There are different strategies that can be used and they each have different costs and benefits. In addition, other metrics can affect the choice of a strategy; e.g., higher freshness reduces the time window for recovery. The design space is too large for informal, ad-hoc approaches. In this paper, we describe our QoX optimizer that considers multiple design strategies and finds an ETL design that satisfies multiple objectives. In particular, we define the optimizer search space, cost functions, and search algorithms. Also, we illustrate its use through several experiments and we show that it produces designs that are very near optimal.
Article
Poor data quality (DQ) can have substantial social and economic impacts. Although firms are improving data quality with practical approaches and tools, their improvement efforts tend to focus narrowly on accuracy. We believe that data consumers have a much broader data quality conceptualization than IS professionals realize. The purpose of this paper is to develop a framework that captures the aspects of data quality that are important to data consumers.A two-stage survey and a two-phase sorting study were conducted to develop a hierarchical framework for organizing data quality dimensions. This framework captures dimensions of data quality that are important to data consumers. Intrinsic DQ denotes that data have quality in their own right. Contextual DQ highlights the requirement that data quality must be considered within the context of the task at hand. Representational DQ and accessibility DQ emphasize the importance of the role of systems. These findings are consistent with our understanding that high-quality data should be intrinsically good, contextually appropriate for the task, clearly represented, and accessible to the data consumer.Our framework has been used effectively in industry and government. Using this framework, IS managers were able to better understand and meet their data consumers' data quality needs. The salient feature of this research study is that quality attributes of data are collected from data consumers instead of being defined theoretically or based on researchers' experience. Although exploratory, this research provides a basis for future studies that measure data quality along the dimensions of this framework.