Conference PaperPDF Available

Abstract and Figures

With the advances in communication technologies and the high amount of data generated, collected, and stored, it becomes crucial to manage the quality of this data deluge in an efficient and cost-effective way. The storage, processing, privacy and analytics are the main keys challenging aspects of Big Data that require quality evaluation and monitoring. Quality has been recognized by the Big Data community as an essential facet of its maturity. Yet, it is a crucial practice that should be implemented at the earlier stages of its lifecycle and progressively applied across the other key processes. The earlier we incorporate quality the full benefit we can get from insights. In this paper, we first identify the key challenges that necessitates quality evaluation. We then survey, classify and discuss the most recent work on Big Data management. Consequently, we propose an across-the-board quality management framework describing the key quality evaluation practices to be conducted through the different Big Data stages. The framework can be used to leverage the quality management and to provide a roadmap for Data scientists to better understand quality practices and highlight the importance of managing the quality. We finally, conclude the paper and point to some future research directions on quality of Big Data.
Content may be subject to copyright.
Big Data Quality: A Survey
Ikbal taleb
CIISE
Concordia University
Montreal, QC, Canada
i_taleb@live.concordia.ca
Mohamed Adel Serhani
College of Information Technology
UAE University
Al Ain, UAE
serhanim@uaeu.ac.ae
Rachida Dssouli
CIISE
Concordia University
Montreal, QC, Canada
rachida.dssouli@concordia.ca
AbstractWith the advances in communication technologies
and the high amount of data generated, collected, and stored, it
becomes crucial to manage the quality of this data deluge in an
efficient and cost-effective way. The storage, processing, privacy
and analytics are the main keys challenging aspects of Big Data
that require quality evaluation and monitoring. Quality has been
recognized by the Big Data community as an essential facet of its
maturity. Yet, it is a crucial practice that should be implemented
at the earlier stages of its lifecycle and progressively applied across
the other key processes. The earlier we incorporate quality the full
benefit we can get from insights. In this paper, we first identify the
key challenges that necessitates quality evaluation. We then
survey, classify and discuss the most recent work on Big Data
management. Consequently, we propose an across-the-board
quality management framework describing the key quality
evaluation practices to be conducted through the different Big
Data stages. The framework can be used to leverage the quality
management and to provide a roadmap for Data scientists to
better understand quality practices and highlight the importance
of managing the quality. We finally, conclude the paper and point
to some future research directions on quality of Big Data.
KeywordsBig Data, Data Quality, Quality Management
framework, Quality of Big Data.
I. INTRODUCTION
Big Data (BD) has become a very attractive area of research
and development for both academia and industries. With the
spread of Broadband Internet everywhere and the large number
of services that emerged recently (VOD, Cloud storage and
services, data clusters), a huge amount of data is generated every
day strengthening the Big Data Era. Many IT professionals,
researchers, scientists, and companies are working heavily to
define, describe, and analyze the new challenges and the
possible technologies and approaches that might be used to
address these challenges. Exploring existing technologies and
platforms, data scientists are processing, and analyzing this huge
amount of data to produce relevant insights that might have a big
impact on society and human wellbeing. For instance, predicting
market growth, tracking and isolating infection diseases,
managing road traffic, and predicting meteorology. However,
traditional tools, techniques, and algorithms used for traditional
datasets are not anymore suitable since Big Data is dynamic,
continuous in nature, takes various format, unstructured, and of
big size. Therefore, it is important to adapt, rewrites, redesign,
from scratch these tools and algorithms to respond the new Data
characteristics and related challenges.
In Big Data, data originally comes in different aspects, from
multiples sources that must be cleaned, filtered, processed,
integrated, merged, partitioned, transported, sketched, and
stored. All these steps are executed in real-time, in batch or in
parallel and preferably on the cloud. While it is well-known that,
in theory, more high quality data leads to better predictive power
and overall insight, this raw data must be channeled through a
quality assessment in the pre-processing phase in which
activities such as data cleansing, de-duplication, compression,
filtering, and format conversion take place. This mandatory step
is essential to refine, valuate the data and ensure its quality.
In order to keep track of data value and relevance as well as
the severity of the impact of the aforementioned pre-processing,
and processing transformations, a concept of data quality is
paramount importance. Moreover, the nature of targeted data,
such as those generated from social networks and which are
unstructured with no quality references, suggests that data must
be profiled and provided with certain quality information’s at
the inception phase. This also means that data attributes quality
must be assessed, improved and controlled all along its lifecycle
as it directly impacts the results of the analysis phase.
Data quality is a well-known concept within database
community and have been an active area of research for many
years [1], [2]. However, a direct application of these quality
concepts to Big Data faces severe challenges in terms of time
and costs of data pre-processing. The problem is exacerbated by
the fact that these techniques were developed for well-structured
data. Big data reveals new characteristics that make its quality
assessment very challenging. The variety of Big data brings
complex data structure which increases the difficulty of its
quality evaluation. Also, Big data high volume involves time
and resources for processing which hardly influence the process
of its quality evaluation. In addition, variability, velocity, and
volatility features introduce new challenges in managing, and
assessing the quality of Big data given the speed in which it is
generated and fluctuated. To the best of our knowledge there is
no standard quality management framework for Big data that has
emerged yet. Most of existing works on Big data quality
management are still under investigations and have not reached
a good level of maturity. Past work in the database community
cannot be fully adopted as it is because of the above mentioned
Big data new challenges. However, some quality assessment
practices can be readapted to cope with these new issues.
In this context, the data quality model should be developed
to follow some Big Data key concepts as the origin, domain,
nature, format, and type of data it is applied on. A proper
management of these quality schemes is essential when dealing
with large datasets. In addition, existing Big Data architectures
do not support quality management processes. However, some
initiatives are still limited to specific application domain and
scope. Moreover, the evaluation and estimation of Quality must
be handled in all the lifecycle phases from data inception to its
analytics. This evaluation is crucial to provision value-added
services and achieve the Big Data vision. Quality measurement,
assessment, enforcement, monitoring, and adaptation are key
quality processes that will illustrate what Big Data quality
management means.
The rest of paper is organized as follow: next section
introduces Big Data and data quality foundations, definition,
characteristics, and lifecycle. Section 3 introduces a holistic
quality management model for Big Data value chain. Section 4
surveys and classify the most important research works on Big
Data quality evaluation and management. Section 5 identifies
the main challenges and the open research directions in Big Data
quality management in general and along the quality
management sub-processes. Finally, the last section concludes
the paper with ongoing and challenging directions.
II. BIG DATA AND DATA QUALITY FOUNDATIONS
According to IBM [3], Gartner [4], [5], McKinsey [6],and
[7][9] every day huge amounts of data are generated; this data
represents 2.5 quintillion bytes (Exabyte (EB) = 1018 bytes). In
year 2000, 800,000 Petabyte (1 PB= 1015 bytes) of data were
stored. Twenty years later, in 2020 this number will reach 35
Zettabytes (1 ZB= 1021 bytes) [10]. This exponential increase of
data storage is originated from Web search companies as
Google, Yahoo; who had to query very large distributed
aggregations of loosely structured data sources. Moreover,
application domains including Facebook, Amazon, Twitter,
YouTube, Internet of things sensors, mobile smart phones are
the main actors and data generators. The amount of data they
generate daily is from 5 to 10 Terabytes (1 TB= 1012 bytes).
A. Big Data
If we need to define Big Data, we must introduce its evolution
through the years while linking it to its characteristics. As the
name implies, it was somehow about the large size of data files
that cannot be handled by traditional databases [1]. Then
extended to cover the difficulty to analyze these data using the
traditional software algorithms. Big Data means the whole value
chain that includes several stages: data generation, collection,
acquisition, transportation, storage, preprocessing, and
processing, analytics, and visualization. The insights that we can
extract from this chain are from the continuous data growth
using new techniques and new architectures.
1) Definition: there is no clear and final definition of Big Data
according to many references such as: [8], [9], [11], [12]. It
is high-volume, high-velocity and high-variety information
assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision-
making. Big Data is used to describe a massive volume of
both structured and unstructured data; therefore, it's difficult
to process it using traditional database and software
techniques. It refers also to the technologies and storage
facilities that an organization requires to handle and manage
the large amounts of data that derives from multiples sources.
2) Origins: the data originates from everywhere: sensors used
to gather climate information, posts to social media sites,
digital pictures and videos uploaded to media portals,
purchase transaction records, and cell phone GPS signals to
name a few. The gigantic volume of data did not mean it’s
the only characteristics to consider.
B. Charachteristics
In 2011, early days of Big Data, McKinney Global Institute
report [6] identifies three main original dimensions that
characterize it from any other data concepts. The Volume,
Velocity and Variety, also called the 3 V’s as illustrated in figure
2. It is essential that these characteristics are not limited in
number. Because Big Data is more than simply a matter of size;
it is an opportunity to find insights in new and emerging types
of data and content, to make useful decisions. Lately the number
of dimensions increased to 4, 7 and even to 10 V’s [13][15].
Figure 2. Big Data Original 3 V’s
C. Big Data Lifecycle
As mentioned early and showed in depicted in figure 3, the
Big Data ecosystem is organized as a value chain lifecycle from
data inception to visualization. In the following, a brief
description of all the main stages of Big Data lifecycle.
Figure 3 Big Data Lifecycle
1) Data Generation/Inception: is the phase where data is
created, many data sources are responsible for these data:
electrophysiology signals, sensors used to gather climate
information, surveillance devices, posts to social media sites,
videos and still images, transaction records, to name a few.
2) Data Acquisition: consists of data collection, data
transmission, and data pre-processing [12][16] .
o Data Collection: the data is gathered in specific data
formats from different sources: real world data
measurements using sensors and RFID, or data from any
sources using a specific designed script to crawl the web.
o Data Transport: to transfer the collected data into storage
data centers using interconnected networks.
o Data Pre-Processing: it consists of the typical pre-
processing activities like Data Integration, Enrichment,
Transformation, Reduction, and Cleansing.
3) Data Storage: the infrastructure data center where the data is
stored and distributed among several clusters, data centers
spread geographically. The storage systems ensure several
fault tolerance levels to achieve reliability and efficiency.
Volume
Scale and Size of Data
Multi-domain
User/Device Data
Petabytes
Velocity
Speed of Data generation
Batch
Real Time
Stream
Variety
Different Forms and Type of Data
Structured
Semi-Structured
Unstructured
Data Storage
Data
Collection
Data
Pre-Processing
Data Inc eption
Visualization
Data Transp ort
Analytics
Data
Processing
4) Data Processing & Analytics: application of Data Mining
algorithms, Machine Learning, Artificial Intelligence and
Deep Learning to process the data and extract useful insight
for better decision making. Data scientists are the most
expected users of this phase since they have the expertise to
apply what needed on what must be analyzed.
5) Data Visualization: the best way of assessing the value of
processed data is to examine it visually and taking decision
accordingly. Application of visualization methods in Big
Data is of an importance as it closes the loop value chain.
D. Data Quality
Most studies in the area of Data Quality (DQ) are from
database management research communities [1], [2]. According
to [17], data quality is not easy to define, its definitions are data
domain aware. In general, there is a consensus that data quality
is always dependent on the quality of the data source [18].
1) Definition: it is recognized that DQ has many definitions that
are related to the context, domain, area or the fields in which
it is used [19], [20]. DQ is differently understood in
academia than in industry. In [21], the authors summarized
data quality from the most known and used definitions from
ISO 25012 Standard. In the literature, data quality is
“fitness for use”. In [19], data quality is defined as the
appropriateness for use or meeting user needs.
2) Data Quality Dimensions(DQD’s): according to [19], [22],
[23], a DQD offers a way to measure and manage data
quality. There are several quality dimensions each of which
is associated with a specific metrics. DQD’s usually fall into
four categories illustrated in Figure 4: intrinsic, contextual,
and representational and accessibility [23][27]. For
instance, the contextual dimensions are related to the
information, and intrinsic refers to objective and native data
attributes. Examples of intrinsic DQD’s include:
Accuracy: measures whether data was recorded correctly
and reflect realistic values.
Timeliness: measures whether data is up to date as data
currency and volatility [28].
Consistency: measures whether data agrees with its format
and structure. Mostly the respect of data constraints.
Completeness: describes whether all relevant data are
recorded. It measures missing values for an attribute.
Figure 4. Data Quality Dimensions (Characteristics)
3) Data Quality Metrics: each DQD need to be quantified and
measured. DQD metrics represents the steps to evaluate
these dimensions. From simple formulas to more complex
multivariate expressions, the metrics gives the
measurability property to DQD’s. For example the
computation of missing values for an attribute is considered
as the measure to rate the DQD completeness [29][31].
4) Data Quality Evaluation: following a data driven strategy
requires handling quality evaluation on the already
generated data. Hence, it is mandatory to measure and
quantify the DQD. For structured or semi-structured data,
the data is available as a set of attributes represented in
columns or rows and respectively their values are recorded.
Any data quality metric should specify whether the values
of data respect the quality attributes (Dimensions). The
author in [32], quoted that data quality measurement
metrics tend to evaluate a binary results correct or incorrect
(0 and 100% respectively), and use universal formulas to
compute these attributes applied to quality dimensions (e.g.
accuracy). The measurements generate DQD’s scores using
their related defined metrics (e.g. the accuracy score is the
number of correct instances values divided by the total
number of instances of the measured attributes).
E. Big Data Quality Evaluation
The importance of data quality in the Big Data lifecycle
redefines the way data supervision is handled. Managing the
data quality involves adding more functionalities in each stage
with an ongoing quality control and monitoring to avoid quality
failure during all phases of the lifecycle. Big Data quality
evaluation is concerned about properties such as the
performance, the value and the cost. In the next section, we
illustrate more details on how quality must be managed and
assessed in Big Data lifecycle.
F. Big Data Quality Issues
Data quality issues take place when quality requirements are
not met on data values [33]. These issues are due to several
factors or processes happened in different levels: 1) the data
sources: unreliability, trust, data copying, inconsistency, multi-
sources, and data domain, 2) the generation level: human data
entry, sensors devices readings, social media, unstructured data,
and missing values, and 3) the process and/or application level
(acquisition: collection, transmission). The data pre-processing
improves data quality by executing many tasks and activities
like data transformation, integration, fusion, and normalization.
The authors in [21], [34] enumerate many causes of poor data
that affect the data quality and came up with a list of elements
that affects the data quality and its related dimensions. In [19],
[34], The authors addressed a compilation of poor data causes
classified by DQD, granularity level, and data source type while
highlighting the causality mapping between these.
III. A HOLISTIC QUALITY MANAGEMENT MODEL FOR BIG DATA
In Big Data, the Data management is impartial to its quality
management. Therefore, we need to identify the quality issues
and requirements at each stage of the lifecycle. To ensure a high-
quality value chain, an improvement procedure is inevitable and
should be built-in within each process of the lifecycle. The goal
is that Quality management activities should be undertaken
without adding extra communication, processing and cost
overhead on the different Big Data ecosystem layers.
To initiate any quality project for Big Data, a set of parameters
and concepts must be enumerated to identify the processes
involved and to characterize the data and workflow type used as
crucial inputs for its management. Two strategies are being
perpetrated in data quality: data driven mentioned early in the
paper and process driven strategies. The data driven approach
act on the data itself, assess its quality with the objective of
augmenting its quality. The process driven is a predictive
approach that focus on the quality of the process used to generate
or manipulate data.
In figure 5, we propose a holistic quality management model that
captures important quality aspects and explore how to deal with
Big Data quality throughout its lifecycle. We identified the
processes that must handles and address data quality problems
and provide a quality assessment schemes to ensure its effective
management. The most important stages in this respecting order
are: (1) data creation, (2) data source (3) data collection, (4) data
transport, (5) data storage, (6) data preprocessing, (7) processing
and analytics, and (8) visualization. Moreover, Big Data
lifecycle stages and their related quality information’s and
processes must be addresses to achieve an end-to-end Big Data
quality management driven lifecycle.
In the following, we highlight the importance of quality
requirements considerations, quality implementation and
enforcement across different processes of the Big Data value
chain. We also describe quality propagation across different
processes of the value chain and the level of coordination
between these processes for the sake of supporting quality
management. Finally, we illustrate continuous improvement of
quality through loopbacks and inter-processes interactivity.
A. Quality enabled data acquisition
The data acquisition is handled in two stages that are: data
inception, and data collection. However, data sources (2 in
Figure 5) is related to already existing data generally stored in
many formats. Before addressing the Big Data quality at its
origins, first, we need to examine the existing data that is
accumulated in the past and build a knowledge base from this
data to better predict a redesign its creation for Big Data. In this
case, the data design must be an iterative process that
compromise data inspection and data quality evaluation (usually
follows a data driven approach when using existing data) to
enhance processes that create a high-quality data ready to be
used by processes of the Big Data lifecycle.
A process driven strategy is adopted in the data creation
phase, where quality constraints are set to eliminate bad data
before it is created. Moreover, after data is created, a collection
process is initiated to gather data into more structured and
organized format, thus becomes useful data. The set of
techniques used to reorganize data themselves ensure and
guarantee the quality of data. The data collection processes are
considered as data management processes that can positively
impacts the data quality if it use the proper data science methods
and techniques [35].
The frameworks and tools used to design data varies from
traditional database systems to more Big Data related systems
like NoSQL databases (MongoDb, Cassandra, Hadoop HBase
and other associated Hadoop ecosystem tools). A database
design based on NoSQL will discourages the complicated
tabular structure queries and lead to use connector tools.
B. Quality-aware data transport and storage
For data transport phase, it is very important to support QoS
considering the requirements that should be met during service
provisioning. Ensuring data quality is more specifically related
to the underlying networks used to transmit data, and the
security measures that guarantee the transmission of data
between multiple points without data losses or corruptions. The
quality of data networks is based on clients QoS requirements,
Figure 5. A Holistic Quality Management Model for Big Data Value Chain
provider QoS offering and the real-time QoS measurements.
The primary goal of QoS is to provide priority, including
dedicated bandwidth, controlled jitter and latency (e.g. Big Data
real-time stream processing), and improved loss characteristics.
Moreover, SLA, proactive and reactive provisioning policy
should be considered by providers to deliver the required QoS.
For data storage it is handled for Big Data quality through data
distribution and replication. For example, storage using Hadoop
ecosystem relies on several nodes that duplicate data to avoid
any catastrophic data loss and ensure continuity when failure
happens. Moreover, data storage quality relies on the storage
medium I/O performances for several types of Big Data.
Reading and writing ratios must follow the I/O requirements for
each data type, e.g., HD Video data, and stream processing.
C. Quality driven data preprocessing
Preprocessing represents the last remedy to quality management
in Big Data. It is the process of cleansing, integrating,
normalizing, transforming, integrating data to improve it quality
before being consumed by the processing stage. It follows a data
driven approach that relies heavily on data values. For example,
data cleansing relies on completeness, consistency. The quality
settings (e.g. requirements, baseline/model, and quality report)
are essential for the process to achieve data quality
improvements. It identifies what quality is expected from the
data, dimensions and scores. The model or baseline represents
the basic quality requirements bounds and can be adjusted based
on the quality evaluation process and the matching of quality
with its requirements. Finally, the quality report similarly to data
provenance, records the data path since its inception to its final
analytics stage. The Quality report contents is iteratively
updated and augmented with new parameters to serves as a Big
Data quality repository.
Many data pre-processing tools has emerged, in addition others
existing tools used in the database domain have been updated to
handle Big Data. Mostly, Hadoop and spark based preprocessing
tools and framework are widely used (e.g. Talend Open Studio
for Big Data [36], and Open Refine [37]).
D. Quality enabled data processing and analytics
Processing and analytics take advantages from the quality
assessments undertaken in the previous quality assessment
activities. In the processing stage, quality management consists
of validating that preprocessed data is conform to processing
quality requirements, in addition, to the quality evaluation of
processing algorithms (e.g. Hadoop based processing).
However, analytics consist of assessing different analytics
schemes and algorithms (e.g. deep learning) to ensure the
validity of quality evaluation before visualization of insights
retrieved from analytics.
As for Big data preprocessing, the processing frameworks are
Hadoop and spark related. In [38], many frameworks and
methods for the preprocessing and processing of have been
detailed and compared.
E. Quality enabled data visualization
It consists of aggregating the visualization tools quality
requirements, the quality report, and the quality of diverse
visualization tools. It also assesses the visualized data that leads
to accurate decision. This stage is very important as it is the last
process in closing the quality assessment loop in the Big Data
value chain and triggers some improvements recommendations
for continuous quality monitoring. Many visualization tools for
Big Data are used like Google Chart, Tableau, DataWrapper.
F. Quality propagation and continous quality improvement
The quality management model proposed in figure 5 illustrates
how quality management activities propagate across the
different processes of the Big Data value chain. At each stage,
a quality requirement is taken into considerations, quality
measurements are reflected, and a quality report is generated.
This quality report is forwarded to the next stage of the value
chain to track, validate, enrich, and enhance the quality of both
the data and the process. A continuous quality improvement is
possible through loopback between processes to monitor,
revise, and enhance quality whenever required. This is a very
important feature that ensure accuracy and improvement of
quality assessment across the Big Data value chain.
IV. BIG DATA QUALITY: RESEARCH CLASSIFICATION
As shown in Figure 6, we have selected numerous papers
that mainly address, and debated the data quality in Big Data.
Some of them went deeply into specific quality properties,
however, others addressed Big Data quality issues, and proposed
some solutions based on quality assessment, improvement while
applying diverse techniques. The designated literature reveals
that quality of data has been addressed from diverse perceptions
(e.g. the data, processes, applications, and its management) with
focus on quality properties assessment, quality evaluation
processes, and quality models related to Big Data.
The goal of this classification is to extract major research trends
related to Big Data quality, identify what have been addressed
so far in its quality management, and what need further
explorations to reach its full potential. Also, we projected these
findings to draw the path towards future research directions. We
come-up with 10 classes (from I to XI) that identify the key
research trends in Big Data Quality. These categories are also
grouped into 5 main clusters (A to E) that define the most
important areas of interest tied with data quality and Big Data.
A. Big Data Value Chain (I-II-III)
In [35], [39][49], the authors emphasized many data quality
issues and their effects in Big data lifecycle stages; as
preprocessing, processing, and storage. They mostly proposed a
management process flow as taxonomy to control quality of pre-
processing and processing for verification of data quality, data
sources and data formats before analytics. Others proposed a
combined quality model to detect preprocessing defects and act
upon to correct tasks flaws.
For data processing, some studies leveraged analytics
techniques, machine learning and classification practices and
evaluate their suitability for Big data. Then, they analyzed the
effects of data size to evaluate the accuracy on the applications
of these methodologies. Others proposed to adapt these
techniques to handle the Big Data characteristics and tackles
different processing quality issues.
As storage is crucial for Big Data, authors addressed the multi-
storage providers and its impact on the performance and
efficiency of pre-processing, processing when high data
distribution on multiple cloud providers is adopted. Most of the
previous works, addressed quality in Big data in an ad hoc
manner while not following a comprehensive model that
considers the quality characteristics, the processes, and the
underlying infrastructure. Such model will assure an end-to-end
quality management in the value chain. Most of the works
addressed separately Big data stages without leveraging their
effects through its whole lifecycle.
B. Big Data Management & Characteristics (IV-V)
In this category, two important aspects of Big Data were
targeted in [31],[35],[39][41],[44],[46][59] that are its
characteristics (V’s) and management. Most of the authors
agreed that the V’s represent a significant aspect of Big Dat and
have a high impact on data quality when it is not well managed
in an efficient way. They typically linked the V’s with DQD’s
in order to find correlations and impact on each other. Others
emphasized that the scalability, integrity and resource
optimization is highly proportional to Big Data V’s as they
represents the key elements for its Mangement solutions.
In the Big Data management, authors surveyed and proposed
management models tackling storage, pre-processing, and
processing. Moreover, an up-to-date review of techniques and
methods for each process involved in the management
processes were carefully studied. Managing Big Data requires
a mapping of its value chain stages with its related processes
and sub-processes. The importance of the Quality in Big Data
Management was not generally addressed. Such a Framework
with end-to-end Quality management and enforcement is very
challenging.
C. Big Data Problems & Data Quality Issues (VI)
Congregating Big Data problems with Data quality issues is
justified by the strong relationship between these two concepts.
Any data quality issues will be reflected in the analytics. The
authors of selected literature [25], [31], [41], [43], [45], [46],
[48], [49], [51][56], [58][60], have stressed that it is very
important to discover quality issues and map them with Big data
problems in the lifecycle as early as possible. This will help
isolate and adapt the processes that must handle both concerns.
Most of the data quality issues have been addresses heavily in
the research community and yet it is still not adapted in Big
Data. Further discussions related the DQDs and V’s were
considered proportional.
D. Data Quality (VII-VIII-IX):
Data quality has been fully investigated in the following work
[25], [31], [39], [41][46], [48][59], [61] which confirm its
importance for Big Data. Most of authors, consider the DQD’s
as a crucial model to use. They developed a Quality model for
Big Data based on DQD’s mapped with V’s to address
scalability and reliability issues. Others addressed the problem
of choosing DQD’s and metrics for unusual data as images,
binary data, and unstructured data. They extracted features and
combined many DQD’s to measure a quality dimension score.
There are many challenges that need to be tackled to come up
with a quality assessment scheme for Big Data. The authors
listed many functionalities that must be considered: e.g.
accuracy, consistency, provenance, uncertainty.
Accordingly, there is no complete reference model for Data
Quality and its Management in Big Data. DQM must ensure
data conformity and follows all the steps since its inception to
quality assessment. In other words, new technological and
decision-making challenges are making DQM applications
more complicated.
E. Big Data Applications and Quality Improvements (X)
It is a value chain-based application that follows stages from
data creation to visualization. Some of the authors focus on how
to manage the quality within these applications by evaluating
metrics for resource management like storage and processing.
Others, proposed solutions to enhance quality of the data while
applying cleansing tasks and activities that are parts of pre-
processing (e.g. BigDansing, and Nadeef) [51] [53], [56], [68].
Data quality consists of assessing quality dimensions, along
with combining all the above to aggregate one quality score that
reflects the quality of Big Data Lifecycle applications.
V. DISCUSSION AND FUTURE DIRECTIONS
Ensuring Quality is recognized as one of the most challenging
issues in Big Data era. Current approaches and solutions
emerged both from academia and industry that tackled quality
have not reached yet a convincing level of maturity. Evaluate
the importance of assessing quality of Big Data versus the value
it generates for its users (e.g. governments, businesses) is of
paramount importance. In addition, following well studied data
quality management plan, using the right assessment scheme,
adopting the appropriate quality measurement approaches, and
utilizing the suitable tools and platforms to conduct different
quality evaluation activities all together will help achieving
high quality assessment results. Furthermore, addressing
quality across the Big Data value chain enforces an end-to-end
quality evaluation and management and leads to better quality
improvements. Finally, evaluating the overhead of quality
assessment guarantees a cost-effective quality management
processes.
Based on the above, future research directions in Big Data
quality should be geared towards the development of solutions
that consider the following:
a) Assessment of quality as earlier as possible and it end-to-
end integration across its.
b) Implementation of continuous quality improvement and
enforcement mechanisms in Big Data quality management.
c) Specification of Big Data Quality metrics that should cope
with the data dynamic nature and its unconventional
characteristics.
d) Development of new quality dimensions with specific
measurement attributes for unstructured, and schema less
data.
e) Enforcement of quality requirements, generation of quality
reports and feedbacks to support assessment activities.
f) Development of more Online automated real-time
dashboards for Big data quality monitoring.
g) Application of higher degree of statistical proof in different
Big data quality evaluation processes including sampling,
regression, correlation, and matching.
h) Development of effective quality outcomes prediction.
i) Evaluation of quality of a representative set of data samples
then generate a quality model to apply on the whole Data.
This will get a glimpse of the data quality and proceed with
the equality results applied on all the data [45] [47].
Finally, it is worth mentioning that research work and solutions
on Big Data quality are still in its preliminary phase, and there
is much to do in its development and standardization. It is a
multidisciplinary, complex, and multi-variant domain where
new assessment, processing techniques, and analytics
algorithms, storage technologies and processing platforms will
play a great role in the development and the maturation of this
active research area. We anticipate that researchers from
academia will contribute to the development of new Big data
quality approaches, algorithms, optimizations techniques that
go beyond the traditional ones used in databases and data
warehouses. However, industries will lead development
initiatives of new platforms, solutions, and technologies that
support end-to-end quality management within the Big Data
lifecycle.
VI. CONCLUSION
Big Data has emerged as new paradigm for handling huge,
continuous, varying, and complex data. Its quality is the key for
its acceptance and usefulness. A poor data quality might lead to
severe consequences. This will lose the benefit of analyzing and
exploring large-scale data sets in an appropriate way. Using
conventional techniques to manage Big Data is not any more
appropriate. Therefore, the design and application of efficient
approaches to manage the quality is highly demanded. In this
paper, we identified the key research challenges in Big Data
quality and we highlighted their importance. We then surveyed
classified and discussed the most comprehensive research
initiatives. Afterwards, we proposed a holistic view of Big Data
quality management model that emphasized the key quality
assessment activities to be conducted across the value chain.
Finlay, we discussed the main tendencies in Big Data quality
assessment and we point to some future research directions. We
are planning to further extend the scope of this work and deeply
describe how quality assessments activities can be implemented
in a context of a real Big Data project and where quality matters.
REFERENCES
[1] P. Z. Yeh and C. A. Puri, “An Efficient and Robust Approach for
Discovering Data Quality Rules,” in 2010 22nd IEEE International
Conference on Tools with Artificial Intelligence (ICTAI), 2010, vol. 1,
pp. 248255.
[2] F. Chiang and R. J. Miller, “Discovering data quality rules,” Proc.
VLDB Endow., vol. 1, no. 1, pp. 11661177, 2008.
[3] “IBM - What is big data?” [Online]. Available: http://www-
01.ibm.com/software/data/bigdata/what-is-big-data.html. [Accessed:
30-May-2016].
[4] “What Is Big Data? - Gartner IT Glossary - Big Data,” Gartner IT
Glossary, 25-May-2012. [Online]. Available:
http://www.gartner.com/it-glossary/big-data/. [Accessed: 30-May-
2016].
[5] D. Laney, “The importance of’Big Data’: A definition,” Gart.
Retrieved, vol. 21, pp. 20142018, 2012.
[6] J. Manyika et al., “Big data: The next frontier for innovation,
competition, and productivity,” McKinsey Glob. Inst., pp. 1137, 2011.
[7] A. De Mauro, M. Greco, and M. Grimaldi, “What is big data? A
consensual definition and a review of key research topics,” in AIP
conference proceedings, 2015, vol. 1644, pp. 97104.
[8] I. Emmanuel and C. Stanier, “Defining Big Data,” in Proceedings of
the International Conference on Big Data and Advanced Wireless
Technologies, New York, NY, USA, 2016, p. 5:15:6.
[9] J. S. Ward and A. Barker, “Undefined by data: a survey of big data
definitions,” ArXiv Prepr. ArXiv13095821, 2013.
[10] P. Zikopoulos and C. Eaton, “Understanding Big Data: Analytics for
Enterprise Class Hadoop and Streaming Data,” 2011.
[11] G. Press, “12 Big Data Definitions: What’s Yours?,” Forbes. [Online].
Available: https://www.forbes.com/sites/gilpress/2014/09/03/12-big-
data-definitions-whats-yours/. [Accessed: 29-Nov-2017].
[12] H. Hu, Y. Wen, T.-S. Chua, and X. Li, “Toward Scalable Systems for
Big Data Analytics: A Technology Tutorial,” IEEE Access, vol. 2, pp.
652687, 2014.
[13] P. Géczy, “Big data characteristics,” Macrotheme Rev., vol. 3, no. 6,
pp. 94104, 2014.
[14] M. Ali-ud-din Khan, M. F. Uddin, and N. Gupta, “Seven V’s of Big
Data understanding Big Data to extract value,” in American Society for
Engineering Education (ASEE Zone 1), 2014 Zone 1 Conference of the,
2014, pp. 15.
[15] “Big Data Technology with 8 V´s,” M-Brain. .
[16] M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mob. Netw. Appl.,
vol. 19, no. 2, pp. 171209, 2014.
[17] P. Oliveira, F. Rodrigues, and P. R. Henriques, “A Formal Definition
of Data Quality Problems.,” in IQ, 2005.
[18] M. Maier, A. Serebrenik, and I. T. P. Vanderfeesten, Towards a Big
Data Reference Architecture. University of Eindhoven, 2013.
[19] F. Sidi, P. H. Shariat Panahy, L. S. Affendey, M. A. Jabar, H. Ibrahim,
and A. Mustapha, “Data quality: A survey of data quality dimensions,”
in 2012 International Conference on Information Retrieval Knowledge
Management (CAMP), 2012, pp. 300304.
[20] I. Caballero and M. Piattini, “CALDEA: a data quality model based on
maturity levels,” in Third International Conference on Quality
Software, 2003. Proceedings, 2003, pp. 380387.
[21] M. Chen, M. Song, J. Han, and E. Haihong, “Survey on data quality,”
in 2012 World Congress on Information and Communication
Technologies (WICT), 2012, pp. 10091013.
[22] P. Glowalla, P. Balazy, D. Basten, and A. Sunyaev, “Process-Driven
Data Quality Management An Application of the Combined
Conceptual Life Cycle Model,” in 2014 47th Hawaii International
Conference on System Sciences (HICSS), 2014, pp. 47004709.
[23] C. Batini, C. Cappiello, C. Francalanci, and A. Maurino,
“Methodologies for data quality assessment and improvement,” ACM
Comput. Surv., vol. 41, no. 3, pp. 152, Jul. 2009.
[24] B. T. Hazen, C. A. Boone, J. D. Ezell, and L. A. Jones-Farmer, “Data
quality for data science, predictive analytics, and big data in supply
chain management: An introduction to the problem and suggestions for
research and applications,” Int. J. Prod. Econ., vol. 154, pp. 7280,
2014.
[25] B. Saha and D. Srivastava, “Data quality: The other face of Big Data,”
in 2014 IEEE 30th International Conference on Data Engineering
(ICDE), 2014, pp. 12941297.
[26] C. Cappiello, A. Caro, A. Rodriguez, and I. Caballero, “An Approach
To Design Business Processes Addressing Data Quality Issues,” 2013.
[27] Y. Wand and R. Y. Wang, “Anchoring data quality dimensions in
ontological foundations,” Commun. ACM, vol. 39, no. 11, pp. 8695,
1996.
[28] W. Fan, F. Geerts, and J. Wijsen, “Determining the currency of data,”
ACM Trans. Database Syst. TODS, vol. 37, no. 4, p. 25, 2012.
[29] V. Goasdoué, S. Nugier, D. Duquennoy, and B. Laboisse, “An
Evaluation Framework For Data Quality Tools.,” in ICIQ, 2007, pp.
280294.
[30] M. A. Serhani, H. T. E. Kassabi, I. Taleb, and A. Nujum, “An Hybrid
Approach to Quality Evaluation across Big Data Value Chain,” in 2016
IEEE International Congress on Big Data (BigData Congress), 2016,
pp. 418425.
[31] D. Firmani, M. Mecella, M. Scannapieco, and C. Batini, “On the
Meaningfulness of ‘Big Data Quality’ (Invited Paper),” in Data Science
and Engineering, Springer Berlin Heidelberg, 2015, pp. 115.
[32] H. M. Sneed and K. Erdoes, “Testing big data (Assuring the quality of
large databases),” in 2015 IEEE Eighth International Conference on
Software Testing, Verification and Validation Workshops (ICSTW),
2015, pp. 16.
[33] C. Fürber and M. Hepp, “Towards a Vocabulary for Data Quality
Management in Semantic Web Architectures,” in Proceedings of the
1st International Workshop on Linked Web Data Management, New
York, NY, USA, 2011, pp. 18.
[34] N. Laranjeiro, S. N. Soydemir, and J. Bernardino, “A Survey on Data
Quality: Classifying Poor Data,” in 2015 IEEE 21st Pacific Rim
International Symposium on Dependable Computing (PRDC), 2015,
pp. 179188.
[35] A. Siddiqa et al., “A survey of big data management: Taxonomy and
state-of-the-art,” J. Netw. Comput. Appl., vol. 71, pp. 151166, Aug.
2016.
[36] “Big Data: Talend Big Data Integration Products & Services.” [Online].
Available: https://www.talend.com/products/big-data/. [Accessed: 30-
Jan-2018].
[37] “OpenRefine.” [Online]. Available: http://openrefine.org/. [Accessed:
30-Jan-2018].
[38] S. García, S. Ramírez-Gallego, J. Luengo, J. M. Benítez, and F. Herrera,
“Big data preprocessing: methods and prospects,” Big Data Anal., vol.
1, no. 1, Dec. 2016.
[39] S. Li et al., “Geospatial big data handling theory and methods: A review
and research challenges,” ISPRS J. Photogramm. Remote Sens., vol.
115, pp. 119133, May 2016.
[40] A. Oussous, F.-Z. Benjelloun, A. Ait Lahcen, and S. Belfkih, “Big Data
technologies: A survey,” J. King Saud Univ. - Comput. Inf. Sci., Jun.
2017.
[41] M. Scannapieco and L. Berti, “Quality of Web Data and Quality of Big
Data: Open Problems,” in Data and Information Quality, Springer,
Cham, 2016, pp. 421449.
[42] S.-T. Lai and F.-Y. Leu, “Data Preprocessing Quality Management
Procedure for Improving Big Data Applications Efficiency and
Practicality,” in Advances on Broad-Band Wireless Computing,
Communication and Applications, vol. 2, L. Barolli, F. Xhafa, and K.
Yim, Eds. Cham: Springer International Publishing, 2017, pp. 731738.
[43] K. Sharma and others, “Quality issues with big data analytics,” in
Computing for Sustainable Global Development (INDIACom), 2016
3rd International Conference on, 2016, pp. 35893591.
[44] J. Ding, D. Zhang, and X. H. Hu, “A Framework for Ensuring the
Quality of a Big Data Service,” in 2016 IEEE International Conference
on Services Computing (SCC), 2016, pp. 8289.
[45] D. Becker, B. McMullen, and T. D. King, “Big data, big data quality
problem,” in 2015 IEEE International Conference on Big Data (Big
Data), 2015, pp. 26442653.
[46] A. F. Haryadi, J. Hulstijn, A. Wahyudi, H. Van Der Voort, and M.
Janssen, “Antecedents of big data quality: An empirical examination in
financial service organizations,” in Big Data (Big Data), 2016 IEEE
International Conference on, 2016, pp. 116121.
[47] M. H. ur Rehman, V. Chang, A. Batool, and T. Y. Wah, “Big data
reduction framework for value creation in sustainable enterprises,” Int.
J. Inf. Manag., vol. 36, no. 6, pp. 917928, Dec. 2016.
[48] J. Gao, C. Xie, and C. Tao, “Big Data Validation and Quality Assurance
Issuses, Challenges, and Needs,” in 2016 IEEE Symposium on
Service-Oriented System Engineering (SOSE), 2016, pp. 433441.
[49] Z. Khayyat et al., “Bigdansing: A system for big data cleansing,” in
Proceedings of the 2015 ACM SIGMOD International Conference on
Management of Data, 2015, pp. 12151230.
[50] J. Merino, I. Caballero, B. Rivas, M. Serrano, and M. Piattini, “A Data
Quality in Use model for Big Data,” Future Gener. Comput. Syst., vol.
63, pp. 123130, Oct. 2016.
[51] G. A. Lakshen, S. Vraneš, and V. Janev, “Big data and quality: A
literature review,” in 2016 24th Telecommunications Forum (℡FOR),
2016, pp. 14.
[52] N. Abdullah, S. A. Ismail, S. Sophiayati, and S. M. Sam, “Data quality
in big data: a review,” Int. J. Adv. Soft Comput. Its Appl., vol. 7, no. 3,
2015.
[53] C. Batini, A. Rula, M. Scannapieco, and G. Viscusi, “From data quality
to big data quality,” J. Database Manag., vol. 26, no. 1, pp. 6082,
2015.
[54] J. Liu, J. Li, W. Li, and J. Wu, “Rethinking big data: A review on the
data quality and usage issues,” ISPRS J. Photogramm. Remote Sens.,
vol. 115, pp. 134142, May 2016.
[55] A. Immonen, P. Paakkonen, and E. Ovaska, “Evaluating the Quality of
Social Media Data in Big Data Architecture,” IEEE Access, vol. 3, pp.
20282043, 2015.
[56] D. Brown, “Encyclopedia of Big Data,” in Encyclopedia of Big Data,
L. A. Schintler and C. L. McNeely, Eds. Cham: Springer International
Publishing, 2017, pp. 13.
[57] “A Suggested Framework for the Quality of Big Data.” [Online].
Available: https://statswiki.unece.org/display/bigdata/2014+Project.
[Accessed: 11-Nov-2017].
[58] P. Pääkkönen and J. Jokitulppo, “Quality management architecture for
social media data,” J. Big Data, vol. 4, no. 1, p. 6, Dec. 2017.
[59] M. Kläs, W. Putz, and T. Lutz, “Quality Evaluation for Big Data: A
Scalable Assessment Approach and First Evaluation Results,” in 2016
Joint Conference of the International Workshop on Software
Measurement and the International Conference on Software Process
and Product Measurement (IWSM-MENSURA), 2016, pp. 115124.
[60] M. Janssen, H. van der Voort, and A. Wahyudi, “Factors influencing
big data decision-making quality,” J. Bus. Res., vol. 70, pp. 338345,
Jan. 2017.
[61] P. Ciancarini, F. Poggi, and D. Russo, “Big Data Quality: A Roadmap
for Open Data,” in 2016 IEEE Second International Conference on Big
Data Computing Service and Applications (BigDataService), 2016, pp.
210215.
[62] I. Taleb, H. T. E. Kassabi, M. A. Serhani, R. Dssouli, and C.
Bouhaddioui, “Big Data Quality: A Quality Dimensions Evaluation,”
in 2016 Intl IEEE Conferences on Ubiquitous Intelligence Computing,
Advanced and Trusted Computing, Scalable Computing and
Communications, Cloud and Big Data Computing, Internet of People,
and Smart World Congress
(UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), 2016, pp. 759765.
[63] I. Taleb and M. A. Serhani, “Big Data Pre-Processing: Closing the Data
Quality Enforcement Loop,” in 2017 IEEE International Congress on
Big Data (BigData Congress), 2017, pp. 498501.
[64] I. Taleb, R. Dssouli, and M. A. Serhani, “Big Data Pre-processing: A
Quality Framework,” in 2015 IEEE International Congress on Big
Data (BigData Congress), 2015, pp. 191198.
... BD represents the collection, processing, and accessibility of huge volumes of fast streaming and heterogeneous data [3][4][5] . Organizations base their strategic decisions by taking insight from BD 6 . There are many benefits of BD analysis, such as identification of new revenue opportunities, better client benefit, more successful promotion and understanding of market demands. ...
... The BD processing and management solution enables an organization to capture and collect the data from many sources. Initially BD is unreliable and contains a considerable measure of inequalities and anomalies which lead to unreliable data 6 . The U.S. organizations believe that about 32% of their data is inaccurate 10 . ...
... Quality problems lead to business failure. BD has a lot of challenges due to its complex nature and multiple layers of both hardware and software which lead to poor data quality 6 . Challenges of BD are everywhere, with most of the challenges being caused by the Vs properties of BD 19 . ...
Preprint
Full-text available
Big Data has emerged as an asset for business practices and marketing strategies due to its data analysis capabilities. For an effective decision-making process, reliable data is critically important. Unreliable data results in wrong decision making ultimately leading to a negative reputation, customer churn, and financial loss. Therefore, testing Big Data is vital to ensure the reliability of data. Defining test strategies and setting up a testing environment are complex and challenging tasks. In this research, we tend to focus on data reliability, since a wrong decision based on unreliable data can have huge negative consequences. The testing framework for Big Data is validated via an exploratory case study in the telecommunication sector. Three-layer architecture for Big Data testing is used, namely Ingestion, Preparation, and Access. Errors are injected to verify fault injection in the ingestion and processing of data files of the Hadoop framework and technology stack. The results of the testing framework show errors in three layers of Big Data. The testing framework effectively detects data faults in Big Data,and a total of 76.92% errors in the presentation layer of total failed test cases are identified. 48.72% of test cases are developed to detect errors to check data accuracy, and after execution of these test cases 31.58% errors are detected. The highest proportion of errors detected are related to the completeness, which is 66.67% of the number of errors injected for completeness. The results show the effectiveness of the framework to identify accuracy and completeness errors.
... Assessing and ascertaining the quality of such data before using it is therefore important. Data quality has widely been studied in database management [9]- [11], and also in the big data context [12]. Poor data quality management can have adverse negative effects on business decisions [13]. ...
Preprint
Full-text available
Continued development of communication technologies has led to the widespread Internet of Things (IoT) integration into various domains, including health, manufacturing , automotive and precision agriculture. This has further led to the increased sharing of data amongst such domains to abet innovation. Most of these IoT deployments, however, are based on heterogeneous, pervasive sensors, which can lead to quality issues in the recorded data. This can lead to sharing of inaccurate or inconsistent data. There is a significant need to assess the quality of the collected data, should it be shared with multiple application domains, as inconsistencies in the data could have financial or health ramifications. This paper builds on the recent research on trust metrics and presents a framework to integrate such metrics into the IoT data cycle for real-time data quality assessment. Critically, this paper adopts a mechanism to facilitate end user parameterisation of a trust metric tailoring it's use in the framework. Trust is a well-established metric that has been used to determine the validity of a piece or source of data in crowd-sourced or other unreliable data collection techniques such as that in IoT.The paper further discusses how the trust based framework eliminates the requirement for a gold standard and provides visibility into data quality assessment throughout the big data model.To qualify the use of trust as a measure of quality, an experiment is conducted using data collected from an IoT deployment of sensors to measure air quality in which low cost sensors were co-located with a gold standard reference sensor. The calculated trust metric is compared with two well understood metrics for data quality, Root Mean Squares Error (RMSE) and Mean Absolute Error (MAE). A strong correlation between the trust metric and the comparison metrics shows that trust may be used as an indicative quality metric for data quality. The metric incorporates the additional benefit of its ability for use in low context scenarios, as opposed to RMSE and MAE, which require a reference for comparison.
... To ensure the value of the data and ensure producing high-quality results from it, three critical features are argued by Ref. [37] that must be examined, which are quality, quantity, and availability of data. A data quality test is essential in dataset evaluation [38]. A quantity test indicates whether there is enough data to train and test the models, which, for example, is necessary for ML models. ...
Article
Full-text available
File-Type Identification (FTI) is one of the essential functions that can be performed by examining the data blocks' magic numbers. However, this examination leads to a challenge when a file is corrupt, or these magic numbers are missing. Content-based analytics is the best way for file type identification when the magic numbers are not available. This paper prepares and presents a content-based dataset for eight common types of files based on twelve features. We designed our dataset to be used for supervised and unsupervised machine learning models. It provides the ability to classify and cluster these types into two levels, as a fine-grain level (by their file type exactly, JPG, PNG, HTML, TXT, MP4, M4A, MOV, and MP3) and as a coarse-grain level (by their broad type, image, text, audio, video). A dataset quality and features assessments are performed in this study. The obtained results show that our dataset is high-quality, non-biased, complete, and with an acceptable duplication ratio. In addition, several multi-class classifiers are learned by our data, and classification accuracy of up to 81.8% is obtained. The main contributions of this work are summarized in constructing a new publicly available dataset based on statistical and information content-related features with detailed assessments and evaluation. Abstract File-Type Identification (FTI) is one of the essential functions that can be performed by examining the data blocks' magic numbers. However, this examination leads to a challenge when a file is corrupt, or these magic numbers are missing. Content-based analytics is the best way for file type identification when the magic numbers are not available. This paper prepares and presents a content-based dataset for eight common types of files based on twelve features. We designed our dataset to be used for supervised and unsupervised machine learning models. It provides the ability to classify and cluster these types into two levels, as a fine-grain level (by their file type exactly, JPG, PNG, HTML, TXT, MP4, M4A, MOV, and MP3) and as a coarse-grain level (by their broad type, image, text, audio, video). A dataset quality and features assessments are performed in this study. The obtained results show that our dataset is high-quality, non-biased, complete, and with an acceptable duplication ratio. In addition, several multi-class classifiers are learned by our data, and classification accuracy of up to 81.8% is obtained. The main contributions of this work are summarized in constructing a new publicly available dataset based on statistical and information content-related features with detailed assessments and evaluation.
... The increased interest in software engineering for AI-based software has resulted in a growing number of publications in this area over the last four years (Nascimento et al. 2020). Although there are existing secondary studies on the topic of quality for big data systems Rahman and Reza 2020), surveys on quality of ML software or data quality for big data software (Lakshen et al. 2016;Taleb et al. 2018); to the best of our knowledge, there is no systematic literature review on software product quality for AI-based software specifically. This has motivated us to conduct a systematic literature review (SLR) in a broader way to provide an up-to-date, holistic, and comprehensive view of the state-of-the-art of software quality for AI-based software research, especially from the product quality perspective, to analyze existing quality challenges, models, attributes, metrics or techniques that are concerning quality for AI-based software. ...
Article
Full-text available
There is a widespread demand for Artificial Intelligence (AI) software, specifically Machine Learning (ML). It is getting increasingly popular and being adopted in various applications we use daily. AI-based software quality is different from traditional software quality because it generally addresses distinct and more complex kinds of problems. With the fast advance of AI technologies and related techniques, how to build high-quality AI-based software becomes a very prominent subject. This paper aims at investigating the state of the art on software quality (SQ) for AI-based systems and identifying quality attributes, applied models, challenges, and practices that are reported in the literature. We carried out a systematic literature review (SLR) from 1988 to 2020 to (i) analyze and understand related primary studies and (ii) synthesize limitations and open challenges to drive future research. Our study provides a road map for researchers to understand quality challenges, attributes, and practices in the context of software quality for AI-based software better. From the empirical evidence that we have gathered by this SLR, we suggest future work on this topic be structured under three categories which are Definition/Specification, Design/Evaluation, and Process/Socio-technical.
Chapter
Full-text available
Input of data for administrative healthcare workflows is generally experienced in a hybrid form of organization and production, which concerns the human labor actions and computerization infrastructure. During the COVID-19 time, many healthcare organizations found themselves limited in terms of addressing the increased supply and demand and workload generated in hospitals and other healthcare services/products. This complicated scenario has in its base of analysis both the human capability of processing the information generated by new data concerning disease, methods, and technologies to prevent COVID-19 dissemination and computed data treatment of daily new information available in order to predict the behavior of supply chains. Concerning a stage before manufacturing and logistics, and after planning and policies, there is a lacuna where public healthcare organizations need to be able to predict how their infrastructure and daily routines respond to unusual and stressful conditions. For this problem, this chapter focused on the work productivity of the bidding sector responsible for delivering services/goods to the population during 2020–21 in the State of Paraná, Brazil and how it failed to provide correct responses under abnormal conditions imposed by COVID-19.
Article
Full-text available
Several organizations have migrated to the cloud for better quality in business engagements and security. Data quality is crucial in present-day activities. Information is generated and collected from data representing real-time facts and activities. Poor data quality affects the organizational decision-making policy and customer satisfaction, and influences the organization’s scheme of execution negatively. Data quality also has a massive influence on the accuracy, complexity and efficiency of the machine and deep learning tasks. There are several methods and tools to evaluate data quality to ensure smooth incorporation in model development. The bulk of data quality tools permit the assessment of sources of data only at a certain point in time, and the arrangement and automation are consequently an obligation of the user. In ensuring automatic data quality, several steps are involved in gathering data from different sources and monitoring data quality, and any problems with the data quality must be adequately addressed. There was a gap in the literature as no attempts have been made previously to collate all the advances in different dimensions of automatic data quality. This limited narrative review of existing literature sought to address this gap by correlating different steps and advancements related to the automatic data quality systems. The six crucial data quality dimensions in organizations were discussed, and big data were compared and classified. This review highlights existing data quality models and strategies that can contribute to the development of automatic data quality systems.
Article
With the development of internet technology, information has become an increasingly important asset for the operation and planning of power systems. However, the existing studies and practices pay little attention to information value evaluation, i.e., the potential for information to be converted into actual economic benefits. To this end, this paper designs an information market framework and proposes a generalized information valuation model to help price data in smart grids efficiently. Here we analyze the information value of photovoltaic (PV)-related data in a power system operation problem. Specifically, we examine how additional meteorological and PV power data help to improve day-ahead forecasting accuracy, thus enhancing unit commitment (UC). In this paper, information quality is captured by two indices of a set of PV-related data, i.e., Shannon entropy and non-noise ratio. Then a neural network-based engine is employed to predict day-ahead hourly solar power on the premise of input datasets with different information quality. Here we define forecasting accuracy as information utility, and discover an exponential relationship between such utility and information quality. Finally, a two-stage stochastic UC model is formulated to quantify the contributions of different PV-related datasets, in which real-time solar power deviation is penalized. In this instance, the economic value of PV-related data is measured as the operational cost reduction induced by forecasting accuracy improvement, which we find can be estimated by information quality. Case studies based on the IEEE 30- and 118-bus systems validate the effectiveness of the proposed paradigm and method.
Conference Paper
Full-text available
In the Big Data Era, data is the core for any governmental, institutional, and private organization. Efforts were geared towards extracting highly valuable insights that cannot happen if data is of poor quality. Therefore, data quality (DQ) is considered as a key element in Big data processing phase. In this stage, low quality data is not penetrated to the Big Data value chain. This paper, addresses the data quality rules discovery (DQR) after the evaluation of quality and prior to Big Data pre-processing. We propose a DQR discovery model to enhance and accurately target the pre-processing activities based on quality requirements. We defined, a set of pre-processing activities associated with data quality dimensions (DQD’s) to automatize the DQR generation process. Rules optimization are applied on validated rules to avoid multi-passes pre-processing activities and eliminates duplicate rules. Conducted experiments showed an increased quality scores after applying the discovered and optimized DQR’s on data.
Article
Full-text available
Developing Big Data applications has become increasingly important in the last few years. In fact, several organizations from different sectors depend increasingly on knowledge extracted from huge volumes of data. However, in Big Data context, traditional data techniques and platforms are less efficient. They show a slow responsiveness and lack of scalability, performance and accuracy. To face the complex Big Data challenges, much work has been carried out. As a result, various types of distributions and technologies have been developed. This paper is a review that survey recent technologies developed for Big Data. It aims to help to select and adopt the right combination of different Big Data technologies according to their technological needs and specific applications’ requirements. It provides not only a global view of main Big Data technologies but also comparisons according to different system layers such as Data Storage Layer, Data Processing Layer, Data Querying Layer, Data Access Layer and Management Layer. It categorizes and discusses main technologies features, advantages, limits and usages.
Article
Full-text available
Social media data has provided various insights into the behaviour of consumers and businesses. However, extracted data may be erroneous, or could have originated from a malicious source. Thus, quality of social media should be managed. Also, it should be understood how data quality can be managed across a big data pipeline, which may consist of several processing and analysis phases. The contribution of this paper is evaluation of data quality management architecture for social media data. The theoretical concepts based on previous work have been implemented for data quality evaluation of Twitter-based data sets. Particularly, reference architecture for quality management in social media data has been extended and evaluated based on the implementation architecture. Experiments indicate that 150–800 tweets/s can be evaluated with two cloud nodes depending on the configuration.
Article
Full-text available
The massive growth in the scale of data has been observed in recent years being a key factor of the Big Data scenario. Big Data can be defined as high volume, velocity and variety of data that require a new high-performance processing. Addressing big data is a challenging and time-demanding task that requires a large computational infrastructure to ensure successful data processing and analysis. The presence of data preprocessing methods for data mining in big data is reviewed in this paper. The definition, characteristics, and categorization of data preprocessing approaches in big data are introduced. The connection between big data and data preprocessing throughout all families of methods and big data technologies are also examined, including a review of the state-of-the-art. In addition, research challenges are discussed, with focus on developments on different big data framework, such as Hadoop, Spark and Flink and the encouragement in devoting substantial research efforts in some families of data preprocessing methods and applications on new big data learning paradigms.
Conference Paper
Full-text available
During past several years, we have built an online big data service called CMA that includes a group of scientific modeling and analysis tools, machine learning algorithms and a large scale image database for biological cell classification and phenotyping study. Due to the complexity and "non-testable" of scientific software and machine learning algorithms, adequately verifying and validating big data services is a grand challenge. In this paper, we introduce a framework for ensuring the quality of big data services. The framework includes an iterative metamorphic testing technique for testing "non-testable" scientific software, and an experiment based approach with stratified 10-fold cross validation for validating machine learning algorithms. The effectiveness of the framework for ensuring the quality of big data services is demonstrated through verifying and validating the software and algorithms in CMA.
Conference Paper
Full-text available
While the potential benefits of Big Data adoption are significant, and some initial successes have already been realized, there remain many research and technical challenges that must be addressed to fully realize this potential. The Big Data processing, storage and analytics, of course, are major challenges that are most easily recognized. However, there are additional challenges related for instance to Big Data collection, integration, and quality enforcement. This paper proposes a hybrid approach to Big Data quality evaluation across the Big Data value chain. It consists of assessing first the quality of Big Data itself, which involve processes such as cleansing, filtering and approximation. Then, assessing the quality of process handling this Big Data, which involve for example processing and analytics process. We conduct a set of experiments to evaluate Quality of Data prior and after its pre-processing, and the Quality of the pre-processing and processing on a large dataset. Quality metrics have been measured to access three Big Data quality dimensions: accuracy, completeness, and consistency. The results proved that combination of data-driven and process-driven quality evaluation lead to improved quality enforcement across the Big Data value chain. Hence, we recorded high prediction accuracy and low processing time after we evaluate 6 well-known classification algorithms as part of processing and analytics phase of Big Data value chain.
Conference Paper
As Big Data becomes better understood, there is a need for a comprehensive definition of Big Data to support work in fields such as data quality for Big Data. Existing definitions of Big Data define Big Data by comparison with existing, usually relational, definitions, or define Big Data in terms of data characteristics or use an approach which combines data characteristics with the Big Data environment. In this paper we examine existing definitions of Big Data and discuss the strengths and limitations of the different approaches, with particular reference to issues related to data quality in Big Data. We identify the issues presented by incomplete or inconsistent definitions. We propose an alternative definition and relate this definition to our work on quality in Big Data.
Conference Paper
Big Data refers to data volumes in the range of Exabyte (10 18 ) and beyond. Such volumes exceed the capacity of current on-line storage and processing systems. With characteristics like volume, velocity and variety big data throws challenges to the traditional IT establishments. Computer assisted innovation, real time data analytics, customer-centric business intelligence, industry wide decision making and transparency are possible advantages, to mention few, of Big Data. There are many issues with Big Data that warrant quality assessment methods. The issues are pertaining to storage and transport, management, and processing. This paper throws light into the present state of quality issues related to Big Data. It provides valuable insights that can be used to leverage Big Data science activities.
Conference Paper
High-quality data is a prerequisite for most types of analysis provided by software systems. However, since data quality does not come for free, it has to be assessed and managed continuously. The increasing quantity, diversity, and velocity that characterize big data today make these tasks even more challenging. We identified challenges that are specific for big data quality assessments with particular emphasis on their usage in smart ecosystems and make a proposal for a scalable cross-organizational approach that addresses these challenges. We developed an initial prototype to investigate scalability in a multi-node test environment using big data technologies. Based on the observed horizontal scalability behavior, there is an indication that the proposed approach also allows dealing with increasing volumes of heterogeneous data.