ChapterPDF Available

Data Profiling Technology of Data Governance Regarding Big Data: Review and Rethinking


Abstract and Figures

Data profiling technology is very valuable for data governance and data quality control because people need it to verify and review the quality of structured, semi-structured, and unstructured data. In this paper, we first review relevant works and discuss their definitions of data profiling. Second, we offer a new definition and propose new classifications for data profiling tasks. Third, the paper presents several free and commercial profiling tools. Fourth, authors offer a new data quality metrics and data quality score calculation. Finally, authors discuss a data profiling tool framework for big data. ***The paper was cited by a research team at University of Oxford in Jul 2019.***
Content may be subject to copyright.
Data Profiling Technology of Data Governance
Regarding Big Data: Review and Rethinking
Wei Dai1*, Isaac Wardlaw2%, Yu Cui3, Kashif Mehdi4, Yanyan Li5#, Jun Long6
1Information Science, University of Arkansas at Little Rock, Little Rock, AR, USA
{wxdai*, iawardlaw%, yxli5#}
25Computer Science, University of Arkansas at Little Rock, Little Rock, AR, USA
3College of Information Engineering, Guangdong Mechanical and Electrical Polytechnic,
Guangzhou, Guangdong, China
4Software Development Group, Collibra Inc., New York, NY, USA
6Information Science and Engineering, Central South University, Changsha, Hunan, China
Abstract. Data profiling technology is very valuable for data governance and
data quality control because people need it to verify and review the quality of
structured, semi-structured, and unstructured data. In this paper, we first review
relevant works and discuss their definitions of data profiling. Second, we offer a
new definition and propose new classifications for data profiling tasks. Third,
the paper presents several free and commercial profiling tools. Fourth, authors
offer a new data quality metrics and data quality score calculation. Finally, au-
thors discuss a data profiling tool framework for big data.
Keywords: data profiling tools; data governance; big data; data quality control;
data management.
Data is ubiquitous. People use digital maps to navigate in major cities, send emails to
communicate, and buy e-tickets online every day. The more computers assist people,
the larger the volume data that will be stored. Big data means a new era in data utili-
zation and exploration; and its key characteristics are volume, variety, and velocity
Data can be structured, as is scientific and statistical data; semi-structured, such as
PDF and XML; or unstructured, like raw video or audio [2]. Structured data can be
accessed by tools like SQL, while XQuery is usually used to access semi-structured
data. Because of the complexity of unstructured data, data profiling tools usually fo-
cus on structured data and semi-structured data.
Data quality plays an important role in data governance. According to [3] and [4],
data governance takes a unique role in companies because of the Sarbanes-Oxley
(SOX) Act and Basel II. Data is a valuable asset for customers, companies, and gov-
ernments. The target of Total Data Quality Management (TDQM) is to offer high-
quality information products (IP) to users, so definition, measurement, analysis, and
improvement of data quality are included in TDQM cycle [5] [6].
Data profiling technology can improve data quality. In [7], [8], and [9], data profiling
tools discover the pattern of datasets and offer a scores of projects regarding TDQM,
including data cleaning, data integration, and data analysis. Data scientists, end-users,
and IT engineers utilize these tools to improve data quality because data profiling
tools easily show the frequency patterns of addresses information, credit cards, and
phone numbers.
The remainder of this paper is organized as follows: Section II reviews relevant works
and discusses a new definition of data profiling. Section III presents new classifica-
tions of data profiling task; Sections IV and V introduce several data profiling tools,
including both free and commercial software (Fig. 1). Section VI discovers data quali-
ty metrics and an exponentially weighted data quality score in detail. Section VII
discusses a data profiling tool framework for big data. Section VIII concludes main
ideas, and section VIII addresses future work.
Fig. 1. Selected Data Profiling Tools
Different authors give different definitions of data profiling. [7] describes it as “the
set of activities and processes to determine the metadata about a given dataset.” Ac-
cording to [10], “data profiling is a process whereby one examines the data available
in an existing database or flat file and collects statistics and information about that
data. These statistics become the real or true metadata.” [11] defines it as referring to
the activity of creating small but informative summaries of a database.” These defini-
tions of data profiling seem convincing initially, but [7] only believes that data profil-
ing is the metadata regarding datasets, and [10] does not mention unstructured data at
all. Definition of [11] is vague because data has different types, structures, and utiliza-
Data profiling can be utilized at different stages of data governance. Thus, in our
opinion, profiling is the process of verifying users’ structured data, semi-structured
data, and unstructured data, gathering data structure, data pattern, statistical infor-
mation, distribution messages, and reviewing data attributes for data governance, data
management, data migration, and data quality control.
The data profiling includes in multiple tasks for data governance. According to [8]
and [10], people need to profile data only when cleaning, managing, analyzing, or
integrating it; evaluating its quality; performing data archaeology; optimizing queries;
or engaging in Extract, Transform and Load (ETL) projects. However, data profiling
could also be used for compressing, verifying, masking, auditing, migrating, archiv-
ing, and recovering data as well as for generating testing data and data health reports.
[7] and [8] list classifications of data profiling tasks; however, these classifications are
not good enough. For example, [8] differentiates data quality jobs based upon whether
they use a single data source or multiple sources to separate different data quality
jobs. In our view, data profiling tools consists of five primary jobs:
1. Metadata Profiling: discovering metadata information, such as data structures,
creators, times of creation, primary keys, and foreign keys.
2. Presentation Profiling: finding data patterns, including text patterns, time patterns,
and number patterns. such as address pattern, date patterns, and telephone patterns.
3. Content Profiling: reviewing data basis information, including accuracy, precision,
timeliness, null or non-null.
4. Set Profiling: analyzing data from collections or groups; for example statistics,
distribution, cardinality, frequency, uniqueness, row count, maximum or minimum
values, mean, and redundancy.
5. Logical Rule Profiling: reviewing data based on business logical rules or business
glossary domain, such data logical meanings, business rules, and functional
Different data profiling missions could be used for different projects, such as Table 1.
Table 1. data profiling for user scenario
Profiling Missions
Profiling Scenario
metadata profiling
Data management, data integration, ETL, data migration
Presentation Profiling
Data compression, data audit
Content Profiling
Data compression, Data management,
Set Profiling
data audit, Data compression, data management
Logical Rule Profiling
data audit, data management
Some sorts of data profiling (such as content profiling and logical-rule profiling) are
strongly linked to business domain, but others are not. Fig 2 shows this relationship.
Fig. 2. Data Profiling in different technical or business domains.
Some tools are free software and open source; however, many, but not all free data
profiling tools are open source projects. In general, their functionality is more limited
than that of commercial products, and they may not offer free telephone or online
support. Furthermore, their documentation is not always thorough. However, some
small companies still use these free tools instead of expensive commercial software,
considering the benefits that free tools provide.
1. Aggregate Profiler Tool
Aggregate Profiler (AP) is an open source project developed in Java [7]. AP supports
both traditional database and big data, such as Hadoop or Hive, and it offers statistical
analysis, pattern matching, distribution chat, basket analysis, etc. AP also supports
data generation, data preparation, data masking features, and address correction for
data quality projects. Moreover, this tool offers data validation (metadata profiling,
analytical profiling, and structural profiling), and data quality (removing duplicate
data, null values, and dirty data).
2. Talend Open Studio for Data Quality
Talend Open Studio for Data Quality (TOSDQ) [8] is also based on Java and is a
mature open source tool. TOSDQ offers navigator interface to access databases and
data files. This tool supports catalog analysis, time correlation analysis, column anal-
ysis, table analysis, column correlation analysis, and schema analysis; it also supports
column functional dependency, redundancy analysis, numerical correlation analysis,
nominal correlation analysis, connection analysis, column set analysis, and match
analysis. Furthermore, TOSDQ reports several different types of statistics indicators,
including simple statistics, text statistics, summary statistics, pattern frequency statis-
tics, Soundex frequency statistics, phone number statistics, and Fraud detection (Ben-
ford's law frequency).
3. DataCleaner
DataCleaner [9] is a commercial tool for data profiling and data cleaning, but it has a
free version which offers multiple data profiling functions, including pattern match-
ing, boolean analysis, weekday distribution, completeness analysis, value matcher,
character set distribution, value distribution, date gap analysis, unique key check,
date/time analysis, string analysis, number analysis, referential integrity, and refer-
ence data matching.
Commercial data profiling products usually come packaged in data governance suites.
These products have multiple functions, high performance, and strong capabilities;
they can connect to other suites to provide comprehensive solutions for customers.
Moreover, these software is not only powerful, but end-users also can find online
services and telephone support.
1. IBM InfoSphere Information Analyzer
IBM InfoSphere Information Analyzer (IIA) [10] is part of IBM’s data governance
suite that includes InfoSphere Blueprint Director, Metadata Workbench, DataStage,
QualityStage, Data Click, Business Glossary, and Information Services Director. IIA
supports column analysis (statistics, distribution, cardinality, and value analysis.),
identifying keys and relationships, discovering redundant data, comparing data and
structures through history baselines, analyzing data via data rules, and importing and
exporting data rules.
2. Informatic Data Profiling
Informatic Data Profiling is a key component of PowerCenter [11]. This profiling
software supports aggregate functions (count null values, calculate averages, get max-
imum or minimum values, and get lengths of strings), candidate key evaluation
(unique or non-unique), distinct value count, domain inference, functional dependen-
cy analysis, redundancy evaluation, and row count. In addition, users can add busi-
ness rules (verbose mode) or configure profile functions in this tool.
3. Oracle Enterprise Data Quality
Oracle Enterprise Data Quality (EDQ) [12] permits address verification, profiling data
(files, databases, and spreadsheets), standardization, audit reviews (incorrect values,
missing data, inconsistencies, duplicate records, and key quality metrics), matching
and merging columns (duplicate prevention, de-duplication, consolidation, and inte-
gration), and case management (data reviewing). Furthermore, the tool can utilize pre-
built templates or user-defined rules to profile data. EQD also can connect to other
Oracle data governance products, including Oracle Data Integrator and Oracle Master
Data Management.
4. SAP Information Steward
SAP Information Steward can improve information quality and governance [13] via
the Data Insight module (data profiling and data quality monitoring), Metadata Man-
agement module (metadata analysis), Metapedia Module (business term taxonomy),
and cleansing package builder (cleansing rules). The data insight module can define
validation rules, determine profiling (column, address, uniqueness, dependency, and
redundancy), import and export metadata, and create views [13].
5. SAS DataFlux Data Management Studio
SAS DataFlux Data Management Studio (DDMS) [14] is a data governance suite that
consists data profiling, master data management, and data integration. This data pro-
filing tool covers key analysis (primary and foreign keys), pattern frequency distribu-
tion analysis, redundant data analysis, and data profiling reports.
6. Collibra Data Stewardship Manager
Collibra Data Stewardship Manager (DSM) [15] module is part of Collibra’s Data
Governance Center that also includes Business Semantic Glossary (BSG) and Refer-
ence Data Accelerator (RDA) module. DSM also provides historical data quality re-
ports around trend analysis and reports to understand the impact of resolved data is-
sues. In addition, DSM provides fully configurable data quality reporting dashboard
(see figure below) by bringing data quality rules and metrics calculated in one or mul-
tiple sources (data quality tools, databases, and big data).
Several academic papers offer data quality measurement methods. In [16], the authors
mention information quality dimensions including accessibility, completeness, and
security; however, these dimensions only focus on information quality for data gov-
ernance. [17] separates data quality problems according to how a data administrator
might view them-for example, single-source, multi-source, instance level, schema
level. The authors state that these problems could be solved by ETL, but these proce-
dures only improve the quality of data after it has been collected and stored in a data
warehouse making the solution inflexible. [18] presents algorithms for calculating
data quality measures, such as free-of-error, completeness, and appropriate-amount-
Some papers describe how to build metrics for data quality. In [19], the authors offer
many metrics (25 candidate metrics and 18 subsets of metrics) regarding data quality.
However, they do not mention how to implement them. [20] offers a blueprint for
improving data quality, but it only focuses on the business value of such improve-
ment. In [21], the authors discuss how to build the data metrics of data warehouse
from such components as table metrics, start metrics, and schema metrics, but they do
not mention how to profile data quality, enhance data quality of data structures, or
improve unstructured data quality. In [22], the authors mention dimensions of data
quality (for instance, uniqueness, accuracy, and consistency), and they also detail how
to build a model for quantifying data quality performance.
1. Data Quality Indicators
Qualitative indicators are the best way to measure data quality, but sometime we utilize
dimensions to define data quality. In [24], authors enumerate six indicators of data
quality (uniqueness, accuracy, consistency, and etc.). Table 2 shows these qualitative
indicators and definitions
Table 2. Six Indicators of data quality [24]
The degree to which data correctly describes the real
worldobject or event being described.
The proportion of stored data against the potential of
100% complete
Similarity when comparing two or more representa-
tions of something against its definition.
The degree to which data represents reality at the re-
quired point in time.
Conformity of data’s syntax (format, type, range) to
its definition.
Nothing will be recorded more than once based upon
how it is identified.
2. Data Quality Metrics
Qualitative indicators do not offer precise indexes of data quality, but quantitative met-
rics deliver accurate methods to monitor, control, and improve data quality, which is
valuable for data profiling. Metrics consist of directly measured regulations. [24] dis-
cusses twenty quantitative metrics of data governance for health organizations; howev-
er, these metrics only target at health field. In this paper, we offer comprehensive met-
rics for data quality, shown in Table 3.
Table 3. Quantitative Data Quality Metrics
Percent of data is correct ( correct data / total data)
(e.g. ZIP code, SSN)
Percent of data is completeness data
(e.g. phone number, address)
Percent of data is correct consistency
(Such as business rules and logical rules of Consistency)
Percent of data is correct timeliness
(For example, ages, educational degree at a special time or
Percent of data is validity (Such as first name, last name, suffix,
and etc.)
Percent of data is uniqueness
(e.g. primary keys, foreign keys)
Moreover, we can get the same results if we use simple data instead of the full volume
Data Quality Metric (DQM) formula usually need to build data quality indexes, or
weights, which depend on specific business scenarios. For example, banks usually send
bills via mail or email, so home and email addresses are important to them. However,
they may not care much about your cell phone number because they do not need it.
Customers could define their own weights for data quality metrics. Table 4 shows an
example of metric weights for the address column.
Table 4. METRIC WEIGHTS OF Address Column
Accuracy : % of data is correct (W1)
Completeness: % of data is completeness data (W2)
Consistency: % of data is correct consistency (W3)
Timeliness : % of data is correct timeliness (W4)
Validity: % of data is validity (W5)
Uniqueness: % of data is null value or non-null values (W6)
Total Weight Value:
DQM formula is
3. Exponentially Weighted Data Quality Score
Exponentially Weighted Data Quality Score (EWDQS) is strongly related to time and
DQM. If we want to create a data quality score, we should build exponentially
weighted moving average formula.
The DQM is a series EWDQS, which may be calculated recursively:
1) A higher λ increases previous DQM quicker and 1 >λ>0.
2) DQM (T) is the value of the EWDQS at any time period t.
3) EWDQS (T) is the value at any time period t.
4. Case Study
Imagine Bank A has a lot of address data, and the DBA wants to check its data quality.
After profiling the data via data quality tool in January, he notices some problems: 5%
of the data is incorrect, 7.8% of data contains null values, and 12.5% of data is incon-
sistent. The engineer utilize our formulas to measure the data quality score.
According to Equations 1 and 3:
DQM (1) = (1-0.05) × 30 + 15 + 5+ (1-0.078) × 20 + (1-0.12.5) × 10 +20 = 95.69
EWDQS (1) = 95.69
In February, the engineer check it again, it also has some data quality problems: 4.5%
of the data is incorrect, 7.0% of data contains null values, and 10.5% of data is incon-
According to Equations 1 and 4, and if λ =0.75:
DQM (2) = (1-0.045) × 30 + 15 + 5+ (1-0.070) × 20 + (1-0.105) × 10 +20 = 96.2
EWDQS (2) = 0.75× DQM (2) + (1-0.75) × EWDQS (1)
= 0.75×96.2 +0.25 × 95.69 = 96.07
Therefore, the exponentially weighted data quality score of Bank A shifts from 95.69
and 96.07 in the month of January.
There is limited document mention data profiling tools framework because commercial
companies consider this to be a trade secret [19] [20] [21] [22] [23] [24]. However, as
big data increases, data governance, data quality control, and data profiling technology
grow more important [1] [25] [26]. People need data profiling tools to perform the big
data analysis necessary to improve data quality, but they are currently challenged by
limited performance and unexpected robustness. Customers want these tool to profile
real-time data, static data profiling, simple data, and full volume data, and we should
strive to satisfy their requirements. The data profiling framework has six layers: the
hardware layer, data layer, parallelism layer, algorithms layer, function layer, and Web
UI layer (Fig. 3).
Machine Learning
CPU/GPU X86_64bit
Data Layer
CPU/GPU X86_64bit
Layer Data Mining
Configure Rules
ER Model Jobs Control Data Source
Fig. 3. Data Profiling Architecture
Each layer has different application (Table 4). “At the Algorithm Layer, for instance, machine
learning algorithms could be employed that automatically analyze the patterns and rules of data
as well as data structures in order to enhance data quality and facilitate its governance. These
algorithms could run on x86 machines or graphics processing units (GPU), such as NVidia
CUDA [27].
Table 5. data profiling framework
Web-UI Layer
User interface for ER model, business rules, KPI
dashboard, batch or real-time job maintenance, in-
put/output data source configuration.
Function Layer
Scores of data profiling missions (Section III).
Algorithm Layer
Utilize data mining, machine learning, and statistics,
or other algorithms.
Parallelism Layer
Apache Spark [28] for static data and full volume data;
Apache Storm [29] for real-time data.
Data Layer
Business rules, configuration data, metadata store in
Hadoop, traditional databases.
Hardware Layer
Integrate CPU and GPU clusters for improving per-
formance, especially real-time or machine learning tasks.
As data profiling enriches data quality, data management, and data governance, it is
important for customers, data scientists, and DBA to use data profiling tools. Data is an
asset for all users, so data quality should be controlled by procedures, rules, people,
and software. In this paper, after reviewing existing relevant works, a new data profil-
ing definition and data profiling tasks were presented. Authors discuss several free or
paid data profiling tools. Moreover, the paper introduced a method for building data
quality metrics and showed how to calculate data quality scores. Finally, a data profil-
ing tool framework was given.
Data profiling only grows more important in this new era of big data. We will continu-
ally read more academic papers, technical documents, and develop codes about algo-
rithms for data profiling on big data and will find new ways to extend the functions of
data profiling in the future.
1. Zikopoulos, Paul, and Chris Eaton. Understanding big data: Analytics for enterprise class
hadoop and streaming data. McGraw-Hill Osborne Media, 2011.
2. Buneman, Peter. "Semistructured data." In Proceedings of the sixteenth ACM SIGACT-
SIGMOD-SIGART symposium on Principles of database systems, pp. 117-121. ACM,
3. Buneman, Peter, Susan Davidson, Mary Fernandez, and Dan Suciu. "Adding structure to
unstructured data." In Database TheoryICDT'97, pp. 336-350. Springer Berlin Heidel-
berg, 1997.
4. Khatri, Vijay, and Carol V. Brown. "Designing data governance." Communications of the
ACM 53, no. 1 (2010): 148-152.
5. Pipino, Leo L., Yang W. Lee, and Richard Y. Wang. "Data quality assessment." Commu-
nications of the ACM 45, no. 4 (2002): 211-218.
6. Wang, Richard Y. "A product perspective on total data quality management." Communica-
tions of the ACM 41, no. 2 (1998): 58-65.
7. Kumar, Roushan, and Arun Yadav. "Aggregate Profiler -- Data Quality." Accessed Octo-
ber 20, 2015. “”.
8. Talend Company. "Talend Open Studio for Data Quality." Accessed October 20, 2015.
9. DataCleaner Company. "DataCleaner Manual." Accessed October 20, 2015.
10. IBM Company. “InfoSphere Information Server: Information Center.” Accessed October
20, 2015. ””
11. Informatica Company. “Data Profiling Solutions.”Accessed October 20, 2015.
12. Oracle Company. “Oracle Enterprise Data Quality.” Accessed October 20, 2015.
13. SAP Company. “SAP Information Steward.” Accessed October 20, 2015.
14. SAS Company. “SAS Products: DataFlux Data Management Studio.” Accessed October
20, 2015. “
15. "A Data Governance Solution Tailored for Your Role." Collibra Solution Comments. Ac-
cessed November 12, 2015.
16. Pipino, Leo L., Yang W. Lee, and Richard Y. Wang. "Data quality assessment." Commu-
nications of the ACM 45, no. 4 (2002): 211-218.
17. Rahm, Erhard, and Hong Hai Do. "Data cleaning: Problems and current approaches." IEEE
Data Eng. Bull. 23, no. 4 (2000): 3-13.
18. Lee, Yang W., Leo L. Pipino, James D. Funk, and Richard Y. Wang. Journey to data quali-
ty. The MIT Press, 2009.
19. Moody, Daniel L. "Metrics for evaluating the quality of entity relationship models." In
Conceptual Modeling–ER’98, pp. 211-225. Springer Berlin Heidelberg, 1998.
20. Ballou, Donald P., and Giri Kumar Tayi. "Enhancing data quality in data warehouse envi-
ronments." Communications of the ACM 42, no. 1 (1999): 73-78.
21. Calero, Coral, Mario Piattini, Carolina Pascual, and Manuel A. Serrano. "Towards Data
Warehouse Quality Metrics." In DMDW, p. 2. 2001.
22. Loshin, D. "Monitoring Data Quality Performance Using Data Quality Metrics: A White
Paper." Informatica. November (2006).
23. "The Six Primary Dimensions for Data Quality Assessment." The Six Primary Dimensions
for Data Quality Assessment. Accessed November 5, 2015.
24. "The Ultimate Guide to Data Governance Metrics : Healthcare Edition:40 Ways for Payers
and Providers to Measure Information Quality Success." The Ultimate Guide to Data Gov-
ernance Metrics : Healthcare Edition. 2012. Accessed November 5, 2015.
25. Zikopoulos, Paul, and Chris Eaton. Understanding big data: Analytics for enterprise class
hadoop and streaming data. McGraw-Hill Osborne Media, 2011.
26. LaValle, Steve, Eric Lesser, Rebecca Shockley, Michael S. Hopkins, and Nina Kruschwitz.
"Big data, analytics and the path from insights to value." MIT sloan management review
21 (2013).
27. "CUDA GPUs." NVIDIA Developer. June 4, 2012. Accessed November 12, 2015.
28. "Apache Spark™ - Lightning-Fast Cluster Computing." Apache Spark™ - Lightning-Fast
Cluster Computing. Accessed November 12, 2015.
29. "Apache Storm." Apache Storm. Accessed November 12, 2015.
... The development of data governance activities can be demonstrated through a range of strategic actions-specific actions undertaken by the firm to ensure the smooth deployment of data governance (e.g., [6]). For example, Dai et al. [25] highlighted the importance of digital technologies in improving the data quality in the process of data governance. Additionally, Benfeldt et al. [11] theorized the six challenges of data governance by considering it as a collective action. ...
... The role of technology for the enterprise has thus risen from functional to a strategic level. Indeed, data governance in enterprises can be seen as a set of corporate governance activities around data supported by various technologies [25,98]. Moreover, the design of platform-related technologies can provide complex interactions between multiple parties using or sharing data, thus enhancing data governance's efficiency and eliminating data privacy hazards [56,96]. ...
While there has been a wealth of research exploring data governance, there are still some gaps in how firms deploy data governance and what strategic actions they take to do so, especially as the volume of data increases dramatically and the pace of data assetization accelerates. To achieve this end, through an in-depth case study of a Chinese gold mining company, namely Shandong Gold, we develop a framework to explain how firms configure data governance activities and conduct related strategic actions. Our study identifies four key data governance activities that are supported by two strategic actions. Overall, we contribute to research in data governance and strategic action fields and also provide an alternative implementation framework for practitioners.
... The categories of data profiling, data pre-processing, data assessment, and data assurance can be grouped under data fitness. Data profiling includes the processes, tools, and skilled resources required to identify characteristics and understand the meaning and structure of critical data, conduct root-causes, and impact analyses [28]. This leads to the benefits of understanding whether the data is fit for purpose, reduced cycle times for critical projects, and the possibility to compare the data with user expectations [28]. ...
... Data profiling includes the processes, tools, and skilled resources required to identify characteristics and understand the meaning and structure of critical data, conduct root-causes, and impact analyses [28]. This leads to the benefits of understanding whether the data is fit for purpose, reduced cycle times for critical projects, and the possibility to compare the data with user expectations [28]. Certain data pre-processing and improvement techniques are required to avoid costly consequences from low data quality. ...
Currently, data quality is in the spotlight of research and organizations. It derives from new technological developments, such as the Internet of Things (IoT), which provides unprecedented amounts of data and enables new ways of creating knowledge. The interim value is hidden in the flood of data and has already received many industries such as construction, manufacturing, and healthcare. These organizations adopt data applications to extract critical information to understandits purpose better, leading to a competitive advantage. In parallel to gain these positive effects based on struggling around with the data, the quality of data and the decisions are neglected very often. Decisions must be made quickly in response to changing requirements. Thereby, agile methods and approaches can be successfully and profitably applied. The decisive factor for success is the holistic view of architecture, organization, technology, and adapted process models. Due to the given ideas, research on data quality combined with agile companies is still in its infancy. This paper presents categories for data quality in combination agility based on literature and expert interviews to close this gap and establish a foundation for future research.
... Aggregate Profiler (AP) is a freely available DQ tool, which is dedicated to data profiling. The tool was discovered twice in our systematic search: once because it was mentioned by Dai et al. (2016) in the Springer search results, and once in the Google search results, since it is also published on Sourceforge as "Open Source Data Quality and Profiling, " 4 developed by arrah and arunwizz. In addition to its data profiling capabilities, like statistical analysis and pattern matching, Aggregate Profiler can also be used for data preparation and cleansing activities, like address correction or duplicate removal. ...
Full-text available
High-quality data is key to interpretable and trustworthy data analytics and the basis for meaningful data-driven decisions. In practical scenarios, data quality is typically associated with data preprocessing, profiling, and cleansing for subsequent tasks like data integration or data analytics. However, from a scientific perspective, a lot of research has been published about the measurement (i.e., the detection) of data quality issues and different generally applicable data quality dimensions and metrics have been discussed. In this work, we close the gap between data quality research and practical implementations with a detailed investigation on how data quality measurement and monitoring concepts are implemented in state-of-the-art tools. For the first time and in contrast to all existing data quality tool surveys, we conducted a systematic search, in which we identified 667 software tools dedicated to “data quality.” To evaluate the tools, we compiled a requirements catalog with three functionality areas: (1) data profiling, (2) data quality measurement in terms of metrics, and (3) automated data quality monitoring. Using a set of predefined exclusion criteria, we selected 13 tools (8 commercial and 5 open-source tools) that provide the investigated features and are not limited to a specific domain for detailed investigation. On the one hand, this survey allows a critical discussion of concepts that are widely accepted in research, but hardly implemented in any tool observed, for example, generally applicable data quality metrics. On the other hand, it reveals potential for functional enhancement of data quality tools and supports practitioners in the selection of appropriate tools for a given use case.
... Deep learning algorithms' accuracy can be improved with the method of feeding high quality images in it [23]. Data accuracy is necessary because people tend to verify and review the quality of data to ensure that a device is effective and efficient [20], [23], [24]. As an example, a study in [25] used CNN for image classifying. ...
Full-text available
This study focuses on developing an automated egg incubator with a camera-assisted candler for egg maturity detection of balut and penoy commercial duck eggs. The incubator is a four-layer chamber installed with a heater, fan, and DHT11 sensors. DHT11 sensors are interfaced with a Raspberry Pi 4 to observe and maintain the optimal parameters inside the incubator. Trays with built-in candlers made from fluorescent bulbs are placed per layer with a capacity of 20 eggs positioned on rollers. These rollers are programmed to drive every 8 hours for 5 minutes for the egg turning which is essential in incubating eggs. Cameras are installed to capture the images of the candled eggs on their 1st, 10th, and 18th day. The result will be shown on a monitor with a user-friendly GUI which will help the vendor to determine the condition and maturity of the eggs inside the incubator. A region-based convolutional neural network (R-CNN/RCNN) was used as the classifier algorithm for balut, penoy, and fresh eggs. The classification accuracy of the proposed system is 80.5%.
... For each generated sample, another set of samples is created by re-sampling with replacements. (b) Profiling: The data profiling module performs the data quality screening based on statistics and information summary [59][60][61]. Since profiling is meant to discover data characteristics from data sources, it is considered as a data assessment process that provides a first summary of the data quality reported in its data profile. ...
Full-text available
Big Data is an essential research area for governments, institutions, and private agencies to support their analytics decisions. Big Data refers to all about data, how it is collected, processed, and analyzed to generate value-added data-driven insights and decisions. Degradation in Data Quality may result in unpredictable consequences. In this case, confidence and worthiness in the data and its source are lost. In the Big Data context, data characteristics, such as volume, multi-heterogeneous data sources, and fast data generation, increase the risk of quality degradation and require efficient mechanisms to check data worthiness. However, ensuring Big Data Quality (BDQ) is a very costly and time-consuming process, since excessive computing resources are required. Maintaining Quality through the Big Data lifecycle requires quality profiling and verification before its processing decision. A BDQ Management Framework for enhancing the pre-processing activities while strengthening data control is proposed. The proposed framework uses a new concept called Big Data Quality Profile. This concept captures quality outline, requirements, attributes, dimensions, scores, and rules. Using Big Data profiling and sampling components of the framework, a faster and efficient data quality estimation is initiated before and after an intermediate pre-processing phase. The exploratory profiling component of the framework plays an initial role in quality profiling; it uses a set of predefined quality metrics to evaluate important data quality dimensions. It generates quality rules by applying various pre-processing activities and their related functions. These rules mainly aim at the Data Quality Profile and result in quality scores for the selected quality attributes. The framework implementation and dataflow management across various quality management processes have been discussed, further some ongoing work on framework evaluation and deployment to support quality evaluation decisions conclude the paper.
... Figure 3 shows the image we used in this study. The noise could affect data quality [25,26,27,28,29], especially image quality in this research. Instead of discussing low-quality images, we choose high-resolution images. ...
... 53 Madera, C. et Laurent, A. (2016) 54 Gray, J. et al. (1997) 55 Berkowitz, B. T. et al. (2002) 56 Demchenko, Y. et al. (2013) 57 Dai, W. et al. (2016) With data governance and security tools such as Distributed Ledger Technology (DLT) 58 and blockchain, it is possible to implement automatic validation systems through a consensus algorithm that replicates, shares and synchronises digital data across dierent locations. Blockchain is a distributed digital ledger of all transactions or information shared across a peer-to-peer network veried by all participants 59 . ...
Full-text available
The increasing use of nancial technologies (FinTech) by market participants fostered the discussion among public authorities on the use of similar technologies for regulatory (RegTech) and supervisory (SupTech) purposes. In a similar vein, innovative technologies could be applied in the context of nancial rms' crisis resolution. The resolution context is, however, peculiar: since resolution is not a prot-making activity, there is little market incentive for the private sector to foster innovation in this area. Therefore, resolution authorities may decide to drive the innovation by developing big data technologies and solutions to resolve nancial crisis. This paper sets the denition of Resolution Technology (ResTech), and outlines its areas of application. ResTech is built on innovative technologies, which could support the work of resolution authorities in developing resolution plans and in resolving nancial rms. Also, ResTech could guide rms' compliance functions. This paper identies four main areas of ResTech: resolution planning, resolution execution, cross-border exchange of information and automatised compliance for banks and nancial rms. The adoption of big data architectures could transform resolution planning in a dynamic activity; in the same vein, machine learning algorithms could boost the application of existing resolution tools and support the determination of the resolution strategy. This paper concludes that the benets of ResTech need to be measured against the increased risks taken by technology adopters. In a technology-driven environment, resilient IT infrastructures and e-governance processes are essential to prevent operational, reputational and legal risks.
Full-text available
The rapid development of deep learning improves the detection and classification of attacks on intrusion detection systems. However, the unbalanced data issue increases the complexity of the architecture model. This study proposes a novel deep learning model to overcome the problem of classifying multi-class attacks. The deep learning model consists of two stages. The pre-tuning stage uses automatic feature extraction with a deep autoencoder. The second stage is fine-tuning using deep neural network classifiers with fully connected layers. To reduce imbalanced class data, the feature extraction was implemented using the deep autoencoder and improved focal loss function in the classifier. The model was evaluated using 3 loss functions, including cross-entropy, weighted cross-entropy, and focal losses. The results could correct the class imbalance in deep learning-based classifications. Attack classification was achieved using automatic extraction with the focal loss on the CSE-CIC-IDS2018 dataset is a high-quality classifier with 98.38% precision, 98.27% sensitivity, and 99.82% specificity.
In this work we have developed a quality approach for the quality assessment of data related to the business process for quality projects, this approach uses the cost of the implementation of quality combined with the impact of quality broken down into the benefit and efficiency of data, shapley value helps us choose the business processes that will collaborate to reduce the cost of improvement, Deep learning helps us calculate the quality values for any dimension based on history of previous improvements. To reach our goal, we used the cost-benefit approach (ACB) and the cost-effective approach (ACE) to extract the impact and cost factors then using a multi-optimization algorithm. -objective we will minimize the cost and maximize the impact for each business process and the deep learning introduced will complement our approach to learn from the previous improvements after validation of the processes which will be chosen as well as the values calculated after improvement. The importance of this research lies in the use of impact factors and the cost of the quality evaluation which represent the basis of any improvement, our approach uses generic multi-objective optimization algorithms which will help choose the minimum value of each business process before the improvement, adding a layer of predicting and estimating the quality value of the data generated by the business process before the improvement even, while the value of shapley has aim to minimize the cost of quality projects during fission and merger of companies and even within a company composed of several services and departments to have the lowest possible total cost to help companies manage the portfolios of quality.. Keywords: Artificial neural network, data quality assessment, data quality improvement, deep learning, prediction of improvement in data completeness shapley value.
Full-text available
The article reports on enhancement of data quality in data warehouse environment. Here a conceptual framework is offered for enhancing data quality in data warehouse environments. Factors are explored such as current level of data quality, the levels of quality needed by the relevant decision process, the potential benefits of projects designed to enhance data. Those who are responsible for data quality have to understand the importance of such factors. For warehouses supporting a limited number of decision processes, awareness of these issues coupled with good judgment should suffice. Data warehousing efforts may not succeed for various reasons, but nothing is more certain to yield failure than lack of concern for the quality of the data. Data supporting organizational activities in a meaningful way should be warehoused. A distinguishing characteristic of warehoused data is that it is used for decision making, rather than for operations. Data warehousing efforts have to address several potential problems.
Full-text available
We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.
Full-text available
We develop a new schema for unstructured data. Traditional schemas resemble the type systems of programming languages. For unstructured data, however, the underlying type may be much less constrained and hence an alternative way of expressing constraints on the data is needed. Here, we propose that both data and schema be represented as edge-labeled graphs. We develop notions of conformance between a graph database and a graph schema and show that there is a natural and efficiently computable ordering on graph schemas. We then examine certain subclasses of schemas and show that schemas are closed under query applications. Finally, we discuss how they may be used in query decomposition and optimization.
Conference Paper
Conference Paper
This paper defines a comprehensive set of metrics for evaluating the quality of Entity Relationship models. This is an extension of previous research which developed a conceptual framework and identified stakeholders and quality factors for evaluating data models. However quality factors are not enough to ensure quality in practice, because different people will have different interpretations of the same concept. The objective of this paper is to refine these quality factors into quantitative measures to reduce subjectivity and bias in the evaluation process. A total of twenty five candidate metrics are proposed in this paper, each of which measures one of the quality factors previously defined. The metrics may be used to evaluate the quality of data models, choose between alternatives and identify areas for improvement.
Introduction Organizations are becoming increasingly serious about the notion of "data as an asset" as they face increasing pressure for reporting a "single version of the truth." In a 2006 survey of 359 North American organizations that had deployed business intelligence and analytic systems, a program for the governance of data was reported to be one of the five success "practices" for deriving business value from data assets. In light of the opportunities to leverage data assets as well ensure legislative compliance to mandates such as the Sarbanes-Oxley (SOX) Act and Basel II, data governance has also recently been given significant prominence in practitioners' conferences, such as TDWI (The Data Warehousing Institute) World Conference and DAMA (Data Management Association) International Symposium. The objective of this article is to provide an overall framework for data governance that can be used by researchers to focus on important data governance issues, and by practitioners to develop an effective data governance approach, strategy and design. Designing data governance requires stepping back from day-to-day decision making and focusing on identifying the fundamental decisions that need to be made and who should be making them. Based on Weill and Ross, we also differentiate between governance and management as follows: • Governance refers to what decisions must be made to ensure effective management and use of IT ( decision domains ) and who makes the decisions ( locus of accountability for decision-making ). • Management involves making and implementing decisions. For example, governance includes establishing who in the organization holds decision rights for determining standards for data quality. Management involves determining the actual metrics employed for data quality. Here, we focus on the former. Corporate governance has been defined as a set of relationships between a company's management, its board, its shareholders and other stakeholders that provide a structure for determining organizational objectives and monitoring performance, thereby ensuring that corporate objectives are attained. Considering the synergy between macroeconomic and structural policies, corporate governance is a key element in not only improving economic efficiency and growth, but also enhancing corporate confidence. A framework for linking corporate and IT governance (see Figure 1) has been proposed by Weill and Ross. Unlike these authors, however, we differentiate between IT assets and information assets: IT assets refers to technologies (computers, communication and databases) that help support the automation of well-defined tasks, while information assets (or data) are defined as facts having value or potential value that are documented. Note that in the context of this article, we do not differentiate between data and information. Next, we use the Weill and Ross framework for IT governance as a starting point for our own framework for data governance. We then propose a set of five data decision domains, why they are important, and guidelines for what governance is needed for each decision domain. By operationalizing the locus of accountability of decision making (the "who") for each decision domain, we create a data governance matrix, which can be used by practitioners to design their data governance. The insights presented here have been informed by field research, and address an area that is of growing interest to the information systems (IS) research and practice community.
Conference Paper