Content uploaded by WEI Dai
Author content
All content in this area was uploaded by WEI Dai on Oct 16, 2017
Content may be subject to copyright.
Data Profiling Technology of Data Governance
Regarding Big Data: Review and Rethinking
Wei Dai1*, Isaac Wardlaw2%, Yu Cui3, Kashif Mehdi4, Yanyan Li5#, Jun Long6
1Information Science, University of Arkansas at Little Rock, Little Rock, AR, USA
{wxdai*, iawardlaw%, yxli5#}@ualr.edu
2,5Computer Science, University of Arkansas at Little Rock, Little Rock, AR, USA
3College of Information Engineering, Guangdong Mechanical and Electrical Polytechnic,
Guangzhou, Guangdong, China
cuiyu@gdmec.edu.cn
4Software Development Group, Collibra Inc., New York, NY, USA
kashif.mehdi@collibra.com
6Information Science and Engineering, Central South University, Changsha, Hunan, China
jlong@csu.edu.cn
Abstract. Data profiling technology is very valuable for data governance and
data quality control because people need it to verify and review the quality of
structured, semi-structured, and unstructured data. In this paper, we first review
relevant works and discuss their definitions of data profiling. Second, we offer a
new definition and propose new classifications for data profiling tasks. Third,
the paper presents several free and commercial profiling tools. Fourth, authors
offer a new data quality metrics and data quality score calculation. Finally, au-
thors discuss a data profiling tool framework for big data.
Keywords: data profiling tools; data governance; big data; data quality control;
data management.
1 INTRODUCTION
Data is ubiquitous. People use digital maps to navigate in major cities, send emails to
communicate, and buy e-tickets online every day. The more computers assist people,
the larger the volume data that will be stored. Big data means a new era in data utili-
zation and exploration; and its key characteristics are volume, variety, and velocity
[1].
Data can be structured, as is scientific and statistical data; semi-structured, such as
PDF and XML; or unstructured, like raw video or audio [2]. Structured data can be
accessed by tools like SQL, while XQuery is usually used to access semi-structured
data. Because of the complexity of unstructured data, data profiling tools usually fo-
cus on structured data and semi-structured data.
Data quality plays an important role in data governance. According to [3] and [4],
data governance takes a unique role in companies because of the Sarbanes-Oxley
(SOX) Act and Basel II. Data is a valuable asset for customers, companies, and gov-
ernments. The target of Total Data Quality Management (TDQM) is to offer high-
quality information products (IP) to users, so definition, measurement, analysis, and
improvement of data quality are included in TDQM cycle [5] [6].
Data profiling technology can improve data quality. In [7], [8], and [9], data profiling
tools discover the pattern of datasets and offer a scores of projects regarding TDQM,
including data cleaning, data integration, and data analysis. Data scientists, end-users,
and IT engineers utilize these tools to improve data quality because data profiling
tools easily show the frequency patterns of addresses information, credit cards, and
phone numbers.
The remainder of this paper is organized as follows: Section II reviews relevant works
and discusses a new definition of data profiling. Section III presents new classifica-
tions of data profiling task; Sections IV and V introduce several data profiling tools,
including both free and commercial software (Fig. 1). Section VI discovers data quali-
ty metrics and an exponentially weighted data quality score in detail. Section VII
discusses a data profiling tool framework for big data. Section VIII concludes main
ideas, and section VIII addresses future work.
Fig. 1. Selected Data Profiling Tools
2 DATA PROFILING DEFINITION
Different authors give different definitions of data profiling. [7] describes it as “the
set of activities and processes to determine the metadata about a given dataset.” Ac-
cording to [10], “data profiling is a process whereby one examines the data available
in an existing database or flat file and collects statistics and information about that
data. These statistics become the real or true metadata.” [11] defines it as “referring to
the activity of creating small but informative summaries of a database.” These defini-
tions of data profiling seem convincing initially, but [7] only believes that data profil-
ing is the metadata regarding datasets, and [10] does not mention unstructured data at
all. Definition of [11] is vague because data has different types, structures, and utiliza-
tion.
Data profiling can be utilized at different stages of data governance. Thus, in our
opinion, profiling is the process of verifying users’ structured data, semi-structured
data, and unstructured data, gathering data structure, data pattern, statistical infor-
mation, distribution messages, and reviewing data attributes for data governance, data
management, data migration, and data quality control.
3 DATA PROFILING TASKS
The data profiling includes in multiple tasks for data governance. According to [8]
and [10], people need to profile data only when cleaning, managing, analyzing, or
integrating it; evaluating its quality; performing data archaeology; optimizing queries;
or engaging in Extract, Transform and Load (ETL) projects. However, data profiling
could also be used for compressing, verifying, masking, auditing, migrating, archiv-
ing, and recovering data as well as for generating testing data and data health reports.
[7] and [8] list classifications of data profiling tasks; however, these classifications are
not good enough. For example, [8] differentiates data quality jobs based upon whether
they use a single data source or multiple sources to separate different data quality
jobs. In our view, data profiling tools consists of five primary jobs:
1. Metadata Profiling: discovering metadata information, such as data structures,
creators, times of creation, primary keys, and foreign keys.
2. Presentation Profiling: finding data patterns, including text patterns, time patterns,
and number patterns. such as address pattern, date patterns, and telephone patterns.
3. Content Profiling: reviewing data basis information, including accuracy, precision,
timeliness, null or non-null.
4. Set Profiling: analyzing data from collections or groups; for example statistics,
distribution, cardinality, frequency, uniqueness, row count, maximum or minimum
values, mean, and redundancy.
5. Logical Rule Profiling: reviewing data based on business logical rules or business
glossary domain, such data logical meanings, business rules, and functional
dependency.
Different data profiling missions could be used for different projects, such as Table 1.
Table 1. data profiling for user scenario
Profiling Missions
Profiling Scenario
metadata profiling
Data management, data integration, ETL, data migration
Presentation Profiling
Data compression, data audit
Content Profiling
Data compression, Data management,
Set Profiling
data audit, Data compression, data management
Logical Rule Profiling
data audit, data management
Some sorts of data profiling (such as content profiling and logical-rule profiling) are
strongly linked to business domain, but others are not. Fig 2 shows this relationship.
Fig. 2. Data Profiling in different technical or business domains.
4 FREE DATA PROFILING TOOLS
Some tools are free software and open source; however, many, but not all free data
profiling tools are open source projects. In general, their functionality is more limited
than that of commercial products, and they may not offer free telephone or online
support. Furthermore, their documentation is not always thorough. However, some
small companies still use these free tools instead of expensive commercial software,
considering the benefits that free tools provide.
1. Aggregate Profiler Tool
Aggregate Profiler (AP) is an open source project developed in Java [7]. AP supports
both traditional database and big data, such as Hadoop or Hive, and it offers statistical
analysis, pattern matching, distribution chat, basket analysis, etc. AP also supports
data generation, data preparation, data masking features, and address correction for
data quality projects. Moreover, this tool offers data validation (metadata profiling,
analytical profiling, and structural profiling), and data quality (removing duplicate
data, null values, and dirty data).
2. Talend Open Studio for Data Quality
Talend Open Studio for Data Quality (TOSDQ) [8] is also based on Java and is a
mature open source tool. TOSDQ offers navigator interface to access databases and
data files. This tool supports catalog analysis, time correlation analysis, column anal-
ysis, table analysis, column correlation analysis, and schema analysis; it also supports
column functional dependency, redundancy analysis, numerical correlation analysis,
nominal correlation analysis, connection analysis, column set analysis, and match
analysis. Furthermore, TOSDQ reports several different types of statistics indicators,
including simple statistics, text statistics, summary statistics, pattern frequency statis-
tics, Soundex frequency statistics, phone number statistics, and Fraud detection (Ben-
ford's law frequency).
3. DataCleaner
DataCleaner [9] is a commercial tool for data profiling and data cleaning, but it has a
free version which offers multiple data profiling functions, including pattern match-
ing, boolean analysis, weekday distribution, completeness analysis, value matcher,
character set distribution, value distribution, date gap analysis, unique key check,
date/time analysis, string analysis, number analysis, referential integrity, and refer-
ence data matching.
5 COMMERCIAL DATA PROFILING PRODUCTS
Commercial data profiling products usually come packaged in data governance suites.
These products have multiple functions, high performance, and strong capabilities;
they can connect to other suites to provide comprehensive solutions for customers.
Moreover, these software is not only powerful, but end-users also can find online
services and telephone support.
1. IBM InfoSphere Information Analyzer
IBM InfoSphere Information Analyzer (IIA) [10] is part of IBM’s data governance
suite that includes InfoSphere Blueprint Director, Metadata Workbench, DataStage,
QualityStage, Data Click, Business Glossary, and Information Services Director. IIA
supports column analysis (statistics, distribution, cardinality, and value analysis.),
identifying keys and relationships, discovering redundant data, comparing data and
structures through history baselines, analyzing data via data rules, and importing and
exporting data rules.
2. Informatic Data Profiling
Informatic Data Profiling is a key component of PowerCenter [11]. This profiling
software supports aggregate functions (count null values, calculate averages, get max-
imum or minimum values, and get lengths of strings), candidate key evaluation
(unique or non-unique), distinct value count, domain inference, functional dependen-
cy analysis, redundancy evaluation, and row count. In addition, users can add busi-
ness rules (verbose mode) or configure profile functions in this tool.
3. Oracle Enterprise Data Quality
Oracle Enterprise Data Quality (EDQ) [12] permits address verification, profiling data
(files, databases, and spreadsheets), standardization, audit reviews (incorrect values,
missing data, inconsistencies, duplicate records, and key quality metrics), matching
and merging columns (duplicate prevention, de-duplication, consolidation, and inte-
gration), and case management (data reviewing). Furthermore, the tool can utilize pre-
built templates or user-defined rules to profile data. EQD also can connect to other
Oracle data governance products, including Oracle Data Integrator and Oracle Master
Data Management.
4. SAP Information Steward
SAP Information Steward can improve information quality and governance [13] via
the Data Insight module (data profiling and data quality monitoring), Metadata Man-
agement module (metadata analysis), Metapedia Module (business term taxonomy),
and cleansing package builder (cleansing rules). The data insight module can define
validation rules, determine profiling (column, address, uniqueness, dependency, and
redundancy), import and export metadata, and create views [13].
5. SAS DataFlux Data Management Studio
SAS DataFlux Data Management Studio (DDMS) [14] is a data governance suite that
consists data profiling, master data management, and data integration. This data pro-
filing tool covers key analysis (primary and foreign keys), pattern frequency distribu-
tion analysis, redundant data analysis, and data profiling reports.
6. Collibra Data Stewardship Manager
Collibra Data Stewardship Manager (DSM) [15] module is part of Collibra’s Data
Governance Center that also includes Business Semantic Glossary (BSG) and Refer-
ence Data Accelerator (RDA) module. DSM also provides historical data quality re-
ports around trend analysis and reports to understand the impact of resolved data is-
sues. In addition, DSM provides fully configurable data quality reporting dashboard
(see figure below) by bringing data quality rules and metrics calculated in one or mul-
tiple sources (data quality tools, databases, and big data).
6 DATA QUALITY METRICS
Several academic papers offer data quality measurement methods. In [16], the authors
mention information quality dimensions including accessibility, completeness, and
security; however, these dimensions only focus on information quality for data gov-
ernance. [17] separates data quality problems according to how a data administrator
might view them-for example, single-source, multi-source, instance level, schema
level. The authors state that these problems could be solved by ETL, but these proce-
dures only improve the quality of data after it has been collected and stored in a data
warehouse making the solution inflexible. [18] presents algorithms for calculating
data quality measures, such as free-of-error, completeness, and appropriate-amount-
of-data.
Some papers describe how to build metrics for data quality. In [19], the authors offer
many metrics (25 candidate metrics and 18 subsets of metrics) regarding data quality.
However, they do not mention how to implement them. [20] offers a blueprint for
improving data quality, but it only focuses on the business value of such improve-
ment. In [21], the authors discuss how to build the data metrics of data warehouse
from such components as table metrics, start metrics, and schema metrics, but they do
not mention how to profile data quality, enhance data quality of data structures, or
improve unstructured data quality. In [22], the authors mention dimensions of data
quality (for instance, uniqueness, accuracy, and consistency), and they also detail how
to build a model for quantifying data quality performance.
1. Data Quality Indicators
Qualitative indicators are the best way to measure data quality, but sometime we utilize
dimensions to define data quality. In [24], authors enumerate six indicators of data
quality (uniqueness, accuracy, consistency, and etc.). Table 2 shows these qualitative
indicators and definitions
Table 2. Six Indicators of data quality [24]
Indicator
Definition
Accuracy
The degree to which data correctly describes the ‘real
world’ object or event being described.
Completeness
The proportion of stored data against the potential of
‘100% complete’
Consistency
Similarity when comparing two or more representa-
tions of something against its definition.
Timeliness
The degree to which data represents reality at the re-
quired point in time.
Validity
Conformity of data’s syntax (format, type, range) to
its definition.
Uniqueness
Nothing will be recorded more than once based upon
how it is identified.
2. Data Quality Metrics
Qualitative indicators do not offer precise indexes of data quality, but quantitative met-
rics deliver accurate methods to monitor, control, and improve data quality, which is
valuable for data profiling. Metrics consist of directly measured regulations. [24] dis-
cusses twenty quantitative metrics of data governance for health organizations; howev-
er, these metrics only target at health field. In this paper, we offer comprehensive met-
rics for data quality, shown in Table 3.
Table 3. Quantitative Data Quality Metrics
Indicator
Metric
Accuracy
Percent of data is correct ( correct data / total data)
(e.g. ZIP code, SSN)
Completeness
Percent of data is completeness data
(e.g. phone number, address)
Consistency
Percent of data is correct consistency
(Such as business rules and logical rules of Consistency)
Timeliness
Percent of data is correct timeliness
(For example, ages, educational degree at a special time or
date.)
Validity
Percent of data is validity (Such as first name, last name, suffix,
and etc.)
Uniqueness
Percent of data is uniqueness
(e.g. primary keys, foreign keys)
Moreover, we can get the same results if we use simple data instead of the full volume
data.
Data Quality Metric (DQM) formula usually need to build data quality indexes, or
weights, which depend on specific business scenarios. For example, banks usually send
bills via mail or email, so home and email addresses are important to them. However,
they may not care much about your cell phone number because they do not need it.
Customers could define their own weights for data quality metrics. Table 4 shows an
example of metric weights for the address column.
Table 4. METRIC WEIGHTS OF Address Column
Metric
Weight
Accuracy : % of data is correct (W1)
30
Completeness: % of data is completeness data (W2)
15
Consistency: % of data is correct consistency (W3)
10
Timeliness : % of data is correct timeliness (W4)
05
Validity: % of data is validity (W5)
20
Uniqueness: % of data is null value or non-null values (W6)
20
Total Weight Value:
100
DQM formula is
3. Exponentially Weighted Data Quality Score
Exponentially Weighted Data Quality Score (EWDQS) is strongly related to time and
DQM. If we want to create a data quality score, we should build exponentially
weighted moving average formula.
The DQM is a series EWDQS, which may be calculated recursively:
Where:
1) A higher λ increases previous DQM quicker and 1 >λ>0.
2) DQM (T) is the value of the EWDQS at any time period t.
3) EWDQS (T) is the value at any time period t.
4. Case Study
Imagine Bank A has a lot of address data, and the DBA wants to check its data quality.
After profiling the data via data quality tool in January, he notices some problems: 5%
of the data is incorrect, 7.8% of data contains null values, and 12.5% of data is incon-
sistent. The engineer utilize our formulas to measure the data quality score.
According to Equations 1 and 3:
DQM (1) = (1-0.05) × 30 + 15 + 5+ (1-0.078) × 20 + (1-0.12.5) × 10 +20 = 95.69
EWDQS (1) = 95.69
In February, the engineer check it again, it also has some data quality problems: 4.5%
of the data is incorrect, 7.0% of data contains null values, and 10.5% of data is incon-
sistent.
According to Equations 1 and 4, and if λ =0.75:
DQM (2) = (1-0.045) × 30 + 15 + 5+ (1-0.070) × 20 + (1-0.105) × 10 +20 = 96.2
EWDQS (2) = 0.75× DQM (2) + (1-0.75) × EWDQS (1)
= 0.75×96.2 +0.25 × 95.69 = 96.07
Therefore, the exponentially weighted data quality score of Bank A shifts from 95.69
and 96.07 in the month of January.
7 DATA PROFILING TOOL FRAMEWORK FOR BIG
DATA
There is limited document mention data profiling tools framework because commercial
companies consider this to be a trade secret [19] [20] [21] [22] [23] [24]. However, as
big data increases, data governance, data quality control, and data profiling technology
grow more important [1] [25] [26]. People need data profiling tools to perform the big
data analysis necessary to improve data quality, but they are currently challenged by
limited performance and unexpected robustness. Customers want these tool to profile
real-time data, static data profiling, simple data, and full volume data, and we should
strive to satisfy their requirements. The data profiling framework has six layers: the
hardware layer, data layer, parallelism layer, algorithms layer, function layer, and Web
UI layer (Fig. 3).
statistic
Machine Learning
X86_64bit
CPU/GPU X86_64bit
CPU/GPU
Parallelism
Layer
Data Layer
X86_64bit
CPU/GPU
X86_64bit
CPU/GPU X86_64bit
CPU/GPU
Hardware
Layer
Algorithms
Layer Data Mining
IBM DB2
WebUi
Layer
Dashboard
Configure Rules
ER Model Jobs Control Data Source
Functions
Layer
Fig. 3. Data Profiling Architecture
Each layer has different application (Table 4). “At the Algorithm Layer, for instance, machine
learning algorithms could be employed that automatically analyze the patterns and rules of data
as well as data structures in order to enhance data quality and facilitate its governance. These
algorithms could run on x86 machines or graphics processing units (GPU), such as NVidia
CUDA [27].
Table 5. data profiling framework
Layer
Application
Web-UI Layer
User interface for ER model, business rules, KPI
dashboard, batch or real-time job maintenance, in-
put/output data source configuration.
Function Layer
Scores of data profiling missions (Section III).
Algorithm Layer
Utilize data mining, machine learning, and statistics,
or other algorithms.
Parallelism Layer
Apache Spark [28] for static data and full volume data;
Apache Storm [29] for real-time data.
Data Layer
Business rules, configuration data, metadata store in
Hadoop, traditional databases.
Hardware Layer
Integrate CPU and GPU clusters for improving per-
formance, especially real-time or machine learning tasks.
8 CONCLUSION
As data profiling enriches data quality, data management, and data governance, it is
important for customers, data scientists, and DBA to use data profiling tools. Data is an
asset for all users, so data quality should be controlled by procedures, rules, people,
and software. In this paper, after reviewing existing relevant works, a new data profil-
ing definition and data profiling tasks were presented. Authors discuss several free or
paid data profiling tools. Moreover, the paper introduced a method for building data
quality metrics and showed how to calculate data quality scores. Finally, a data profil-
ing tool framework was given.
9 FUTURE WORK
Data profiling only grows more important in this new era of big data. We will continu-
ally read more academic papers, technical documents, and develop codes about algo-
rithms for data profiling on big data and will find new ways to extend the functions of
data profiling in the future.
REFERENCES
1. Zikopoulos, Paul, and Chris Eaton. Understanding big data: Analytics for enterprise class
hadoop and streaming data. McGraw-Hill Osborne Media, 2011.
2. Buneman, Peter. "Semistructured data." In Proceedings of the sixteenth ACM SIGACT-
SIGMOD-SIGART symposium on Principles of database systems, pp. 117-121. ACM,
1997.
3. Buneman, Peter, Susan Davidson, Mary Fernandez, and Dan Suciu. "Adding structure to
unstructured data." In Database Theory—ICDT'97, pp. 336-350. Springer Berlin Heidel-
berg, 1997.
4. Khatri, Vijay, and Carol V. Brown. "Designing data governance." Communications of the
ACM 53, no. 1 (2010): 148-152.
5. Pipino, Leo L., Yang W. Lee, and Richard Y. Wang. "Data quality assessment." Commu-
nications of the ACM 45, no. 4 (2002): 211-218.
6. Wang, Richard Y. "A product perspective on total data quality management." Communica-
tions of the ACM 41, no. 2 (1998): 58-65.
7. Kumar, Roushan, and Arun Yadav. "Aggregate Profiler -- Data Quality." Accessed Octo-
ber 20, 2015. “http://sourceforge.net/projects/dataquality/”.
8. Talend Company. "Talend Open Studio for Data Quality." Accessed October 20, 2015.
“http://www.talend.com/products/data-quality”.
9. DataCleaner Company. "DataCleaner Manual." Accessed October 20, 2015.
“http://datacleaner.org/resources/docs/4.0.10/pdf/datacleaner-reference.pdf”
10. IBM Company. “InfoSphere Information Server: Information Center.” Accessed October
20, 2015. ”http://www-01.ibm.com/support/knowledgecenter/SSZJPZ_9.1.0/”
11. Informatica Company. “Data Profiling Solutions.”Accessed October 20, 2015.
“https://www.informatica.com/data-profiling.html”
12. Oracle Company. “Oracle Enterprise Data Quality.” Accessed October 20, 2015.
“http://www.oracle.com/us/products/middleware/data-integration/enterprise-data-
quality/overview/index.html”
13. SAP Company. “SAP Information Steward.” Accessed October 20, 2015.
“http://scn.sap.com/docs/DOC-8751”
14. SAS Company. “SAS Products: DataFlux Data Management Studio.” Accessed October
20, 2015. “http://support.sas.com/software/products/dfdmstudioserver/
15. "A Data Governance Solution Tailored for Your Role." Collibra Solution Comments. Ac-
cessed November 12, 2015. https://www.collibra.com/solution/.
16. Pipino, Leo L., Yang W. Lee, and Richard Y. Wang. "Data quality assessment." Commu-
nications of the ACM 45, no. 4 (2002): 211-218.
17. Rahm, Erhard, and Hong Hai Do. "Data cleaning: Problems and current approaches." IEEE
Data Eng. Bull. 23, no. 4 (2000): 3-13.
18. Lee, Yang W., Leo L. Pipino, James D. Funk, and Richard Y. Wang. Journey to data quali-
ty. The MIT Press, 2009.
19. Moody, Daniel L. "Metrics for evaluating the quality of entity relationship models." In
Conceptual Modeling–ER’98, pp. 211-225. Springer Berlin Heidelberg, 1998.
20. Ballou, Donald P., and Giri Kumar Tayi. "Enhancing data quality in data warehouse envi-
ronments." Communications of the ACM 42, no. 1 (1999): 73-78.
21. Calero, Coral, Mario Piattini, Carolina Pascual, and Manuel A. Serrano. "Towards Data
Warehouse Quality Metrics." In DMDW, p. 2. 2001.
22. Loshin, D. "Monitoring Data Quality Performance Using Data Quality Metrics: A White
Paper." Informatica. November (2006).
23. "The Six Primary Dimensions for Data Quality Assessment." The Six Primary Dimensions
for Data Quality Assessment. Accessed November 5, 2015.
http://www.enterprisemanagement360.com/white_paper/six-primary-dimensions-data-
quality-assessment/.
24. "The Ultimate Guide to Data Governance Metrics : Healthcare Edition:40 Ways for Payers
and Providers to Measure Information Quality Success." The Ultimate Guide to Data Gov-
ernance Metrics : Healthcare Edition. 2012. Accessed November 5, 2015.
http://www.ajilitee.com/wp-content/uploads/2013/09/Ultimate-Guide-to-Data-
Governance-Metrics-for-Healthcare-Ajilitee-June-2012.pdf.
25. Zikopoulos, Paul, and Chris Eaton. Understanding big data: Analytics for enterprise class
hadoop and streaming data. McGraw-Hill Osborne Media, 2011.
26. LaValle, Steve, Eric Lesser, Rebecca Shockley, Michael S. Hopkins, and Nina Kruschwitz.
"Big data, analytics and the path from insights to value." MIT sloan management review
21 (2013).
27. "CUDA GPUs." NVIDIA Developer. June 4, 2012. Accessed November 12, 2015.
https://developer.nvidia.com/cuda-gpus.
28. "Apache Spark™ - Lightning-Fast Cluster Computing." Apache Spark™ - Lightning-Fast
Cluster Computing. Accessed November 12, 2015. http://spark.apache.org/.
29. "Apache Storm." Apache Storm. Accessed November 12, 2015. http://storm.apache.org/.