Content uploaded by Salahuddin A. Azad
Author content
All content in this area was uploaded by Salahuddin A. Azad on Nov 28, 2019
Content may be subject to copyright.
Content uploaded by Salahuddin A. Azad
Author content
All content in this area was uploaded by Salahuddin A. Azad on Nov 28, 2019
Content may be subject to copyright.
Content uploaded by Salahuddin A. Azad
Author content
All content in this area was uploaded by Salahuddin A. Azad on Nov 27, 2019
Content may be subject to copyright.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Business Data Enrichment: Issues and Challenges
Salahuddin A. Azad
School of Engineering and
Technology
Central Queensland University
MelbourneVic 3000, Australia
Email: s.azad@cqu.edu.au
Saleh Wasimi
School of Engineering and
Technology
Central Queensland University
MelbourneVic 3000, Australia
Email: s.wasimi@cqu.edu.au
ABM Shawkat Ali
School of Mathemetical and
Computing Sciences
Fiji National University
Samabula, Fiji Islands
Email: shawkat.ali@fnu.ac.fj
Abstract—Companies are collecting vast amount of
data to gain insight from it to make better decisions and
enhance customer experience. If the quality of data
collected is poor, no matter how powerful the data
analytics is, businesses cannot gain the expected benefits
from it. To achieve maximum benefits, data needs to be
cleaned, organized, and refined with additional
information or intelligence. Data enrichment goes
beyond correcting errors and improving accuracy. It
unveils the relationships, clusters, semantic ontologies in
the data to glean new insight about customers, suppliers
and peer businesses. This paper discusses the issues and
challenges of business data enrichment. It provides
insight into processes involved in data enrichment and
methods for selecting data providers. The paper also
explains semantic data enrichment that improves the
usability and value of unstructured data. Finally, the
paper presents the challenges in data enrichment that
need to be addressed.
Keywords—business data enrichment, semantic data
enrichment, crowd-powered data enrichment, data fusion
I. INTRODUCTION
The vast amount of data available today has created
a great opportunity for businesses to make faster and
better business decisions, enhance customer
experience and leverage business assets. Companies
are collecting vast amount of data and using data
analytics to leverage big data for competitive
advantage. But data can only be beneficial to a
business if it is clean and organized. To achieve
maximum benefit from the data, it needs to be refined
and enhanced using additional information or
intelligence. Data enrichment refers to the process of
enhancing, refining or organizing data to extract
valuable insight from it. Data Enrichment is more than
correcting errors and improving data accuracy. The
purpose of data enrichment is to discern relationships,
clusters, semantic ontologies within the collections of
data that unveil new insights to make informed
decisions [1]. Businesses collect and utilise a large
volume of data everyday to make informed decisions.
The correctness, relevance and quality of data are of
utmost importance to leverage the business value of
the data. Data enrichment can turn existing data into an
insightful lead generating asset. Enhanced customer
profiling allows businesses to identify and segment
customers according to age, gender, marital status,
income, profession, children and many other
characteristics to target customers with more personal
and tailored messages. It also enables businesses to
identify the anomalies in customer behaviour and
detect fraudulent activities. The benefits of data
enrichment include but not limited to the following:
• Deeper insight into customer behaviour
• Better targeting of customers
• Personalisation of communication
• Prompt identification of new opportunities
• Diagnosis and mitigation of potential risks
• Improved capital equipment maintenance
• Better competitive intelligence
Now-a-days people are spending a significant
amount of time in social media creating a rich history
of information about them. Businesses can harness this
information to find out more about their customers’
passion, association and lifestyle, which would
facilitate better segmentation of customers and
personalisation of communication. This would
eventually enable businesses to increase conversion
rate and earn more profit. However, before customer
segmentation is performed, businesses should make
sure they have enough and detailed information about
their customers [2].
As the world is changing constantly, static data
cannot be trusted completely since these do not capture
the most recent facts based on which dynamic
decisions could be made. Some of the examples of
dynamicity of B2B data, as mentioned in [3], are as
follows.
• People switch employer.
• Employees take on new roles or get promoted.
• The financial outlooks of companies change.
• The management structure changes.
• The corporate goals shift.
• New units or teams launched or existing ones
dissolved or merged.
So to make the right decision based on the most
updated information, static data should be enriched
with external data. Also, the data enrichment process
should be run on regular intervals and often enough so
that recent changes in the facts are not missed.
The rest of the paper is organized as follows.
Section 2 discusses the processes involved in data
enrichment, Section 3 explains the ways to choose the
right data provider for internal data augmentation,
Section 4 sheds light on semantic data enrichment that
improves the usability and value of unstructured data
and Section 5 mentions some challenges in data
enrichment. Finally, Section 6 concludes the paper.
II. DATA ENRICHMENT PROCESS
Poor quality data wastes valuable time and money
of a business. The accuracy of the existing data must
be verified before any downstream data enhancement,
profiling and scoring takes place to avoid the cost that
serves little purpose. Data enrichment is about
supplying the data with semantics and further
introduction of structuring which involves tasks such
as fixing source information, rectifying errors or
adding missing details. One very common example of
data enrichment is address correction. When a
customer enters address details on some e-commerce
site, the address is transformed into standard format
which includes street, city, state, and post code. The
removal of erroneous duplicates and redundancies and
detecting outliers are parts of preprocessing data
before data augmentation process starts. It is very
important to identify and remove unnecessary
information, enriching those would simply waste time.
Once the anomalies of the existing data is detected and
corrected, the business should look for the external
data sources from where additional data (e.g. life
events, interests, financial data, and automotive data)
can be collected to augment the existing data. Data
enrichment can use different sources of data such as
geographic, behavioral, demographic, psycho-graphic,
and census data [4]. The particular enrichment strategy
adopted depends on the type of insight the business is
looking for and marketing strategy. At the final stage,
predictive analytics should be used to produce accurate
and complete customer profile to be able to personalise
communication to the customers. According to
Experian Data Quality [5], robust data enrichment
process should have three capabilities. Firstly, it will
enhance the accuracy of internal data so that the
business can get the most benefit from its marketing
campaign. Secondly, it will provide additional insight
by augmenting internal data with external data sources
and allow the business to build comprehensive profiles
of the customers. Thirdly, it will enable the business to
make better decision and drive more profit through
predictive analytics. If a business depends on data to
develop business strategies and for process
improvement, data enrichment should be done
frequently.
Pereira et al. [2] described the steps involved in a
successful data enrichment process, which are as
follows.
i. Data Fusion: It is the process of putting
together data from multiple sources representing
the same entity that is consistent, accurate and
has useful representation. The process of
combining data sets from different databases is
typically accomplished through record linkage.
Record linkage involves encoding correction and
matching different coded variables (such as
addresses) and may require additional lookup
databases. There is another method called
statistical matching which finds the matching
partner of an entity in dataset A from dataset B
using statistical inference leveraging data
overlap between two sets. This matching is
possible when there is significant overlap
between datasets.
ii. Data Entity recognition: This is a process of
tagging a series of words from a text. In
particular, it finds and recongises names of
people, companies, organizations, cities and
other predefined types of entities.
iii. Data Disambiguation: Disambiguation is the
process of identifying the correct entity of data
given the non-uniform variations and ambiguity
in entity names.
iv. Data Segmentation: The task of segmentation
is the process of clustering data according to a
set of predefined attributes.
v. Data Imputation: This process estimates the
values for missing or conflicting data items.
vi. Data Categorization: It is the process of
labeling data into different categories based on
the topic sentiment, event or other
characteristics.
Data enrichment can be a batch or continuous
process [6]. Batch process involves manual work. The
problem with this approach is the need to feed data
back to the company database. The continuous process
is generally an automated process which may be
standalone or integrated into other processes [7]. As
mentioned by King [6], automation technologies
improve efficacy of the enrichment process by the
following ways:
• Pre-cleaning the data prior to enrichment
• Pre- and post-processing of the new data to
make it compliant with the company data
standards
• Consolidating new data into primary data fields
• Incorporating new data into the CRM, marketing
automation, support, or other databases
• Aligning data enrichment with processes such as
list loading and lead routing
Lambert [4] suggested that the data substituted by
the enrichment process should be preserved together
with the original source data so that if the enriched
data doesn’t meet the expectations, the business can go
back to the original source data. The data analyst
should have the knowledge of the source of data, how
it was loaded, how it was changed or refined and what
happened along the way.
III. SELECTING DATA PROVIDERS
For successful enrichment of internal data, finding
the right external data source to extract the relevant
and quality attributes is very crucial. Since the data
providers hardly meet all the data requirements of a
company, it is essential for a company to analyse their
source data for understanding their needs for
enrichment in order to select the right providers for
them. Gomadam et al. [8] proposed a Web based data
source selection algorithm named Data Enrichment
Framework (DEF) that involves the following steps
towards data source selection.
i. Attribute Importance Assessment: The
algorithm sets importance for each attribute for a
data object. The more unique values an attribute
have over all instances of the data object, the
more importance will the attribute have.
ii. Data Source Selection: The algorithm selects
the data sources to enrich the attributes of the
data object which have missing values. The
algorithm continues until all attribute values are
filled or no more sources left. There are two
main criteria for selection of the sources: (1)
how well the known values of the objects with
missing values match input values required by
the source, and (2) how many high-importance
attributes the source claims to provide.
iii. Data Source Utility Adaptation: The algorithm
expects that if the data source is provided with
high confidence inputs, the source should be
able to deliver values of all attributes it wants to
get. The supplied attributes should not be
generic and should have low ambiguity. If the
expectations are not met, then source is
penalized by assigning low utility value to it.
The utility value of the data source influences
the priority of the data source in the data source
selection stage. The confidence level of the
output attribute retuned by the data source is
devalued if the data source provides multiple
values of the output attribute and if there is an
ambiguity. Conversely, if the output attribute
value is backed up by the values returned by the
previous data sources, then confidence is further
improved.
Data providers have rich databases of vast amount of
consumers. They use various methods, channels and
tools to collect data. Some common means of
collecting data, as mentioned in [9], are:
i. Web crawling, specially social media, forum
sites
ii. Subscriber database from media publishers such
as magazines, newspapers
iii. Crawling of government open data sites
iv. Crowdsourcing through manual entry, address
book scraping, email scrapping or business card
scanning
v. Manual research.
Businesses can purchase an entire dataset from the
data provider or pay for each query. Alternatively,
businesses can use data enrichment services (such as
Yelp). In the latter case, the business will not have
direct access to data [10]. Instead of using services of
data providers, the enrichment process can retrieve the
missing values of data from the Web through
imputation queries. The complete values of the data
fields can be used as keywords in the imputation
queries to get the missing values [11]. When multiple
answers are received from different sources, the most
probable answer needs to be chosen using some sort of
voting method.
IV. SEMANTIC DATA ENRICHMENT
A significant amount of data that is generated,
collected or owned by businesses are unstructured. For
example, the social media updates, reviews, videos and
images posted by customers, business documents
created by companies like reports, manuals, proposals,
etc. The value of this unstructured data can be
improved through semantic enrichment. This is
accomplished by adding structured semantic mark-up
and meta-data, which would enable businesses to find
the contents when needed, reuse them, link the
contents to other relevant contents and route them to
the right process. An example of structured meta-data
is classification and tagging of images, videos or
documents. According to IXXUS [12], semantic
enrichment offers the following benefits.
i. Contents can be discovered more easily and
more relevant to what is being sought.
ii. The search can produce relevant contents
allowing the business to cross-sell products.
iii. Facilitate repacking of contents around a specific
theme leading to a new stream of revenue.
Although semantic enrichment could be
accomplished by automation (e.g., data mining), there
are some tasks that are either beyond the capabilities
of machines or require higher order cognitive
reasoning or human judgment. Semantic data
enrichment typically uses a data model in the form of a
network, the framework of which is objects and
relationships between them. Objects, relationships,
data connections and dependencies lead to higher
order structures. Data enrichment then is about
providing semantic stability solving a number of
problems among which there are the search for
semantically unstable objects, their classification by
types of instability and the identification of ways to
overcome the instability. However, the entire operation
might turn out to be very complex and time consuming
and produce unsatisfactory results. Keeping human in
the computing loop can leverage their higher cognitive
ability to perform data enrichment task with better
reliability. While it is possible to hire dedicated
workers to enrich unstructured data, it might not be
financially viable for a business. This is where the
crowd powered semantic data enrichment comes into
play. Crowd powered semantic data enrichment
system involves external web-based casual workforce
into the data enrichment task by posting data to a web
platform via an API or a web service [13]. The crowd
powered data enrichment jobs can be paid or unpaid.
For paid data enrichment jobs, the worker is paid a
small amount (a few cents) for each job, while for
unpaid enrichment jobs, an incentive is provided via
alternative means such as incorporating the task into
an interactive mobile game. Crowd powered semantic
data enrichment process can be active or passive [14].
In active semantic data enrichment, the workers are
given very precise tasks which are easy for humans to
perform but difficult for machines to execute. Passive
data enrichment process is about utilizing the online
activities of users for data enrichment. This involves
analyzing vast amount of user generated contents to
glean the behaviour, passion and association of social
media users.
V. CHALLENGES OF DATA ENRICHMENT
While data enrichment offers immense benefits to
businesses through better segmentation and targeting
of customers and improved decision making, it is not
without challenges, which need to be addressed to
make it a robust process. Finding the right sources of
data and judging the quality and relevance of external
data sources are currently done manually. Automating
this task is quite challenging and could be a future
research issue. Internet is one of the main sources for
data providers. A major part of data available on the
Internet is hidden inside the deep web. Pulling data
from the deep web could be very overwhelming as the
contents of the deep web cannot be extracted using
classical search engines; however, the quality of this
data is very good and contains superior value. The
information on the deep web can only be accessed
through a restricted interface such a keyword-search
API [10]. Social media is another source of rich data
but assessing the reliability of the contents produced
by the social media users is quite difficult due to the
uncontrolled nature of social media. The challenges of
crowd powered semantic data enrichment are judging
the quality of answers provided by the human workers
and distinguishing spammers from true workers.
VI. CONCLUSION
This paper provides a review of the issues and
challenges of business data enrichment. The paper
provides insights to the processes involved in data
enrichment and the methods for selecting relevant data
providers that are needed to augment internal data. An
overview of semantic data enrichment is provided,
which makes unstructured data more usable and
valuable to businesses. The paper also sheds light on
crowd powered semantic data enrichment which keeps
human in the computing loop to use their higher
cognitive ability that data mining techniques are
lacking. Finally, the paper presents the challenges in
data enrichment, which need to be overcome to make
data enrichment more robust so as to enable businesses
to make the most of it.
REFERENCES
[1] M. Toussant (2018, Mar. 22). The Importance of
Human-Curated Data Enrichment of Big Data Analysis
[Blog]. Available:
https://www.cas.org/blog/importance-human-curated-
data-enrichment-big-data-analysis
[2] A. C. Pereira, A. A. Veloso, G. L. Pappa, and W.
Meira, “Recommended Best Practices for Data
Enrichment Scenarios,” W3C Member Submission,
2016. Available:
http://www.inweb.org.br/w3c/dataenrichment/
[3] K. Searight (2017, Sep. 5). What Data Enrichment
means for B2B Marketers: Better Leads, Content &
Sales [Blog]. Available:
https://www.convertrmedia.com/data-enrichment-in-
b2b-marketing/
[4] B. Lambert (2014, Jan. 30). Guiding Principals for Data
Enrichment [Blog]. Available:
https://www.captechconsulting.com/blogs/Guiding-
Principles-for-Data-Enrichmnent
[5] Experian Data Quality, “What every Organization
should Know when Selecting a Data Enrichment
Vendor,” Data Enrichment Buyer’s Guide, 2016.
[6] E. King (2017, Dec. 7). Data Enrichment Part VIII:
Implementation Tips [Blog]. Available:
https://www.openprisetech.com/data-enrichment-part-
viii-implementation-tips/
[7] OPENPRISE. The complete data enrichment survival
guide for Marketing and Sales [Online]. Available:
https://launchpoint.marketo.com/assets/1598-
openprise/10543-analyze-clean-enrich-and-unify-your-
data/The-Complete-Enrichment-Survival-Guide-for-
Marketing-and-Sales.pdf
[8] K. Gomadam, R. J. Yeh, and K. Verma, “Data
Enrichment Using Data Sources on the Web,”
Accenture Technology Labs, AAAI Technical Report
SS-12-04, San Jose, CA, 2012, pp.34-38.
[9] E. King (2017, Nov. 3). Data Enrichment Part III:
Determining your target market [Blog]. Available:
https://www.openprisetech.com/data-enrichment-blog-
series-part-3/
[10] P. Wang, W. Hey, R. Shea, J. Wang, and E. Wu,
“Deeper: A Data Enrichment System Powered by Deep
Web,” in Proc. of the 2018 International Conference on
Management of Data, pp. 1801-1804. doi:
10.1145/3183713.3193569
[11] Z. Li, S. Shang, Q. Xie, and Z. Zhang, “Cost Reduction
for Web-based Data Imputation,” in Proc. International
Conference on Database Systems for Advanced
Applications (DASFAA 2014), pp. 438-452.
[12] IXXUS. Semantic Enrichment [Blog]. Available:
https://www.ixxus.com/solutions/ semanticenrichment/
[13] A. Kass, and M. Mehta, “Crowd Powered Data
Enrichment: Combining Crowdsourcing and
Automation to Drive Smarter Business Processes from
Unstructured Data,” Accenture, 2016.
[14] P. Milano, “Humans in the Loop: Optimization of
Active and Passive Crowdsourcing,” Doctoral
Dissertation, Polytechnic University of Milan, Italy,
2015.