Conference PaperPDF Available

Business Data Enrichment: Issues and Challenges

Authors:
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
Business Data Enrichment: Issues and Challenges
Salahuddin A. Azad
School of Engineering and
Technology
Central Queensland University
MelbourneVic 3000, Australia
Email: s.azad@cqu.edu.au
Saleh Wasimi
School of Engineering and
Technology
Central Queensland University
MelbourneVic 3000, Australia
Email: s.wasimi@cqu.edu.au
ABM Shawkat Ali
School of Mathemetical and
Computing Sciences
Fiji National University
Samabula, Fiji Islands
Email: shawkat.ali@fnu.ac.fj
AbstractCompanies are collecting vast amount of
data to gain insight from it to make better decisions and
enhance customer experience. If the quality of data
collected is poor, no matter how powerful the data
analytics is, businesses cannot gain the expected benefits
from it. To achieve maximum benefits, data needs to be
cleaned, organized, and refined with additional
information or intelligence. Data enrichment goes
beyond correcting errors and improving accuracy. It
unveils the relationships, clusters, semantic ontologies in
the data to glean new insight about customers, suppliers
and peer businesses. This paper discusses the issues and
challenges of business data enrichment. It provides
insight into processes involved in data enrichment and
methods for selecting data providers. The paper also
explains semantic data enrichment that improves the
usability and value of unstructured data. Finally, the
paper presents the challenges in data enrichment that
need to be addressed.
Keywordsbusiness data enrichment, semantic data
enrichment, crowd-powered data enrichment, data fusion
I. INTRODUCTION
The vast amount of data available today has created
a great opportunity for businesses to make faster and
better business decisions, enhance customer
experience and leverage business assets. Companies
are collecting vast amount of data and using data
analytics to leverage big data for competitive
advantage. But data can only be beneficial to a
business if it is clean and organized. To achieve
maximum benefit from the data, it needs to be refined
and enhanced using additional information or
intelligence. Data enrichment refers to the process of
enhancing, refining or organizing data to extract
valuable insight from it. Data Enrichment is more than
correcting errors and improving data accuracy. The
purpose of data enrichment is to discern relationships,
clusters, semantic ontologies within the collections of
data that unveil new insights to make informed
decisions [1]. Businesses collect and utilise a large
volume of data everyday to make informed decisions.
The correctness, relevance and quality of data are of
utmost importance to leverage the business value of
the data. Data enrichment can turn existing data into an
insightful lead generating asset. Enhanced customer
profiling allows businesses to identify and segment
customers according to age, gender, marital status,
income, profession, children and many other
characteristics to target customers with more personal
and tailored messages. It also enables businesses to
identify the anomalies in customer behaviour and
detect fraudulent activities. The benefits of data
enrichment include but not limited to the following:
Deeper insight into customer behaviour
Better targeting of customers
Personalisation of communication
Prompt identification of new opportunities
Diagnosis and mitigation of potential risks
Improved capital equipment maintenance
Better competitive intelligence
Now-a-days people are spending a significant
amount of time in social media creating a rich history
of information about them. Businesses can harness this
information to find out more about their customers’
passion, association and lifestyle, which would
facilitate better segmentation of customers and
personalisation of communication. This would
eventually enable businesses to increase conversion
rate and earn more profit. However, before customer
segmentation is performed, businesses should make
sure they have enough and detailed information about
their customers [2].
As the world is changing constantly, static data
cannot be trusted completely since these do not capture
the most recent facts based on which dynamic
decisions could be made. Some of the examples of
dynamicity of B2B data, as mentioned in [3], are as
follows.
People switch employer.
Employees take on new roles or get promoted.
The financial outlooks of companies change.
The management structure changes.
The corporate goals shift.
New units or teams launched or existing ones
dissolved or merged.
So to make the right decision based on the most
updated information, static data should be enriched
with external data. Also, the data enrichment process
should be run on regular intervals and often enough so
that recent changes in the facts are not missed.
The rest of the paper is organized as follows.
Section 2 discusses the processes involved in data
enrichment, Section 3 explains the ways to choose the
right data provider for internal data augmentation,
Section 4 sheds light on semantic data enrichment that
improves the usability and value of unstructured data
and Section 5 mentions some challenges in data
enrichment. Finally, Section 6 concludes the paper.
II. DATA ENRICHMENT PROCESS
Poor quality data wastes valuable time and money
of a business. The accuracy of the existing data must
be verified before any downstream data enhancement,
profiling and scoring takes place to avoid the cost that
serves little purpose. Data enrichment is about
supplying the data with semantics and further
introduction of structuring which involves tasks such
as fixing source information, rectifying errors or
adding missing details. One very common example of
data enrichment is address correction. When a
customer enters address details on some e-commerce
site, the address is transformed into standard format
which includes street, city, state, and post code. The
removal of erroneous duplicates and redundancies and
detecting outliers are parts of preprocessing data
before data augmentation process starts. It is very
important to identify and remove unnecessary
information, enriching those would simply waste time.
Once the anomalies of the existing data is detected and
corrected, the business should look for the external
data sources from where additional data (e.g. life
events, interests, financial data, and automotive data)
can be collected to augment the existing data. Data
enrichment can use different sources of data such as
geographic, behavioral, demographic, psycho-graphic,
and census data [4]. The particular enrichment strategy
adopted depends on the type of insight the business is
looking for and marketing strategy. At the final stage,
predictive analytics should be used to produce accurate
and complete customer profile to be able to personalise
communication to the customers. According to
Experian Data Quality [5], robust data enrichment
process should have three capabilities. Firstly, it will
enhance the accuracy of internal data so that the
business can get the most benefit from its marketing
campaign. Secondly, it will provide additional insight
by augmenting internal data with external data sources
and allow the business to build comprehensive profiles
of the customers. Thirdly, it will enable the business to
make better decision and drive more profit through
predictive analytics. If a business depends on data to
develop business strategies and for process
improvement, data enrichment should be done
frequently.
Pereira et al. [2] described the steps involved in a
successful data enrichment process, which are as
follows.
i. Data Fusion: It is the process of putting
together data from multiple sources representing
the same entity that is consistent, accurate and
has useful representation. The process of
combining data sets from different databases is
typically accomplished through record linkage.
Record linkage involves encoding correction and
matching different coded variables (such as
addresses) and may require additional lookup
databases. There is another method called
statistical matching which finds the matching
partner of an entity in dataset A from dataset B
using statistical inference leveraging data
overlap between two sets. This matching is
possible when there is significant overlap
between datasets.
ii. Data Entity recognition: This is a process of
tagging a series of words from a text. In
particular, it finds and recongises names of
people, companies, organizations, cities and
other predefined types of entities.
iii. Data Disambiguation: Disambiguation is the
process of identifying the correct entity of data
given the non-uniform variations and ambiguity
in entity names.
iv. Data Segmentation: The task of segmentation
is the process of clustering data according to a
set of predefined attributes.
v. Data Imputation: This process estimates the
values for missing or conflicting data items.
vi. Data Categorization: It is the process of
labeling data into different categories based on
the topic sentiment, event or other
characteristics.
Data enrichment can be a batch or continuous
process [6]. Batch process involves manual work. The
problem with this approach is the need to feed data
back to the company database. The continuous process
is generally an automated process which may be
standalone or integrated into other processes [7]. As
mentioned by King [6], automation technologies
improve efficacy of the enrichment process by the
following ways:
Pre-cleaning the data prior to enrichment
Pre- and post-processing of the new data to
make it compliant with the company data
standards
Consolidating new data into primary data fields
Incorporating new data into the CRM, marketing
automation, support, or other databases
Aligning data enrichment with processes such as
list loading and lead routing
Lambert [4] suggested that the data substituted by
the enrichment process should be preserved together
with the original source data so that if the enriched
data doesn’t meet the expectations, the business can go
back to the original source data. The data analyst
should have the knowledge of the source of data, how
it was loaded, how it was changed or refined and what
happened along the way.
III. SELECTING DATA PROVIDERS
For successful enrichment of internal data, finding
the right external data source to extract the relevant
and quality attributes is very crucial. Since the data
providers hardly meet all the data requirements of a
company, it is essential for a company to analyse their
source data for understanding their needs for
enrichment in order to select the right providers for
them. Gomadam et al. [8] proposed a Web based data
source selection algorithm named Data Enrichment
Framework (DEF) that involves the following steps
towards data source selection.
i. Attribute Importance Assessment: The
algorithm sets importance for each attribute for a
data object. The more unique values an attribute
have over all instances of the data object, the
more importance will the attribute have.
ii. Data Source Selection: The algorithm selects
the data sources to enrich the attributes of the
data object which have missing values. The
algorithm continues until all attribute values are
filled or no more sources left. There are two
main criteria for selection of the sources: (1)
how well the known values of the objects with
missing values match input values required by
the source, and (2) how many high-importance
attributes the source claims to provide.
iii. Data Source Utility Adaptation: The algorithm
expects that if the data source is provided with
high confidence inputs, the source should be
able to deliver values of all attributes it wants to
get. The supplied attributes should not be
generic and should have low ambiguity. If the
expectations are not met, then source is
penalized by assigning low utility value to it.
The utility value of the data source influences
the priority of the data source in the data source
selection stage. The confidence level of the
output attribute retuned by the data source is
devalued if the data source provides multiple
values of the output attribute and if there is an
ambiguity. Conversely, if the output attribute
value is backed up by the values returned by the
previous data sources, then confidence is further
improved.
Data providers have rich databases of vast amount of
consumers. They use various methods, channels and
tools to collect data. Some common means of
collecting data, as mentioned in [9], are:
i. Web crawling, specially social media, forum
sites
ii. Subscriber database from media publishers such
as magazines, newspapers
iii. Crawling of government open data sites
iv. Crowdsourcing through manual entry, address
book scraping, email scrapping or business card
scanning
v. Manual research.
Businesses can purchase an entire dataset from the
data provider or pay for each query. Alternatively,
businesses can use data enrichment services (such as
Yelp). In the latter case, the business will not have
direct access to data [10]. Instead of using services of
data providers, the enrichment process can retrieve the
missing values of data from the Web through
imputation queries. The complete values of the data
fields can be used as keywords in the imputation
queries to get the missing values [11]. When multiple
answers are received from different sources, the most
probable answer needs to be chosen using some sort of
voting method.
IV. SEMANTIC DATA ENRICHMENT
A significant amount of data that is generated,
collected or owned by businesses are unstructured. For
example, the social media updates, reviews, videos and
images posted by customers, business documents
created by companies like reports, manuals, proposals,
etc. The value of this unstructured data can be
improved through semantic enrichment. This is
accomplished by adding structured semantic mark-up
and meta-data, which would enable businesses to find
the contents when needed, reuse them, link the
contents to other relevant contents and route them to
the right process. An example of structured meta-data
is classification and tagging of images, videos or
documents. According to IXXUS [12], semantic
enrichment offers the following benefits.
i. Contents can be discovered more easily and
more relevant to what is being sought.
ii. The search can produce relevant contents
allowing the business to cross-sell products.
iii. Facilitate repacking of contents around a specific
theme leading to a new stream of revenue.
Although semantic enrichment could be
accomplished by automation (e.g., data mining), there
are some tasks that are either beyond the capabilities
of machines or require higher order cognitive
reasoning or human judgment. Semantic data
enrichment typically uses a data model in the form of a
network, the framework of which is objects and
relationships between them. Objects, relationships,
data connections and dependencies lead to higher
order structures. Data enrichment then is about
providing semantic stability solving a number of
problems among which there are the search for
semantically unstable objects, their classification by
types of instability and the identification of ways to
overcome the instability. However, the entire operation
might turn out to be very complex and time consuming
and produce unsatisfactory results. Keeping human in
the computing loop can leverage their higher cognitive
ability to perform data enrichment task with better
reliability. While it is possible to hire dedicated
workers to enrich unstructured data, it might not be
financially viable for a business. This is where the
crowd powered semantic data enrichment comes into
play. Crowd powered semantic data enrichment
system involves external web-based casual workforce
into the data enrichment task by posting data to a web
platform via an API or a web service [13]. The crowd
powered data enrichment jobs can be paid or unpaid.
For paid data enrichment jobs, the worker is paid a
small amount (a few cents) for each job, while for
unpaid enrichment jobs, an incentive is provided via
alternative means such as incorporating the task into
an interactive mobile game. Crowd powered semantic
data enrichment process can be active or passive [14].
In active semantic data enrichment, the workers are
given very precise tasks which are easy for humans to
perform but difficult for machines to execute. Passive
data enrichment process is about utilizing the online
activities of users for data enrichment. This involves
analyzing vast amount of user generated contents to
glean the behaviour, passion and association of social
media users.
V. CHALLENGES OF DATA ENRICHMENT
While data enrichment offers immense benefits to
businesses through better segmentation and targeting
of customers and improved decision making, it is not
without challenges, which need to be addressed to
make it a robust process. Finding the right sources of
data and judging the quality and relevance of external
data sources are currently done manually. Automating
this task is quite challenging and could be a future
research issue. Internet is one of the main sources for
data providers. A major part of data available on the
Internet is hidden inside the deep web. Pulling data
from the deep web could be very overwhelming as the
contents of the deep web cannot be extracted using
classical search engines; however, the quality of this
data is very good and contains superior value. The
information on the deep web can only be accessed
through a restricted interface such a keyword-search
API [10]. Social media is another source of rich data
but assessing the reliability of the contents produced
by the social media users is quite difficult due to the
uncontrolled nature of social media. The challenges of
crowd powered semantic data enrichment are judging
the quality of answers provided by the human workers
and distinguishing spammers from true workers.
VI. CONCLUSION
This paper provides a review of the issues and
challenges of business data enrichment. The paper
provides insights to the processes involved in data
enrichment and the methods for selecting relevant data
providers that are needed to augment internal data. An
overview of semantic data enrichment is provided,
which makes unstructured data more usable and
valuable to businesses. The paper also sheds light on
crowd powered semantic data enrichment which keeps
human in the computing loop to use their higher
cognitive ability that data mining techniques are
lacking. Finally, the paper presents the challenges in
data enrichment, which need to be overcome to make
data enrichment more robust so as to enable businesses
to make the most of it.
REFERENCES
[1] M. Toussant (2018, Mar. 22). The Importance of
Human-Curated Data Enrichment of Big Data Analysis
[Blog]. Available:
https://www.cas.org/blog/importance-human-curated-
data-enrichment-big-data-analysis
[2] A. C. Pereira, A. A. Veloso, G. L. Pappa, and W.
Meira, “Recommended Best Practices for Data
Enrichment Scenarios,” W3C Member Submission,
2016. Available:
http://www.inweb.org.br/w3c/dataenrichment/
[3] K. Searight (2017, Sep. 5). What Data Enrichment
means for B2B Marketers: Better Leads, Content &
Sales [Blog]. Available:
https://www.convertrmedia.com/data-enrichment-in-
b2b-marketing/
[4] B. Lambert (2014, Jan. 30). Guiding Principals for Data
Enrichment [Blog]. Available:
https://www.captechconsulting.com/blogs/Guiding-
Principles-for-Data-Enrichmnent
[5] Experian Data Quality, “What every Organization
should Know when Selecting a Data Enrichment
Vendor,” Data Enrichment Buyer’s Guide, 2016.
[6] E. King (2017, Dec. 7). Data Enrichment Part VIII:
Implementation Tips [Blog]. Available:
https://www.openprisetech.com/data-enrichment-part-
viii-implementation-tips/
[7] OPENPRISE. The complete data enrichment survival
guide for Marketing and Sales [Online]. Available:
https://launchpoint.marketo.com/assets/1598-
openprise/10543-analyze-clean-enrich-and-unify-your-
data/The-Complete-Enrichment-Survival-Guide-for-
Marketing-and-Sales.pdf
[8] K. Gomadam, R. J. Yeh, and K. Verma, “Data
Enrichment Using Data Sources on the Web,”
Accenture Technology Labs, AAAI Technical Report
SS-12-04, San Jose, CA, 2012, pp.34-38.
[9] E. King (2017, Nov. 3). Data Enrichment Part III:
Determining your target market [Blog]. Available:
https://www.openprisetech.com/data-enrichment-blog-
series-part-3/
[10] P. Wang, W. Hey, R. Shea, J. Wang, and E. Wu,
“Deeper: A Data Enrichment System Powered by Deep
Web,” in Proc. of the 2018 International Conference on
Management of Data, pp. 1801-1804. doi:
10.1145/3183713.3193569
[11] Z. Li, S. Shang, Q. Xie, and Z. Zhang, “Cost Reduction
for Web-based Data Imputation,” in Proc. International
Conference on Database Systems for Advanced
Applications (DASFAA 2014), pp. 438-452.
[12] IXXUS. Semantic Enrichment [Blog]. Available:
https://www.ixxus.com/solutions/ semanticenrichment/
[13] A. Kass, and M. Mehta, “Crowd Powered Data
Enrichment: Combining Crowdsourcing and
Automation to Drive Smarter Business Processes from
Unstructured Data,” Accenture, 2016.
[14] P. Milano, “Humans in the Loop: Optimization of
Active and Passive Crowdsourcing,” Doctoral
Dissertation, Polytechnic University of Milan, Italy,
2015.
... Data enrichment refers to the process of appending or otherwise enhancing collected data, with the relevant context being obtained from additional sources [10]. Originating in the realm of processing consumer data [11], data enrichment is an increasingly popular approach to data preprocessing to substantially enhance the performance of various forecasting problems in business [12], industrial network security [10], and commodity price forecasting [13]. However, data enrichment has not been carried out in previous research on short-term wind power forecasting, as revealed by the literature search in Section 2.1. ...
Article
Full-text available
Wind power forecasting involves data preprocessing and modeling. In pursuit of better forecasting performance, most previous studies focused on creating various wind power forecasting models, but few studies have been published with an emphasis on new types of data preprocessing methods. Effective data preprocessing techniques and the fusion with the physical nature of the wind have been called upon as potential future research directions in recent reviews in this area. Data enrichment as a method of data preprocessing has been widely applied to forecasting problems in the consumer data universe but has not seen application in the wind power forecasting area. This study proposes data enrichment as a new addition to the existing library of data preprocessing methods to improve wind power forecasting performance. A methodological framework of data enrichment is developed with four executable steps: add error features of weather prediction sources, add features of weather prediction at neighboring nodes, add time series features of weather prediction sources, and add complementary weather prediction sources. The proposed data enrichment method takes full advantage of multiple commercially available weather prediction sources and the physical continuity nature of wind. It can cooperate with any existing forecasting models that have weather prediction data as inputs. The controlled experiments on three actual individual wind farms have verified the effectiveness of the proposed data enrichment method: The normalized root mean square error (NRMSE) of the day-ahead wind power forecast of XGBoost and LSTM with data enrichment is 11% to 27% lower than that of XGBoost and LSTM without data enrichment. In the future, variations on the data enrichment methods can be further explored as a promising direction of enhancing short-term wind power forecasting performance.
... The data enrichment process can be classified into six approaches [6]: ...
Conference Paper
Full-text available
Data enrichment uses resources to fill gaps in customer data sets, enterprise systems, marketing, product sales, and related applications. Environmental applications have the potential to be enriched by aggregating georeferenced data from external sources. The data available on the Web might be a viable alternative to support data enrichment. Usually, this process is done manually, at a high cost of time and human resources. In this context, georeferenced data enrichment using external datasets is a viable and available resource that can be used to reduce financial costs and improve the geographic localization process, which directly depends on appropriate hardware, such as GPS devices and human availability for data collection. This paper presents two main contributions: (i) a data extraction process, which enriches georeferences in a specific application; and (ii) a data enrichment process, which indicates potential risks in environmental areas with potential soil degradation problems. We validate the extracted data from the Web, using these data in an application to verify the distance between areas classified as degraded and possible points of interest in or near urban areas. Finally, it is essential to point out that the research has an interdisciplinary essence involving information systems and the environment, collaborating with both domains.
... Data enrichment refers to the process of enhancing existing information by supplementing missing or incomplete data with data from external data sources [3]. For example, figure 1 describes a situation where a source dataset containing information about cities and touristic attractions could be enriched with weather forecasts to plan future indoor/outdoor activities. ...
Preprint
The large availability of datasets fosters the use of \acrshort{ml} and \acrshort{ai} technologies to gather insights, study trends, and predict unseen behaviours out of the world of data. Today, gathering and integrating data from different sources is mainly a manual activity that requires the knowledge of expert users at an high cost in terms of both time and money. It is, therefore, necessary to make the process of gathering and linking data from many different sources affordable to make datasets ready to perform the desired analysis. In this work, we propose the development of a comprehensive framework, named SemTUI, to make the enrichment process flexible, complete, and effective through the use of semantics. The approach is to promote fast integration of external services to perform enrichment tasks such as reconciliation and extension; and to provide users with a graphical interface to support additional tasks, such as refinement to correct ambiguous results provided by automatic enrichment algorithms. A task-driven user evaluation proved SemTUI to be understandable, usable, and capable of achieving table enrichment with little effort and time with user tests that involved people with different skills and experiences.
... We identify in our paper five major functionalities that must be provided in any data management system that favors the data catalog that ensures governance, such as: Data enrichment (DE) is one of the most important characteristics which has the role of supplementing the data, improving it and structuring it, so that it provides valuable information. Data enrichment is more than correcting errors it also improves data accuracy [16]. Data indexing (DI) consists of organizing a collection of data sources so that we can later easily find the one that interests us according to specific keywords. ...
Chapter
Over the past few years Big data is at the center of the concerns of actors in all fields of activity. The rapid growth of this massive data requires the question of its storage. Data lakes meet these storage needs, offering data storage without a predefined schema. In this context a strategy for building a clear data catalog is fundamental for any organization that stores big data, helping to ensure the effective and efficient use of information. Setting up a data catalog in a data lake remains a complicated task, presents a major issue for data managers. However, the data catalog is still essential. This article presents the use of XML and JAXB technologies in the modeling of the data catalog by proposing an approach called DLDS (stands for Data Lake Description Service), enables to build a central catalog file that allows the users to search, locate, understand and queries different data sources stored in the lake.
Preprint
Full-text available
The detection of anomalies in a set of data represents a ubiquitous challenge in computer science emerging in a wide range of fields such as financial transactions, particle physics and communication traffic. At the very heart of our method to attack this problem lies a representation of the data by key-value pairs with sets of associated identifiers. We compute distances between sets of key-value pairs with the help of a reference key and associated identifiers, and visualize these distances using heatmaps to bring out with few prior assumptions anomalies. Our method is universal since it works with any tabular or document-oriented data, independent of their content and form. We demonstrate the power of our technique by identifying three anomalies in the New York City Taxi and Limousine Commission Trip Record Data.
Article
Full-text available
In this paper, we present a novel framework for enriching time series data in smart cities by supplementing it with information from external sources via semantic data enrichment. Our methodology effectively merges multiple data sources into a uniform time series, while addressing difficulties such as data quality, contextual information, and time lapses. We demonstrate the efficacy of our method through a case study in Barcelona, which permitted the use of advanced analysis methods such as windowed cross-correlation and peak picking. The resulting time series data can be used to determine traffic patterns and has potential uses in other smart city sectors, such as air quality, energy efficiency, and public safety. Interactive dashboards enable stakeholders to visualize and summarize key insights and patterns.
Conference Paper
Data scientists often spend more than 80% of their time on data preparation. Data enrichment, the act of extending a local database with new attributes from external data sources, is among the most time-consuming tasks. Existing data enrichment works are resource intensive: data-intensive by relying on web tables or knowledge bases, monetarily-intensive by purchasing entire datasets, or time-intensive by fully crawling a web-based data source. In this work, we explore a more targeted alternative that uses resources (in terms of web API calls) proportional to the size of the local database of interest. We build Deeper, a data enrichment system powered by the deep web. The goal of Deeper is to help data scientists to link a local database to a hidden database so that they can easily enrich the local database with the attributes from the hidden database. We find that a challenging problem is how to crawl a hidden database. This is different from a typical deep web crawling problem, whose goal is to crawl the entire hidden database rather than only the content relating to the data enrichment task. We demonstrate the limitations of straightforward solutions and propose an effective new crawling strategy. We also present the Deeper system architecture and discuss how to implement each component. During the demo, we will use Deeper to enrich a publication database and aim to show that (1) Deeper is an end-to-end data enrichment solution, and (2) the proposed crawling strategy is superior to the straightforward ones.
Conference Paper
Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity of these keywords and the data complexity on the Web, different queries may retrieve different answers to the same absent field value. To decide the most probable right answer to each absent filed value, existing method issues quite a few available imputation queries for each absent value, and then vote on deciding the most probable right answer. As a result, we have to issue a large number of imputation queries for filling all absent values in an incomplete data set, which brings a large overhead. In this paper, we work on reducing the cost of Web-based Data Imputation in two aspects: First, we propose a query execution scheme which can secure the most probable right answer to an absent field value by issuing as few imputation queries as possible. Second, we recognize and prune queries that probably will fail to return any answers a priori. Our extensive experimental evaluation shows that our proposed techniques substantially reduce the cost of Web-based Imputation without hurting its high imputation accuracy.
Article
As businesses seek to monetize their data, they are leveraging Web-based delivery mechanisms to provide publicly available data sources. Also, as analytics becomes a central part of many business functions such as customer segmentation, competitive intelligence and fraud detection, many businesses are seeking to enrich their internal data records with data from these data sources. As the number of sources with varying degrees of accuracy and quality proliferate, it is a non-trivial task to effectively select which sources to use for a particular enrichment task. The old model of statically buying data from one or two providers becomes inefficient because of the rapid growth of new forms of useful data such as social media and the lack of dynamism to plug sources in and out. In this paper, we present the data enrichment framework, a tool that uses data mining and other semantic techniques to automatically guide the selection of sources. The enrichment framework also monitors the quality of the data sources and automatically penalizes sources that continue to return low quality results. Copyright © 2012, Association for the Advancement of Artificial Intelligence . All rights reserved.
Crowd Powered Data Enrichment: Combining Crowdsourcing and Automation to Drive Smarter Business Processes from Unstructured Data
  • A Kass
  • M Mehta
A. Kass, and M. Mehta, "Crowd Powered Data Enrichment: Combining Crowdsourcing and Automation to Drive Smarter Business Processes from Unstructured Data," Accenture, 2016.
The Importance of Human-Curated Data Enrichment of Big Data Analysis [Blog]
  • M Toussant
M. Toussant (2018, Mar. 22). The Importance of Human-Curated Data Enrichment of Big Data Analysis [Blog]. Available: https://www.cas.org/blog/importance-human-curateddata-enrichment-big-data-analysis
The complete data enrichment survival guide for Marketing and Sales
  • Openprise
OPENPRISE. The complete data enrichment survival guide for Marketing and Sales [Online]. Available: https://launchpoint.marketo.com/assets/1598-openprise/10543-analyze-clean-enrich-and-unify-yourdata/The-Complete-Enrichment-Survival-Guide-for-Marketing-and-Sales.pdf
Humans in the Loop: Optimization of Active and Passive Crowdsourcing
  • P Milano
P. Milano, "Humans in the Loop: Optimization of Active and Passive Crowdsourcing," Doctoral Dissertation, Polytechnic University of Milan, Italy, 2015.
Guiding Principals for Data Enrichment [Blog]
  • B Lambert
B. Lambert (2014, Jan. 30). Guiding Principals for Data Enrichment [Blog]. Available: https://www.captechconsulting.com/blogs/Guiding-Principles-for-Data-Enrichmnent
Recommended Best Practices for Data Enrichment Scenarios
  • A C Pereira
  • A A Veloso
  • G L Pappa
  • W Meira
A. C. Pereira, A. A. Veloso, G. L. Pappa, and W. Meira, "Recommended Best Practices for Data Enrichment Scenarios," W3C Member Submission, 2016. Available: http://www.inweb.org.br/w3c/dataenrichment/
What Data Enrichment means for B2B Marketers: Better Leads, Content & Sales [Blog]
  • K Searight
K. Searight (2017, Sep. 5). What Data Enrichment means for B2B Marketers: Better Leads, Content & Sales [Blog]. Available: https://www.convertrmedia.com/data-enrichment-in-b2b-marketing/