Conference PaperPDF Available

Data Cleaning: Current Approaches and Issues

Authors:

Abstract

The data cleaning is the process of identifying and removing the errors in the data warehouse. While collecting and combining data from various sources into a data warehouse, ensuring high data quality and consistency becomes a significant, often expensive and always challenging task. Without clean and correct data the usefulness of Data Mining and data warehousing is mitigated. This paper analyzes the problem of data cleansing and the identification of potential errors in data sets. The differing views of data cleansing are surveyed and reviewed and a brief overview of existing data cleansing techniques is given. We also give an outlook to research directions that complement the existing systems.
9
CHAPTER
Data Cleaning: Current Approaches
and Issues
1
Vaishali Chandrakant Wangikar and
2
Ratnadeep R. Deshmukh
1
MCA Department, Maharashtra Academy of Engineering, Alandi, Pune (MS), India,
2
Deptartment of Computer Science & IT, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad (MS), India
E-mail: vaishali.wangikar@gmail.com, ratnadeep_deshmukh@yahoo.co.in
ABSTRACT
The data cleaning is the process of identifying and removing the errors in the data warehouse. While collecting and combining
data from various sources into a data warehouse, ensuring high data quality and consistency becomes a significant, often
expensive and always challenging task. Without clean and correct data the usefulness of Data Mining and data warehousing
is mitigated. This paper analyzes the problem of data cleansing and the identification of potential errors in data sets. The
differing views of data cleansing are surveyed and reviewed and a brief overview of existing data cleansing techniques is
given. We also give an outlook to research directions that complement the existing systems.
Keywords: Sorted Neighborhood Methods, Fuzzy Match, Clustering and Association, Token-Based Data, Record Linkage.
1. INTRODUCTION
Common data quality problems(anomalies) include inconsistent data conventions amongst sources such as different
abbreviations or synonyms; data entry errors such as spelling mistakes inconsistent data formats, missing, incomplete,
outdated or otherwise incorrect attribute values, data duplication, irrelevant objects or data. Data that is incomplete or
inaccurate is known asdirty
data.
The various types of anomalies occurring in data that have to be eliminated. The type of anomalies can be classified
under several types of it. Based on this classification we evaluate and compare existing approaches for data cleansing
with respect to the types of anomalies handled and eliminated by them.
The paper
categorizes the data cleansing into two categories: cleansing string data and record or attribute de-
duplication.
Data cleaning offers the fundamental services for data cleaning such as attribute selection, formation of tokens,
selection of clustering algorithm, selection of similarity function, selection of elimination function and merge function
etc.
The paper is organized as
follows. Related Research Work describes various existing data cleaning techniques,
Comparison of existing techniques, Conclusion and Future Work.
6 Knowledge Engineering
2. RELATED RESEARCH WORK
2.1 Cleansing String Data
This category of data cleaning removesdirtin strings (words). Algorithm that identifies a group of strings that consists
of (multiple) occurrences of a correctly spelled string plus nearby misspelled strings. All strings in a group are replaced
by the most frequent string of this group.
2.1.1 Data Cleaning for Misspelled Proper Nouns (Border Detection Algorithm)
The method is proposed by Arturas Mazeika and Michael H.B¨ohlen in 2006. The method targets proper noun databases,
including names and addresses, which are not handled by dictionaries. Center Calculation and Border Detection
algorithms [1] are suggested. Data cleansing is done in two steps. First, the string data is clustered by identifying center
and
border of hyper-spherical clusters, and second, the cluster strings are cleansed with the most frequent string of the
cluster. All strings within the overlap threshold from the center of the cluster are assigned to one cluster.
Border Detection algorithm is a simple and effective strategy to compute clusters in string data. One starts with
a
string in the database and selects the border that separates the cluster from the other clusters. If the initial string was
chosen close to the center of the cluster, the border detection will yield good and robust results. If one chooses the initial
string close to the border, two separate clusters
might be assigned. As the cluster size increases, the relative clustering
error decreases. The algorithm successfully identifies borders of clusters provided a sufficient sample size. The
robustness of the algorithm is not affected by the cluster size.
Experiments show that the border detection is robust provided a sufficient sample size. The
investigation indicates
that very few q-grams of the center strings are sufficient to identify strings of the cluster. An algorithm that robustly finds
the identifying q-grams of the cluster is an interesting challenge.
2.1.2 Robust and Efficient Fuzzy Match for Online Data Cleaning(Fuzzy
Match similarity Algorithm)
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In
many situations, clean tuples must match acceptable tuples in reference tables. A significant challenge in such a scenario
is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to
match exactly with any tuple in the reference relation. A few similarity function which overcomes limitations of
commonly used similarity functions is proposed, and an efficient fuzzy match algorithm is developed in 2003 by Surajit
Chaudhuri, Kris Ganjam, Venkatesh Ganti Rajeev Motwani. Edit distance similarity
[2] is generalized by incorporating the
notion of tokens and their importance to develop an accurate fuzzy match similarity function for matching erroneous
input tuples with clean tuples from a reference relation. The error tolerant index [2] and an efficient algorithm is
developed for identifying with high probability the closest fuzzy matching reference tuples.
Using real datasets,
demonstration of the high quality of proposed similarity function and the efficiency of algorithms given.
2.1.3 Data Cleaning by Clustering and Association Methods (Data Mining Algorithms)
The two applications of data mining techniques in the area of attribute correction: context-independent attribute
correction implemented using clustering techniques and context-dependent attribute correction using associations are
proposed by Lukasz Ciszak, 2008 IEEE[3].
Attribute correction solutions require reference data in order to provide satisfying results.
Context-independent attribute correction means that all the record attributes
are examined and cleaned in isolation
without regard to values of other attributes of a given record.
Context-dependent means that attribute values are corrected with regard not only to the reference data value it is
most similar to, but also takes into consideration values of other attributes within a given record
Experimental
results of both algorithms created by the author show that attribute correction is possible without an
external reference data and can give good results. As it was discovered in the experiments, the effectiveness of a method
Vaishali Chandrakant Wangikar and Ratnadeep R. Deshmukh 63
depends strongly on its parameters. The optimal parameters discovered here may give optimal results only for the data
examined and it is very likely that different data sets would need different values of the parameters to achieve a high ratio
of correctly cleaned data.
2.2 Record or Attribute De-duplication
A process for determining whether two or more records defined differently in a database, actually represent the same real
world object. During data cleaning, multiple records representing the same real life object are identified, assigned only one
unique database identification, and only one copy of exact duplicate records is retained.
2.2.1 A Token-Based Data Cleaning Technique
Most existing work on data cleaning, identify record duplicates by computing match scores compared against a given
match score threshold. Some use the entire records for long string comparisons that involve a number of passes.
Determining optimal match score threshold in a domain is hard and straight long string comparisons with many passes is
inefficient.
The proposed token based technique proposed by Timothy E., Ohanekwu and C.I. Ezeife eliminates the need to rely
on match threshold by defining smart tokens that are used for identifying duplicates. This approach also eliminates the
need to use the entire long string records with multiple passes, for
duplicate identification.
Existing algorithms use token keys extracted from records for only sorting and/or clustering. The results from the
experiments show that the proposed token-based algorithm [5] outperforms the other two algorithms.
2.2.2 Record Linkage: Similarity Measures and Algorithms
In the presence of data quality errors, a central problem is the ability to identify whether two entities (e.g., relational
tuples) are approximately the same The techniques used here are record linkage and approximate join in the sequel.
A variety of approximate match predicates that have been proposed to quantify the
degree of similarity or closeness
of two data entities. The authors Nick Koudas, Sunita Sarawagi, Divesh Srivastava have compared and contrasted them
based on their applicability to various data types, algorithmic properties, computational overhead and their adaptability.
Most approximate match predicates return a score between 0 and 1 (with 1 being assigned to identical entities)
that
effectively quantifies the degree of similarity between data entities. Such approximate match predicates will consist of
three parts.
Atomic Similarity Measures: This part measures to assess atomic (attribute value) similarity between a pair of data
entities. Several approximate match predicates including edit distance, phonetic distance (soundex), the Jaro and Winkler
measures, tf.idf and
many variants thereof. Several approaches to fine tune parameters of such measures are considered.
Functions to combine similarity measures : Given a set of pairs of attributes belonging to two entities (tuples), in which
each pair is tagged with its own approximate match score (possibly applying distinct approximate match predicates for
each attribute pair), how does one combine such scores to decide whether the entire entities (tuples) are approximately
the same. For this basic decision problem several proposed methodologies like statistical and probabilistic, predictive,
cost based, rule based, user assisted as well as learning based are given. Moreover, several specific functions including
Naive Bayes, the Fellegi-Sunter
model, linear support vector machines (SVM) and approaches based on voting theory are
covered.
Similarity between linked entities: Often the entities over which we need to resolve duplicates are linked together via
foreign keys in a multi-relational database. Author has presented various graph-based similarity measures that capture
transitive contextual similarity in combination with the intrinsic similarity between
two entities.
Record Linkage Algorithms: Once the basic techniques for quantifying the degree of approximate match for a pair (or
subsets) of attributes have been identified, the next challenging operation is to embed them into an approximate join
framework between two data sets. A common feature of all such algorithms is the ability to keep the total number of pairs
(and subsequent decisions) low utilizing various pruning mechanisms. These algorithms can be classified into two main
categories.
64 Knowledge Engineering
1. Algorithms inspired by relational duplicate elimination and join techniques including sort-merge, band join and
indexed nested loops: In this context, techniques like Merge/Purge [9] (based on the concept of sorted
neighborhoods), Big Match (based on indexed nested loops joins) and Dimension Hierarchies (based on the
concept of hierarchically clustered neighborhoods) are reviewed .
2. Algorithms inspired by information retrieval that treat each tuple as a set of tokens, and return those set pairs
whose (weighted) overlap exceeds a specified threshold: In this context, a variety of set join algorithms
are
reviewed [6].
2.2.3 Adaptive Sorted Neighborhood Methods for Efficient Record Linkage
A variety of record linkage algorithms have been developed and deployed successfully. Often, however, existing solutions
have a set of parameters whose values are set by human experts off-line and are fixed during the execution. Since finding
the ideal values of such parameters is not straightforward, or no such single ideal value even exists, the applicability of
existing solutions to new scenarios or domains is greatly hampered. To remedy this problem, an argument is made by Su
Yan, Dongwon Lee, Min-Yen Kany, C. Lee Giles that one can achieve significant improvement by adaptively and
dynamically changing such parameters of record linkage algorithms.
To validate the hypothesis, a classical record linkage
algorithm, the Sorted Neighborhood Method (SNM)[7] are used and demonstrated how one can achieve improved
accuracy and performance by adaptively changing its fixed sliding window size.
Two adaptive versions of the SNM algorithm, named as the incrementally-adaptive SNM (IA-SNM) and the
accumulatively-adaptive SNM (AA-SNM) are
proposed, both of which dynamically adjust the sliding window size, a key
parameter used in SNM, during the blocking phase to adaptively fit the duplicate distribution. Comprehensive experiments
with both real and synthetic data sets of three domains validate the effectiveness and the efficiency of the proposed
adaptive schemes.
3. COMPARISON
3.1 The following table shows comparison of three different algorithms especially used for cleaning string type of data.
Border Detection Data Mining Algorithm- Fuzzy Match Similarity
Algorithm Attribute Correction Algorithm FunctionAlgorithm
Features (1) Simple, effective to (1) The given attributes are (1) If the tuple or attribute fails
compute clusters in the validated against reference to match the reference data then
string data. data to provide cleansing solution fuzzy match operation is applied
(2) It helps in selection (2) fuzzy match similarity (fms)on it.
of border
that separates one function that explicitly considers (2) The two applications of
cluster from the other. If the IDF token weights and input data mining techniques in the area
initial string was chosen close errors while comparing tuples. of attribute correction are: context-
to the center of the cluster, the independent attribute correction
border detection will yield good implemented using clustering
and robust results . If one techniques and context-dependent
chooses the initial string close to attribute correction using
the border, two separate clusters associations.
might be assigned
Significance/ It produces good cleansing results Quality of fms is better than The algorithms displays better
Performance for string data with large distances ed (edit distance) using
two performance for long strings as
between centers of clusters and Datasets. The algorithms short strings would require higher
small distances within the clusters. are 2 to 3 orders of magnitude value of the parameter to discover
faster than the naïve algorithm a correct reference value.
This method produces as 92%
of correctly altered elements which
is an acceptable value.
Vaishali Chandrakant Wangikar and Ratnadeep R. Deshmukh 65
Limitations Our data cleansing algorithm is There is always cost associated The major drawback of this method
less is applicable for natural with transformations of IDF that may classify asincorrect
language databases. tokens. a value that is correct in context
of other attributes of this record,
but does not have enough
occurrences within the
cleaned data set
3.2 The following table shows comparison of three different algorithms used for cleaning duplicate entries of attributes
as well as records.
Token-Based Algorithm Record Linkage Adaptive Sorted Neighborhood
and Algorithms Similarity Measures Linkage Methods For Efficient Record
Features For finding duplicates of Approximate match and Among many parameters of record
attributes as well as records approximate join techniques linkage algorithms, the main focus is
smart tokens are used instead are proposed to quantify the on the size of the sliding window in
match score comparison against degree of similarity SNM and the adaptive version of
match threshold. Atomic Similarity Measures, SNM is proposed.
This approach also eliminates the Functions to combine similarity The size of the window in SNM
need to use the entire long string measures, Similarity between amounts to the size of the block,
records with multiple passes, for linked entities are considered which in turn is related to
duplicate identification. for matching the aggressiveness of a
Two approximate join techniques blocking method.
proposed, the first is concerned Two adaptive versions of the SNM
with procedural algorithms algorithm, named as the
operating on data, applying incrementally-adaptive SNM (IA-
approximate match predicates, SNM) and the accumulatively-
without a particular storage or adaptive SNM (AA-SNM) are
query model in mind. proposed.
The second is concerned with
declarative specifications of data
cleaning operations.
Significance By using short lengthened tokens A non-declarative specification Adaptive sorted neighborhood
/Performance for record comparisons, a high offers greater algorithmic flexibility methods significantly outperform the
recall/precision is achieved. and possibly improved perform- the original SNM method. AA-SSNM
It has drastically lowers the ance (e.g., implemented on top has better performance than IA-SNM.
dependency of the data cleaning of a file system without incurring The F-score of AA-SNM is 49%
on matchthreshold choice. RDBMS overheads). larger than that of SNM.
It has a recall close to 100%, as A declarative specification offers The F-score difference between IA-
well as negligible false positive unbeatable ease of deployment SNM and AA-SNM is about 4%, but
errors. It succeeded in reducing (as a set of SQL queries), direct the PC difference is around 13%.
the number of token tables to a processing of data in their native This shows that AA-SNM is a better
constant of 2, irrespective of the store (RDBMS) and flexible blocking method than IA-SNM since
number of fields selected by the integration with existing it finds more potential duplicate pairs
user. applications utilizing an RDBMS with similar F-score
The smart tokens are more likely
applicable to domain-independent
data cleaning, and could be used as
warehouse identifiers to enhance
the process of incremental cleaning
and refreshing of integrated data.
66 Knowledge Engineering
Limitation Existing algorithms use token keys The output of the approximate The adaptive schemes are robust to
extracted from records for only join needs to be post processed the variance in the size of each
sorting and/or clustering to cluster together all tuples that individual block, which can range
Token-based cleaning technique referto the same entity. from moderate to severe. Besides,
on unstructured, and semi- The approximate join operation the adaptive schemes show better
structured data yet to be above might produce seemingly resistance to the errors in the
considered. inconsistent results like tuple blocking fields.
A joins with tuple B, tuple A joins The performance of the algorithm
with tuple C, but tuple B does not depends upon the appropriate size
join with tuple C. A straightforward of window. Several methods for
way to resolve such inconsist- adjusting window sizes, which are
encies is to cluster together all used in the adaptive methods, are
tuples via a transitive closure of proposed and compared. Among
the join pairs. In practice, this can them, the full adjustment method is
lead to extremely poor results shown to be near optimal.
since unrelated tuples might get
connected through noisy links.
4. CONCLUSION
Various data cleaning algorithms and techniques are presented in the paper but each method can be used to identify a
particular type of error in the data. The technique suitable for one type of data cleaning may not be suitable for the other.
As data cleaning has a wide variety of situations that need to cater efficiently by some comprehensive data cleaning
framework. Future research directions include the review and investigation of various methods to address wide area of
data cleaning. A better integration of data cleaning approach in the frameworks and data decision processes should be
achieved.
Acknowledgements
For reviewing different algorithms of data cleaning, experiments and contributions done by several authors are referred.
The papers are enlisted in the references.
References
1. Arturas Mazeika Michael H.B¨ohlen: Cleansing Databases of Misspelled Proper Nouns, Clean DB, Seoul, Korea, 2006
2. Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti Rajeev Motwani: Robust and Efficient Fuzzy Match for Online Data
Cleaning. ACM, SIGMOD 2003, June 9-12, 2003, San Diego__CA.
3. Rohit Ananthakrishna1 Surajit Chaudhuri Venkatesh Ganti,: Research Eliminating Fuzzy Duplicates in Data Warehouses.
Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002.
4. Lukasz Ciszak: Application of Clustering and Association Methods in Data Cleaning .Proceedings of the International
Multiconference on ISBN 978-83-60810-14-9 Computer Science and Information Technology, pp. 97 103
.
5. Timothy E. Ohanekwu, C.I. Ezeife: A Token-Based Data Cleaning Technique for Data Warehouse Systems.
6. Nick Koudas, Sunita Sarawagi, Divesh Srivastava: Record Linkage: Similarity Measures and Algorithms. SIGMOD 2006,
June 2729, 2006, Chicago, Illinois, USA.
7. Su Yan, Dongwon Lee, Min-Yen Kany, C. Lee Giles: Adaptive Sorted Neighborhood Methods for Efficient Record
Linkage. JCDL07, June 17.22, 2007, Vancouver, British Columbia, Canada.
8.M. Hernandez and S. Stolfo :The Merge/Purge Problem for Large Databases. Proc. ACM SIGMOD Intl Conf.
Management of Data. pp. 127-138, May 1995.
9. M.A. Hernandez and S.J. Stolfo,
:Real-World Data Is Dirty: Data Cleansing and the Merge/Purge Problem. Data Mining
and Knowledge Discovery, Vol. 2, pp. 9-37, 1998.
... Deshmukh and Wangikar [19] discuss an accurate and successful algorithm to detect duplicate records through the idea of "smart tokens", which removes the need to parse extremely long records several times to identify this duplication. The performance has been measured to have very high recall and precision. ...
... The performance has been measured to have very high recall and precision. This technique can be used in the context of simply correcting for duplicated customer interaction (e.g., click, buy) entries in the recorded database [19]. ...
Preprint
E-commerce businesses employ recommender models to assist in identifying a personalized set of products for each visitor. To accurately assess the recommendations’ influence on customer clicks and buys, three target areas—customer behavior, data collection, user-interface —will be explored for possible sources of erroneous data. Varied customer behavior misrepresents the recommendations’ true influence on a customer due to the presence of B2B interactions and outlier customers. Non-parametric statistical procedures for outlier removal are delineated and other strategies are investigated to account for the effect of a large percentage of new customers or high bounce rates. Subsequently, in data collection we identify probable misleading interactions in the raw data, propose a robust method of tracking unique visitors, and accurately attributing the buy influence for combo products. Lastly, user-interface issues discuss the possible problems caused due to the recommendation widget’s positioning on the e-commerce website and the stringent conditions that should be imposed when utilizing data from the product listing page. This collective methodology results in an exact and valid estimation of the customer’s interactions influenced by the recommendation model in the context of standard industry metrics such as Click-through rates, Buy-through rates, and Conversion revenue.
... Data cleansing is the process of identifying and removing the errors in the data records. While collecting and combining data from various sources into a data warehouse, ensuring high data quality and consistency becomes a significant, often expensive and always challenging task [3]. The cleaning process in my example isn't performed in one go, but in iterations by the needs of Exploratory Data Analysis. ...
Experiment Findings
Full-text available
In this paper, we will consider using a stroke prediction dataset for building a model for stroke prediction. In the first step, we will clean the data, the next step is to perform the Exploratory Data Analysis, and finally, create a model with a satisfactory degree of accuracy.
... This technique is usually applied to remove redundant data, NULL values, ineffective features, etc. The common techniques applied for data cleaning are discussed in [26]. The dataset under consideration was found to contain some rows with NULL values, which might cause issues when used in the supervised ML algorithms. ...
... Data cleaning is performed to remove duplicate entries, null entries, useless attributes, misspelled entries etc. Following are some of the techniques used by researchers to clean the data [9]: ...
... Before making any analysis, so called 'data cleaning' should be performed to explore the data, removing mistakes to avoid errors and reducing the data for the required computing power. Data cleaning is defined as the process of identifying and removing errors in a data set [15]. Many researchers define this process as the most time consuming and challenging part of data analysis since most of the errors are discovered upon completing the analysis which requires repeating the previous tasks. ...
Article
Rapid development in data science keeps paving the way for use of data for many purposes in shipbuilding, both for product development and production, such as Industry 4.0 have been developing many industries. Similar to other industries the evaluation of performance in shipbuilding is the key to success which is closely connected to productivity and lowered costs. Data mining and analysis techniques are used to create effective algorithms to evaluate the performance, also by means of cost estimation based on parametric methods. However, it is usually not very clear how data are collected, organised and prepared for analysing and deriving valuable knowledge as well as algorithms. In most of the cases, having this data requires either continuous investment in expensive software or expensive external expertise which are generally not available for small and medium size shipyards. In this study, considering the needs of the small and medium sized shipyards, a step-by-step methodology is proposed which could be easily applied with widely available low budget software. The application is demonstrated with a case to evaluate the performance of early phase structural design with a data driven cost estimation algorithm.
Article
Full-text available
Este trabajo presenta aplicaciones del método Backward Seismic Analysis (BSA) para estanques de acero de acuerdo con una base de datos que recopila más de 382 estanques de acero en operación durante terremotos de subducción: el ocurrido en Valdivia en 1960, en Chile Central en 1985, en Tocopilla en 2007, el último de gran magnitud registrado en el Maule en 2010, en Alaska en 1964 y otros de Estados Unidos entre 1933 y 1995 (subductivos y corticales). Se ha registrado que gran parte de los estanques sin sistemas de anclajes han fallado durante grandes terremotos. Estos han sido diseñados principalmente con los códigos API 650-E, AWWA-D100 y NZSEE, los cuales proponen procedimientos equivalentes para estimar las demandas sísmicas, pero con métodos de diseño distintos. Durante diferentes conferencias se evaluaron las causas que originaron las fallas, concluyendo que los estanques estaban diseñados principalmente con el estándar API 650-E y no disponían de sistemas de anclajes. Además, los códigos de diseño más utilizados no consideran en la actualidad los aspectos relevantes que condicionan la respuesta sísmica de los estanques de acero. Este trabajo desarrolla un modelo de predicción basado en la información histórica ya descrita, capaz de predecir de manera eficiente si un estanque presentará fallas durante algún terremoto. Se evaluaron diversos algoritmos, encontrando que el método Random Forest exhibe los mejores resultados. Los resultados obtenidos en la predicción de fallas de estanques alcanzan más del 90 % de eficiencia en la mayoría de los escenarios evaluados.
Article
Full-text available
Developments in big data technology, wireless networks, Geographic information system (GIS) technology, and internet growth has increased the volume of data at an exponential rate. Internet users are generating data with every single click. Geospatial metadata is widely used for urban planning, map making, spatial data analysis, and so on. Scientific databases use metadata for computations and query processing. Cleaning of data is required for improving the quality of geospatial metadata for scientific computations and spatial data analysis. In this paper, we have designed a data cleaning tool named as GeoWebCln to remove useless data from geospatial metadata in a user-friendly environment using the Python console of QGIS Software.
Conference Paper
Full-text available
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.
Article
Full-text available
The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent “equational theory” that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a real-world database that conclusively validates our results previously achieved for statistically generated data.
Article
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.
Article
Data cleaning is a process for determining whether two or more records defined differently in a database, actually represent the same real world object. During data cleaning, mul-tiple records representing the same real life object are identified, assigned only one unique database identification, and only one copy of exact duplicate records is retained. Most existing work on data cleaning, identify record duplicates by computing match scores compared against a given match score threshold. Some use the entire records for long string comparisons that involve a number of passes. Determining optimal match score threshold in a domain is hard and straight long string comparisons with many passes is inefficient. This paper proposes a technique that eliminates the need to rely on match threshold by defining smart tokens that are used for identifying duplicates. This approach also eliminates the need to use the entire long string records with multiple passes, for duplicate identification.
Conference Paper
Data cleaning is a process of maintaining data quality in information systems. Current data cleaning solutions require reference data to identify incorrect or duplicate entries. This article proposes usage of data mining in the area of data cleaning as effective in discovering reference data and validation rules from the data itself. Two algorithms designed by the author for data attribute correction have been presented. Both algorithms utilize data mining methods. Experimental results show that both algorithms can effectively clean text attributes without external reference data.
Conference Paper
This tutorial provides a comprehensive and cohesive overview of the key research results in the area of record linkage methodologies and algorithms for identifying approximate duplicate records, and available tools for this purpose. It encompasses techniques introduced in several communities including databases, information retrieval, statistics and machine learning. It aims to identify similarities and differences across the techniques as well as their merits and limitations.
Conference Paper
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.
Conference Paper
Traditionally, record linkage algorithms have played an im- portant role in maintaining digital libraries { i.e., identifying matching citations or authors for consolidation in updating or integrating digital libraries. As such, a variety of record linkage algorithms have been developed and deployed suc- cessfully. Often, however, existing solutions have a set of parameters whose values are set by human experts o-line and are xed during the execution. Since nding the ideal values of such parameters is not straightforward, or no such single ideal value even exists, the applicability of existing so- lutions to new scenarios or domains is greatly hampered. To remedy this problem, we argue that one can achieve signi- cant improvement by adaptively and dynamically changing such parameters of record linkage algorithms. To validate our hypothesis, we take a classical record linkage algorithm, the sorted neighborhood method (SNM), and demonstrate how we can achieve improved accuracy and performance by adaptively changing its xed sliding window size. Our claim is analytically and empirically validated using both real and synthetic data sets of digital libraries and other domains.