Content uploaded by Ratnadeep R. Deshmukh
Author content
All content in this area was uploaded by Ratnadeep R. Deshmukh on Jun 17, 2015
Content may be subject to copyright.
Content uploaded by Ratnadeep R. Deshmukh
Author content
All content in this area was uploaded by Ratnadeep R. Deshmukh on Jun 15, 2015
Content may be subject to copyright.
9
CHAPTER
Data Cleaning: Current Approaches
and Issues
1
Vaishali Chandrakant Wangikar and
2
Ratnadeep R. Deshmukh
1
MCA Department, Maharashtra Academy of Engineering, Alandi, Pune (MS), India,
2
Deptartment of Computer Science & IT, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad (MS), India
E-mail: vaishali.wangikar@gmail.com, ratnadeep_deshmukh@yahoo.co.in
ABSTRACT
The data cleaning is the process of identifying and removing the errors in the data warehouse. While collecting and combining
data from various sources into a data warehouse, ensuring high data quality and consistency becomes a significant, often
expensive and always challenging task. Without clean and correct data the usefulness of Data Mining and data warehousing
is mitigated. This paper analyzes the problem of data cleansing and the identification of potential errors in data sets. The
differing views of data cleansing are surveyed and reviewed and a brief overview of existing data cleansing techniques is
given. We also give an outlook to research directions that complement the existing systems.
Keywords: Sorted Neighborhood Methods, Fuzzy Match, Clustering and Association, Token-Based Data, Record Linkage.
1. INTRODUCTION
Common data quality problems(anomalies) include inconsistent data conventions amongst sources such as different
abbreviations or synonyms; data entry errors such as spelling mistakes inconsistent data formats, missing, incomplete,
outdated or otherwise incorrect attribute values, data duplication, irrelevant objects or data. Data that is incomplete or
inaccurate is known as dirty
data.
The various types of anomalies occurring in data that have to be eliminated. The type of anomalies can be classified
under several types of it. Based on this classification we evaluate and compare existing approaches for data cleansing
with respect to the types of anomalies handled and eliminated by them.
The paper
categorizes the data cleansing into two categories: cleansing string data and record or attribute de-
duplication.
Data cleaning offers the fundamental services for data cleaning such as attribute selection, formation of tokens,
selection of clustering algorithm, selection of similarity function, selection of elimination function and merge function
etc.
The paper is organized as
follows. Related Research Work describes various existing data cleaning techniques,
Comparison of existing techniques, Conclusion and Future Work.
6 Knowledge Engineering
2. RELATED RESEARCH WORK
2.1 Cleansing String Data
This category of data cleaning removes dirt in strings (words). Algorithm that identifies a group of strings that consists
of (multiple) occurrences of a correctly spelled string plus nearby misspelled strings. All strings in a group are replaced
by the most frequent string of this group.
2.1.1 Data Cleaning for Misspelled Proper Nouns (Border Detection Algorithm)
The method is proposed by Arturas Mazeika and Michael H.B¨ohlen in 2006. The method targets proper noun databases,
including names and addresses, which are not handled by dictionaries. Center Calculation and Border Detection
algorithms [1] are suggested. Data cleansing is done in two steps. First, the string data is clustered by identifying center
and
border of hyper-spherical clusters, and second, the cluster strings are cleansed with the most frequent string of the
cluster. All strings within the overlap threshold from the center of the cluster are assigned to one cluster.
Border Detection algorithm is a simple and effective strategy to compute clusters in string data. One starts with
a
string in the database and selects the border that separates the cluster from the other clusters. If the initial string was
chosen close to the center of the cluster, the border detection will yield good and robust results. If one chooses the initial
string close to the border, two separate clusters
might be assigned. As the cluster size increases, the relative clustering
error decreases. The algorithm successfully identifies borders of clusters provided a sufficient sample size. The
robustness of the algorithm is not affected by the cluster size.
Experiments show that the border detection is robust provided a sufficient sample size. The
investigation indicates
that very few q-grams of the center strings are sufficient to identify strings of the cluster. An algorithm that robustly finds
the identifying q-grams of the cluster is an interesting challenge.
2.1.2 Robust and Efficient Fuzzy Match for Online Data Cleaning(Fuzzy
Match similarity Algorithm)
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In
many situations, clean tuples must match acceptable tuples in reference tables. A significant challenge in such a scenario
is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to
match exactly with any tuple in the reference relation. A few similarity function which overcomes limitations of
commonly used similarity functions is proposed, and an efficient fuzzy match algorithm is developed in 2003 by Surajit
Chaudhuri, Kris Ganjam, Venkatesh Ganti Rajeev Motwani. Edit distance similarity
[2] is generalized by incorporating the
notion of tokens and their importance to develop an accurate fuzzy match similarity function for matching erroneous
input tuples with clean tuples from a reference relation. The error tolerant index [2] and an efficient algorithm is
developed for identifying with high probability the closest fuzzy matching reference tuples.
Using real datasets,
demonstration of the high quality of proposed similarity function and the efficiency of algorithms given.
2.1.3 Data Cleaning by Clustering and Association Methods (Data Mining Algorithms)
The two applications of data mining techniques in the area of attribute correction: context-independent attribute
correction implemented using clustering techniques and context-dependent attribute correction using associations are
proposed by Lukasz Ciszak, 2008 IEEE[3].
Attribute correction solutions require reference data in order to provide satisfying results.
Context-independent attribute correction means that all the record attributes
are examined and cleaned in isolation
without regard to values of other attributes of a given record.
Context-dependent means that attribute values are corrected with regard not only to the reference data value it is
most similar to, but also takes into consideration values of other attributes within a given record
Experimental
results of both algorithms created by the author show that attribute correction is possible without an
external reference data and can give good results. As it was discovered in the experiments, the effectiveness of a method
Vaishali Chandrakant Wangikar and Ratnadeep R. Deshmukh 63
depends strongly on its parameters. The optimal parameters discovered here may give optimal results only for the data
examined and it is very likely that different data sets would need different values of the parameters to achieve a high ratio
of correctly cleaned data.
2.2 Record or Attribute De-duplication
A process for determining whether two or more records defined differently in a database, actually represent the same real
world object. During data cleaning, multiple records representing the same real life object are identified, assigned only one
unique database identification, and only one copy of exact duplicate records is retained.
2.2.1 A Token-Based Data Cleaning Technique
Most existing work on data cleaning, identify record duplicates by computing match scores compared against a given
match score threshold. Some use the entire records for long string comparisons that involve a number of passes.
Determining optimal match score threshold in a domain is hard and straight long string comparisons with many passes is
inefficient.
The proposed token based technique proposed by Timothy E., Ohanekwu and C.I. Ezeife eliminates the need to rely
on match threshold by defining smart tokens that are used for identifying duplicates. This approach also eliminates the
need to use the entire long string records with multiple passes, for
duplicate identification.
Existing algorithms use token keys extracted from records for only sorting and/or clustering. The results from the
experiments show that the proposed token-based algorithm [5] outperforms the other two algorithms.
2.2.2 Record Linkage: Similarity Measures and Algorithms
In the presence of data quality errors, a central problem is the ability to identify whether two entities (e.g., relational
tuples) are approximately the same The techniques used here are record linkage and approximate join in the sequel.
A variety of approximate match predicates that have been proposed to quantify the
degree of similarity or closeness
of two data entities. The authors Nick Koudas, Sunita Sarawagi, Divesh Srivastava have compared and contrasted them
based on their applicability to various data types, algorithmic properties, computational overhead and their adaptability.
Most approximate match predicates return a score between 0 and 1 (with 1 being assigned to identical entities)
that
effectively quantifies the degree of similarity between data entities. Such approximate match predicates will consist of
three parts.
Atomic Similarity Measures: This part measures to assess atomic (attribute value) similarity between a pair of data
entities. Several approximate match predicates including edit distance, phonetic distance (soundex), the Jaro and Winkler
measures, tf.idf and
many variants thereof. Several approaches to fine tune parameters of such measures are considered.
Functions to combine similarity measures : Given a set of pairs of attributes belonging to two entities (tuples), in which
each pair is tagged with its own approximate match score (possibly applying distinct approximate match predicates for
each attribute pair), how does one combine such scores to decide whether the entire entities (tuples) are approximately
the same. For this basic decision problem several proposed methodologies like statistical and probabilistic, predictive,
cost based, rule based, user assisted as well as learning based are given. Moreover, several specific functions including
Naive Bayes, the Fellegi-Sunter
model, linear support vector machines (SVM) and approaches based on voting theory are
covered.
Similarity between linked entities: Often the entities over which we need to resolve duplicates are linked together via
foreign keys in a multi-relational database. Author has presented various graph-based similarity measures that capture
transitive contextual similarity in combination with the intrinsic similarity between
two entities.
Record Linkage Algorithms: Once the basic techniques for quantifying the degree of approximate match for a pair (or
subsets) of attributes have been identified, the next challenging operation is to embed them into an approximate join
framework between two data sets. A common feature of all such algorithms is the ability to keep the total number of pairs
(and subsequent decisions) low utilizing various pruning mechanisms. These algorithms can be classified into two main
categories.
64 Knowledge Engineering
1. Algorithms inspired by relational duplicate elimination and join techniques including sort-merge, band join and
indexed nested loops: In this context, techniques like Merge/Purge [9] (based on the concept of sorted
neighborhoods), Big Match (based on indexed nested loops joins) and Dimension Hierarchies (based on the
concept of hierarchically clustered neighborhoods) are reviewed .
2. Algorithms inspired by information retrieval that treat each tuple as a set of tokens, and return those set pairs
whose (weighted) overlap exceeds a specified threshold: In this context, a variety of set join algorithms
are
reviewed [6].
2.2.3 Adaptive Sorted Neighborhood Methods for Efficient Record Linkage
A variety of record linkage algorithms have been developed and deployed successfully. Often, however, existing solutions
have a set of parameters whose values are set by human experts off-line and are fixed during the execution. Since finding
the ideal values of such parameters is not straightforward, or no such single ideal value even exists, the applicability of
existing solutions to new scenarios or domains is greatly hampered. To remedy this problem, an argument is made by Su
Yan, Dongwon Lee, Min-Yen Kany, C. Lee Giles that one can achieve significant improvement by adaptively and
dynamically changing such parameters of record linkage algorithms.
To validate the hypothesis, a classical record linkage
algorithm, the Sorted Neighborhood Method (SNM)[7] are used and demonstrated how one can achieve improved
accuracy and performance by adaptively changing its fixed sliding window size.
Two adaptive versions of the SNM algorithm, named as the incrementally-adaptive SNM (IA-SNM) and the
accumulatively-adaptive SNM (AA-SNM) are
proposed, both of which dynamically adjust the sliding window size, a key
parameter used in SNM, during the blocking phase to adaptively fit the duplicate distribution. Comprehensive experiments
with both real and synthetic data sets of three domains validate the effectiveness and the efficiency of the proposed
adaptive schemes.
3. COMPARISON
3.1 The following table shows comparison of three different algorithms especially used for cleaning string type of data.
Border Detection Data Mining Algorithm- Fuzzy Match Similarity
Algorithm Attribute Correction Algorithm FunctionAlgorithm
Features (1) Simple, effective to (1) The given attributes are (1) If the tuple or attribute fails
compute clusters in the validated against reference to match the reference data then
string data. data to provide cleansing solution fuzzy match operation is applied
(2) It helps in selection (2) fuzzy match similarity (fms)on it.
of border
that separates one function that explicitly considers (2) The two applications of
cluster from the other. If the IDF token weights and input data mining techniques in the area
initial string was chosen close errors while comparing tuples. of attribute correction are: context-
to the center of the cluster, the independent attribute correction
border detection will yield good implemented using clustering
and robust results . If one techniques and context-dependent
chooses the initial string close to attribute correction using
the border, two separate clusters associations.
might be assigned
Significance/ It produces good cleansing results Quality of fms is better than The algorithms displays better
Performance for string data with large distances ed (edit distance) using
two performance for long strings as
between centers of clusters and Datasets. The algorithms short strings would require higher
small distances within the clusters. are 2 to 3 orders of magnitude value of the parameter to discover
faster than the naïve algorithm a correct reference value.
This method produces as 92%
of correctly altered elements which
is an acceptable value.
Vaishali Chandrakant Wangikar and Ratnadeep R. Deshmukh 65
Limitations Our data cleansing algorithm is There is always cost associated The major drawback of this method
less is applicable for natural with transformations of IDF that may classify as incorrect
language databases. tokens. a value that is correct in context
of other attributes of this record,
but does not have enough
occurrences within the
cleaned data set
3.2 The following table shows comparison of three different algorithms used for cleaning duplicate entries of attributes
as well as records.
Token-Based Algorithm Record Linkage Adaptive Sorted Neighborhood
and Algorithms Similarity Measures Linkage Methods For Efficient Record
Features For finding duplicates of Approximate match and Among many parameters of record
attributes as well as records approximate join techniques linkage algorithms, the main focus is
smart tokens are used instead are proposed to quantify the on the size of the sliding window in
match score comparison against degree of similarity SNM and the adaptive version of
match threshold. Atomic Similarity Measures, SNM is proposed.
This approach also eliminates the Functions to combine similarity The size of the window in SNM
need to use the entire long string measures, Similarity between amounts to the size of the block,
records with multiple passes, for linked entities are considered which in turn is related to
duplicate identification. for matching the aggressiveness of a
Two approximate join techniques blocking method.
proposed, the first is concerned Two adaptive versions of the SNM
with procedural algorithms algorithm, named as the
operating on data, applying incrementally-adaptive SNM (IA-
approximate match predicates, SNM) and the accumulatively-
without a particular storage or adaptive SNM (AA-SNM) are
query model in mind. proposed.
The second is concerned with
declarative specifications of data
cleaning operations.
Significance By using short lengthened tokens A non-declarative specification Adaptive sorted neighborhood
/Performance for record comparisons, a high offers greater algorithmic flexibility methods significantly outperform the
recall/precision is achieved. and possibly improved perform- the original SNM method. AA-SSNM
It has drastically lowers the ance (e.g., implemented on top has better performance than IA-SNM.
dependency of the data cleaning of a file system without incurring The F-score of AA-SNM is 49%
on match threshold choice. RDBMS overheads). larger than that of SNM.
It has a recall close to 100%, as A declarative specification offers The F-score difference between IA-
well as negligible false positive unbeatable ease of deployment SNM and AA-SNM is about 4%, but
errors. It succeeded in reducing (as a set of SQL queries), direct the PC difference is around 13%.
the number of token tables to a processing of data in their native This shows that AA-SNM is a better
constant of 2, irrespective of the store (RDBMS) and flexible blocking method than IA-SNM since
number of fields selected by the integration with existing it finds more potential duplicate pairs
user. applications utilizing an RDBMS with similar F-score
The smart tokens are more likely
applicable to domain-independent
data cleaning, and could be used as
warehouse identifiers to enhance
the process of incremental cleaning
and refreshing of integrated data.
66 Knowledge Engineering
Limitation Existing algorithms use token keys The output of the approximate The adaptive schemes are robust to
extracted from records for only join needs to be post processed the variance in the size of each
sorting and/or clustering to cluster together all tuples that individual block, which can range
Token-based cleaning technique referto the same entity. from moderate to severe. Besides,
on unstructured, and semi- The approximate join operation the adaptive schemes show better
structured data yet to be above might produce seemingly resistance to the errors in the
considered. inconsistent results like tuple blocking fields.
A joins with tuple B, tuple A joins The performance of the algorithm
with tuple C, but tuple B does not depends upon the appropriate size
join with tuple C. A straightforward of window. Several methods for
way to resolve such inconsist- adjusting window sizes, which are
encies is to cluster together all used in the adaptive methods, are
tuples via a transitive closure of proposed and compared. Among
the join pairs. In practice, this can them, the full adjustment method is
lead to extremely poor results shown to be near optimal.
since unrelated tuples might get
connected through noisy links.
4. CONCLUSION
Various data cleaning algorithms and techniques are presented in the paper but each method can be used to identify a
particular type of error in the data. The technique suitable for one type of data cleaning may not be suitable for the other.
As data cleaning has a wide variety of situations that need to cater efficiently by some comprehensive data cleaning
framework. Future research directions include the review and investigation of various methods to address wide area of
data cleaning. A better integration of data cleaning approach in the frameworks and data decision processes should be
achieved.
Acknowledgements
For reviewing different algorithms of data cleaning, experiments and contributions done by several authors are referred.
The papers are enlisted in the references.
References
1. Arturas Mazeika Michael H.B¨ohlen: Cleansing Databases of Misspelled Proper Nouns, Clean DB, Seoul, Korea, 2006
2. Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti Rajeev Motwani: Robust and Efficient Fuzzy Match for Online Data
Cleaning. ACM, SIGMOD 2003, June 9-12, 2003, San Diego__CA.
3. Rohit Ananthakrishna1 Surajit Chaudhuri Venkatesh Ganti,: Research Eliminating Fuzzy Duplicates in Data Warehouses.
Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002.
4. Lukasz Ciszak: Application of Clustering and Association Methods in Data Cleaning .Proceedings of the International
Multiconference on ISBN 978-83-60810-14-9 Computer Science and Information Technology, pp. 97 103
.
5. Timothy E. Ohanekwu, C.I. Ezeife: A Token-Based Data Cleaning Technique for Data Warehouse Systems.
6. Nick Koudas, Sunita Sarawagi, Divesh Srivastava: Record Linkage: Similarity Measures and Algorithms. SIGMOD 2006,
June 2729, 2006, Chicago, Illinois, USA.
7. Su Yan, Dongwon Lee, Min-Yen Kany, C. Lee Giles: Adaptive Sorted Neighborhood Methods for Efficient Record
Linkage. JCDL07, June 17.22, 2007, Vancouver, British Columbia, Canada.
8.M. Hernandez and S. Stolfo :The Merge/Purge Problem for Large Databases. Proc. ACM SIGMOD Intl Conf.
Management of Data. pp. 127-138, May 1995.
9. M.A. Hernandez and S.J. Stolfo,
:Real-World Data Is Dirty: Data Cleansing and the Merge/Purge Problem. Data Mining
and Knowledge Discovery, Vol. 2, pp. 9-37, 1998.