ArticlePDF Available

Generalized Hamming Distance

Authors:

Abstract

Many problems in information retrieval and related fields depend on a reliable measure of the distance or similarity between objects that, most frequently, are represented as vectors. This paper considers vectors of bits. Such data structures implement entities as diverse as bitmaps that indicate the occurrences of terms and bitstrings indicating the presence of edges in images. For such applications, a popular distance measure is the Hamming distance. The value of the Hamming distance for information retrieval applications is limited by the fact that it counts only exact matches, whereas in information retrieval, corresponding bits that are close by can still be considered to be almost identical. We define a “Generalized Hamming distance” that extends the Hamming concept to give partial credit for near misses, and suggest a dynamic programming algorithm that permits it to be computed efficiently. We envision many uses for such a measure. In this paper we define and prove some basic properties of the “Generalized Hamming distance”, and illustrate its use in the area of object recognition. We evaluate our implementation in a series of experiments, using autonomous robots to test the measure's effectiveness in relating similar bitstrings.
... Hamming distance Measures the minimum number of substitutions required to change one string x into the other y [12]. ...
... We match monolingual ontologies from BioPortal 10a web-based application for accessing and sharing biomedical ontologies and providing alignments between them. We match the Sample Processing and Separation Techniques Ontology (SEP) 11 with several ontologies from the Bioportal such as Plant Experimental Conditions (PECO) 12 , Plant (PO) 13 , Plant Trait (PTO) 14 , and Units of Measurement (UO) 15 ontologies. Table 5 shows the statistics of the Bioportal ontologies and the degree of overlap between SEP with each ontology. ...
Article
Full-text available
The amount of multilingual data on the Web proliferates; therefore, developing ontologies in various natural languages is attracting considerable attention. In order to achieve semantic interoperability for the multilingual Web, cross-lingual ontology matching techniques are highly required. This paper proposes a Multilingual Ontology Matching (MoMatch) approach for matching ontologies in different natural languages. MoMatch uses machine translation and various string similarity techniques to identify correspondences across different ontologies. Furthermore, we propose a Quality Assessment Suite for Ontologies (QASO) that comprises 14 metrics, out of which seven metrics are used to assess the quality of the matching process and seven metrics are used to evaluate the quality of the ontology. We present an in-depth comparison of different string similarity techniques across various languages to get the most effective similarity measure(s) between multilingual terms. To illustrate the applicability of our approach and how it can be used in different domains, we present two use cases. MoMatch has been implemented using Scala and Apache Spark under an open-source license. We have compared our results with the results from the Ontology Alignment Evaluation Initiative (OAEI 2020). MoMatch has achieved significantly high precision, recall, and F-measure compared to five state-of-the-art matching approaches.
... Number of distance measures can be used. The Hamming distance [11] [12] which quantifies the differences between two sequences of symbols of the same length as the number of symbols, at the same position, that differs. For example the Hamming distance between "rose" and "ruse" is 1, while the Hamming distance between "110110" and "000101" is 4. The Levenshtein distance [13] [14] which corresponds to the minimum number of characters that must be deleted, inserted or replaced to go from one sequence to another. ...
Preprint
Full-text available
The species inventory of global biodiversity is constantly revised and refined by taxonomic research, through the addition of newly discovered species. This almost three century old project provide a knowledge foundation essential for humankind, and notably to develop appropriate conservation strategies. This task relies on the study of millions of specimens housed all around the world in natural history collections. Since two decades, taxonomy generates a plethoric amount of numeric data every year, and notably through the digitization of collection specimens, gradually transforms into a big data science. In this line, the French National Museum of Natural History (MNHN) has embarked into a major research and engineering challenge within its information system, in order to facilitate the transition towards cyber-taxonomic practices which require a facilitated access to data on reference collection specimens housed all over the world. To this end, a first mandatory step is to automatically complete classification data usually associated to collection specimens found in multiple databases. We use here fuzzy approaches to connect one database with other databases and match identical specimens found in each databases together.
... To attain the optimum results, select the right measure for a specific binary data analysis [112]. Classic hamming distance does not take nearby bits (the close bits) into account [113]. To enhance the activity detection of kitchen ADL, the sensor fusion technique considers the extraction of pertinent information from each type of sensor data and their combination. ...
Article
Full-text available
Abnormal behavior detection (ABD) systems are built to automatically identify and recognize abnormal behavior from various input data types, such as sensor-based and vision-based input. As much as the attention received for ABD systems, the number of studies on ABD in activities of daily living (ADL) is limited. Owing to the increasing rate of elderly accidents in the home compound, ABD in ADL research should be given as much attention to preventing accidents by sending out signals when abnormal behavior such as falling is detected. In this study, we compare and contrast the formation of the ABD system in ADL from input data types (sensor-based input and vision-based input) to modeling techniques (conventional and deep learning approaches). We scrutinize the public datasets available and provide solutions for one of the significant issues: the lack of datasets in ABD in ADL. This work aims to guide new research to understand the field of ABD in ADL better and serve as a reference for future study of better Ambient Assisted Living with the growing smart home trend.
Article
Recommender Systems (RS) are used to generate recommendations of items that a user may be interested in. Several commercial wine recommender systems exist but are largely tailored to consumers outside of South Africa (SA). Consequently, these systems are of limited use to novice wine consumers in SA. In this research, a system soMLier (a combination of the terms ‘sommelier’ and ‘Machine Learning’) is developed for SA consumers that yields high-quality wine recommendations, maximises the accuracy of predicted ratings for those recommendations and provides insights into why those suggestions were made. This system is developed using two datasets – a database containing several attributes of SA wines and the corresponding numeric 5-star ratings made by users on Vivino.com. Using these datasets, several recommendation methodologies are investigated and it is found that collaborative filtering succeeds at generating lists of relevant wine recommendations, matrix factorisation techniques accurately predict ratings and content-based methods are most appropriate for explaining wine recommendations. These methods are optimally combined in the soMLier system. Though it would benefit from more explicit user data to establish a richer model of user preferences, soMLier can assist consumers in discovering wines they will likely enjoy and understanding their preferences of SA wine. Abbreviations: SA: South Africa(n); RS: Recommender System(s); IBCF: Item-basedCollaborative Filtering; CB: Content-Based; MF: Matrix Factorisation; RMSE: RootMean Square Error; COV: Coverage; PER: Personalistion; ARHR: Average ReciporcalHit Rate
Preprint
Full-text available
Objectives: Although specifiers for a major depressive disorder (MDE) are supposed to reduce diagnostic heterogeneity, recent literature challenges the idea that the atypical and melancholic features identify more homogenous or coherent subgroups. We attempt to replicate these findings and explore whether symptom heterogeneity is reduced in depression subgroups using novel data-analytic techniques. Methods: Using data derived from the National Epidemiological Survey on Alcohol and Related Conditions (NESARC Wave I; N = 5,749) and Sequenced Treatment Alternatives to Relieve Depression (STAR*D; N = 2,498) we computed the Hamming and Manhattan distance ratios comparing within and between individuals for the melancholic and atypical specifier subgroups. Results: In neither of the datasets was the heterogeneity between subgroups higher than the heterogeneity within subgroups, suggesting that the melancholic and atypical specifiers do not create more coherent (i.e., more homogeneous) subgroups. Conclusion: Replicating prior work, melancholic and atypical depression subtypes appear to have limited utility in reducing heterogeneity. The current study does not support the claim that symptom and course specifiers create more coherent subgroups as operationalized by similarity in symptoms and their severity.
Conference Paper
Full-text available
Knowledge of the structural organization of information in documents can be of significant assistance to information systems that use documents as their knowledge bases. In particular, such knowledge is of use to information retrieval systems that retrieve documents in response to user queries. This chapter presents an approach to mining free-text documents for structure that is qualitative in nature. It complements the statistical and machine-learning approaches, insomuch as the structural organization of information in documents is discovered through mining free text for content markers left behind by document writers. The ultimate objective is to find scalable data mining (DM) solutions for free-text documents in exchange for modest knowledge-engineering requirements. The problem of mining free text for structure is addressed in the context of finding structural components of files of frequently asked questions (FAQs) associated with many USENET newsgroups. The chapter describes a system that mines FAQs for structural components. The chapter concludes with an outline of possible future trends in the structural mining of free text.
Article
From the Publisher:A straightforward, practical examination of the fundamentals of computer vision using a minimum of mathematics. Concentrates on explanation, illustration, implementation and the various types of vision imaging problems including grey-level images, recognizing objects, computer readable codes, scientific images, etc. Contains authentic examples in C from a variety of disciplines as well as immediate access to images with which users can test ideas and software.
Article
Keyword search of multimedia collections lacks precision and automatic parsing of unrestricted natural language annotations lacks accuracy. We propose a structure for natural language descriptions of the semantic content of visual materials that requires descriptions to be (modified) keywords, phrases, or simple sentences, with components that are grammatical relations common to many languages. This structure makes it easy to implement a collection's descriptions as a relational database, enabling efficient search via the application of well-developed database-indexing methods. Description components may be elements from external resources (thesaurus, ontology, database, or knowledge base). This provides a rich superstructure for the meaningful retrieval of images by their semantic contents.
Conference Paper
The advent of the CD-ROM as a means of distributing massive bodies of textual data increases the importance of developing automatic techniques for textual analysis. To accomplish this task, we should be alert to existing techniques, perhaps developed for other purposes, that can be of value. We here report on observations we made while carrying out research on information storage and retrieval that promise to be helpful. Specifically, auxiliary information and data structures created incidental to our IR investigations are rich in semantic content, and can be useful in suggesting or confirming relations among concepts in text. Two examples are given: one based on a term weighting scheme for IR, the other on a tree structure for compressing bitmaps.
Conference Paper
Many problems depend on a reliable measure of the distance or similarity between objects that, frequently, are represented as vectors. We consider here vectors that can be expressed as bit sequences. For such problems, the most heavily used measure is the Hamming distance, perhaps normalized. The value of Hamming distances is limited by the fact that it counts only exact matches, whereas in various applications, corresponding bits that are close by, but not exactly matched, can still be considered to be almost identical. We here define a "fuzzy Hamming distance" that extends the Hamming concept to give partial credit for near misses, and suggest a dynamic programming algorithm that permits it to be computed efficiently. We envision many uses for such a measure.
Article
Bitmaps are data structures occurring often in information retrieval. They are useful, but are also large and expensive to store. For this reason, considerable effort has been devoted to finding techniques for compressing them. These techniques are most effective for sparse bitmaps. We propose a preprocessing stage, in which bitmaps are first clustered and the clusters used to transform their member bitmaps into sparser ones, that can be more effectively compressed. The clustering method efficiently generates a graph structure on the bitmaps. In some situations, it is desired to impose restrictions on the graph; finding the optimal graph satisfying these restrictions is shown to be NP-complete. The results of applying our algorithm to the Bible is presented: for some sets of bitmaps, our method almost doubled in the compression savings.