ArticlePDF Available

Using Hybrid Semantic Similarity Methods when Examining Corpora with Limited Content

Authors:

Abstract

Semantic similarity is ever evolving. It has roots in similarity and dissimilarity measures used in data science and has grown into four major categories of employable methodologies. There is no single correct answer for employing any of the similarity methods. Employment should be based on the corpora at hand coupled with the desired outcome the researcher is looking for. This paper presents the various methods, broken down by category, along with associated research papers for each method to be used by semantic similarity practitioners. It concludes with a single use case (our organization) and why we chose the methods we are employing to date.
1
Using Hybrid Semantic Similarity Methods
when Examining Corpora with Limited Content
Dorian Cougias, Steven Piliero
{dcougias, spiliero} @unifiedcompliance.com
March 7, 2023
Abstract
Semantic similarity is ever evolving. It has roots in similarity and dissimilarity measures used in
data science and has grown into four major categories of employable methodologies. There is no single
correct answer for employing any of the similarity methods. Employment should be based on the
corpora at hand coupled with the desired outcome the researcher is looking for. This paper presents
the various methods, broken down by category, along with associated research papers for each method
to be used by semantic similarity practitioners. It concludes with a single use case (our organization)
and why we chose the methods we are employing to date.
1. Semantic Similarity Methods
Four major categories of semantic similarity methods can be deployed, all of which have their roots in the analysis of
similarity and dissimilarity measures used in data science
1
. An insanely simple article on sentence similarity can be
found on the Hugging Face website
2
. How semantic similarity has evolved over the years (at least up until 2020) is
covered in a very well-done survey by Dhivya Chandrasekaran and Vijay Mago
3
as well as Roberto Navigli and
Federico Martellis overview of word and sense similarity
4
. What we present here is the latest knowledge of the
various semantic similarity methods divided into four categories as shown below.
1. The knowledge-based methods exploit the underlying ontologies to disambiguate synonyms,
2. Corpus-based methods are versatile as they can be used across languages.
3. Deep neural network-based methods, though computationally expensive, provide better results.
4. Finally, hybrid methods attempt to overcome the limitations inherent in various methods.
Semantic Similarity methods
A. Knowledge-based Methods
There are four major knowledge-based methods, as described below:
Edge-counting methods
5
1
A good overview of the basics of similarity and dissimilarity rudimentary studies can be found in Harmouch, 17 Types of Similarity and
Dissimilarity Measures Used in Data Science.
2
What Is Sentence Similarity?
3
Chandrasekaran and Mago, Evolution of Semantic Similarity -- A Survey.
4
Navigli and Martelli, An Overview of Word and Sense Similarity.
5
Achananuparp, Hu, and Shen, “The Evaluation of Sentence Similarity Measures”; Gao, Zhang, and Chen, “A Simplified Edge-Counting
Method for Measuring Semantic Similarity of Concepts”; K, Shet, and Acharya, “A New Similarity Measure for Taxonomy Based on Edge
Counting.”
2
Feature-counting methods
6
Information-content based methods
7
Combined knowledge-based methods
8
Semantic similarity knowledge-based methods use external knowledge sources, such as ontologies, thesauri, or
word co-occurrence statistics to measure the similarity between the meanings of words or phrases. There are several
types of knowledge-based methods, including:
Edge-counting methods: These methods measure semantic similarity by counting the number of edges
(relationships) in an ontology that connects two words. The more edges connecting two words, the more similar
they are considered to be.
Feature-counting methods: These methods use predefined features to measure the similarity between words or
phrases. For example, words that share a common parent node in an ontology are considered to be similar.
Information-content-based methods: These methods use information-theoretic measures, such as entropy, to
measure the similarity between words or phrases. For example, two highly specific words are considered less
similar than two general words.
Combined knowledge-based methods: These methods use multiple knowledge-based methods, as described in
the other bullets, to measure semantic similarity. This approach can often produce more accurate results, as it
incorporates the strengths of different techniques into a single score.
Knowledge-based methods are generally effective for measuring semantic similarity in structured domains, such as
the biomedical or legal domains, where the relationships between concepts can be well-defined. However, these
methods can be limited by the quality and coverage of external knowledge sources.
Knowledge-based semantic similarity models rely on structured semantic network knowledge, such as WordNet or
a knowledge graph, to measure the similarity between words, concepts, sentences, or documents. Some of the pros
and cons of these models are:
Pros
They can capture the semantic relations and meanings of words and concepts that are not lexicographically
similar.
They can explain how two terms are similar with the help of rich knowledge in the semantic network.
They can use the semantic networks structural knowledge, such as depth, path length, least common subsumer,
and information content, to compute similarity scores.
Cons
They depend on the quality and coverage of the semantic network, which may not be complete or consistent.
They may not capture similaritys contextual or pragmatic aspects, such as word sense disambiguation, domain
specificity, or user preferences.
They may not reflect the current usage or trends of words and concepts, as the semantic network may not be
updated frequently.
6
Bullock, “Machine Learning Foundations”; Rydbeck et al., “ClusTrack”; You et al., “Few-Shot Object Counting with Similarity-Aware Feature
Enhancement”; Zhang et al., “An Adaptive Method for Organization Name Disambiguation with Feature Reinforcing.”
7
Mohd, “Information Content Based Semantic Similarity Measure”; Pedersen, “Information Content Measures of Semantic Similarity Perform
Better Without Sense-Tagged Text”; Sánchez and Batet, “A Semantic Similarity Method Based on Information Content Exploiting Multiple
Ontologies”; Zhang, Sun, and Zhang, “An Information Content-Based Approach for Measuring Concept Semantic Similarity in WordNet”;
“Issues in Crosswalking Content Metadata Standards - National Information Standards Organization.”
8
Gonçalo Oliveira, “Distributional and Knowledge-Based Approaches for Computing Portuguese Word Similarity”; Stefanescu et al.,
“Combining Knowledge and Corpus-Based Measures for Word-to-Word Similarity”; Yan and Webster, “A Corpus-Based Approach to Linguistic
Function”; Rada Mihalcea, Carlo Strapparavay, and Courtney Corle, “Corpus-Based and Knowledge-Based Measures of Text Semantic
Similarity.”
3
B. Corpus-Based Methods
There are a great many models for corpus-based semantic similarity methods, as shown below:
Latent Semantic Analysis (LSA)
9
Hyperspace Analogue to Language (HAL)
10
Explicit Semantic Analysis (ESA)
11
Word-alignment models
12
Latent Dirichlet Allocation (LDA)
13
Normalized Google Distance (NGD)
14
Dependency-based models
15
Kernel-based models
16
Word-attention models
17
9
Anna Rozeva and Silvia Zerkova, “Assessing Semantic Similarity of Texts Methods and Algorithms Including Latent Semantic Analysis”;
Doshi, “Latent Semantic Analysis Deduce the Hidden Topic from the Document”; Babcock, Ta, and Ickes, “Latent Semantic Similarity and
Language Style Matching in Initial Dyadic Interactions”; Simmons et al., “Using Latent Semantic Analysis to Estimate Similarity”; Suleman and
Korkontzelos, “Extending Latent Semantic Analysis to Manage Its Syntactic Blindness.”
10
Hyperspace Analogue to Language Algorithm - GM-RKB; Wu et al., Using an Analogical Reasoning Framework to Infer Language
Patterns for Negative Life Events; Yan et al., Event-Based Hyperspace Analogue to Language for Query Expansion; KEVIN LUND and
CURT BURGESS, Producing High-Dimensional Semantic Spaces from Lexical Co-Occurrence.
11
Scholl et al., Extended Explicit Semantic Analysis for Calculating Semantic Relatedness of Web Resources; dramé, Mougin, and Diallo,
Large Scale Biomedical Texts Classification; Oosten, Wikipedia-Based Explicit Semantic Analysis; Mrhar and Abik, Towards Optimize-
ESA for Text Semantic Similarity; bogatron, Answer to What Is the Difference between Latent and Explicit Semantic Analysis.’”
12
Songyot and Chiang, Improving Word Alignment Using Word Similarity; Kosar, Word Alignment for Sentence Similarity; Vinh-Trung
Luu et al., A Review of Alignment Based Similarity Measures for Web Usage Mining; Der and Tiedemann, Finding Synonyms Using
Automatic Word Alignment and Measures of Distributional Similarity; Ismail, Shishtawy, and Alsammak, A New Alignment Word-Space
Approach for Measuring Semantic Similarity for Arabic Text.
13
Rus, Niraula, and Banjade, Similarity Measures Based on Latent Dirichlet Allocation; Tiedan Zhu and Kan Li, The Similarity Measure
Based on LDA for Automatic Summarization; galoosh33, Answer to Using LDA to Calculate Similarity’”; W. BEN TOWNE, CAROLYN P.
ROSÉ, and JAMES D. HERBSLEB, Measuring Similarity Similarly: LDA and Human Perception; Micheal Olalekan Ajinaja et al., Semantic
Similarity Measure for Topic Modeling Using Latent Dirichlet Allocation and Collapsed Gibbs Sampling; Niraula et al., Experiments with
Semantic Similarity Measures Based on LDA and LSA; Toni Cvitanic et al., LDA v. LSA: A Comparison of Two Computational Text
Analysis Tools for the Functional Categorization of Patents; Garbhapu and Bodapati, A Comparative Analysis of Latent Semantic Analysis and
Latent Dirichlet Allocation Topic Modeling Methods Using Bible Data.
14
Cilibrasi and Vitanyi, The Google Similarity Distance; Gligorov et al., Using Google Distance to Weight Approximate Ontology Matches;
Lopes and Moura, Normalized Google Distance in the Identification and Characterization of Health Queries; Hamza Osman and Abobieda,
An Adaptive Normalized Google Distance Similarity Measure for Extractive Text Summarization; Rudi L. Cilibrasi and Paul M.B. Vitanyi,
Normalized Web Distance and Word Similarity; Create a Manual Similarity Measure | Machine Learning; Measuring Similarity from
Embeddings | Machine Learning; Hong Nhung BUI, Quang-Thuy HA, and Tri-Thanh NGUYEN, AN NOVEL SIMILARITY MEASURE
FOR TRACE CLUSTERING BASED ON NORMALIZED GOOGLE DISTANCE.
15
Özateş, Özgür, and Radev, Sentence Similarity Based on Dependency Tree Kernels for Multi-Document Summarization; Campiteli et al., A
Reliable Measure of Similarity Based on Dependency for Short Time Series; Calvo, Segura-Olivares, and García, Dependency vs. Constituent
Based Syntactic N-Grams in Text Similarity Measures for Paraphrase Recognition; Marek Rei, Minimally Supervised Dependency-Based
Methods for Natural Language Processing; Özateş, Özgür, and Radev, Sentence Similarity Based on Dependency Tree Kernels for Multi-
Document Summarization; Liu and El-Gohary, Similarity-Based Dependency Parsing for Extracting Dependency Relations from Bridge
Inspection Reports; Chen and Liu, Automated Scoring System Using Dependency-Based Weighted Semantic Similarity Model; Richa Dhagat,
Arpana Rawal, and Sunita Soni, Comparison of Free Text Semantic Similarity Measures Using Dependency Relations; Papers with Code -
Sentence Similarity Based on Dependency Tree Kernels for Multi-Document Summarization.
16
Similarity and Dissimilarity Metrics - Kernel Distance; Nathan Srebro, How Good Is a Kernel When Used as a Similarity Measure?; Zhao
Kang, Chong Peng, and Qiang Cheng, Kernel-Driven Similarity Learning; Andre T. Martins, Mario A. T. Figueiredo, and Pedro M. Q. Aguiar,
KERNELS AND SIMILARITY MEASURES FOR TEXT CLASSIFICATION; Zhao Kang et al., Similarity Learning via Kernel Preserving
Embedding; Niazmardi, Safari, and Homayouni, Similarity-Based Multiple Kernel Learning Algorithms for Classification of Remotely Sensed
Images; Slimene and Zagrouba, Overlapping Area Hyperspheres for Kernel-Based Similarity Method; Maria-Florina Balcan and Avrim Blum,
On a Theory of Learning with Similarity Functions; Chen, Wang, and Hu, Kernel-Based Similarity Learning.
17
Wang et al., Sentence Similarity Learning Method Based on Attention Hybrid Model; Ji and Zhang, A Short Text Similarity Calculation
Method Combining Semantic and Headword Attention Mechanism; Lopez-Gazpio et al., Word N-Gram Attention Models for Sentence
Similarity and Inference; Shen, Lai, and Mohaghegh, Effects of Similarity Score Functions in Attention Mechanisms on the Performance of
Neural Question Answering Systems; Zhuang and Chang, Neobility at SemEval-2017 Task 1; Sonkar, Waters, and Baraniuk, Attention Word
Embedding.”; Lopez-Gazpio et al., Word N-Gram Attention Models for Sentence Similarity and Inference.
4
All the above semantic similarity corpus-based methods use large amounts of text data to measure the similarity
between the meanings of words or phrases. These methods do not rely on external knowledge sources, but instead,
use the relationships between words that can be derived from the text. Some standard corpus-based methods for
measuring semantic similarity include:
Latent Semantic Analysis (LSA): LSA is a method that uses singular value decomposition to identify the
underlying latent concepts in a large corpus of text. Semantic similarity between words is then measured based
on the cosine similarity between their respective latent concept representations.
Hyperspace Analogue to Language (HAL): HAL is a method that uses a neural network to learn a continuous
vector representation of words. The similarity between words is then calculated based on the cosine similarity
between their vector representations.
Explicit Semantic Analysis (ESA): ESA is a method that uses a large thesaurus or ontology to measure the
similarity between words by counting the number of articles in a corpus that co-occur with each word.
Word-alignment models: These models use word-to-word alignments derived from parallel corpora to measure
the similarity between words in different languages.
Latent Dirichlet Allocation (LDA): LDA is a topic modeling technique that can identify the underlying topics
in a corpus of text. The similarity between words can then be measured based on the degree of overlap between
the associated topics.
Normalized Google Distance (NGD): NGD is a method that uses Google search results to estimate the similarity
between words. The method is based on the idea that the more similar two words are, the more likely they are
to co-occur in a Google search result.
Dependency-based models: These models use the dependency relationships between words in a sentence to
measure their similarity. For example, two words that share a typical dependency relationship, such as a subject-
verb relationship, are considered to be more similar than two words that do not share a relationship.
Kernel-based models: These models use kernel functions to measure word similarity based on their co-
occurrence patterns in the corpus.
Word-attention models: These models use attention mechanisms to weigh the contribution of different words
to the similarity score between two phrases.
Corpus-based methods can scale to vast text corpora and can be trained on any language for which there is a large
corpus of text available. However, these methods are limited by the quality and domain-specificity of the text corpus
used.
Corpus-based semantic similarity models are methods that use statistical information from extensive collections of
texts to measure the similarity of words or texts based on their co-occurrence patterns. Some examples of corpus-
based models are word embeddings, such as Word2Vec, GloVe, fastText, etc., and vector space models, such as TF-
IDF, LSI, etc.
Some pros and cons of corpus-based semantic similarity models are:
Pros
They can capture the contextual and pragmatic aspects of word meaning, such as synonyms, antonyms,
collocations, etc.
They can be easily applied to different domains and languages as long as enough corpus data is available.
They can be combined with other sources of information, such as lexical taxonomies, to improve the accuracy
and coverage of similarity measures.
Cons
They require extensive and representative corpora to be reliable and robust, which may not always be available
or accessible.
5
They may not account for the semantic relations not based on co-occurrence, such as hypernyms, meronyms,
etc.
They may be sensitive to noise, ambiguity, and variability in the corpus data, which may affect the quality of
the similarity measures.
C. Deep Neural Network-Based Methods
Convolutional Neural Network-based model (CNN)
18
Long short-term memory-based model (LSTM)
19
Bi-LSTM-based model
20
Combined Neural Network model
21
Transformer-based model
22
Semantic similarity Deep Neural Network-Based Methods leverage the advancements in deep learning to measure
the similarity between two or more words, phrases, or sentences. These methods are based on training large neural
networks on large amounts of text data to learn the underlying semantic representations of words.
Convolutional Neural Network-based model (CNN): CNN-based models use convolutional filters to identify
the meaningful patterns and relationships between words in a sentence. The filters are trained to extract local
context representations that capture the semantic meaning of the words.
Long short-term memory-based model (LSTM): LSTM-based models are designed to handle sequential data
using memory cells to store information about the sequence. The model can be trained to understand the context
and meaning of words in a sentence, making it well-suited for semantic similarity tasks.
Bi-LSTM-based model: Bi-LSTM-based models are an extension of LSTM-based models that process input
sequences in both directions, resulting in a more comprehensive understanding of the context and meaning of
words.
Combined Neural Network model: Combined Neural Network models utilize a combination of different neural
network architectures, such as CNN and LSTM, to model the semantic representations of words. This approach
allows the model to leverage the strengths of different network architectures to produce more robust and
accurate results.
Transformer-based model: Transformer-based models, such as BERT, are pre-trained on large amounts of text
data and can capture the relationships between words in a sentence in a self-attention mechanism. The models
have achieved state-of-the-art results in various NLP tasks, including semantic similarity.
18
Moitreya Chatterjee and Yunan Luo, Similarity Learning with (or without) Convolutional Neural Network; Zhang, Liang, and Lin, Sentence
Similarity Measurement with Convolutional Neural Networks Using Semantic and Syntactic Features; Zhou, Comparison of CNN Based and
Self-Similarity Based Denoising Methods; Saraswathi et al., A Hybrid Multi-Feature Semantic Similarity Based Online Social
Recommendation System Using CNN.
19
Zhao et al., Re-LSTM; LSTM | Introduction to LSTM | Long Short Term Memory Algorithms; Lin et al., LSTM Based Similarity
Measurement with Spectral Clustering for Speaker Diarization; Sebastian Bångerius, LSTM Feature Engineering Through Time Series
Similarity Embedding; Chi and Zhang, A Sentence Similarity Estimation Method Based on Improved Siamese Network.
20
Zhang et al., Text Similarity Measurement Method Based on BiLSTM-SECapsNet Model; Viji and Revathy, A Hybrid Approach of
Weighted Fine-Tuned BERT Extraction with Deep Siamese Bi LSTM Model for Semantic Text Similarity Identification”; Yao, Pan, and Ning,
Unlabeled Short Text Similarity With LSTM Encoder; Zhang, Zhu, and He, Hierarchical Attention-Based BiLSTM Network for Document
Similarity Calculation; Zongkui Zhu et al., A Semantic Similarity Computing Model Based on Siamese Network for Duplicate Questions
Identification.
21
Moitreya Chatterjee and Yunan Luo, Similarity Learning with (or without) Convolutional Neural Network; Pravallika Mallela, Analysis of
Learning Algorithms for the Similarity Neural Network; Supervised Similarity Measure | Machine Learning; Ranasinghe, Orasan, and Mitkov,
Semantic Textual Similarity with Siamese Neural Networks; Kulmanov et al., Semantic Similarity and Machine Learning with Ontologies.
22
Cheng, Semantic Similarity Using Transformers; Semantic Textual Similarity Sentence-Transformers Documentation; Semantic
Similarity With Sentence Transformers; Nemani and Vollala, A Cognitive Study on Semantic Similarity Analysis of Large Corpora;
Sentence Transformers and Embeddings; Mastering Sentence Transformers For Sentence Similarity Predictive Hacks; Papers with Code -
Sentence Similarity Based on Dependency Tree Kernels for Multi-Document Summarization; Measuring Text Similarity Using BERT -
Analytics Vidhya.
6
Deep neural network-based semantic similarity models use multiple layers of non-linear transformations to learn
high-level representations of words or texts and measure their similarity based on their proximity in a semantic space.
Some examples of deep neural network-based models are DSSM, Siamese network, BiLSTM, etc.
Some pros and cons of deep neural network-based semantic similarity models are:
Pros
They can learn complex and abstract word or text meaning features, such as syntactic, semantic, and pragmatic
aspects.
They can handle large amounts of data and learn from various sources of information, such as word
embeddings, lexical taxonomies, etc..
They can perform state-of-the-art semantic similarity tasks, such as semantic role labeling, paraphrase
identification, textual entailment, etc..
Cons
They require a lot of computational resources and time to train and optimize, which may not always be feasible
or efficient.
Depending on the quality and quantity of training data and the network architecture, they may suffer from
overfitting, underfitting, or generalization problems.
They may not be interpretable or explainable, as the internal representations and decisions of the network are
often opaque and complex.
D. Hybrid Methods
Unified Compliance Hybrid AI
Novel Approach to a Semantically-Aware Representation of Items (NASARI)
23
Most Suitable Sense Annotation (MSSA)
24
Unsupervised Ensemble Semantic Textual Similarity Methods (UESTS)
25
Semantic similarity Hybrid Methods combine the strengths of different approaches to measure semantic similarity
between words, phrases, or sentences. These methods leverage the advantages of knowledge-based methods, corpus-
based methods, and deep neural network-based methods to achieve more accurate and robust results.
Unified Compliance Hybrid AI: This is our approach, and as such, is covered in the section below entitled
Unified Compliance’s Hybrid Semantic Similarity AI.
Novel Approach to a Semantically-Aware Representation of Items (NASARI): NASARI is a hybrid method that
combines knowledge-based and corpus-based methods to create a dense vector representation of words that
capture their semantic meaning. The representation is trained on a large corpus of text data to capture the
relationships between words in context.
Most Suitable Sense Annotation (MSSA): MSSA is a hybrid method that combines knowledge-based methods
and corpus-based methods to annotate words with the most appropriate sense for the context. The method uses
knowledge-based methods to determine the most appropriate sense for a word and corpus-based methods to
validate the sense in the sentence context.
23
Camacho-Collados, Pilehvar, and Navigli, NASARI; Zhang et al., A Novel Word Similarity Measure Method for IoT-Enabled Healthcare
Applications; Nemika Tyagi et al., Word Sense Disambiguation Models Emerging Trends: A Comparative Analysis.
24
Ruas, Grosky, and Aizawa, Multi-Sense Embeddings through a Word Sense Disambiguation Process; Iacobacci, Pilehvar, and Navigli,
SensEmbed; Erbs, Gurevych, and Zesch, Sense and Similarity; Ramakrishnan B. Guru, A Similarity Based Concordance Approach to Word
Sense Disambiguation; Dagan, Lee, and Peteira, Similarity-Based Methods For Word Sense Disambiguation; Yael Karov and Shimon
Edelman, Similarity-Based Word Sense Disambiguation; Bhagwan Parshuram Institute of Technology, India et al., WORD SENSE
DISAMBIGUATION METHOD USING SEMANTIC SIMILARITY MEASURES AND OWA OPERATOR.
25
Hassan et al., UESTS; Poerner, Waltinger, and Schütze, Sentence Meta-Embeddings for Unsupervised Semantic Textual Similarity;
Kohail, Salama, and Biemann, STS-UHH at SemEval-2017 Task 1.
7
Unsupervised Ensemble Semantic Textual Similarity Methods (UESTS): UESTS is a hybrid method that
combines multiple corpus-based methods to measure semantic similarity. The method aggregates the results of
multiple similarity measures to produce a final score that captures the semantic similarity between two words,
phrases, or sentences.
Hybrid semantic similarity models combine different sources of information, such as corpus-based, knowledge-
based, or geometric-based, to measure the similarity of words or texts based on their features, relations, and positions
in a semantic space. Some examples of hybrid semantic similarity models are HydMethod, HSSM, and HSM.
Some pros and cons of hybrid semantic similarity models are:
Pros
They can provide interpretable and explainable.
They can leverage the strengths and overcome the weaknesses of different models, such as feature-based,
distance-based, or network-based.
They can provide a richer and more comprehensive representation of a word or text meaning, considering
various aspects, such as properties, relations, types, ranges, etc..
They can improve the accuracy and robustness of similarity measures, especially for complex and
heterogeneous domains, such as geospatial entities.
Cons
They may require more computational resources and time to integrate and process different sources of
information, which may not always be available or efficient.
Depending on the task and the data, they may face challenges in determining the optimal combination and
weighting of different models.
They may introduce noise or inconsistency in the similarity measures due to the variability and quality of
different sources of information.
2. Unified Compliance’s Hybrid Semantic Similarity AI
There are several reasons why Unified Compliance chose a hybrid semantic similarity AI solution using combined
knowledge-based methods, word attention corpus-based methods, and transformer-based deep machine learning
methods. Together, these represent the state-of-the-art method in each semantic similarity category.
Combined knowledge-based methods provide a flexible and adaptable solution that can be tailored to the
specific needs of each regulation or framework. Unlike statistical methods like edge-counting, feature-counting,
and information-content-based methods, knowledge-based methods rely on expert knowledge and domain-
specific expertise to identify semantic similarities, leading to more accurate and reliable results. Combined
knowledge-based methods can be customized to the specific domain, leveraging the expertise of individuals
with domain-specific knowledge to identify semantic similarities. Combined knowledge-based methods offer
a more flexible and adaptable solution that can be tailored to the specific needs of each regulation or framework.
Word attention corpus-based methods were chosen for their ability to accurately capture the relationships
between words in a sentence, more accurate and reliable than most other corpus-based methods, and more
efficient and scalable.
Transformer-based deep machine learning-based methods were selected because they are highly effective
at capturing long-range dependencies in language data. They have also shown to be more efficient, scalable,
and generalize well to new data due to their self-attention mechanisms that allow the model to focus on different
parts of the input sequence during training.
Leveraging each of these methods in a hybrid semantic similarity AI solution enables us to overcome the limitations
of individual methods and combine their strengths to provide the best:
8
Accuracy: Each method has its strengths and weaknesses, and by combining them, Unified Compliance can
increase the accuracy of the crosswalking and harmonization process. Knowledge-based methods rely on expert
knowledge and domain expertise, while corpus-based methods rely on large bodies of text. Deep learning-based
methods like transformer-based models can capture complex relationships between words and phrases. By
using a hybrid approach, Unified Compliance can leverage each method’s strengths to increase its solution’s
accuracy.
Scalability: Each method has its limitations in terms of scalability. For example, knowledge-based methods
are limited by the individuals’ expertise, while corpus-based methods may not be accurate if the corpus is
incomplete or biased. Deep learning-based methods require large amounts of data and computational resources.
By using a hybrid approach, Unified Compliance can increase its solution’s scalability by leveraging each
method’s strengths.
Flexibility: Different regulations and frameworks may require different approaches to crosswalking and
harmonization. By using a hybrid approach, Unified Compliance can customize its solution to meet the specific
needs of each regulation or framework. For example, knowledge-based methods may be more appropriate for
regulations that require a high degree of domain expertise, while deep learning-based methods may be more
appropriate for regulations with complex relationships between words and phrases.
Explainability: Deep learning-based methods can be opaque, making it difficult to understand how the model
arrived at its decisions. By combining knowledge-based and corpus-based methods with deep learning-based
methods, Unified Compliance can create a more transparent and explainable solution. This can increase trust
and confidence in the compliance management process.
Future-proofing: The regulatory landscape is constantly changing, and new regulations and frameworks will
continue to emerge. Using a hybrid approach, Unified Compliance can future-proof its solution by adapting to
new regulations and frameworks. This can help organizations stay ahead of compliance requirements and
reduce the risk of non-compliance.
The combination of knowledge-based methods, corpus-based methods, and deep learning-based methods in a hybrid
solution creates a more flexible and adaptable solution that can meet the needs of different regulations and frameworks.
Knowledge-based methods are effective for regulations that require a high degree of domain expertise, while corpus-
based methods can be effective for regulations with a large amount of text. Deep learning-based methods are effective
for regulations with complex relationships between words and phrases.
In addition, by using a hybrid approach, the solution can be customized to meet the specific needs of each regulation
or framework. This customization can increase the accuracy and efficiency of the crosswalking and harmonization
process, reducing compliance costs and mitigating risks.
Finally, the hybrid approach provides an explainable and transparent solution that increases trust and confidence in
the compliance management process. This transparency is essential for regulatory compliance and ensures that
organizations understand how the solution arrived at its crosswalking and harmonization decisions.
The technical reasons for choosing a hybrid semantic similarity AI solution include its flexibility, adaptability,
customizability, transparency, and explainability. Using a hybrid approach, Unified Compliance can future proof its
solution and provide a powerful tool for organizations looking to manage their compliance programs more effectively.
9
References
[1] Achananuparp, Palakorn, Xiaohua Hu, and Xiajiong Shen. “The Evaluation of Sentence Similarity
Measures.” In Data Warehousing and Knowledge Discovery, edited by Il-Yeol Song, Johann Eder, and
Tho Manh Nguyen, 30516. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2008.
https://doi.org/10.1007/978-3-540-85836-2_29.
[2] Andre T. Martins, Mario A. T. Figueiredo, and Pedro M. Q. Aguiar. “KERNELS AND SIMILARITY
MEASURES FOR TEXT CLASSIFICATION.” Accessed March 7, 2023. https://andre-
martins.github.io/docs/Martins_Aguiar_Figueiredo_CONFTELE2007.pdf.
[3] Anna Rozeva and Silvia Zerkova. “Assessing Semantic Similarity of Texts Methods and Algorithms
Including Latent Semantic Analysis.” Accessed March 7, 2023.
https://aip.scitation.org/doi/pdf/10.1063/1.5014006.
[4] Babcock, Meghan J., Vivian P. Ta, and William Ickes. “Latent Semantic Similarity and Language
Style Matching in Initial Dyadic Interactions.” Journal of Language and Social Psychology 33, no. 1
(January 1, 2014): 7888. https://doi.org/10.1177/0261927X13499331.
[5] Bhagwan Parshuram Institute of Technology, India, Kanika Mittal, Amita Jain, and Ambedkar Institute
of Advanced Communication Technologies & Research, India. “WORD SENSE DISAMBIGUATION
METHOD USING SEMANTIC SIMILARITY MEASURES AND OWA OPERATOR.” In ICTACT
Journal on Soft Computing, 05:896904, 2015. https://doi.org/10.21917/ijsc.2015.0126.
[6] bogatron. “Answer to ‘What Is the Difference between Latent and Explicit Semantic Analysis.’” Data
Science Stack Exchange, May 14, 2015. https://datascience.stackexchange.com/a/5794.
[7] Bullock, Jamie. “Machine Learning Foundations: Features and Similarity.” Medium, March 9, 2020.
https://towardsdatascience.com/machine-learning-foundations-features-and-similarity-a6ef2901f09f.
[8] Calvo, Hiram, Andrea Segura-Olivares, and Alejandro García. “Dependency vs. Constituent Based
Syntactic N-Grams in Text Similarity Measures for Paraphrase Recognition.” Computación y Sistemas
18, no. 3 (September 2014): 51754. https://doi.org/10.13053/CyS-18-3-2044.
[9] Camacho-Collados, José, Mohammad Taher Pilehvar, and Roberto Navigli. “NASARI: A Novel
Approach to a Semantically-Aware Representation of Items.” In Proceedings of the 2015 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 56777. Denver, Colorado: Association for Computational Linguistics, 2015.
https://doi.org/10.3115/v1/N15-1059.
[10] Campiteli, Mônica G., Frederico M. Soriani, Iran Malavazi, Osame Kinouchi, Carlos AB Pereira, and
Gustavo H. Goldman. “A Reliable Measure of Similarity Based on Dependency for Short Time Series:
An Application to Gene Expression Networks.” BMC Bioinformatics 10, no. 1 (August 28, 2009): 270.
https://doi.org/10.1186/1471-2105-10-270.
[11] Chandrasekaran, Dhivya, and Vijay Mago. “Evolution of Semantic Similarity -- A Survey.” ACM
Computing Surveys 54, no. 2 (March 31, 2022): 137. https://doi.org/10.1145/3440755.
[12] Chen, Liang, and Yajun Liu. “Automated Scoring System Using Dependency-Based Weighted
Semantic Similarity Model.” In 2009 Second International Symposium on Knowledge Acquisition and
Modeling, 1:24144, 2009. https://doi.org/10.1109/KAM.2009.77.
[13] Chen, Long-Bin, Yan-Ni Wang, and Bao-Gang Hu. “Kernel-Based Similarity Learning,” 215256
vol.4, 2002. https://doi.org/10.1109/ICMLC.2002.1175419.
[14] Cheng, Raymond. “Semantic Similarity Using Transformers.” Medium, January 16, 2021.
https://towardsdatascience.com/semantic-similarity-using-transformers-8f3cb5bf66d6.
[15] Chi, Ziming, and Bingyan Zhang. “A Sentence Similarity Estimation Method Based on Improved
Siamese Network.” Journal of Intelligent Learning Systems and Applications 10, no. 4 (October 24,
2018): 12134. https://doi.org/10.4236/jilsa.2018.104008.
10
[16] Cilibrasi, Rudi, and Paul M. B. Vitanyi. “The Google Similarity Distance.” arXiv, May 30, 2007.
http://arxiv.org/abs/cs/0412098.
[17] Google Developers. “Create a Manual Similarity Measure | Machine Learning.” Accessed March 7,
2023. https://developers.google.com/machine-learning/clustering/similarity/manual-similarity.
[18] Dagan, Ido, Lillian Lee, and Fernando Peteira. “Similarity-Based Methods For Word Sense
Disambiguation,” May 4, 2002. https://doi.org/10.3115/976909.979625.
[19] Der, Lonneke, and Jörg Tiedemann. “Finding Synonyms Using Automatic Word Alignment and
Measures of Distributional Similarity,” 2006. https://doi.org/10.3115/1273073.1273184.
[20] Doshi, Sanket. “Latent Semantic Analysis Deduce the Hidden Topic from the Document.” Medium,
February 26, 2020. https://towardsdatascience.com/latent-semantic-analysis-deduce-the-hidden-topic-
from-the-document-f360e8c0614b.
[21] dramé, Khadim, Fleur Mougin, and Gayo Diallo. “Large Scale Biomedical Texts Classification: A
KNN and an ESA-Based Approaches.” Journal of Biomedical Semantics 7 (June 16, 2016).
https://doi.org/10.1186/s13326-016-0073-1.
[22] Erbs, Nicolai, Iryna Gurevych, and Torsten Zesch. “Sense and Similarity: A Study of Sense-Level
Similarity Measures.” In Proceedings of the Third Joint Conference on Lexical and Computational
Semantics (*SEM 2014), 3039. Dublin, Ireland: Association for Computational Linguistics and
Dublin City University, 2014. https://doi.org/10.3115/v1/S14-1004.
[23] galoosh33. “Answer to ‘Using LDA to Calculate Similarity.’” Cross Validated, April 2, 2017.
https://stats.stackexchange.com/a/271368.
[24] Gao, Jian-Bo, Bao-Wen Zhang, and Xiao-Hua Chen. “A Simplified Edge-Counting Method for
Measuring Semantic Similarity of Concepts.” In 2015 International Conference on Machine Learning
and Cybernetics (ICMLC), 1:17681, 2015. https://doi.org/10.1109/ICMLC.2015.7340918.
[25] Garbhapu, Vasantha Kumari, and Prajna Bodapati. “A Comparative Analysis of Latent Semantic
Analysis and Latent Dirichlet Allocation Topic Modeling Methods Using Bible Data.” Indian Journal
of Science and Technology 13, no. 44 (December 13, 2020): 447482.
https://doi.org/10.17485/IJST/v13i44.1479.
[26] Gligorov, Risto, Warner ten Kate, Zharko Aleksovski, and Frank van Harmelen. “Using Google
Distance to Weight Approximate Ontology Matches.” In Proceedings of the 16th International
Conference on World Wide Web, 76776. Banff Alberta Canada: ACM, 2007.
https://doi.org/10.1145/1242572.1242676.
[27] Gonçalo Oliveira, Hugo. “Distributional and Knowledge-Based Approaches for Computing Portuguese
Word Similarity.” Information 9, no. 2 (February 2018): 35. https://doi.org/10.3390/info9020035.
[28] Hamza Osman, Ahmed, and Albaraa Abobieda. “An Adaptive Normalized Google Distance Similarity
Measure for Extractive Text Summarization,” October 1, 2020.
https://doi.org/10.1109/ICCIS49240.2020.9257668.
[29] Harmouch, Mahmoud. “17 Types of Similarity and Dissimilarity Measures Used in Data Science.”
Medium, April 2, 2021. https://towardsdatascience.com/17-types-of-similarity-and-dissimilarity-
measures-used-in-data-science-3eb914d2681.
[30] Hassan, Basma, Samir E. Abdelrahman, Reem Bahgat, and Ibrahim Farag. “UESTS: An Unsupervised
Ensemble Semantic Textual Similarity Method.” IEEE Access 7 (2019): 8546282.
https://doi.org/10.1109/ACCESS.2019.2925006.
[31] Hong Nhung BUI, Quang-Thuy HA, and Tri-Thanh NGUYEN. “AN NOVEL SIMILARITY
MEASURE FOR TRACE CLUSTERING BASED ON NORMALIZED GOOGLE DISTANCE.”
Accessed March 7, 2023.
https://eprints.uet.vnu.edu.vn/eprints/id/eprint/3160/1/18_9_An%20novel%20similarity%20measure%
20for%20trace%20clustering%20based%20on%20Normalized%20Google%20Distance.pdf.
11
[32] “Hyperspace Analogue to Language Algorithm - GM-RKB.” Accessed March 7, 2023.
http://www.gabormelli.com/RKB/Hyperspace_Analogue_to_Language_Algorithm.
[33] Iacobacci, Ignacio, Mohammad Taher Pilehvar, and Roberto Navigli. “SensEmbed: Learning Sense
Embeddings for Word and Relational Similarity.” In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), 95105. Beijing, China: Association for
Computational Linguistics, 2015. https://doi.org/10.3115/v1/P15-1010.
[34] Ismail, Shimaa, Tarek EL Shishtawy, and Abdelwahab Kamel Alsammak. “A New Alignment Word-
Space Approach for Measuring Semantic Similarity for Arabic Text.” International Journal on
Semantic Web and Information Systems (IJSWIS) 18, no. 1 (January 1, 2022): 118.
https://doi.org/10.4018/IJSWIS.297036.
[35] “Issues in Crosswalking Content Metadata Standards - National Information Standards Organization.”
Accessed November 19, 2018. https://groups.niso.org/publications/white_papers/crosswalk/.
[36] Ji, Mingyu, and Xinhai Zhang. “A Short Text Similarity Calculation Method Combining Semantic and
Headword Attention Mechanism.” Scientific Programming 2022 (May 21, 2022): e8252492.
https://doi.org/10.1155/2022/8252492.
[37] K, Manjula Shenoy, K. C. Shet, and U. Dinesh Acharya. “A New Similarity Measure for Taxonomy
Based on Edge Counting.” arXiv, November 20, 2012. http://arxiv.org/abs/1211.4709.
[38] KEVIN LUND and CURT BURGESS. “Producing High-Dimensional Semantic Spaces from Lexical
Co-Occurrence.” Accessed March 7, 2023.
https://link.springer.com/content/pdf/10.3758/BF03204766.pdf.
[39] Kohail, Sarah, Amr Rekaby Salama, and Chris Biemann. “STS-UHH at SemEval-2017 Task 1:
Scoring Semantic Textual Similarity Using Supervised and Unsupervised Ensemble.” In Proceedings
of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 17579. Vancouver,
Canada: Association for Computational Linguistics, 2017. https://doi.org/10.18653/v1/S17-2025.
[40] Kosar, Vaclav. “Word Alignment for Sentence Similarity,” April 2, 2022.
https://vaclavkosar.com/ml/monolingual-word-alignment-for-sentence-similarity.
[41] Kulmanov, Maxat, Fatima Zohra Smaili, Xin Gao, and Robert Hoehndorf. “Semantic Similarity and
Machine Learning with Ontologies.” Briefings in Bioinformatics 22, no. 4 (July 1, 2021): bbaa199.
https://doi.org/10.1093/bib/bbaa199.
[42] Lin, Qingjian, Ruiqing Yin, Ming Li, Hervé Bredin, and Claude Barras. “LSTM Based Similarity
Measurement with Spectral Clustering for Speaker Diarization.” In Interspeech 2019, 36670, 2019.
https://doi.org/10.21437/Interspeech.2019-1388.
[43] Liu, Kaijian, and Nora El-Gohary. “Similarity-Based Dependency Parsing for Extracting Dependency
Relations from Bridge Inspection Reports,” June 13, 2017, 31623.
https://doi.org/10.1061/9780784480823.038.
[44] Lopes, Carla Teixeira, and Diogo Moura. “Normalized Google Distance in the Identification and
Characterization of Health Queries.” In 2019 14th Iberian Conference on Information Systems and
Technologies (CISTI), 14, 2019. https://doi.org/10.23919/CISTI.2019.8760964.
[45] Lopez-Gazpio, I., M. Maritxalar, M. Lapata, and E. Agirre. “Word N-Gram Attention Models for
Sentence Similarity and Inference.” Expert Systems with Applications 132 (October 15, 2019): 111.
https://doi.org/10.1016/j.eswa.2019.04.054.
[46] “LSTM | Introduction to LSTM | Long Short Term Memory Algorithms.” Accessed March 7, 2023.
https://www.analyticsvidhya.com/blog/2021/03/introduction-to-long-short-term-memory-lstm/.
[47] Marek Rei. “Minimally Supervised Dependency-Based Methods for Natural Language Processing.”
Accessed March 7, 2023. https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-840.pdf.
[48] Maria-Florina Balcan and Avrim Blum. “On a Theory of Learning with Similarity Functions.”
Accessed March 7, 2023. http://www.cs.cmu.edu/~ninamf/papers/similarity_icml.pdf.
12
[49] “Mastering Sentence Transformers For Sentence Similarity Predictive Hacks.” Accessed March 7,
2023. https://predictivehacks.com/mastering-sentence-transformers-for-sentence-similarity/.
[50] Google Developers. “Measuring Similarity from Embeddings | Machine Learning.” Accessed March 7,
2023. https://developers.google.com/machine-learning/clustering/similarity/measuring-similarity.
[51] “Measuring Text Similarity Using BERT - Analytics Vidhya.” Accessed March 7, 2023.
https://www.analyticsvidhya.com/blog/2021/05/measuring-text-similarity-using-bert/.
[52] Micheal Olalekan Ajinaja, Olusola Adebayo Adetunmbi, Chukwuemeka Christian Ugwu, and Popoola
Olugbemiga Solomon. “Semantic Similarity Measure for Topic Modeling Using Latent Dirichlet
Allocation and Collapsed Gibbs Sampling.” Accessed March 7, 2023.
https://assets.researchsquare.com/files/rs-1968318/v1_covered.pdf?c=1661194432.
[53] Mohd, Ayesha. “Information Content Based Semantic Similarity Measure,” July 21, 2020.
[54] Moitreya Chatterjee and Yunan Luo. “Similarity Learning with (or without) Convolutional Neural
Network.” Accessed March 7, 2023. https://slazebni.cs.illinois.edu/spring17/lec09_similarity.pdf.
[55] Mrhar, Khaoula, and Mounia Abik. “Towards Optimize-ESA for Text Semantic Similarity: A Case
Study of Biomedical Text.” International Journal of Electrical and Computer Engineering (IJECE)
10, no. 3 (June 1, 2020): 293443. https://doi.org/10.11591/ijece.v10i3.pp2934-2943.
[56] Nathan Srebro. “How Good Is a Kernel When Used as a Similarity Measure?” Accessed March 7,
2023. https://home.ttic.edu/~nati/Publications/SrebroCOLT07.pdf.
[57] Navigli, Roberto, and Federico Martelli. “An Overview of Word and Sense Similarity.” Natural
Language Engineering 25, no. 6 (November 2019): 693714.
https://doi.org/10.1017/S1351324919000305.
[58] Nemani, Praneeth, and Satyanarayana Vollala. “A Cognitive Study on Semantic Similarity Analysis of
Large Corpora: A Transformer-Based Approach.” arXiv, July 24, 2022.
https://doi.org/10.48550/arXiv.2207.11716.
[59] Nemika Tyagi, Dr. Sudeshna Chakraborty, Jyotsna, Aditya Kumar, and Nzanzu Katasohire Romeo.
“Word Sense Disambiguation Models Emerging Trends: A Comparative Analysis.” Accessed March
7, 2023. https://iopscience.iop.org/article/10.1088/1742-6596/2161/1/012035/pdf.
[60] Niazmardi, Saeid, Abdolreza Safari, and Saeid Homayouni. “Similarity-Based Multiple Kernel
Learning Algorithms for Classification of Remotely Sensed Images.” IEEE Journal of Selected Topics
in Applied Earth Observations and Remote Sensing 10, no. 5 (May 2017): 201221.
https://doi.org/10.1109/JSTARS.2017.2662484.
[61] Niraula, Nobal, Rajendra Banjade, Dan Stefanescu, and Vasile Rus. “Experiments with Semantic
Similarity Measures Based on LDA and LSA,” 18899, 2013. https://doi.org/10.1007/978-3-642-
39593-2_17.
[62] Oosten, Philip van. “Wikipedia-Based Explicit Semantic Analysis.” Java, January 28, 2023.
https://github.com/pvoosten/explicit-semantic-analysis.
[63] Özateş, Şaziye Betül, Arzucan Özgür, and Dragomir Radev. “Sentence Similarity Based on
Dependency Tree Kernels for Multi-Document Summarization.” In Proceedings of the Tenth
International Conference on Language Resources and Evaluation (LREC’16), 283338. Portorož,
Slovenia: European Language Resources Association (ELRA), 2016. https://aclanthology.org/L16-
1452.
[64] “Papers with Code - Sentence Similarity Based on Dependency Tree Kernels for Multi-Document
Summarization.” Accessed March 7, 2023. https://paperswithcode.com/paper/sentence-similarity-
based-on-dependency-tree.
[65] Pedersen, Ted. “Information Content Measures of Semantic Similarity Perform Better Without Sense-
Tagged Text.” In Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, 32932. Los Angeles, California:
Association for Computational Linguistics, 2010. https://aclanthology.org/N10-1047.
13
[66] Poerner, Nina, Ulli Waltinger, and Hinrich Schütze. “Sentence Meta-Embeddings for Unsupervised
Semantic Textual Similarity.” arXiv, June 24, 2020. http://arxiv.org/abs/1911.03700.
[67] Pravallika Mallela. “Analysis of Learning Algorithms for the Similarity Neural Network.” Accessed
March 7, 2023.
https://upcommons.upc.edu/bitstream/handle/2117/336006/154362.pdf?sequence=1&isAllowed=y.
[68] Rada Mihalcea, Carlo Strapparavay, and Courtney Corle. “Corpus-Based and Knowledge-Based
Measures of Text Semantic Similarity.” Accessed March 7, 2023.
https://cdn.aaai.org/AAAI/2006/AAAI06-123.pdf.
[69] Ramakrishnan B. Guru. “A Similarity Based Concordance Approach to Word Sense Disambiguation.”
Accessed March 7, 2023. https://thekeep.eiu.edu/cgi/viewcontent.cgi?article=2382&context=theses.
[70] Ranasinghe, Tharindu, Constantin Orasan, and Ruslan Mitkov. “Semantic Textual Similarity with
Siamese Neural Networks.” In Proceedings of the International Conference on Recent Advances in
Natural Language Processing (RANLP 2019), 100411. Varna, Bulgaria: INCOMA Ltd., 2019.
https://doi.org/10.26615/978-954-452-056-4_116.
[71] Richa Dhagat, Arpana Rawal, and Sunita Soni. “Comparison of Free Text Semantic Similarity
Measures Using Dependency Relations.” Accessed March 7, 2023.
https://www.irjet.net/archives/V8/i9/IRJET-V8I924.pdf.
[72] Ruas, Terry, William Grosky, and Akiko Aizawa. “Multi-Sense Embeddings through a Word Sense
Disambiguation Process.” Expert Systems with Applications 136 (December 2019): 288303.
https://doi.org/10.1016/j.eswa.2019.06.026.
[73] Rudi L. Cilibrasi and Paul M.B. Vitanyi. “Normalized Web Distance and Word Similarity.” Accessed
March 7, 2023. https://homepages.cwi.nl/~paulv/papers/crc08.pdf.
[74] Rus, Vasile, Nobal Niraula, and Rajendra Banjade. “Similarity Measures Based on Latent Dirichlet
Allocation.” In Computational Linguistics and Intelligent Text Processing, edited by Alexander
Gelbukh, 45970. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2013.
https://doi.org/10.1007/978-3-642-37247-6_37.
[75] Rydbeck, Halfdan, Geir Kjetil Sandve, Egil Ferkingstad, Boris Simovski, Morten Rye, and Eivind
Hovig. “ClusTrack: Feature Extraction and Similarity Measures for Clustering of Genome-Wide Data
Sets.” PLOS ONE 10, no. 4 (April 16, 2015): e0123261. https://doi.org/10.1371/journal.pone.0123261.
[76] Sánchez, David, and Montserrat Batet. “A Semantic Similarity Method Based on Information Content
Exploiting Multiple Ontologies.” Expert Systems with Applications 40, no. 4 (March 1, 2013): 1393
99. https://doi.org/10.1016/j.eswa.2012.08.049.
[77] Saraswathi, K., V. Mohanraj, Y. Suresh, and J. Senthilkumar. “A Hybrid Multi-Feature Semantic
Similarity Based Online Social Recommendation System Using CNN.” International Journal of
Uncertainty, Fuzziness and Knowledge-Based Systems 29, no. Supp02 (December 2021): 33352.
https://doi.org/10.1142/S0218488521400183.
[78] Scholl, Philipp, Doreen Böhnstedt, Renato Domínguez García, Christoph Rensing, and Ralf Steinmetz.
“Extended Explicit Semantic Analysis for Calculating Semantic Relatedness of Web Resources.” In
Sustaining TEL: From Innovation to Learning and Practice, edited by Martin Wolpers, Paul A.
Kirschner, Maren Scheffel, Stefanie Lindstaedt, and Vania Dimitrova, 32439. Lecture Notes in
Computer Science. Berlin, Heidelberg: Springer, 2010. https://doi.org/10.1007/978-3-642-16020-2_22.
[79] Sebastian Bångerius. “LSTM Feature Engineering Through Time Series Similarity Embedding.”
Accessed March 7, 2023. https://www.diva-portal.org/smash/get/diva2:1698439/FULLTEXT01.pdf.
[80] Vennify Inc. “Semantic Similarity With Sentence Transformers,” June 8, 2022.
https://www.vennify.ai/semantic-similarity-sentence-transformers/.
[81] “Semantic Textual Similarity Sentence-Transformers Documentation.” Accessed March 7, 2023.
https://www.sbert.net/docs/usage/semantic_textual_similarity.html.
14
[82] Pinecone. “Sentence Transformers and Embeddings.” Accessed March 7, 2023.
https://www.pinecone.io/learn/sentence-embeddings/.
[83] Shen, Yuanyuan, Edmund M.-K. Lai, and Mahsa Mohaghegh. “Effects of Similarity Score Functions
in Attention Mechanisms on the Performance of Neural Question Answering Systems.” Neural
Processing Letters 54, no. 3 (June 1, 2022): 22832302. https://doi.org/10.1007/s11063-021-10730-4.
[84] “Similarity and Dissimilarity Metrics - Kernel Distance,” May 4, 2022.
https://eranraviv.com/similarity-dissimilarity-metrics-kernel-distance/.
[85] Simmons, Sabrina, S Simmons@warwick, Uk, Zachary Estes, and Z Estes@warwick. “Using Latent
Semantic Analysis to Estimate Similarity,” January 1, 2006.
[86] Slimene, Alya, and Ezzeddine Zagrouba. “Overlapping Area Hyperspheres for Kernel-Based
Similarity Method.” Pattern Analysis and Applications 20, no. 4 (November 1, 2017): 122743.
https://doi.org/10.1007/s10044-017-0604-0.
[87] Songyot, Theerawat, and David Chiang. “Improving Word Alignment Using Word Similarity.” In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), 184045. Doha, Qatar: Association for Computational Linguistics, 2014.
https://doi.org/10.3115/v1/D14-1197.
[88] Sonkar, Shashank, Andrew Waters, and Richard Baraniuk. “Attention Word Embedding.” In
Proceedings of the 28th International Conference on Computational Linguistics, 68946902.
Barcelona, Spain (Online): International Committee on Computational Linguistics, 2020.
https://doi.org/10.18653/v1/2020.coling-main.608.
[89] Stefanescu, Dan, Vasile Rus, Nobal Niraula, and Rajendra Banjade. “Combining Knowledge and
Corpus-Based Measures for Word-to-Word Similarity,” 2014.
[90] Suleman, Raja Muhammad, and Ioannis Korkontzelos. “Extending Latent Semantic Analysis to
Manage Its Syntactic Blindness.” Expert Systems with Applications 165 (March 1, 2021): 114130.
https://doi.org/10.1016/j.eswa.2020.114130.
[91] Google Developers. “Supervised Similarity Measure | Machine Learning.” Accessed March 7, 2023.
https://developers.google.com/machine-learning/clustering/similarity/supervised-similarity.
[92] Tiedan Zhu and Kan Li. “The Similarity Measure Based on LDA for Automatic Summarization.”
Accessed March 7, 2023. https://pdf.sciencedirectassets.com/278653/1-s2.0-S1877705812X00043/1-
s2.0-S1877705812004298/main.pdf
[93] Toni Cvitanic, Bumsoo Lee, Hyeon Ik Song, Katherine Fu, and David Rosen. “LDA v. LSA: A
Comparison of Two Computational Text Analysis Tools for the Functional Categorization of Patents.”
Accessed March 7, 2023. https://par.nsf.gov/servlets/purl/10055536.
[94] Viji, D., and S. Revathy. “A Hybrid Approach of Weighted Fine-Tuned BERT Extraction with
Deep Siamese Bi LSTM Model for Semantic Text Similarity Identification.” Multimedia Tools and
Applications 81, no. 5 (2022): 613157. https://doi.org/10.1007/s11042-021-11771-6.
[95] Vinh-Trung Luu, Germain Forestier, Jonathan Weber, Paul Bourgeois, Fahima Djelil, and Pierre-Alain
Muller. “A Review of Alignment Based Similarity Measures for Web Usage Mining.” Accessed
March 7, 2023. https://hal.science/hal-02632870/document.
[96] W. BEN TOWNE, CAROLYN P. ROSÉ, and JAMES D. HERBSLEB. “Measuring Similarity
Similarly: LDA and Human Perception.” Accessed March 7, 2023.
http://www.cs.cmu.edu/~cprose/TIST.pdf.
[97] Wang, Yue, Xiaoqiang Di, Jinqing Li, Huamin Yang, and Lin Bi. “Sentence Similarity Learning
Method Based on Attention Hybrid Model.” Journal of Physics: Conference Series 1069, no. 1
(August 2018): 012119. https://doi.org/10.1088/1742-6596/1069/1/012119.
[98] “What Is Sentence Similarity? - Hugging Face,” July 11, 2022. https://huggingface.co/tasks/sentence-
similarity.
15
[99] Wu, Jheng-Long, Xiang Xiao, Liang-Chih Yu, Shao-Zhen Ye, and K. Robert Lai. “Using an
Analogical Reasoning Framework to Infer Language Patterns for Negative Life Events.” BMC Medical
Informatics and Decision Making 19 (August 28, 2019): 173. https://doi.org/10.1186/s12911-019-
0895-8.
[100] Yael Karov and Shimon Edelman. “Similarity-Based Word Sense Disambiguation.” Accessed March
7, 2023. https://dl.acm.org/doi/pdf/10.5555/972719.972722.
[101] Yan, Hengbin, and Jonathan Webster. “A Corpus-Based Approach to Linguistic Function,” n.d., 7.
[102] Yan, Tingxu, Tamsin Maxwell, Dawei Song, Yuexian Hou, and Peng Zhang. “Event-Based
Hyperspace Analogue to Language for Query Expansion.” Uppsala, Sweden, 2010.
http://dl.acm.org/citation.cfm?id=1858864.
[103] Yao, Lin, Zhengyu Pan, and Huansheng Ning. “Unlabeled Short Text Similarity With LSTM
Encoder.” IEEE Access 7 (2019): 343037. https://doi.org/10.1109/ACCESS.2018.2885698.
[104] You, Zhiyuan, Kai Yang, Wenhan Luo, Xin Lu, Lei Cui, and Xinyi Le. “Few-Shot Object Counting
with Similarity-Aware Feature Enhancement.” arXiv, September 10, 2022.
https://doi.org/10.48550/arXiv.2201.08959.
[105] Zhang, Dehai, Xiaoqiang Xia, Yun Yang, Po Yang, Cheng Xie, Menglong Cui, and Qing Liu. “A
Novel Word Similarity Measure Method for IoT-Enabled Healthcare Applications.” Future
Generation Computer Systems 114 (January 1, 2021): 20918.
https://doi.org/10.1016/j.future.2020.07.053.
[106] Zhang, Jiang, Qun-Xiong Zhu, and Yan-Lin He. “Hierarchical Attention-Based BiLSTM Network for
Document Similarity Calculation.” In Proceedings of the 2020 4th International Symposium on
Computer Science and Intelligent Control, 15. ISCSIC 2020. New York, NY, USA: Association for
Computing Machinery, 2021. https://doi.org/10.1145/3440084.3441188.
[107] Zhang, Shanping, Xiaowei Xu, Ye Tao, Xiaodong Wang, Qiuchen Wang, and Fangfang Chen. “Text
Similarity Measurement Method Based on BiLSTM-SECapsNet Model.” In 2021 6th International
Conference on Image, Vision and Computing (ICIVC), 41419, 2021.
https://doi.org/10.1109/ICIVC52351.2021.9527010.
[108] Zhang, Shiru, Zhiyao Liang, and Jian Lin. “Sentence Similarity Measurement with Convolutional
Neural Networks Using Semantic and Syntactic Features.” Computers, Materials & Continua 63, no. 2
(2020): 94357. https://doi.org/10.32604/cmc.2020.08800.
[109] Zhang, Shu, Jianwei Wu, Dequan Zheng, Yao Meng, and Hao Yu. “An Adaptive Method for
Organization Name Disambiguation with Feature Reinforcing.” In Proceedings of the 26th Pacific
Asia Conference on Language, Information, and Computation, 23745. Bali, Indonesia: Faculty of
Computer Science, Universitas Indonesia, 2012. https://www.aclweb.org/anthology/Y12-1025.
[110] Zhang, Xiaogang, Shouqian Sun, and Kejun Zhang. “An Information Content-Based Approach for
Measuring Concept Semantic Similarity in WordNet.” Wireless Personal Communications 103, no. 1
(November 1, 2018): 11732. https://doi.org/10.1007/s11277-018-5429-7.
[111] Zhao Kang, Chong Peng, and Qiang Cheng. “Kernel-Driven Similarity Learning.” Accessed March 7,
2023. https://pdf.sciencedirectassets.com/271597/1-s2.0-S0925231217X00404/1-s2.0-
S0925231217310603/Zhao_Kang_Similarity_measure_2017.pdf
[112] Zhao Kang, Yiwei Lu, Yuanzhang Su, Changsheng Li, and Zenglin Xu. “Similarity Learning via
Kernel Preserving Embedding.” Accessed March 7, 2023.
https://dl.acm.org/doi/pdf/10.1609/aaai.v33i01.33014057.
[113] Zhao, Weidong, Xiaotong Liu, Jun Jing, and Rongchang Xi. “Re-LSTM: A Long Short-Term Memory
Network Text Similarity Algorithm Based on Weighted Word Embedding.” Connection Science 34,
no. 1 (December 31, 2022): 265270. https://doi.org/10.1080/09540091.2022.2140122.
16
[114] Zhou, Minchuan. “Comparison of CNN Based and Self-Similarity Based Denoising Methods,” 2020.
https://www.semanticscholar.org/paper/Comparison-of-CNN-based-and-self-similarity-based-
Zhou/c5d60b43b9ea45ca5f1f01c6268a2b1e753263a7.
[115] Zhuang, WenLi, and Ernie Chang. “Neobility at SemEval-2017 Task 1: An Attention-Based Sentence
Similarity Model.” In Proceedings of the 11th International Workshop on Semantic Evaluation
(SemEval-2017), 16469. Vancouver, Canada: Association for Computational Linguistics, 2017.
https://doi.org/10.18653/v1/S17-2023.
[116] Zongkui Zhu, Zhengqiu He, Ziyi Tang, Baohui Wang, and Wenliang Chen. “A Semantic Similarity
Computing Model Based on Siamese Network for Duplicate Questions Identification.” Accessed
March 7, 2023. https://ceur-ws.org/Vol-2242/paper08.pdf.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Natural language processing text similarity calculation is a crucial and difficult problem that enables matching between various messages. This approach is the foundation of many applications. The word representation features and contextual relationships extracted by current text similarity computation methods are insufficient, and too many factors increase the computational complexity. Re-LSTM, a weighted word embedding long and short-term memory network, has therefore been proposed as a text similarity computing model. The two-gate mechanism of Re-LSTM neurons is built on the foundation of the conventional LSTM model and is intended to minimise the parameters and computation to some level. The hidden features and state information of the layer above each gate are considered for extracting more implicit features. By fully utilising the feature word and its domain association, the feature word’s position, and the word frequency information, the TF-IDF method and the χ²-C algorithm may effectively improve the representation of the weights on the words. The Attention mechanism is used in Re-LSTM to combine dependencies and feature word weights for deeper text semantic mining. The experimental results demonstrate that the Re-LSTM model outperforms baselines in terms of precision, recall, accuracy, and F1 values, all of which reach above 85% when applied to the QQPC and ATEC datasets.
Article
Full-text available
This work presents a new alignment word-space approach for measuring the similarity between two snipped texts. The approach combines two similarity measurement methods: alignment-based and vector space-based. The vector space-based method depends on a semantic net that represents the meaning of words as vectors. These vectors are lemmatized to enrich the search space. The alignment-based method generates an alignment word space matrix (AWSM) for the snipped texts according to the generated semantic word spaces. Finally, the degree of sentence semantic similarity is measured using some proposed alignment rules. Four experiments were carried out to evaluate the performance of the proposed approach, using two different datasets. The experimental results proved that applying the lemmatization process for the input text and the vector model has a better effect. The degree of correctness of the results reaches 0.7212 which is considered one of the best two results of the published Arabic semantic similarities.
Article
Full-text available
Short text similarity computation plays an important role in various natural language processing tasks. Siamese neural networks are widely used in short text similarity calculation. However, due to the complexity of syntax and the correlation between words, siamese networks alone cannot achieve satisfactory results. Many studies show that the use of an attention mechanism will improve the impact of key features that can be utilized to measure sentence similarity. In this paper, a similarity calculation method is proposed which combines semantics and a headword attention mechanism. First, a BiGRU model is utilized to extract contextual information. After obtaining the headword set, the semantically enhanced representations of the two sentences are obtained through an attention mechanism and character splicing. Finally, we use a one-dimensional convolutional neural network to fuse the word embedding information with the contextual information. The experimental results on the ATEC and MSRP datasets show that the recall and F1 values of the proposed model are significantly improved through the introduction of the headword attention mechanism.
Article
Full-text available
Attention mechanisms have been incorporated into many neural network-based natural language processing (NLP) models. They enhance the ability of these models to learn and reason with long input texts. A critical part of such mechanisms is the computation of attention similarity scores between two elements of the texts using a similarity score function. Given that these models have different architectures, it is difficult to comparatively evaluate the effectiveness of different similarity score functions. In this paper, we proposed a baseline model that captures the common components of recurrent neural network-based Question Answering (QA) systems found in the literature. By isolating the attention function, this baseline model allows us to study the effects of different similarity score functions on the performance of such systems. Experimental results show that a trilinear function produced the best results among the commonly used functions. Based on these insights, a new T-trilinear similarity function is proposed which achieved the higher predictive EM and F1 scores than these existing functions. A heatmap visualization of the attention score matrix explains why this T-trilinear function is effective.
Article
Full-text available
The conventional semantic text-similarity methods requires high amount of trained labeled data and also human interventions. Generally, it neglects the contextual-information and word-orders information resulted in data sparseness problem and latitudinal-explosion issue. Recently, deep-learning methods are used for determining text-similarity. Hence, this study investigates NLP application tasks usage in detecting text-similarity of question pairs or documents and explores the similarity score predictions. A new hybridized approach using Weighted Fine-Tuned BERT Feature extraction with Siamese Bi-LSTM model is implemented. The technique is employed for determining question pair sets using Semantic-text-similarity from Quora dataset. The text features are extracted using BERT process, followed by words embedding with weights. The features along with weight values, are represented as embedded vectors, are subjected to various layers of Siamese Networks. The embedded vectors of input text features were trained by using Deep Siamese Bi-LSTM model, in various layers. Finally, similarity scores are determined for each sentence, and the semantic text-similarity is learned. The performance evaluation of proposed-framework is established with respect to accuracy rate, precision value, F1 score data and Recall values parameters compared with other existing text-similarity detection methods. The proposed-framework exhibited higher efficiency rate with 91% in accuracy level in determining semantic-text-similarity compared with other existing algorithms.
Article
Full-text available
Estimating the semantic similarity between text data is one of the challenging and open research problems in the field of Natural Language Processing (NLP). The versatility of natural language makes it difficult to define rule-based methods for determining semantic similarity measures. To address this issue, various semantic similarity methods have been proposed over the years. This survey article traces the evolution of such methods beginning from traditional NLP techniques such as kernel-based methods to the most recent research work on transformer-based models, categorizing them based on their underlying principles as knowledge-based, corpus-based, deep neural network–based methods, and hybrid methods. Discussing the strengths and weaknesses of each method, this survey provides a comprehensive view of existing systems in place for new researchers to experiment and develop innovative ideas to address the issue of semantic similarity.
Article
Modern organizations are keen to work towards their customer needs. To achieve this, analyzing their activities and identifying their interest in any entity becomes important. Every user has been identified as the most important factor in point of organization, and they never give up even a single user. Several approaches have been discussed earlier, which use artificial intelligence to mine the users and their interest in the problem. However, the deep learning algorithms are identified as most efficient in identifying the user interest but suffer to achieve higher performance. Towards this issue, an efficient multi-feature semantic similarity-based online social recommendation system has been proposed. The method uses Convolution Neural Network (CNN) to train and predict user interest in any topic. Each layer has been identified as a single interest, and neurons of the layers are initialized with huge data set. The neuron estimates the Multi-Feature Semantic Similarity (MFSS) towards each interest of the user. Finally, the method identifies the single interest for the user by ranking each interest to produce recommendations to the user. The proposed algorithm improves the performance of recommendation generation with less false ratio.