Article

Measuring the Semantic Similarity of Texts

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents a knowledge-based method for measuring the semantic-similarity of texts. While there is a large body of previous work focused on finding the semantic similarity of concepts and words, the application of these word-oriented methods to text similarity has not been yet explored. In this paper, we introduce a method that combines word-to-word similarity metrics into a text-to-text metric, and we show that this method outperforms the traditional text similarity metrics based on lexical matching.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... P r e p r i n t n o t p e e r r e v i e w e d distance between items based on their meaning or semantic content. Both are mathematical tools used to estimate the relationship "strength" between text corpuses through a numerical description obtained according to the comparison [24][25][26][27] . Recently published language models, 'BERT' 12 and it's variations, have achieved state-of-the-art results by calculating similarities between sentence embeddings. ...
... Semantic Textual Similarity (STS), the task of estimating the semantic equivalence between two text contents, is a fundamental component in a variety of NLP tasks; yet, even today, it is considered to be a complicated open research problem in low-resource languages 27,48 . This is especially true for Hebrew which has a complex morphological structure in addition to insufficient resources. ...
... In comparison with STS and FM3S methods, for  = 1; recall value obtained by the proposed method is 0.4883, whereas for STS, it is 0.0054 and for FM3S, it is 0.4557. (Corley and Mihalcea, 2005) 71.5 72.3 92.5 81.2 Combined(U) (Mihalcea et al., 2006) 70.3 69.6 97.7 81.3 Baselines Threshold-1 (Mihalcea et al., 2006) 33.8 100.0 0.44 0.87 Vector-based (Mihalcea et al., 2006) 65.4 71.6 79.5 75.3 Random (Mihalcea et al., 2006) 51 We also compare our method with other ontology based algorithms for calculating the semantic relatedness score between two text sentences. The Table 22 shows baseline methods and several other methods from (Corley and Mihalcea, 2005) and (Mihalcea et al., 2006) on test data. ...
... (Corley and Mihalcea, 2005) 71.5 72.3 92.5 81.2 Combined(U) (Mihalcea et al., 2006) 70.3 69.6 97.7 81.3 Baselines Threshold-1 (Mihalcea et al., 2006) 33.8 100.0 0.44 0.87 Vector-based (Mihalcea et al., 2006) 65.4 71.6 79.5 75.3 Random (Mihalcea et al., 2006) 51 We also compare our method with other ontology based algorithms for calculating the semantic relatedness score between two text sentences. The Table 22 shows baseline methods and several other methods from (Corley and Mihalcea, 2005) and (Mihalcea et al., 2006) on test data. Here, results are also evaluated in terms of accuracy. ...
Article
Full-text available
Finding semantic relatedness score between two sentences is useful in many research areas. Existing relatedness methods do not consider its sense while computing semantic relatedness score between two sentences. In this study, a Word Sense Disambiguation (WSD) method is proposed which is used in finding the sense-oriented sentence semantic relatedness measure. The WSD method is used to find the correct sense of a word present in a sentence. The proposed method uses both the WordNet lexical dictionary and the Wikipedia corpus. The sense-oriented sentence semantic relatedness measure combines edge-based score between words depending the context of the sentence; sense based score which finds sentences having similar senses; as well as word order score. We have evaluated the proposed WSD method on publicly available English WSD corpora. We have compared our proposed sense-oriented sentence semantic relatedness measure on standard datasets. Experimental analysis illustrates the significance of proposed method over many baseline and current systems like Lesk, UKB, IMS, Babelfy.
... Semantic Similarity. Semantic similarity refers to similitude depending on the context rather than the structure [14]. It can be used to recover information, answer questions, perform machine translation of data mining and extract knowledge. ...
... Introducing the sentence similarity technique in the production of a CA is often more efficient than other techniques, as in each clause it substitutes the predetermined patterns with a few natural language sentences. The scripting effort is subsequently reduced to a minimal [14]. There are two main approaches that use semantic similarity to calculate the similarity between words: knowledge-based and corpus-based. ...
Chapter
Full-text available
In our modern day and age, students with different abilities tend to undertake professional examinations (PEs) to obtain certifications that would prove their knowledge and expertise in their respective fields. This will help them when seeking a career boost in various domains. However, a challenging point arises in which many students express a lack of awareness about which PEs to consider, and which PE is best suited to their professional needs. In this research, a solution is proposed to overcome this challenge by designing and developing a web-based recommendation system based on a textual Conversational Agent (CA) called the Conversational Agent for Professional Examinations (CAPEs) Advisory. The CAPEs Advisory provides smart recommendations for better exam pathways that would suit the student’s various types of knowledge and skill level at University. The proposed architecture for the CAPEs Advisory uses Natural Language Processing (NLP) techniques by applying both Pattern Matching (PM) and a semantic similarity algorithm to extract keywords from the user’s utterances to match patterns in the scripted conversation. An evaluation methodology and experiments have been designed and conducted by using subjective and objective methods to evaluate the CAPEs Advisory components. The results showed a statistically significant impact on the effectiveness of the CAPEs Advisory engine in recognizing 97.36% of the utterances. In addition, the results show that the CAPEs Advisory is effective as a Professional Examinations Advisory with the majority user satisfaction being 83.3%.
... An important example of the use of WordNet is to determine the similarity between words. Various algorithms have been proposed [17], [18], [19], and these include considering the distance between the conceptual categories of words, as well as considering the hierarchical structure of the WordNet ontology. ...
... FINDING RESEARCH TOPIC OF AUTHORSIn this section, we also conduct experiments on NIPS 13-22. The top 2 authors who publish articles most on NIPS[13][14][15][16][17][18][19][20][21][22] are: Michael I. Jordan and Andrew Y. Ng. In our experiment, we extracted 48 articles published by Michael Jordan and 29 articles published by Andrew Ng from the corpus. ...
Article
Full-text available
A large volume of research document are available online for us to access and analysis. It is very important to detect and mine the dynamics of the research topics from these large corpora. In this paper, we propose an improved method by introducing WordNet to LDA. This approach is to find latent topics of large corpora, and then we propose many methods to analyze the dynamics of those topics. We apply the methodology to two large document collections: 1,940 papers from NIPS 00-13(1987-2000) and 2074 papers from NIPS 14-23 (2001-2010). Six experiments are conducted on the two corpora and the experimental results show that our method is better than LDA in finding research topics and is feasible in discovering dynamics of research topics.
... The major problem is that more and more basic methods capable of estimating the semantic similarity of pieces of text are being proposed (Navigli & Martelli, 2019). In the end, a plethora of reasonable methods are available, each based on very different concepts and assumptions, and the knowledge engineer does not know which one to use from the classical techniques (Corley & Mihalcea, 2005;Deerwester et al., 1990;Han et al., 2013;Huang et al., 2012;Janowicz et al., 2008;Jiang & Conrath, 1997;Leacock & Chodorow, 1998;Levenshtein, 1966;Li et al., 2003;Lin, 1998;Pedersen et al., 2007;Resnik, 1999;Rodríguez & Egenhofer, 2003;Sánchez et al., 2011;Seco et al., 2004) to the most recent ones (Aouicha et al., 2016;Bojanowski et al., 2017;Cer et al., 2018;Deudon, 2018;Devlin et al., 2019;He & Lin, 2016;Lastra-Díaz et al., 2017;Levy et al., 2015;Mikolov et al., 2013;Peters et al., 2018;Pilehvar & Navigli, 2015;Qu et al., 2018;Zhang et al., 2020). Therefore, many researchers agree that appropriately combining different semantic similarity measures could avoid fatal errors when implementing solutions working in production settings (Mihalcea et al., 2006;Pirrò, 2009;Potash et al., 2016). ...
Article
This article presents a comprehensive review of stacking methods commonly used to address the challenge of automatic semantic similarity measurement in the literature. Since more than two decades of research have left various semantic similarity measures, scientists and practitioners often find many difficulties in choosing the best method to put into production. For this reason, a novel generation of strategies has been proposed to use basic semantic similarity measures using base estimators to achieve a better performance than could be gained from any of the semantic similarity measures. In this work, we analyze different stacking techniques, ranging from the classical algebraic methods to the most powerful ones based on hybridization, including blending, neural, fuzzy, and genetic-based stacking. Each technique excels in aspects such as simplicity, robustness, accuracy, interpretability, transferability, or a favorable combination of several of those aspects. The goal is that the reader can have an overview of the state-of-the-art in this field.
... Semantic textual similarity (STS) is an important component of many NLP tasks, including QA, document summarization, and IR [8]. Moreover, the semantic question similarity task is an application of the STS task. ...
Article
Full-text available
With the rapid increase of Arabic content on the web comes an increased need for short and accurate answers to queries. Machine question answering has appeared as an important emerging field for progress in natural language processing techniques. Machine learning performance surpasses that of humans in some areas, such as natural language processing and text analysis, especially with large amounts of data. There are two main contributions of this research. First, we propose the Tawasul Arabic question similarity (TAQS) system with four Arabic semantic question similarity models using deep learning techniques. Second, we curated and used an Arabic customer service question-similarity dataset with a 44,404 entries of question–answer pairs, called “Tawasul.” For TAQS, first, we use transfer learning to extract the contextualized bidirectional encoder representations from transformers (BERT) embedding with bidirectional long short-term memory (BiLSTM) in two different ways. Specifically, we propose two architectures: the BERT contextual representation with BiLSTM (BERT-BiLSTM) and the hybrid transfer BERT contextual representation with BiLSTM (HT-BERT-BiLSTM). The hybrid transfer representation combines two transfer learning techniques. Second, we fine-tuned two versions of bidirectional encoder representations from transformers for Arabic language (AraBERT). The results show that the HT-BERT-BiLSTM with the features of Layer 12 reaches an accuracy of 94.45%, where the fine-tuning of AraBERTv2 and AraBERTv0.2 achieve 93.10% and 93.90% accuracy, respectively, for the Tawasul dataset. Our proposed TAQS model surpassed the performance of the state-of-the-art BiLSTM with SkipGram by a gain of 43.19% in accuracy.
... To obtain semantically related services, some scholars have proposed ontology-based [14] service discovery methods. This method can effectively identify the function of the service and calculate the semantic similarity [15] between the request and the service. However, the construction of the ontology is time-consuming and labor-intensive, and it is also prone to errors. ...
Article
Full-text available
Mashup is a new type of application that integrates multiple Web APIs. For mashup application development, the quality of the selected APIs is particularly important. However, with the rapid development of Internet technology, the number of Web APIs is increasing rapidly. It is unrealistic for mashup developers to manually select appropriate APIs from a large number of services. For existing methods, there is a problem of data sparsity, because one mashup is related to a few APIs, and another problem of over-reliance on semantic information. To solve these problems in current service discovery approaches, we propose a service discovery approach based on a knowledge map (SDKG). We embed service-related information into the knowledge graph, alleviating the impact of data sparsity and mining deep relationships between services, which improves the accuracy of service discovery. Experimental results show that our approach has obvious advantages in accuracy compared with the existing mainstream service discovery approaches.
... Cosine similarity is a distance measure that can be used to calculate the similarity between two words, sentences, paragraphs, or the whole document 36 . It is an effective measure to estimate the similarity of vectors in high-dimensional space 37 . ...
Preprint
High quality software systems typically require a set of clear, complete and comprehensive requirements. In the process of software development life cycle, a software requirement specification (SRS) document lays the foundation of product development by defining the set of functional and nonfunctional requirements. It also improves the quality of software products and ensure timely delivery of the projects. These requirements are typically documented in natural language which might lead to misinterpretations and conflicts between the requirements. In this study, we aim to identify the conflicts in requirements by analyzing their semantic compositions and contextual meanings. We propose an approach for automatic conflict detection, which consists of two phases: identifying conflict candidates based on textual similarity, and using semantic analysis to filter the conflicts. The similarity-based conflict detection strategy involves finding the appropriate candidate requirements with the help of sentence embeddings and cosine similarity measures. Semantic conflict detection is an additional step applied over all the candidates identified in the first phase, where the useful information is extracted in the form of entities to be used for determining the overlapping portions of texts between the requirements. We test the generalizability of our approach using five SRS documents from different domains. Our experiments show that the proposed conflict detection strategy can capture the conflicts with high accuracy, and help automate the entire conflict detection process.
... Tasks STS: Semantic textual similarity assesses the degree of semantic equivalence between two pieces of text (Corley and Mihalcea, 2005). The aim is to predict a similarity score for a sentence pair (S1, S2), generally in the range [0, 5], where 0 indicates complete dissimilarity and 5 indicates equivalence in meaning. ...
Article
Full-text available
State-of-the-art classification and regression models are often not well calibrated, and cannot reliably provide uncertainty estimates, limiting their utility in safety-critical applications such as clinical decision-making. While recent work has focused on calibration of classifiers, there is almost no work in NLP on calibration in a regression setting. In this paper, we quantify the calibration of pre- trained language models for text regression, both intrinsically and extrinsically. We further apply uncertainty estimates to augment training data in low-resource domains. Our experiments on three regression tasks in both self-training and active-learning settings show that uncertainty estimation can be used to increase overall performance and enhance model generalization.
... Nouns, verbs, adjectives and adverbs are grouped into a series of synonyms called synsets. [14]. Wordnet looks for word similarity, semantic relation, and close relationship or similarity between words [15]. ...
... Cosine similarity is one of the successful similarity measures used with text documents in many information retrievals and clustering applications [21], [35]. The cosine similarity algorithm utilizes the angle between two vectors in the vector space to define the difference in content between two vectors [36]. ...
Article
Full-text available
Web service (WS) discovery is an essential task for implementing complex applications in a service oriented architecture (SOA), such as selecting, composing, and providing services. This task is limited semantically in the incorporation of the customer’s request and the web services. Furthermore, applying suitable similarity methods for the increasing number of WSs is more relevant for efficient web service discovery. To overcome these limitations, we propose a new approach for web service discovery integrating multiple similarity measures and k-means clustering. The approach enables more accurate services appropriate to the customer's request by calculating different similarity scores between the customer's request and the web services. The global semantic similarity is determined by applying k-means clustering using the obtained similarity scores. The experimental results demonstrated that the proposed semantic web service discovery approach outperforms the state-of-the approaches in terms of precision (98%), recall (95%), and F-measure (96%). The proposed approach is efficiently designed to support and facilitate the selection and composition of web services phases in complex applications.
... With a combination of the Cosine Similarity Method with a websitebased programming language, the similarity testing process can run quickly, accurately, efficiently, and effectively. The results will also be very high accuracy [13,14]. ...
Article
Full-text available
Technology development makes everything unlimited. Everyone easily to gets information. There are negative and positive impacts. The one negative impact is plagiarism the work of others. Of course this brings a bad impact. This study aims to examine the similarity of Student Final Reports at the Politeknik Unggul LP3M Medan. The method used is the Cosine Similarity Method. This method was chosen because it works based on mathematical calculations. How it works is by comparing the final project done by students with the final project that has been there before. With the Cosine Similitary Method, percentage of similarity will be obtained. If the similarity is high, the final project is said to be tracing.
... Shahzad et al. [56] use five word-level WordNet semantic similarity measures (Resnik [33], Jiang [57], Leacock [58], Lin [35] and Wu [59] similarity) and three sentencelevel aggregation techniques (Greedy Pairing [60], Optimal Matching [61] and Quadratic Assignment Problem [62]) to evaluate the effectiveness of different combinations in the context of Process Model Matching (PMM). They also emphasize the importance of considering the semantics on PMM. ...
Article
Full-text available
Many companies have implemented their business processes in Web applications which must be frequently adapted so as to stay aligned with new business process requirements. Service-oriented architectures (SOAs) constitute an appropriate option to manage the continuous changes in those processes by facilitating their alignment with the changing underlying system services. In this context, firms are trying to migrate their Web applications to new software architectures such as SOAs. However, this migration is usually carried out ad-hoc by means of non-reusable and error-prone manual processes. Similarly, the alignment between the business processes and the underlying services identified is usually done by hand. This work presents a model-driven semiautomatic approach to modernize legacy Web applications to SOAs. The approach is focused on an automatic semantic process aimed at discovering the services that can be used to implement the business processes (defined by the companies), then aligning these processes with the underlying services. A semantic algorithm is provided to aid the migration architect during the alignment process. The case study carried out shows that the alignment process results obtained by the semantic algorithm presented in this paper are similar to those obtained by the experts manually. Finally, SOA orchestration artifacts are generated from the semantic algorithm results.
... To overcome the difficult operation and common error of information retrieval methods, an automatic evaluation algorithm based on word cooccurrence [18] was proposed, but common words in sentences affected the scores. Therefore, knowledge- [19] and corpus-based methods were applied to measure the semantic similarity [3] and decrease the error rate. However, the performance of these methods depended on the quality of the knowledge base/corpus. ...
Article
Full-text available
The determination of semantic similarity between sentences is an important component in natural language processing (NLP) tasks such as text retrieval and text summarization. Many approaches have been proposed for estimating sentence similarity, and Siamese neural networks (SNN) provide a better approach. However, the sentence semantic representation, generated by sharing weights in the SNN without any attention mechanism, ignores the different contributions of different words to the overall sentence semantics. Furthermore, the attention operation within only a single sentence neglects interactive semantic influence on similarity estimation. To address these issues, an interactive self-attention (ISA) mechanism is proposed in this paper and integrated with an SNN, named an interactive self-attentive Siamese neural network (ISA-SNN) which is used to verify the effectiveness of ISA. The proposed model obtains the weights of words in a single sentence by means of self-attention and extracts inherent interactive semantic information between sentences via interactive attention to enhance sentence semantic representation. It achieves better performances without feature engineering than other existing methods on three biomedical benchmark datasets (a Pearson correlation coefficient of 0.656 and 0.713/0.658 on DBMI and CDD-ful/-ref, respectively).
... Once the textual content for each student is available, the instructor can then employ natural language processing techniques from artificial intelligence to discover similarities against own expert textual information. For instance, semantic similarity techniques can be employed on the text produced by learners and the instructor [118,19,82] and can be used as a form of formative assessment for each of the five aforementioned principles. Similarly, the instructor can employ graph-based methods for assessing conceptual similarity [57,78] of own representations of instructional material and those produced by learners [115]. ...
Conference Paper
Full-text available
Artificial Intelligence is one of the fastest growing disciplines, disrupting many sectors. Originally mainly for computer scientists and engineers, it has been expanding its horizons and empowering many other disciplines contributing to the development of many novel applications in many sectors. These include medicine and health care, business and finance, psychology and neuroscience, physics and biology to mention a few. However, one of the disciplines in which artificial intelligence has not been fully explored and exploited yet is education. In this discipline, many research methods are employed by scholars, lecturers and practitioners to investigate the impact of different instructional approaches on learning and to understand the ways skills and knowledge are acquired by learners. One of these is qualitative research, a scientific method grounded in observations that manipulates and analyses non-numerical data. It focuses on seeking answers to why and how a particular observed phenomenon occurs rather than on its occurrences. This study aims to explore and discuss the impact of artificial intelligence on qualitative research methods. In particular, it focuses on how artificial intelligence have empowered qualitative research methods so far, and how it can be used in education for enhancing teaching and learning.
... For text data, the loss function is (18), indicating that we can use weighted cosine similarity method [33] to calculate the deviation. ...
Article
An important task in big data integration is to derive accurate data records from noisy and conflicting values collected from multiple sources. Most existing truth finding methods assume that the reliability is consistent on the whole data set, ignoring the fact that different attributes, objects and object groups may have different reliabilities even wrt the same source. These reliability differences are caused by the hardness differences in obtaining attribute values, non-uniform updates to objects and the differences in group privileges. This paper addresses the problem how to compute truths by effectively estimating the reliabilities of attributes, objects and object groups in a multi-source heterogeneous data environment. We first propose an optimization framework TFAR, its implementation and Lagrangian duality solution for Truth Finding by Attribute Reliability estimation. We then present a Bayesian probabilistic graphical model TFOR and an inference algorithm applying Collapsed Gibbs Sampling for Truth Finding by Object Reliability estimation. Finally we give an optimization framework TFGR and its implementation for Truth Finding by Group Reliability estimation. All these models lead to a more accurate estimation of the respective attribute, object and object group reliabilities, which in turn can achieve a better accuracy in inferring the truths. Experimental results on both real data and synthetic data show that our methods have better performance than the state-of-art truth discovery methods.
... Due to the lack of good metrics that will measure the quality of NCA knowledge as stated in Vinyals and Le (2015) , we performed a qualitative evaluation using the Bilingual Evaluation Understudy ( BLEU ) score ( Papineni, Roukos, Ward, & Zhu, 2002 ) to measure the similarity between responses of the two bots to same question. The BLEU score algorithm compares the N-grams of two text fragments and counts the number of matches, the similarity score of these texts is a function of the number of matches ( Corley & Mihalcea, 2005 ). The outcome of the Qualitative evaluation is shown in Table 2 . ...
... A metric over a set of documents defines the semantic similarity between them, by measuring the direct and indirect relationships [11], [33]. These rela-* Corresponding author. ...
Article
Full-text available
This work focuses on bolstering the pre–existing Interpretable Semantic Textual Similarity (iSTS) method, that will enable a user to understand the behaviour of an artificial intelligent system. The proposed iSTS method explains the similarities and differences between a pair of sentences. The objective of the iSTS problem is to formalize the alignment between a pair of text segments and to label the relationship between the text fragments with a relation type and relatedness score. The overall objective of this work is to develop a 1:M multi chunk aligner for an iSTS method, which is trained on SemEval 2016 Task 2 dataset. The obtained result outperforms many state–of–art aligners, which were part of SemEval 2016 iSTS task.
... To compute the similarity between two texts using individual word similarities, the words in both texts first have to be aligned by creating word pairs based on semantic similarity and then these similarity scores are combined to yield a similarity measure for the whole text. Corley and Mihalcea [38] propose a text similarity measure, where the most similar word pairs in two texts are determined based on semantic word similarity measures as implemented in the WordNet similarity package [39]. The similarity score of two texts is then computed as the weighted and normalized sum of the single word pairs' similarity scores. ...
Article
Full-text available
More than ever, technical inventions are the symbol of our society's advance. Patents guarantee their creators protection against infringement. For an invention being patentable, its novelty and inventiveness have to be assessed. Therefore, a search for published work that describes similar inventions to a given patent application needs to be performed. Currently, this so-called search for prior art is executed with semi-automatically composed keyword queries, which is not only time consuming, but also prone to errors. In particular, errors may systematically arise by the fact that different keywords for the same technical concepts may exist across disciplines. In this paper, a novel approach is proposed, where the full text of a given patent application is compared to existing patents using machine learning and natural language processing techniques to automatically detect inventions that are similar to the one described in the submitted document. Various state-of-the-art approaches for feature extraction and document comparison are evaluated. In addition to that, the quality of the current search process is assessed based on ratings of a domain expert. The evaluation results show that our automated approach, besides accelerating the search process, also improves the search results for prior art with respect to their quality.
... Automated extraction from narrative clinical notes has played an important role in meaningful use of EHRs for clinical and translational research (Wang et al. 2018a). The earliest methods to compute the similarity between two sentences used word-to-word similarity methods (Corley and Mihalcea 2005) computed using measures from the WordNet similarity package (Pedersen et al. 2004) as well as simple vector space models (Salton et al. 1975). There are two main resources leveraged for measurement of semantic similarity: massive corpora of text documents (Barzilay and McKeown 2005;Islam and Inkpen 2008) and semantic resources and knowledge bases (Li et al. 2006;Corley 2007) such as WordNet (Miller 1995) and Wikipedia. ...
Article
Full-text available
The adoption of electronic health records (EHRs) has enabled a wide range of applications leveraging EHR data. However, the meaningful use of EHR data largely depends on our ability to efficiently extract and consolidate information embedded in clinical text where natural language processing (NLP) techniques are essential. Semantic textual similarity (STS) that measures the semantic similarity between text snippets plays a significant role in many NLP applications. In the general NLP domain, STS shared tasks have made available a huge collection of text snippet pairs with manual annotations in various domains. In the clinical domain, STS can enable us to detect and eliminate redundant information that may lead to a reduction in cognitive burden and an improvement in the clinical decision-making process. This paper elaborates our efforts to assemble a resource for STS in the medical domain, MedSTS. It consists of a total of 174,629 sentence pairs gathered from a clinical corpus at Mayo Clinic. A subset of MedSTS (MedSTS_ann) containing 1068 sentence pairs was annotated by two medical experts with semantic similarity scores of 0–5 (low to high similarity). We further analyzed the medical concepts in the MedSTS corpus, and tested four STS systems on the MedSTS_ann corpus. In the future, we will organize a shared task by releasing the MedSTS_ann corpus to motivate the community to tackle the real world clinical problems.
Article
Full-text available
Context. Paraphrased textual content or rewriting is one of the difficult problems of detecting academic plagiarism. Most plagiarism detection systems are designed to detect common words, sequences of linguistic units, and minor changes, but are unable to detect significant semantic and structural changes. Therefore, most cases of plagiarism using paraphrasing remain unnoticed. Objective of the study is to develop a technology for detecting paraphrasing in text based on a classification model and machine learning methods through the use of Siamese neural network based on recurrent and Transformer type – RoBERTa to analyze the level of similarity of sentences of text content. Method. For this study, the following semantic similarity metrics or indicators were chosen as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover’s Distance, distances according to WordNet dictionaries, prediction of two ML models: Siamese neural network based on recurrent and Transformer type - RoBERTa. Results. An intelligent system for detecting paraphrasing in text based on a classification model and machine learning methods has been developed. The developed system uses the principle of model stacking and feature engineering. Additional features indicate the semantic affiliation of the sentences or the normalized number of common N-grams. An additional fine-tuned RoBERTa neural network (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This specificity of the model may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both the overall classification accuracy and the model’s sensitivity to pairs of sentences that are not paraphrases of each other. Conclusions. The created model shows excellent classification results on PAWS test data: precision – 93%, recall – 92%, F1score – 92%, accuracy – 92%. The results of the study showed that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without the need for additional feature generation.
Article
Full-text available
The filter bubble phenomenon and its negative societal effects have been extensively explored in the literature in the past decade. However, the ability of modern AI‐based systems to create personalized information bubbles, that is, to classify similar contents and users into clusters according to their interests and behavior, can actually be quite beneficial if utilized and managed properly and ethically. In this article we present ongoing research that aims to refine such bubble‐building smart systems by adopting an ethical, multi‐perspective approach that allows for linking isolated bubbles into a consolidated bubblesphere and offering users a choice to explore diverse bubbles related to their topics of interest. To implement the proposed approach, content matching should be based on diverse similarity, which can be derived from a multi‐viewpoint KOS. In addition, the study explores how such a multi‐viewpoint KOS and bubblesphere can be constructed using Wikidata's ranks and qualifiers.
Article
Full-text available
The creation of automatic e-mail responder systems with human-quality responses is challenging due to the ambiguity of meanings and difficulty in response modeling. In this paper, we present the Personal Email Responder (PER); a novel system for email categorization and semi-automatic response generation. The key novelty presented in this paper is an approach to email categorization that distinguishes query and non-query email messages using Natural Language Processing (NLP) and Neural Network (NN) methods. The second novelty is the use of Artificial Intelligence Markup Language (AIML)-based chatbot for semiautomatic response creation. The proposed methodology was implemented as a prototype mobile application, which was then used to conduct an experiment. Email messages logs collected in the experimental phase are used to evaluate the proposed methodology and estimate the accuracy of the presented system for email categorization and semi-automatic response generation.
Conference Paper
Full-text available
This paper presents an overview of the open access datasets in Serbian that have been manually annotated for the tasks of semantic textual similarity and short-text sentiment classification. In addition, it describes several kinds of statistical models that have been trained and evaluated on these datasets and discusses their results.
Chapter
Nowadays, Altshuller contradiction matrix is used by many TRIZ practitioners, especially by beginners, thanks to its simplicity. However, establishing the link between user’s specific problems issued from their experience in their domain of knowledge makes the use of the matrix often difficult. Applying specific terms of domain to formalized language of TRIZ tools necessitate an expertise that users often don’t have time to build. Our previous finding based on Natural Languages Processing (NLP) tools and techniques, made possible to process a corpus of patents from a given field and thanks to Topic Modelling technique we achieved to link the technical parameters extracted out of patents to their context representation on a vector space in the text. However, this approach is not pertinent to identify the contradictory relations between extracted parameters. For this reason, we applied antonyms identification technique in order to better process the relations of oppositions between extracted parameters. The goal of this research it to extract automatically potential contradictions and set them up in an Altshuller-like matrix. Such an approach could facilitate the application of this famous TRIZ tool for practical user’s problems. Moreover, setting up the matrix for patents of the new domain of knowledge could help to construct easily the state of art for these types of domain and keep the users informed without spending a lot of time and human resources for reading and analyzing large quantities of texts appearing continuously in each domains.
Article
Web page segmentation (WPS) aims to break a web page into different segments with coherent intra- and inter-semantics. By evidencing the morpho-dispositional semantics of a web page, WPS has traditionally been used to demarcate informative from non-informative content, but it has also evidenced its key role within the context of non-linear access to web information for visually impaired people. For that purpose, a great deal of ad hoc solutions have been proposed that rely on visual, logical, and/or text cues. However, such methodologies highly depend on manually tuned heuristics and are parameter-dependent. To overcome these drawbacks, principled frameworks have been proposed that provide the theoretical bases to achieve optimal solutions. However, existing methodologies only combine few discriminant features and do not define strategies to automatically select the optimal number of segments. In this article, we present a multi-objective clustering technique called MCS that relies on \( K \) -means, in which (1) visual, logical, and text cues are all combined in a early fusion manner and (2) an evolutionary process automatically discovers the optimal number of clusters (segments) as well as the correct positioning of seeds. As such, our proposal is parameter-free, combines many different modalities, does not depend on manually tuned heuristics, and can be run on any web page without any constraint. An exhaustive evaluation over two different tasks, where (1) the number of segments must be discovered or (2) the number of clusters is fixed with respect to the task at hand, shows that MCS drastically improves over most competitive and up-to-date algorithms for a wide variety of external and internal validation indices. In particular, results clearly evidence the impact of the visual and logical modalities towards segmentation performance.
Article
This work shows the use of WEKA, a tool that implements the most common machine learning algorithms, to perform a Text Mining analysis on a set of documents. Applying these methods requires initial steps where the text is converted into a structured format. Both the processing phase and the analysis of the transformed dataset, using classification and clustering algorithms, can be carried out entirely with this tool, in a rigorous and simple way. The work describes the construction of two classification models starting from two different sets of documents. These models are not meant to be good or realistic, but just illustrate how WEKA can be used for a Text Mining analysis.
Chapter
Participatory Budget (PB) is a process that distributes part of the city budgets among projects submitted and selected by the dwellers. The key challenge for IT-supported e-PB is the projects’ comparison and ranking. This paper focuses on an empirical test of a hybrid method for comparing BP projects. In this study, we investigate two difficult to measure dimensions: beneficiaries and categories. We use ontology to describe and map distances between concept and then generate ranking based on the fuzzy TOPSIS method. The method is validated through experiments with annotators and working methods. The results surpass semantic measure, but also show the space for further development.
Chapter
Even though machine translation (MT) systems have shown promise for automatic translations, the quality of translations produced by MT systems is still far behind professional human translations (HTs), because of the complexity of grammar and word usage in natural languages. As a result, HTs are still commonly used in practice. Nevertheless, the quality of HTs is strongly depending on the skills and knowledge of translators. How to measure the quality of translations produced by MT systems and human translators in an automatic manner has faced a lot of challenges. The transitional way to manually checking the accuracy of translation quality by bilingual speakers is expensive and time-consuming. Therefore, we propose an unsupervised method to assess HTs and MTs quality without having access to any labelled data. We compare a range of methods which are able to automatically grade the quality of HTs and MTs, and observe that the Bidirectional Minimum Word Mover’s Distance (BiMWMD) obtains the best performance on both HTs and MTs dataset.
Article
As more and more datasets become available, their utilization in different applications increases in popularity. Their volume and production rate, however, means that their quality and content control is in most cases non-existing, resulting in many datasets that contain inaccurate information of low quality. Especially, in the field of conversational assistants, where the datasets come from many heterogeneous sources with no quality assurance, the problem is aggravated. We present here an integrated platform that creates task- and topic-specific conversational datasets to be used for training conversational agents. The platform explores available conversational datasets, extracts information based on semantic similarity and relatedness, and applies a weight-based score function to rank the information based on its value for the specific task and topic. The finalized dataset can then be used for the training of an automated conversational assistance over accurate data of high quality.
Chapter
This chapter discusses the topic of linkage discovery for data and their applications. This chapter enhances a previous study by the authors and includes additional references that pertain to applications of linkage discovery not requiring a glossary framework.
Article
Full-text available
Semantic similarity is used to perceive the meaning of words in textual data and having several applications in the field of Computational linguistic and Natural language processing. Semantic similarity and semantic relatedness are interchangeable terms, in Semantic relatedness all types of semantic relationships are considered i.e. (has, part-of, contains etc.) whereas in semantic similarity only “is-a” type of relationship is used. Computing accurate similarity improves the efficiency of query processing and understanding of textual data more proficiently. In this work, a novel hybrid approach of corpus based and knowledge based method is proposed approach to compute the semantic relatedness among words. In this hybrid method, Latent semantic analysis is used to find the concepts and their correlation value and mapping is done for different words with the help of fuzzy formal concept analysis.To compute semantic relatedness on different words a fuzzy set similarity measure is used.The proposed approach has been evaluated on solar domain and attains improved results as compared to other baseline measures. This method can be used for any doamin as it is a generalized approach for similarity computation.
Chapter
This paper addresses the task of learning sentence similarity on pairs of relevant sentences retrieved from a Quranic Retrieval Application (QRA). With the existing keywords and semantic concepts extraction, a long list of relevant verses (sentences) is retrieved that matches the query. However, as Quranic concepts are repeatedly conveyed on scattered sentences, it is important to classify which of the retrieved sentences are similar not only in word function but in context with subsequence words. Information context on similar sentences is realized with the evaluation of both word similarity and relatedness. This paper proposed a multi-word Term Similarity and Retrieval (mTSR) model that uses the n-gram score function that measures the relatedness of subsequent words. Bigram similarity scores are constructed between every pair of the relevant Quranic sentences, which boost the conventional keyword matched QRA. A similarity score is established to refine the list of relevant sentences aimed to help the user to understand the scattered content of the documents. The results are presented to the user as a refined list of similar sentences, by ranking the first-retrieved results from a keyword search. The ranking is done using a bigram score. When the score is tested on the Malay Quranic Retrieval Application (myQRA) prototype, results show that the refined results accurately matched the manually perceived similar sentences (iS) extracted by the three volunteers.
Conference Paper
Semantic similarity is important information with which decision-makers can cluster, classify, or compare documents in text mining. Statistical and topological methods are two major ways to determine semantic similarity. However, conventional methods ignore the time factor when calculating the similarity between documents. It should be highlighted that narrative emotions play a critical role in comparing documents. In this paper, copula-based econometric models, including ARMA and GARCH families, are used to calculate the narrative semantic similarity between documents.
Article
Advances in linked geospatial data, recommender systems, and geographic information retrieval have led to urgent necessity to assess the overall semantic relatedness between keyword sets of geographic metadata. In this study, a new model is proposed for computing the semantic relatedness between arbitrary two keyword sets of geographic metadata stored in current global spatial data infrastructures. In this model, the overall semantic relatedness is derived by pairing these keywords that are found to be most relevant to each other and averaging their relatedness. To find the most relevant keywords across two keyword sets precisely, the keywords in the keyword set of geographic metadata are divided into three kinds: the thesaurus elements, the WordNet elements, and the statistical elements. The thesaurus-lexical relatedness measure (TLRM), the extended thesaurus-lexical relatedness measure (ETLRM), and the Longest Common Substring method are proposed to compute the semantic relatedness between two thesaurus elements, two WordNet elements, a thesaurus element, and a WordNet element and two statistical elements, respectively. A human data set – the geographic-metadata’s keyword set relatedness dataset, which was used to evaluate the precision of the semantic relatedness measures of keyword sets of geographic metadata, was created. The proposed method was evaluated against the human-generated relatedness judgments and was compared with the Jaccard method and Vector Space Model. The results demonstrated that the proposed method achieved a high correlation with human judgments and outperformed the existing methods. Finally, the proposed method was applied to quantitatively linked geospatial data.
Article
Paraphrase identification is a natural language processing (NLP) problem that involves the determination of whether two text segments have the same meaning. Various NLP applications rely on a solution to this problem, including automatic plagiarism detection, text summarization, machine translation (MT), and question answering. The methods for identifying paraphrases found in the literature fall into two main classes: similarity-based methods and classification methods. This paper presents a critical study and an evaluation of existing methods for paraphrase identification and its application to automatic plagiarism detection. It presents the classes of paraphrase phenomena, the main methods, and the sets of features used by each particular method. All the methods and features used are discussed and enumerated in a table for easy comparison. Their performances on benchmark corpora are also discussed and compared via tables. Automatic plagiarism detection is presented as an application of paraphrase identification. The performances on benchmark corpora of existing plagiarism detection systems able to detect paraphrases are compared and discussed. The main outcome of this study is the identification of word overlap, structural representations, and MT measures as feature subsets that lead to the best performance results for support vector machines in both paraphrase identification and plagiarism detection on corpora. The performance results achieved by deep learning techniques highlight that these techniques are the most promising research direction in this field.
Chapter
Data protection and insider threat detection and prevention are significant steps that organizations should take to enhance their internal security. Data loss prevention (DLP) is an emerging mechanism that is currently being used by organizations to detect and block unauthorized data transfers. Existing DLP approaches, however, face several practical challenges that limit their effectiveness. In this chapter, by extracting and analyzing document content semantic, we present a new DLP approach that addresses many existing challenges. We introduce the notion of a document semantic signature as a summarized representation of the document semantic. We show that the semantic signature can be used to detect a data leak by experimenting on a public dataset, yielding very encouraging detection effectiveness results including on average a false positive rate (FPR) of 6.71% and on average a detection rate (DR) of 84.47%.
Article
As new requirements are introduced and implemented in a software system, developers must identify the set of source code classes which need to be changed. Therefore, past effort has focused on predicting the set of classes impacted by a requirement. In this paper, we introduce and evaluate a new type of information based on the intuition that the set of requirements which are associated with historical changes to a specific class are likely to exhibit semantic similarity to new requirements which impact that class. This new Requirements to Requirements Set (R2RS) family of metrics captures the semantic similarity between a new requirement and the set of existing requirements previously associated with a class. The aim of this paper is to present and evaluate the usefulness of R2RS metrics in predicting the set of classes impacted by a requirement. We consider 18 different R2RS metrics by combining six natural language processing techniques to measure the semantic similarity among texts (e.g., VSM) and three distribution scores to compute overall similarity (e.g., average among similarity scores). We evaluate if R2RS is useful for predicting impacted classes in combination and against four other families of metrics that are based upon temporal locality of changes, direct similarity to code, complexity metrics, and code smells. Our evaluation features five classifiers and 78 releases belonging to four large open-source projects, which result in over 700,000 candidate impacted classes. Experimental results show that leveraging R2RS information increases the accuracy of predicting impacted classes practically by an average of more than 60% across the various classifiers and projects. IEEE
Chapter
While studies investigating the semantic similarity among concepts, sentences and short text fragments have been fruitful, the problem of document-level semantic matching remains largely unexplored due to its complexity. In this paper, we explore the document-level semantic similarity issue in the academic literatures using an interpretable method. To integrally describe the semantics of an article, we construct a topic event model that utilizes multiple information facets, such as the study purposes, methodologies and domains. Furthermore, to better understand the documents and achieve a more accurate similarity comparison, we incorporate external knowledge into the topic event construction and similarity calculation. Our approach achieves significant improvements over state-of-the-art methods.
Conference Paper
Full-text available
This paper will focus on the semantic representation of verbs in computer systems and its impact on lexical selection problems in machine translation (MT). Two groups of English and Chinese verbs are examined to show that lexical selection must be based on interpretation of the sentences as well as selection restrictions placed on the verb arguments. A novel representation scheme is suggested, and is compared to representations with selection restrictions used in transfer-based MT. We see our approach as closely aligned with knowledge-based MT approaches (KBMT), and as a separate component that could be incorporated into existing systems. Examples and experimental results will show that, using this scheme, inexact matches can achieve correct lexical selection.
Conference Paper
Full-text available
This paper describes the PASCAL Net- work of Excellence Recognising Textual Entailment (RTE) Challenge benchmark 1. The RTE task is defined as recognizing, given two text fragments, whether the meaning of one text can be inferred (en- tailed) from the other. This application- independent task is suggested as capturing major inferences about the variability of semantic expression which are commonly needed across multiple applications. The Challenge has raised noticeable attention in the research community, attracting 17 submissions from diverse groups, sug- gesting the generic relevance of the task.
Article
Full-text available
Latent Semantic Analysis (LSA) is a theory and me:hod for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text (Landauer & Dumais, 1997). The underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. The adequacy of LSA's reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word-word and passage-word lexical priming data; and, as reported in 3 following articles in this issue, it accurately estimates passage coherence, learnability of passages by individual students, and the quality and quantity of knowledge contained in an essay.
Article
Full-text available
We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: graphemeto -phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms. Keywords: memory-based learning, natural language learning, edited nearest neighbor classifier, decision-tree learning 1.
Article
Full-text available
Five different proposed measures of similarity or semantic distance in WordNet were experimentally compared by examining their performance in a real-word spelling correction system. It was found that Jiang and Conrath 's measure gave the best results overall. That of Hirst and St-Onge seriously over-related, that of Resnik seriously under-related, and those of Lin and of Leacock and Chodorow fell in between. 1
Article
This paper presents a new measure of semantic similarity in an is-a taxonomy, based on the notion of information content. Experimental evaluation suggests that the measure performs encouragingly well (a correlation of r = 0:79 with a benchmark set of human similarity judgments, with an upper bound of r = 0:90 for human subjects performing the same task), and significantly better than the traditional edge counting approach (r = 0:66). 1 Introduction Evaluating semantic relatedness using network representations is a problem with a long history in artificial intelligence and psychology, dating back to the spreading activation approach of Quillian [ 1968 ] and Collins and Loftus [ 1975 ] . Semantic similarity represents a special case of semantic relatedness: for example, cars and gasoline would seem to be more closely related than, say, cars and bicycles, but the latter pair are certainly more similar. Rada et al. [ 1989 ] suggest that the assessment of similarity in semantic n...
Article
This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantified with the computational evidence derived from a distributional analysis of corpus data. Specifically, the proposed measure is a combined approach that inherits the edge-based approach of the edge counting scheme, which is then enhanced by the node-based approach of the information content calculation. When tested on a common data set of word pair similarity ratings, the proposed approach outperforms other computational models. It gives the highest correlation value (r = 0.828) with a benchmark based on human similarity judgements, whereas an upper bound (r = 0.885) is observed when human subjects replicate the same task.
paraphrase: Anunsupervisedapproachusingmultiple-sequence alignment Learning to Ido Dagan, Bernardo Magnini and Oren Glickman The PASCAL Recognising Textual Entailment Chal-lenge The measurement of observer agreement for categorical data
  • G G Koch
Challenge Workshop on Recognizing Textual Entail-ment, 2005. In Proceedings of Pascal Regina Barzilay and Lillian Lee. paraphrase: Anunsupervisedapproachusingmultiple-sequence alignment. In Proceedings of HLT-NAACL 2003. pages 16-23, Edmonton, Canada. 2003. Learning to Ido Dagan, Bernardo Magnini and Oren Glickman. 2005. The PASCAL Recognising Textual Entailment Chal-lenge. In Proceedings of Pascal Challenge Workshop on Recognizing Textual Entailment, 2005. J.R. Landis and G.G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics, 33:159-174