Figure 2 - uploaded by Julian Risch
Content may be subject to copyright.

Pearson correlation between BERTScore (computed across different layers) and human judgement of similarity of answer pairs on SQuAD dev set. BERTScore vanilla is pretrained only on Wikipedia, whereas BERTScore trained is fine-tuned on the STS benchmark dataset (Cer et al., 2017).
Contexts in source publication
Context 1
... the embeddings are typically extracted from the last layer of the model, they can be extracted from any of its layers and related work has shown that for some tasks the last layer is not the best ( Liu et al., 2019). The experiment visualized in Figure 2 evaluates the correlation between human judgment of semantic answer similarity and a vanilla and a trained BERTScore model. Comparing the extraction of embeddings from the different layers, we find that the last layer drastically outperforms all other models for the trained model. ...
Context 2
... the vanilla BERTScore model, the choice of the layer has a much smaller influence on the performance, with the first two layers resulting in the strongest correlation with human judgment. For comparison, Figure 2 also includes the results of a cross-encoder model, which does not have the option to choose different layers due to its architecture. ...
Context 3
... the embeddings are typically extracted from the last layer of the model, they can be extracted from any of its layers and related work has shown that for some tasks the last layer is not the best ( Liu et al., 2019). The experiment visualized in Figure 2 evaluates the correlation between human judgment of semantic answer similarity and a vanilla and a trained BERTScore model. Comparing the extraction of embeddings from the different layers, we find that the last layer drastically outperforms all other models for the trained model. ...
Context 4
... the vanilla BERTScore model, the choice of the layer has a much smaller influence on the performance, with the first two layers resulting in the strongest correlation with human judgment. For comparison, Figure 2 also includes the results of a cross-encoder model, which does not have the option to choose different layers due to its architecture. ...
Similar publications
This report discusses the foundations of the VO approach. Then, it explores multiple directions and argues about structure and applications.
Citations
... Noting their lexical nature, the exact and partial matching modes as well as the F1 measure have the drawback that they focus on whether the extracted answer is literally the same as the one in the ground truth rather than providing equivalent information [70]. For example, consider question Q1 in Fig. 1. ...
By virtue of being prevalently written in natural language (NL), requirements are prone to various defects, e.g., inconsistency and incompleteness. As such, requirements are frequently subject to quality assurance processes. These processes, when carried out entirely manually, are tedious and may further overlook important quality issues due to time and budget pressures. In this paper, we propose QAssist -- a question-answering (QA) approach that provides automated assistance to stakeholders, including requirements engineers, during the analysis of NL requirements. Posing a question and getting an instant answer is beneficial in various quality-assurance scenarios, e.g., incompleteness detection. Answering requirements-related questions automatically is challenging since the scope of the search for answers can go beyond the given requirements specification. To that end, QAssist provides support for mining external domain-knowledge resources. Our work is one of the first initiatives to bring together QA and external domain knowledge for addressing requirements engineering challenges. We evaluate QAssist on a dataset covering three application domains and containing a total of 387 question-answer pairs. We experiment with state-of-the-art QA methods, based primarily on recent large-scale language models. In our empirical study, QAssist localizes the answer to a question to three passages within the requirements specification and within the external domain-knowledge resource with an average recall of 90.1% and 96.5%, respectively. QAssist extracts the actual answer to the posed question with an average accuracy of 84.2%. Keywords: Natural-language Requirements, Question Answering (QA), Language Models, Natural Language Processing (NLP), Natural Language Generation (NLG), BERT, T5.
... By learning the binary grammatical features of questions and answers, the features are aggregated into similarity matrices. For example, Risch et al. [7] designed multiple CNN structures to match question and answer embedding. CNN is exceptional at extracting local features, and RNN focuses on context information, but long sequence dependence and gradient disappearance are difficult to address. ...
... Finally, we use the full-connected network to obtain the final features. We use the GESD (Geometric mean of Euclidean and Sigmoid Dot product) method proposed by Risch et al. [7] to calculate the similarity of features, as shown in the following formula 10. ...
Question-answering understanding systems are of central importance to many natural language processing tasks. A successful question-answering system first needs to accurately mine the semantics of the problem text and then match the semantic similarity between the question and the answer. Most of the current pre-training language modes use joint coding of questions and answers, a pre-training language model to avoid the problem of feature extraction from multilevel text structure, it through unified advance training ignores text semantic expression in different particle size, different levels of semantic features, and to some extent avoiding the serious problem of semantic understanding. In this paper, we focus on the problem of multi-granularity and multi-level feature expression of text semantics in question and answer understanding, and design a question-answering understanding method for multi-granularity hierarchical features. First, we extract features from two aspects, the traditional language model and the deep matching model, and then fuse these features to construct the similarity matrix, and learn the similarity matrix by designing three different models. Finally, the similarity matrix is learned by three different models, and after sorting, the overall similarity is obtained from the similarity of multiple granularity features. Evaluated by testing on WikiQA public datasets, experiments show that the results of our method are improved by adding the multi-granularity hierarchical feature learning method compared with traditional deep learning methods.
... Semantic Answer Similarity (SAS): SAS uses a transformer-based cross-encoder architecture to evaluate the semantic similarity of two answers, rather than their lexical overlap. SAS is more helpful in finding answers that have no lexical overlap but are still semantically similar [32] . Average Response Time (ART): It refers to the time it takes the QA system to respond to a question on average. ...
Aviation flight crews rely on a large number of complex standard documents and operation manuals when performing flight tasks. In order to relieve the pressure of manual retrieval of documents, intelligent question-answering technology based on reading comprehension is gradually applied. In this paper, the flight crew operation manual SQuAD dataset is studied and built, based on which the reader-retriever framework of text content-based reading question answering system (TCQA) is analyzed and established. Experiments are conducted to compare the relevant indexes of the QA system with different combinations of reader and retriever models under the open-source tool haystack. Based on the comparison of response speed and retrieval capability, the best model combination is obtained for the flight crew operation manual dataset, and suggestions are made for the model-related performance improvement.
... The goal of STS is to establish the extent to which the meanings of two short texts are similar to each other, which is typically encoded as a numerical score on a Likert scale. The similarity scores can subsequently be used in more complex tasks, such as Question Answering (Risch et al., 2021) or Text Summarisation (Mnasri et al., 2017). ...
This paper presents the Serbian datasets developed within the project Advancing Novel Textual Similarity-based Solutions in Software Development-AVANTES, intended for the study of Cross-Level Semantic Similarity (CLSS). CLSS measures the level of semantic overlap between texts of different lengths, and it also refers to the problem of establishing such a measure automatically. The problem was first formulated about a decade ago, but research on it has been sparse and limited to English. The AVANTES project aims to change this through the study of CLSS in Serbian, focusing on two different text domains-newswire and software code comments-and on two text length combinations-phrase-sentence and sentence-paragraph. We present and compare two newly created datasets, describing the process of their annotation with fine-grained semantic similarity scores, and outlining a preliminary linguistic analysis. We also give an overview of the ongoing detailed linguistic annotation targeted at detecting the core linguistic indicators of CLSS.
... In addition, it is preferable to have an automatic, simple metric as opposed to expensive, manual annotation or a highly configurable and parameterisable metric so that the development and the hyperparameter tuning do not add more layers of complexity. SAS, a cross-encoder-based metric for the estimation of semantic answer similarity [1], provides one such metric to compare answers based on semantic similarity. ...
... Firstly, lexical-based metrics are not well suited for automated QA model evaluation as they lack a notion of context and semantics. Secondly, most metrics, specifically SAS and BERTScore, as described in [1], find some data types more difficult to assess for similarity than others. ...
... Authors in [1] expand on this idea and further address the issues with existing general MT, natural language generation (NLG), which entails as well generative QA and extractive QA evaluation metrics. These include reliance on string-based methods, such as EM, F1-score, and top-n-accuracy. ...
There are several issues with the existing general machine translation or natural language generation evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model predictions to questions are similar to ground-truth annotations. The ability to compare similarity based on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name string pairs along with their similarities, which can be used for training.
... In addition, it is preferable to have an automatic, simple metric as opposed to expensive, manual annotation or a highly configurable and parameterisable metric so that the development and the hyperparameter tuning do not add more layers of complexity. SAS, a cross-encoder-based metric for the estimation of semantic answer similarity [1], provides one such metric to compare answers based on semantic similarity. ...
... Firstly, lexical-based metrics are not well suited for automated QA model evaluation as they lack a notion of context and semantics. Secondly, most metrics, specifically SAS and BERTScore, as described in [1], find some data types more difficult to assess for similarity than others. ...
... After familiarising ourselves with the current state of research in the field in Section 2, we describe the datasets provided in [1] and the new dataset of names that we purposefully tailor to our model in Section 3. This is followed by Section 4, introducing the four new semantic answer similarity approaches described in [1], our fine-tuned model as well as three lexical n-gram-based automated metrics. ...
There are several issues with the existing general machine translation or natural language generation evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model predictions to questions are similar to ground-truth annotations. The ability to compare similarity based on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name string pairs along with their similarities, which can be used both for training and as a benchmark.
... The level of semantic similarity is commonly expressed as a numerical score on a Likert scale. Establishing such similarity measurements is an integral part of various Natural Language Processing (NLP) tasks, such as Information Retrieval (Hliaoutakis et al., 2006), Question Answering (Risch et al., 2021), Text Summarization (Mnasri, de Chalendar, and Ferret, 2017), etc. Semantic similarity tasks typically focus on texts of similar length, such as individual words (Rubenstein and Goodenough, 1965), word senses (Budanitsky and Hirst, 2006), or sentences (Li et al., 2006). A well-known task of this sort is Semantic Textual Similarity (STS) (Corley and Mihalcea, 2005;Mihalcea, Corley, and Strapparava, 2006;Islam and Inkpen, 2008), popularized via a series of SemEval shared tasks (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016Cer et al., 2017). ...
Cross-Level Semantic Similarity (CLSS) is a measure of the level of semantic overlap between texts of different lengths. Although this problem was formulated almost a decade ago, research on it has been sparse, and limited exclusively to the English language. In this paper, we present the first CLSS dataset in another language, in the form of CLSS.news.sr-a corpus of 1000 phrase-sentence and 1000 sentence-paragraph newswire text pairs in Serbian, manually annotated with fine-grained semantic similarity scores using a 0-4 similarity scale. We describe the methodology of data collection and annotation, and compare the resulting corpus to its preexisting counterpart in English, SemEval CLSS, following up with a preliminary linguistic analysis of the newly created dataset. State-of-the-art pre-trained language models are then fine-tuned and evaluated on the CLSS task in Serbian using the produced data, and their settings and results are discussed. The CLSS.news.sr corpus and the guidelines used in its creation are made publicly available.
... Besides, it is unknown if the metric that was pre-trained on one dataset can generalize well to new domains (T. . Notable mentions in this group include RUSE (Shimanaka et al., 2018), BERTScore (T. and SAS (Risch et al., 2021). ...
The research of open-domain, knowledge-grounded dialogue systems has been advancing rapidly due to the paradigm shift introduced by large language models (LLMs). While the strides have improved the performance of the dialogue systems, the scope is mostly monolingual and English-centric. The lack of multilingual in-task dialogue data further discourages research in this direction. This thesis explores the use of transfer learning techniques to extend the English-centric dialogue systems to multiple languages. In particular, this work focuses on five typologically diverse languages, of which well-performing models could generalize to the languages that are part of the language family as the target languages, hence widening the accessibility of the systems to speakers of various languages. I propose two approaches: Multilingual Retrieval-Augmented Dialogue Model () and Multilingual Generative Dialogue Model (). is adopted from a pre-trained multilingual question answering (QA) system and comprises a neural retriever and a multilingual generation model. Prior to the response generation, the retriever fetches relevant knowledge and conditions the retrievals to the generator as part of the dialogue context. This approach can incorporate knowledge into conversational agents, thus improving the factual accuracy of a dialogue model. In addition, has advantages over because of its modularity, which allows the fusion of QA and dialogue systems so long as appropriate pre-trained models are employed. On the other hand, takes advantage of an existing English dialogue model and performs a zero-shot cross-lingual transfer by training sequentially on English dialogue and multilingual QA datasets. Both automated and human evaluation were carried out to measure the models' performance against the machine translation baseline. The result showed that outperformed significantly and surpassed the baseline in most metrics, particularly in terms of relevance and engagingness. Whilst performance was promising to some extent, a detailed analysis revealed that the generated responses were not actually grounded in the retrieved paragraphs. Suggestions were offered to mitigate the issue, which hopefully could lead to significant progress of multilingual knowledge-grounded dialogue systems in the future.
... Accuracy is defined as the proportion of correctly classified items, either as relevant or as irrelevant (Teufel 2007). (Risch et al. 2021) metric takes into account whether the meaning of a predicted answer is similar to the annotated answer, rather than just the exact words comparison. We employ a Transformer-based "cross-encoder/stsb-RoBERTa-large" 19 pre-trained model, to determine the semantic similarity of two answers. ...
Coronavirus disease (COVID-19) is an infectious disease, which is caused by the SARS-CoV-2 virus. Due to the growing literature on COVID-19, it is hard to get precise, up-to-date information about the virus. Practitioners, front-line workers, and researchers require expert-specific methods to stay current on scientific knowledge and research findings. However, there are a lot of research papers being written on the subject, which makes it hard to keep up with the most recent research. This problem motivates us to propose the design of the COVID-19 Search Engine (CO-SE), which is an algorithmic system that finds relevant documents for each query (asked by a user) and answers complex questions by searching a large corpus of publications. The CO-SE has a retriever component trained on the TF-IDF vectorizer that retrieves the relevant documents from the system. It also consists of a reader component that consists of a Transformer-based model, which is used to read the paragraphs and find the answers related to the query from the retrieved documents. The proposed model has outperformed previous models, obtaining an exact match ratio score of 71.45% and a semantic answer similarity score of 78.55%. It also outperforms other benchmark datasets, demonstrating the generalizability of the proposed approach.
... html. [66] metric takes into account whether the meaning of a predicted answer is similar to the annotated gold answer, rather than just the exact words comparison as in other IR measures (F1 score, EM). We employ "cross-encoder/ stsb-RoBERTa-large", 15 a Transformer model, to determine the semantic similarity of the two answers. ...
Background
Due to the growing amount of COVID-19 research literature, medical experts, clinical scientists, and researchers frequently struggle to stay up to date on the most recent findings. There is a pressing need to assist researchers and practitioners in mining and responding to COVID-19-related questions on time.
Methods
This paper introduces CoQUAD, a question-answering system that can extract answers related to COVID-19 questions in an efficient manner. There are two datasets provided in this work: a reference-standard dataset built using the CORD-19 and LitCOVID initiatives, and a gold-standard dataset prepared by the experts from a public health domain. The CoQUAD has a Retriever component trained on the BM25 algorithm that searches the reference-standard dataset for relevant documents based on a question related to COVID-19. CoQUAD also has a Reader component that consists of a Transformer-based model, namely MPNet, which is used to read the paragraphs and find the answers related to a question from the retrieved documents. In comparison to previous works, the proposed CoQUAD system can answer questions related to early, mid, and post-COVID-19 topics.
Results
Extensive experiments on CoQUAD Retriever and Reader modules show that CoQUAD can provide effective and relevant answers to any COVID-19-related questions posed in natural language, with a higher level of accuracy. When compared to state-of-the-art baselines, CoQUAD outperforms the previous models, achieving an exact match ratio score of 77.50% and an F1 score of 77.10%.
Conclusion
CoQUAD is a question-answering system that mines COVID-19 literature using natural language processing techniques to help the research community find the most recent findings and answer any related questions.