Florian Matthes’s research while affiliated with Technical University of Munich and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (516)


Hit Rate over all RAG configurations where er = Ensemble Retriever, cpr = Child-Parent-Retriever, icl = In-Context-Learning, mq = Multi--ery.
Metric: Faithfulness evaluated by GPT-4 on a scale of 1 -5.
Towards Optimizing a Retrieval Augmented Generation using Large Language Model on Academic Data
  • Preprint
  • File available

November 2024

·

44 Reads

·

·

Gentrit Fazlija

·

[...]

·

Florian Matthes

Given the growing trend of many organizations integrating Retrieval Augmented Generation (RAG) into their operations, we assess RAG on domain-specific data and test state-of-the-art models across various optimization techniques. We incorporate four optimizations; Multi-Query, Child-Parent-Retriever, Ensemble Retriever, and In-Context-Learning, to enhance the functionality and performance in the academic domain. We focus on data retrieval, specifically targeting various study programs at a large technical university. We additionally introduce a novel evaluation approach, the RAG Confusion Matrix designed to assess the effectiveness of various configurations within the RAG framework. By exploring the integration of both open-source (e.g., Llama2, Mistral) and closed-source (GPT-3.5 and GPT-4) Large Language Models, we offer valuable insights into the application and optimization of RAG frameworks in domain-specific contexts. Our experiments show a significant performance increase when including multi-query in the retrieval phase.

Download

Enhancing Answer Attribution for Faithful Text Generation with Large Language Models

October 2024

·

73 Reads

The increasing popularity of Large Language Models (LLMs) in recent years has changed the way users interact with and pose questions to AI-based conversational systems. An essential aspect for increasing the trustworthiness of generated LLM answers is the ability to trace the individual claims from responses back to relevant sources that support them, the process known as answer attribution. While recent work has started exploring the task of answer attribution in LLMs, some challenges still remain. In this work, we first perform a case study analyzing the effectiveness of existing answer attribution methods, with a focus on subtasks of answer segmentation and evidence retrieval. Based on the observed shortcomings, we propose new methods for producing more independent and contextualized claims for better retrieval and attribution. The new methods are evaluated and shown to improve the performance of answer attribution components. We end with a discussion and outline of future directions for the task.


Figure 2: Number of papers in each category of the arXiv dataset.
Figure 3: FusionSent micro F 1 scores for few-shot classification on 8 different datasets using either extensive label descriptions or simple label names. We report the average score over the random training splits of each dataset using |N | = 8 training examples per class.
Efficient Few-shot Learning for Multi-label Classification of Scientific Documents with Many Classes

October 2024

·

12 Reads

Scientific document classification is a critical task and often involves many classes. However, collecting human-labeled data for many classes is expensive and usually leads to label-scarce scenarios. Moreover, recent work has shown that sentence embedding model fine-tuning for few-shot classification is efficient, robust, and effective. In this work, we propose FusionSent (Fusion-based Sentence Embedding Fine-tuning), an efficient and prompt-free approach for few-shot classification of scientific documents with many classes. FusionSent uses available training examples and their respective label texts to contrastively fine-tune two different sentence embedding models. Afterward, the parameters of both fine-tuned models are fused to combine the complementary knowledge from the separate fine-tuning steps into a single model. Finally, the resulting sentence embedding model is frozen to embed the training instances, which are then used as input features to train a classification head. Our experiments show that FusionSent significantly outperforms strong baselines by an average of 6.0 F1F_{1} points across multiple scientific document classification datasets. In addition, we introduce a new dataset for multi-label classification of scientific documents, which contains 183,565 scientific articles and 130 classes from the arXiv category taxonomy. Code and data are available at https://github.com/sebischair/FusionSent.


Thinking Outside of the Differential Privacy Box: A Case Study in Text Privatization with Language Model Prompting

October 2024

·

22 Reads

The field of privacy-preserving Natural Language Processing has risen in popularity, particularly at a time when concerns about privacy grow with the proliferation of Large Language Models. One solution consistently appearing in recent literature has been the integration of Differential Privacy (DP) into NLP techniques. In this paper, we take these approaches into critical view, discussing the restrictions that DP integration imposes, as well as bring to light the challenges that such restrictions entail. To accomplish this, we focus on DP-Prompt\textbf{DP-Prompt}, a recent method for text privatization leveraging language models to rewrite texts. In particular, we explore this rewriting task in multiple scenarios, both with DP and without DP. To drive the discussion on the merits of DP in NLP, we conduct empirical utility and privacy experiments. Our results demonstrate the need for more discussion on the usability of DP in NLP and its benefits over non-DP approaches.


Figure 1: Architectural components of the conversational exploratory search system.
Figure 2: Comparison of accuracy and F1-scores for three topic classification approaches.
Figure 3: Comparison of rating distribution between the conversational and graphical user interface (UI) for readability, correctness, usefulness, and overall satisfaction. The thick line intersecting the box marks the median.
Figure 4: Conversational search flow illustrated as dialogue states (S1-S7). The three-phase search process encompasses: first, identifying a research topic (S3); second, choosing clusters of publications (S4); and third, comparing publications via short summaries (S5-S6).
Figure 5: Semantic data model of the scholarly knowledge graph.
Conversational Exploratory Search of Scholarly Publications Using Knowledge Graphs

Traditional search methods primarily depend on string matches, while semantic search targets concept-based matches by recognizing underlying intents and contextual meanings of search terms. Semantic search is particularly beneficial for discovering scholarly publications where differences in vocabulary between users' search terms and document content are common, often yielding irrelevant search results. Many scholarly search engines have adopted knowledge graphs to represent semantic relations between authors, publications, and research concepts. However, users may face challenges when navigating these graphical search interfaces due to the complexity and volume of data, which impedes their ability to discover publications effectively. To address this problem, we developed a conversational search system for exploring scholarly publications using a knowledge graph. We outline the methodical approach for designing and implementing the proposed system, detailing its architecture and functional components. To assess the system's effectiveness, we employed various performance metrics and conducted a human evaluation with 40 participants, demonstrating how the conversational interface compares against a graphical interface with traditional text search. The findings from our evaluation provide practical insights for advancing the design of conversational search systems.


Conversational Exploratory Search of Scholarly Publications Using Knowledge Graphs

October 2024

·

3 Reads

Traditional search methods primarily depend on string matches, while semantic search targets concept-based matches by recognizing underlying intents and contextual meanings of search terms. Semantic search is particularly beneficial for discovering scholarly publications where differences in vocabulary between users' search terms and document content are common, often yielding irrelevant search results. Many scholarly search engines have adopted knowledge graphs to represent semantic relations between authors, publications, and research concepts. However, users may face challenges when navigating these graphical search interfaces due to the complexity and volume of data, which impedes their ability to discover publications effectively. To address this problem, we developed a conversational search system for exploring scholarly publications using a knowledge graph. We outline the methodical approach for designing and implementing the proposed system, detailing its architecture and functional components. To assess the system's effectiveness, we employed various performance metrics and conducted a human evaluation with 40 participants, demonstrating how the conversational interface compares against a graphical interface with traditional text search. The findings from our evaluation provide practical insights for advancing the design of conversational search systems.


AGB-DE: A Corpus for the Automated Legal Assessment of Clauses in German Consumer Contracts

August 2024

·

1 Read

·

2 Citations

Legal tasks and datasets are often used as benchmarks for the capabilities of language models. However, openly available annotated datasets are rare. In this paper, we introduce AGB-DE, a corpus of 3,764 clauses from Ger-man consumer contracts that have been annotated and legally assessed by legal experts. Together with the data, we present a first base-line for the task of detecting potentially void clauses, comparing the performance of an SVM baseline with three fine-tuned open language models and the performance of GPT-3.5. Our results show the challenging nature of the task, with no approach exceeding an F1-score of 0.54. While the fine-tuned models often performed better with regard to precision, GPT-3.5 outperformed the other approaches with regard to recall. An analysis of the errors indicates that one of the main challenges could be the correct interpretation of complex clauses, rather than the decision boundaries of what is permissible and what is not.


Bridging Information Gaps in Dialogues With Grounded Exchanges Using Knowledge Graphs

August 2024

·

6 Reads

Knowledge models are fundamental to dialogue systems for enabling conversational interactions, which require handling domain-specific knowledge. Ensuring effective communication in information-providing conversations entails aligning user understanding with the knowledge available to the system. However, dialogue systems often face challenges arising from semantic inconsistencies in how information is expressed in natural language compared to how it is represented within the system's internal knowledge. To address this problem, we study the potential of large language models for conversational grounding, a mechanism to bridge information gaps by establishing shared knowledge between dialogue participants. Our approach involves annotating human conversations across five knowledge domains to create a new dialogue corpus called BridgeKG. Through a series of experiments on this dataset, we empirically evaluate the capabilities of large language models in classifying grounding acts and identifying grounded information items within a knowledge graph structure. Our findings offer insights into how these models use in-context learning for conversational grounding tasks and common prediction errors, which we illustrate with examples from challenging dialogues. We discuss how the models handle knowledge graphs as a semantic layer between unstructured dialogue utterances and structured information items.



Figure 1: Performance comparison of precision, recall, and F1-score by grounding act for the Llama-3-70B model with all input utterances (n=all).
Figure 2: Count of predictions in JSON-LD format with valid properties, valid values, or identical content for evaluated models in zero- (Z) and few-shot (F) settings.
Figure 3: Confusion matrices for few-shot classification results of GPT-4o with three input utterances and Llama-3- 70B with all input utterances.
Bridging Information Gaps in Dialogues With Grounded Exchanges Using Knowledge Graphs

Knowledge models are fundamental to dialogue systems for enabling conversational interactions, which require handling domain-specific knowledge. Ensuring effective communication in information-providing conversations entails aligning user understanding with the knowledge available to the system. However, dialogue systems often face challenges arising from semantic inconsistencies in how information is expressed in natural language compared to how it is represented within the system's internal knowledge. To address this problem, we study the potential of large language models for conversational grounding, a mechanism to bridge information gaps by establishing shared knowledge between dialogue participants. Our approach involves annotating human conversations across five knowledge domains to create a new dialogue corpus called BridgeKG. Through a series of experiments on this dataset, we empirically evaluate the capabilities of large language models in classifying grounding acts and identifying grounded information items within a knowledge graph structure. Our findings offer insights into how these models use in-context learning for conversational grounding tasks and common prediction errors, which we illustrate with examples from challenging dialogues. We discuss how the models handle knowledge graphs as a semantic layer between unstructured dialogue utterances and structured information items.


Citations (38)


... More recent research has expanded the scope of LLM evaluation to include domain-specific applications. Afzal et al. [30] evaluated eleven models across medical, scientific, and governmental datasets, prioritizing the domain adaptation capabilities of LLMs. Their findings indicate that smaller LLMs can achieve competitive performance in domain-shift tasks even with minimal training samples, underscoring the potential of compact models in resource-limited or highly specialized settings. ...

Reference:

Evaluating Large Language Models on Financial Report Summarization: An Empirical Study
AdaptEval: Evaluating Large Language Models on Domain Adaptation for Text Summarization
  • Citing Conference Paper
  • January 2024

... The second one intends to evaluate LLMs performance in clinical question-answering tasks. Besides MedQA (Jin et al., 2021) and MedMCQA (Pal et al., 2022), many recent benchmarks have been built to test the medical knowledge of LLMs in different aspects (Korgul et al., 2023;Vladika et al., 2024;Shoham & Rappoport, 2024). For example, Chen et al. (2024c) and have developed QA benchmarks to assess the diagnostic performance of LLMs in rare diseases. ...

MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering
  • Citing Conference Paper
  • January 2024

... Consequently, automatically categorizing scientific documents must be approached as a multi-label classification problem over large label spaces. Previous works approach this task either in an unsupervised (Shen et al., 2018;Salatino et al., 2019;Mustafa et al., 2021;Toney and Dunham, 2022;Schopf and Matthes, 2024) or in a fully supervised (Gialitsis et al., 2022;Sadat and Caragea, 2022;E. Mendoza et al., 2022;Schopf et al., 2023) manner. ...

NLP-KG: A System for Exploratory Search of Scientific Literature in Natural Language Processing

... Implementations of D01.1 and D01.2 have already been given in related papers [18,28], while the convenience of D01.3 has also been thoroughly explained in [38]. Similarly, the design choices outlined in Section 4.2 in regard to DPP granularity (R-02) have implementations in other papers such as [20] (since it describes how to use VCs to control the access to a system, as required in D02.1), [28] (for D02.2) and [18,19] (for D02.3). Similarly, option D03.2 (change of a DID controller) is an issue that has already been discussed in Section B.12 of the DID recommendations [35], while D03.3 (VC transfers in DPP) becomes a subcase of D03.2 when a DPP supports jointly DID and ISO-15459 identifiers, as suggested in D01.1, since in that case each product will have its own DID and the current product owner will be the DID controller. ...

A Universal System for OpenID Connect Sign-ins with Verifiable Credentials and Cross-Device Flow
  • Citing Conference Paper
  • May 2024

... A more common setup involves retrieving the references either before the answer generation or after generating it (Malaviya et al., 2024). When attributing claims to scientific sources, the more recent and better-cited publications were found to be the most trustworthy evidence (Vladika and Matthes, 2024). Some approaches to the problem include finetuning smaller LMs on NLP datasets (Yue et al., 2023) or using human-in-the-loop methods (Kamalloo et al., 2023). ...

Improving Health Question Answering with Reliable and Time-Aware Evidence Retrieval
  • Citing Conference Paper
  • January 2024

... These systems progressed significantly in recent years, largely driven by the rapid adoption of large language models (LLMs). A growing body of research focuses on augmenting conversational search sys-tems with LLMs (Schneider et al., 2024c), including utterance understanding (Kuhn et al., 2023), dialogue management (Friedman et al., 2023), knowledge retrieval (Lewis et al., 2020), and response generation (Sekulic et al., 2024;Schneider et al., 2024b). While LLMs hold great potential for conversational search systems, they are not without shortcomings. ...

Engineering Conversational Search Systems: A Review of Applications, Architectures, and Functional Components

... While Fernandes et al. (2019) proposed an early implementation of metric DP, were the first to design a word-level MLDP mechanism for static word embeddings. Ensuing works aim to improve word-level methods through various means, including differing metrics (Xu et al., 2020), nearest neighbor mapping (Xu et al., 2021b;Meisenbacher et al., 2024a), or noise mechanism (Xu et al., 2021a;Carvalho et al., 2023). Other works focus on the selection of words to privatize (Yue et al., 2021;Chen et al., 2022). ...

1-Diffractor: Efficient and Utility-Preserving Text Obfuscation Leveraging Word-Level Metric Differential Privacy

... Similarly, mutual information can measure the amount of information between two variables/messages, but it cannot specifically measure which information poses a threat to privacy. To define a suitable metric, exploiting differential privacy [15] could be a potential way. The rationale behind differential privacy is a mathematical definition, when adding or removing individual data records in the datasets, the distribution of the output results remains virtually unchanged. ...

A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking The Privacy-Utility Trade-off

... LLM-based evaluation metrics are known to perform on par with human evaluation [2]. Since it is costly to perform human evaluation over such a wide range of heuristics, we employ GPT-4 for a proxy evaluator to narrow down the search space and hence find the optimal configurations for RAG. ...

Towards Optimizing and Evaluating a Retrieval Augmented QA Chatbot using LLMs with Human-in-the-Loop