Norman Meuschke

Norman Meuschke
University of Göttingen | GAUG · Institute of Computer Science

Doctor of Engineering in Computer Science

About

81
Publications
65,097
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,949
Citations
Introduction
My main research interests are methods for semantic similarity analysis and their application for information retrieval. Beyond my core research, I am interested in applied data science and knowledge management challenges and the application of blockchain technology to tackle these challenges. My research spans the fields of: Information Retrieval for text, images, and mathematical content, Plagiarism Detection, Citation and Link Analysis, Blockchain Technology, Information Visualization
Additional affiliations
September 2018 - April 2022
University of Wuppertal
Position
  • Researcher
February 2015 - August 2018
University of Konstanz
Position
  • PhD Student
March 2014 - January 2015
National Institute of Informatics
Position
  • Visiting Researcher

Publications

Publications (81)
Chapter
Full-text available
Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we evaluate the effectiveness of five pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models. We analyze preprints of research pap...
Thesis
Full-text available
Identifying academic plagiarism is a pressing problem, among others, for research institutions, publishers, and funding organizations. Detection approaches proposed so far analyze lexical, syntactical, and semantic text similarity. These approaches find copied, moderately reworded, and literally translated text. However, reliably detecting disguise...
Article
Full-text available
This article summarizes the research on computational methods to detect academic plagiarism by systematically reviewing 239 research papers published between 2013 and 2018. To structure the presentation of the research contributions, we propose novel technically oriented typologies for plagiarism prevention and detection efforts, the forms of acade...
Conference Paper
Full-text available
Current plagiarism detection systems reliably find instances of copied and moderately altered text, but often fail to detect strong paraphrases, translations, and the reuse of non-textual content and ideas. To improve upon the detection capabilities for such concealed content reuse in academic publications, we make four contributions: i) We present...
Conference Paper
Full-text available
This paper presents, to our knowledge, the first study on analyzing mathematical expressions to detect academic plagiarism. We make the following contributions. First, we investigate confirmed cases of plagiarism to categorize the similarities of mathematical content commonly found in plagiarized publications. From this investigation, we derive pos...
Chapter
Full-text available
This chapter presents Citation-based Plagiarism Detection (CbPD)—the first plagiarism detection approach that analyzed non-textual content elements. The author started researching CbPD as part of his final year thesis [336], which Bela Gipp supervised after proposing the approach [169]. Bela Gipp’s doctoral thesis [173] presents the approach in det...
Chapter
Full-text available
This chapter provides background information on academic plagiarism and reviews technical approaches to detect it. Section Chapter 1 derives a definition and typology of academic plagiarism that is suitable for the technical research focus of this thesis. Section 2.2 provides a holistic overview of the research on academic plagiarism to contextuali...
Chapter
Full-text available
This chapter is structured as follows. Section 4.1 presents related work on Image-based Plagiarism Detection to point out the research gap we address. Section 4.2 describes typical forms of image similarity we observed in practice.
Chapter
Full-text available
This chapter presents Math-based Plagiarism Detection (MathPD)—a new approach we proposed to improve the detection of academic plagiarism, primarily in the science, technology, engineering, and mathematics (STEM) fields.
Chapter
Full-text available
This chapter concludes the thesis by summarizing our research in Section 7.1, presenting an overview of our research contributions in Section 7.2, and discussing ideas for future research in Section 7.3.
Chapter
Full-text available
This chapter describes the working prototype of a plagiarism detection system that combines the analysis of citation patterns, images, mathematical content, and text. Our system, HyPlag, implements all detection methods presented in Chapters 3–5 and provides a web-based user interface for examining the results.
Preprint
Full-text available
This demo paper presents the first tool to annotate the reuse of text, images, and mathematical formulae in a document pair-TEIMMA. Annotating content reuse is particularly useful to develop plagiarism detection algorithms. Real-world content reuse is often obfuscated, which makes it challenging to identify such cases. TEIMMA allows entering the ob...
Preprint
Full-text available
This project investigated new approaches and technologies to enhance the accessibility of mathematical content and its semantic information for a broad range of information retrieval applications. To achieve this goal, the project addressed three main research challenges: (1) syntactic analysis of mathematical expressions, (2) semantic enrichment o...
Preprint
Full-text available
Extracting information from academic PDF documents is crucial for numerous indexing, retrieval, and analysis use cases. Choosing the best tool to extract specific content elements is difficult because many, technically diverse tools are available, but recent performance benchmarks are rare. Moreover, such benchmarks typically cover only a few conte...
Chapter
Full-text available
Extracting information from academic PDF documents is crucial for numerous indexing, retrieval, and analysis use cases. Choosing the best tool to extract specific content elements is difficult because many, technically diverse tools are available, but recent performance benchmarks are rare. Moreover, such benchmarks typically cover only a few conte...
Article
Full-text available
Centralized networks inevitably exhibit single points of failure that malicious actors regularly target. Decentralized networks are more resilient if numerous participants contribute to the network’s functionality. Most decentralized networks employ incentive mechanisms to coordinate the participation and cooperation of peers and thereby ensure the...
Chapter
Full-text available
Eine Einführung in die rechtlichen, methodischen und technologischen Probleme und Lösungen für die Prävention und Erkennung wissenschaftlicher Plagiate
Preprint
Full-text available
Researchers and scientists increasingly rely on specialized information retrieval (IR) or recommendation systems (RS) to support them in their daily research tasks. Paper recommender systems are one such tool scientists use to stay on top of the ever-increasing number of academic publications in their field. Improving research paper recommender sys...
Chapter
Full-text available
A drastic rise in potentially life-threatening misinformation has been a by-product of the COVID-19 pandemic. Computational support to identify false information within the massive body of data on the topic is crucial to prevent harm. Researchers proposed many methods for flagging online misinformation related to COVID-19. However, these methods pr...
Preprint
Full-text available
We present a free and open-source tool for creating web-based surveys that include text annotation tasks. Existing tools offer either text annotation or survey functionality but not both. Combining the two input types is particularly relevant for investigating a reader's perception of a text which also depends on the reader's background, such as ag...
Preprint
Full-text available
Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Oppos...
Preprint
Full-text available
A drastic rise in potentially life-threatening misinformation has been a by-product of the COVID-19 pandemic. Computational support to identify false information within the massive body of data on the topic is crucial to prevent harm. Researchers proposed many methods for flagging online misinformation related to COVID-19. However, these methods pr...
Conference Paper
Full-text available
Neural language models such as BERT allow for human-like text paraphrasing. This ability threatens academic integrity, as it aggravates identifying machine-obfuscated plagiarism. We make two contributions to foster the research on detecting these novel machine-paraphrases. First, we provide the first large-scale dataset of documents paraphrased usi...
Conference Paper
Full-text available
We present a free and open-source tool for creating web-based surveys that include text annotation tasks. Existing tools offer either text annotation or survey functionality but not both. Combining the two input types is particularly relevant for investigating a reader's perception of a text which also depends on the reader's background, such as ag...
Preprint
Full-text available
We present two supervised (pre-)training methods to incorporate gloss definitions from lexical resources into neural language models (LMs). The training improves our models' performance for Word Sense Disambiguation (WSD) but also benefits general language understanding tasks while adding almost no parameters. We evaluate our techniques with seven...
Preprint
Full-text available
Identifying academic plagiarism is a pressing problem, among others, for research institutions, publishers, and funding organizations. Detection approaches proposed so far analyze lexical, syntactical, and semantic text similarity. These approaches find copied, moderately reworded, and literally translated text. However, reliably detecting disguise...
Preprint
Full-text available
The rise of language models such as BERT allows for high-quality text paraphrasing. This is a problem to academic integrity, as it is difficult to differentiate between original and machine-generated content. We propose a benchmark consisting of paraphrased articles using recent language models relying on the Transformer architecture. Our contribut...
Preprint
Full-text available
Employing paraphrasing tools to conceal plagiarized text is a severe threat to academic integrity. To enable the detection of machine-paraphrased text, we evaluate the effectiveness of five pre-trained word embedding models combined with machine learning classifiers and state-of-the-art neural language models. We analyze preprints of research paper...
Article
Full-text available
Media bias describes differences in the content or presentation of news. It is an ubiquitous phenomenon in news coverage that can have severely negative effects on individuals and society. Identifying media bias is a challenging problem, for which current information systems offer little support. News aggregators are the most important class of sys...
Preprint
Full-text available
Plagiarism detection systems are essential tools for safeguarding academic and educational integrity. However, today's systems require disclosing the full content of the input documents and the document collection to which the input documents are compared. Moreover, the systems are centralized and under the control of individual, typically commerci...
Preprint
Full-text available
In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science,...
Preprint
Full-text available
This poster summarizes our contributions to Wikimedia's processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 to providing modern semantically enriched language-independent MathML formulae in 2020. Additionally, we describe our plans to improve th...
Chapter
Full-text available
Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classif...
Chapter
Full-text available
We report on an exploratory analysis of the forms of plagiarism observable in mathematical publications, which we identified by investigating editorial notes from zbMATH. While most cases we encountered were simple copies of earlier work, we also identified several forms of disguised plagiarism. We investigated 11 cases in detail and evaluate how c...
Preprint
Full-text available
Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual c...
Preprint
Full-text available
We report on an exploratory analysis of the forms of plagiarism observable in mathematical publications, which we identified by investigating editorial notes from zbMATH. While most cases we encountered were simple copies of earlier work, we also identified several forms of disguised plagiarism. We investigated 11 cases in detail and evaluate how c...
Conference Paper
Full-text available
Identifying plagiarized content is a crucial task for educational and research institutions, funding agencies, and academic publishers. Plagiarism detection systems available for productive use reliably identify copied text, or near-copies of text, but often fail to detect disguised forms of academic plagiarism, such as paraphrases, translations, a...
Article
Full-text available
Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in...
Preprint
Full-text available
Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in...
Conference Paper
Full-text available
We present Citolytics - a novel link-based recommendation system for Wikipedia articles. In a preliminary study, Citolytics achieved promising results compared to the widely used text-based approach of Apache Lucene's MoreLikeThis (MLT). In this demo paper, we describe how we plan to integrate Citolytics into the Wikipedia infrastructure by using E...
Preprint
Full-text available
Mathematical expressions can be represented as a tree consisting of terminal symbols, such as identifiers or numbers (leaf nodes), and functions or operators (non-leaf nodes). Expression trees are an important mechanism for storing and processing mathematical expressions as well as the most frequently used visualization of the structure of mathemat...
Conference Paper
Full-text available
Mathematical formulae in academic texts significantly contribute to the overall semantic content of such texts, especially in the fields of Science, Technology, Engineering and Mathematics. Knowing the definitions of the identifiers in mathematical formulae is essential to understand the semantics of the formulae. Similar to the sense-making proces...
Conference Paper
Full-text available
Mathematical expressions can be represented as a tree consisting of terminal symbols, such as identifiers or numbers (leaf nodes), and functions or operators (non-leaf nodes). Expression trees are an important mechanism for storing and processing mathematical expressions as well as the most frequently used visualization of the structure of mathemat...
Conference Paper
Full-text available
Detecting academic plagiarism is a pressing problem, e.g., for educational and research institutions, funding agencies, and academic publishers. Existing plagiarism detection systems reliably identify copied text, or near copies of text, but often fail to detect disguised forms of academic plagiarism, such as paraphrases, translations, and idea pla...
Conference Paper
Full-text available
In this vision paper, we suggest combining two lines of research to study the collective behavior of Wikipedia contributors. The first line of research analyzes Wikipedia's edit history to quantify the quality of individual contributions and the resulting reputation of the contributor. The second line of research surveys Wikipedia contributors to g...
Article
Full-text available
The proportion of information that is exclusively available online is continuously increasing. Unlike physical print media, online news outlets, magazines, or blogs are not immune to retrospective modification. Even significant editing of text in online news sources can easily go unnoticed. This poses a challenge to the preservation of digital cult...
Conference Paper
Full-text available
The amount of news published and read online has increased tremendously in recent years, making news data an interesting resource for many research disciplines , such as the social sciences and linguistics. However, large scale collection of news data is cumbersome due to a lack of generic tools for crawling and extracting such data. We present new...
Conference Paper
Full-text available
Depending on the news source, a reader can be exposed to a different narrative and conflicting perceptions for the same event. Today, news aggregators help users cope with the large volume of news published daily. However, aggregators focus on presenting shared information, but do not expose the different perspectives from articles on same topics....
Article
Full-text available
Aside from improving the visibility and accessibility of scientific publications, many scientific Web repositories also assess researchers' quantitative and qualitative publication performance, e.g., by displaying metrics such as the h-index. These metrics have become important for research institutions and other stakeholders to support impactful d...
Conference Paper
Full-text available
Mathematical formulae are essential in science, but face challenges of ambiguity, due to the use of a small number of identifiers to represent an immense number of concepts. Corresponding to word sense disambiguation in Natural Language Processing, we disambiguate mathematical identifiers. By regarding formulae and natural text as one monolithic in...
Conference Paper
Full-text available
Literature recommender systems support users in filtering the vast and increasing number of documents in digital libraries and on the Web. For academic literature, research has proven the ability of citation-based document similarity measures, such as Co-Citation (CoCit), or Co-Citation Proximity Analysis (CPA) to improve recommendation quality. In...
Conference Paper
Full-text available
This paper compares the search capabilities of a single human brain supported by the text search built into Wikipedia with state-of-the-art math search systems. To achieve this, we compare results of manual Wikipedia searches with the aggregated and assessed results of all systems participating in the NTCIR-12 MathIR Wikipedia Task. For 26 of the 3...
Article
Full-text available
Trusted timestamping is a process for proving that certain information existed at a given point in time. This paper presents a trusted timestamping concept and its implementation in form of a web-based service that uses the decentralized Bitcoin block chain to store anonymous, tamper-proof timestamps for digital content. The service allows users to...
Conference Paper
Full-text available
Citation-based similarity measures such as Bibliographic Coupling and Co-Citation are an integral component of many information retrieval systems. However, comparisons of the strengths and weaknesses of measures are challenging due to the lack of suitable test collections. This paper presents CITREC, an open evaluation framework for citation-based...