# Moritz SchubotzUniversität Konstanz | Uni-Konstanz · Department of Computer and Information Science

Moritz Schubotz

Ph.D. 2017

## About

124

Publications

36,259

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

1,283

Citations

Introduction

Moritz Schubotz currently works at the Department of Computer and Information Science, Universität Konstanz. Moritz does research in Information Science, Data Mining and Artificial Intelligence. Their current project is 'Math Information Retrieval'.

Additional affiliations

March 2012 - present

Education

October 2007 - September 2011

October 2006 - September 2007

## Publications

Publications (124)

The Strategic Research and Innovation Agenda (SRIA) of the European Commission identifies Knowledge Graphs (KGs) as one of the most important technologies for building an interoperability framework and enabling data exchange among users across countries, sectors, and disciplines [1]. KG is a graph-structured knowledge base containing a terminology...

Citation-based Information Retrieval (IR) methods for scientific documents have proven effective for IR applications, such as Plagiarism Detection or Literature Recommender Systems in academic disciplines that use many references. In science, technology, engineering, and mathematics, researchers often employ mathematical concepts through formula no...

We present »PhysWikiQuiz«, a physics question generation and test engine. Our system, hosted by Wikimedia, utilizes Wikidata, an open, community-managed database, to acquire physics knowledge. For the teacher's input of a given concept name, it produces personalized questions for each individual student. Subsequently, it uses a Computer Algebra Sys...

In scientific publications, citations allow readers to assess the authenticity of the presented information and verify it in the original context. News articles, however, for various reasons do not contain citations and only rarely refer readers to further sources. As a result, readers often cannot assess the authenticity of the presented informati...

We tackle the problem of neural machine translation of mathematical formulae between ambiguous presentation languages and unambiguous content languages. Compared to neural machine translation on natural language, mathematical formulae have a much smaller vocabulary and much longer sequences of symbols, while their translation requires extreme preci...

This demo paper presents the first tool to annotate the reuse of text, images, and mathematical formulae in a document pair-TEIMMA. Annotating content reuse is particularly useful to develop plagiarism detection algorithms. Real-world content reuse is often obfuscated, which makes it challenging to identify such cases. TEIMMA allows entering the ob...

This project investigated new approaches and technologies to enhance the accessibility of mathematical content and its semantic information for a broad range of information retrieval applications. To achieve this goal, the project addressed three main research challenges: (1) syntactic analysis of mathematical expressions, (2) semantic enrichment o...

It allows any two parties that are either both on the same network or connected via the internet to transfer the contents of a file based on a particular sequence of words. Peer discovery happens via multicast DNS if both peers are on the same network or via entries in the distributed hash table (DHT) of the InterPlanetary File-System (IPFS) if bot...

Citation-based Information Retrieval (IR) methods for scientific documents have proven effective for IR applications, such as Plagiarism Detection or Literature Recommender Systems in academic disciplines that use many references. In science, technology, engineering, and mathematics, researchers often employ mathematical concepts through formula no...

Centralized networks inevitably exhibit single points of failure that malicious actors regularly target. Decentralized networks are more resilient if numerous participants contribute to the network’s functionality. Most decentralized networks employ incentive mechanisms to coordinate the participation and cooperation of peers and thereby ensure the...

The increasing number of questions on Question Answering (QA) platforms like Math Stack Exchange (MSE) signifies a growing information need to answer math-related questions. However, there is currently very little research on approaches for an open data QA system that retrieves mathematical formulae using their concept names or querying formula ide...

Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypothes...

Since the COVID-19 outbreak, the use of digital learning or education platforms has significantly increased. Teachers now digitally distribute homework and provide exercise questions. In both cases, teachers need to continuously develop novel and individual questions. This process can be very time-consuming and should be facilitated and accelerated...

Recent years have witnessed growing consolidation of web operations. For example, the majority of web traffic now originates from a few organizations, and even micro-websites often choose to host on large pre-existing cloud infrastructures. In response to this, the "Decentralized Web" attempts to distribute ownership and operation of web services m...

Wikipedia combines the power of AI solutions and human reviewers to safeguard article quality. Quality control objectives include detecting malicious edits, fixing typos, and spotting inconsistent formatting. However, no automated quality control mechanisms currently exist for mathematical formulae. Spell checkers are widely used to highlight textu...

Good research data management is essential in modern-day lab work. Various solutions exist that are either highly specific or need a significant effort to be customized appropriately. This paper presents an integrated solution for individuals and small groups of researchers in data-driven deductive research. Our electronic lab book generates itself...

Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypothes...

Digital mathematical libraries assemble the knowledge of years of mathematical research. Numerous disciplines (e.g., physics, engineering, pure and applied mathematics) rely heavily on compendia gathered findings. Likewise, modern research applications rely more and more on computational solutions, which are often calculated and verified by compute...

Digital mathematical libraries assemble the knowledge of years of mathematical research. Numerous disciplines (e.g., physics, engineering, pure and applied mathematics) rely heavily on compendia gathered findings. Likewise, modern research applications rely more and more on computational solutions, which are often calculated and verified by compute...

Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Oppos...

Documents from Science, Technology, Engineering, and Mathematics (STEM) disciplines usually contain a significant amount of mathematical formulae alongside text. Some Mathematical Information Retrieval (MathIR) systems, e.g., Mathematical Question Answering (MathQA), exploit knowledge from Wikidata. Therefore, the mathematical information needs to...

In scientific publications, citations allow readers to assess the authenticity of the presented information and verify it in the original context. News articles, however, for various reasons do not contain citations and only rarely refer readers to further sources. As a result, readers often cannot assess the authenticity of the presented informati...

We have developed an automated procedure for symbolic and numerical testing of formulae extracted from the NIST Digital Library of Mathematical Functions (DLMF). For the NIST Digital Repository of Mathematical Formulae, we have developed conversion tools from semantic LaTeX to the Computer Algebra System (CAS) Maple which relies on Youssef's part-o...

Document preparation systems like LaTeX offer the ability to render mathematical expressions as one would write these on paper. Using LaTeX, LaTeXML, and tools generated for use in the National Institute of Standards (NIST) Digital Library of Mathematical Functions, semantically enhanced mathematical LaTeX markup (semantic LaTeX) is achieved by usi...

Mathematical formulae carry complex and essential semantic information in a variety of formats. Accessing this information with different systems requires a standardized machine-readable format that is capable of encoding presentational and semantic information. Even though MathML is an official recommendation by W3C and an ISO standard for represe...

Document subject classification is essential for structuring (digital) libraries and allowing readers to search within a specific field. Currently, the classification is typically made by human domain experts. Semi-supervised Machine Learning algorithms can support them by exploiting the labeled data to predict subject classes for unclassified new...

In our experiment, we created a cluster of containers in Docker to evaluate a private IPFS cluster for an academic data store focusing on availability, GET/PUT performance, and storage needs. As sample data, we used PDF files to analyze the data transport in our peer-to-peer network with Wireshark. We found that a bandwidth of at least 100 kbit/s i...

Ten years ago, the Mathematics Subject Classification MSC 2010 was released, and a corresponding machine-readable Linked Open Data collection was published using the Simple Knowledge Organization System (SKOS). Now, the new MSC 2020 is out. This paper recaps the last ten years of working on machine-readable MSC data and presents the new machine-rea...

Ten years ago, the Mathematics Subject Classification MSC 2010 was released, and a corresponding machine-readable Linked Open Data collection was published using the Simple Knowledge Organization System (SKOS). Now, the new MSC 2020 is out. This paper recaps the last ten years of working on machine-readable MSC data and presents the new machine-rea...

We present zbMATH Open, the most comprehensive collection of reviews and bibliographic metadata of scholarly literature in mathematics. Besides our website https://zbMATH.org which is openly accessible since the beginning of this year, we provide API endpoints to offer our data. The API improves interoperability with others, i.e., digital libraries...

We present procd [pʁoːst], a python implementation for privacy preserving contact discovery. procd is a trustless solution that requires neither plaintext numbers nor hashes of single phone numbers to retrieve contacts. Instead, we transfer hashed combinations of multiple phone numbers, which increases the effort for dictionary attacks.

Mathematical information retrieval (MathIR) applications such as semantic formula search and question answering systems rely on knowledge-bases that link mathematical expressions to their natural language names.
For database population, mathematical formulae need to be annotated and linked to semantic concepts, which is very time-consuming.
In this...

The zbMATH database contains more than 4 million bibliographic entries. We aim to provide easy access to these entries. Therefore, we maintain different index structures, including a formula index. To optimize the findability of the entries in our database, we continuously investigate new approaches to satisfy the information needs of our users. We...

Authors of research papers in the fields of mathematics, and other math-heavy disciplines commonly employ the Mathematics Subject Classification (MSC) scheme to search for relevant literature. The MSC is a hierarchical alphanumerical classification scheme that allows librarians to specify one or multiple codes for publications. Digital Libraries in...

Scientists increasingly rely on computer algebra systems and digital mathematical libraries to compute, validate, or experiment with mathematical formulae. However, the focus in digital mathematical libraries and scientific documents often lies more on an accurate presentation of the formulae rather than providing uniform access to the semantic inf...

Word embedding, which represents individual words with semantically fixed-length vectors, has made it possible to successfully apply deep learning to natural language processing tasks such as semantic role-modeling, question answering, and machine translation. As math text consists of natural text, as well as math expressions that similarly exhibit...

Authors of research papers in the fields of mathematics, and other math-heavy disciplines commonly employ the Mathematics Subject Classification (MSC) scheme to search for relevant literature. The MSC is a hierarchical alphanumerical classification scheme that allows librarians to specify one or multiple codes for publications. Digital Libraries in...

Plagiarism detection systems are essential tools for safeguarding academic and educational integrity. However, today's systems require disclosing the full content of the input documents and the document collection to which the input documents are compared. Moreover, the systems are centralized and under the control of individual, typically commerci...

In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science,...

Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classificat...

This poster summarizes our contributions to Wikimedia's processing pipeline for mathematical formulae. We describe how we have supported the transition from rendering formulae as course-grained PNG images in 2001 to providing modern semantically enriched language-independent MathML formulae in 2020. Additionally, we describe our plans to improve th...

Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classif...

Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of information search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's systems. In this paper, we present the first in-depth study on the distributions of mathematical notation...

Many sectors, like finance, medicine, manufacturing, and education, use blockchain applications to profit from the unique bundle of characteristics of this technology. Blockchain technology (BT) promises benefits in trustability, collaboration, organization, identification, credibility, and transparency. In this paper, we conduct an analysis in whi...

In scientific publications, citations allow readers to assess the authenticity of the presented information and verify it in the original context. News articles, however, do not contain citations and only rarely refer readers to further sources. Readers often cannot assess the authenticity of the presented information as its origin is unclear. We p...

Documents from science, technology, engineering and mathematics (STEM) often contain a large number of mathematical formulae alongside text. Semantic search, recommender, and question answering systems require the occurring formula constants and variables (identifiers) to be disambiguated. We present a first implementation of a recommender system t...

Handelte es sich bei dem Beitrag des Präsidenten der Freien Universität Berlin Günther M. Ziegler “Die Bedeutung der Verlage wandelt sich” wieder nur um ein “Loblied auf “Open Access" als Zukunft des wissenschaftlichen Publizierens”? Diesen Vorwurf jedenfalls machte Wolfgang Sander dem Präsidenten in seinem kurzen Kommentar vom 12. Juni 2019. Doch...