Conference PaperPDF Available

AnnoMathTeX: a formula identifier annotation recommender system for STEM documents

Authors:

Abstract and Figures

Documents from science, technology, engineering and mathematics (STEM) often contain a large number of mathematical formulae alongside text. Semantic search, recommender, and question answering systems require the occurring formula constants and variables (identifiers) to be disambiguated. We present a first implementation of a recommender system that enables and accelerates formula annotation by displaying the most likely candidates for formula and identifier names from four different sources (arXiv, Wikipedia, Wikidata, or the surrounding text). A first evaluation shows that in total, 78% of the formula identifier name recommendations were accepted by the user as a suitable annotation. Furthermore, document-wide annotation saved the user the annotation of ten times more other identifier occurrences. Our long-term vision is to integrate the annotation recommender into the edit-view of Wikipedia and the online LaTeX editor Overleaf.
Content may be subject to copyright.
Preprint from https://www.gipp.com/pub/
P. Scharpf et al. “AnnoMathTeX - a Formula Identier Annotation Recommender System for STEM Documents”. In: Proceedings of
the 13th ACM Conference on Recommender Systems (RecSys 2019). Copenhagen, Denmark: ACM, Sept. 2019
AnnoMathTeX - a Formula Identifier Annotation
Recommender System for STEM Documents
Philipp Scharpf1, Ian Mackerracher1, Moritz Schubotz2,
Joeran Beel3, Corinna Breitinger1, Bela Gipp12
1University of Konstanz, Germany (rst.last@uni-konstanz.de)
2University of Wuppertal, Germany (last@uni-wuppertal.de)
3Trinity College Dublin, Ireland (rst.last@tcd.ie)
ABSTRACT
Documents from science, technology, engineering and mathemat-
ics (STEM) often contain a large number of mathematical formu-
lae alongside text. Semantic search, recommender, and question
answering systems require the occurring formula constants and
variables (identiers) to be disambiguated. We present a rst imple-
mentation of a recommender system that enables and accelerates
formula annotation by displaying the most likely candidates for
formula and identier names from four dierent sources (arXiv,
Wikipedia, Wikidata, or the surrounding text). A rst evaluation
shows that in total, 78% of the formula identier name recommen-
dations were accepted by the user as a suitable annotation. Further-
more, document-wide annotation saved the user the annotation of
ten times more other identier occurrences. Our long-term vision
is to integrate the annotation recommender into the edit-view of
Wikipedia and the online LaTeX editor Overleaf.
CCS CONCEPTS
Information systems Information retrieval;
KEYWORDS
Information Retrieval, Mathematical Information Retrieval, Recom-
mender Systems, Semantication, Wikipedia/Wikidata
1 INTRODUCTION
Documents from Science, Technology, Engineering, and Mathe-
matics (STEM) often contain numerous mathematical formulae [1],
which are crucial to understanding the semantics of the text. If the
formula characters (constants or variables) are not annotated, the
mathematical statement of a formula cannot be understood and
queried. However, if for example the formula
S=
1
− |R|/|I|·|U|
was annotated
{S
:
sparsity,R
:
rat inдs,I
:
items,U
:
users }
, the
characters (in the following referred to as formula identiers) are
translated into words that represent their meaning. This enables se-
mantic search, recommender and mathematical question answering
systems [5] to nd documents with formulae that for example
allow calculating spar sityor
allow calculating
sparsity
, given
rat inдs
,
items
, and
users
or
contain specic variables, such as ratinдsand items or
relate rat inдsand users .
These are examples of structured queries, which require machine-
interpretability of mathematical documents to approach Mathemat-
ical Language Understanding (MLU). A large part of the mathemat-
ical knowledge today is either contained within research papers
(LaTeX) or in condensed form in Wikipedia articles (Wikitext).
Wikipedia articles are only semi-structured (linked). For the di-
rect retrieval of specic facts and systematic queries, Wikidata was
launched in 2012 [7]. Language-independent items (identied by
a unique ID) are linked by properties. In addition to natural lan-
guage statements, mathematical formulae were transferred from
Wikipedia [5] as items with a "dening formula" property that al-
lows a LaTeX formula string as value. However, only a few formulae
contain their identier names. Thus, a large part is not machine-
interpretable (=allowing structured queries).
Prior research has aimed to extract the identier meaning from
the text that surrounds the formula [6, 3], but all approaches lack
an important element: the quality control aorded by a human
expert verier. Annotating multitudinous formulae can be tedious.
Since the identier annotation in a document must be globally con-
sistent, annotating each instance individually should be avoided.
We address these shortcomings by introducing an annotation rec-
ommender system
1
for formula identiers at the document level.
We evaluate our system’s performance while comparing the user’s
acceptance of recommendations from four dierent sources.
2 ANNOMATHTEX
The workow of our system is as follows
2
: a user uploads a mathe-
matical document in Wikitext or LaTeX format. The system displays
the text while highlighting formulae and identiers. The formulae
are located by searching for their environment tags
(
<math>, $, \{equation}, \{align}
, etc.). Parsing the formulae
yields their identiers, which are then highlighted. If the user clicks
on a formula identier, AnnoMathTeX presents recommendations
for its name, which we extracted using four dierent sources : 1)
arXiv - candidates
3
extracted from the surrounding text of 60 M
formulae 2) Wikipedia - candidates
4
extracted from denitions in
mathematical English articles 3) Wikidata - candidates retrieved via
a SPARQL query
5
4) a surrounding text window of
±
5words around
the formula. The recommendations are then generated from static
dump lists and ranked by the occurrence frequency in their sources.
1System hosted by Wikimedia Foundation at annomathtex.wmabs.org
2Demo video available at bit.ly/annomathtex
3http://ntcir-math.nii.ac.jp/data
4https://en.wikipedia.org/wiki/User:Physikerwelt
5https://query.wikidata.org
RecSys 2019, 16th-20th September 2019, Copenhagen, Denmark
Philipp Scharpf1, Ian Mackerracher1, Moritz Schubotz2,
Joeran Beel3, Corinna Breitinger1, Bela Gipp12
Figure 1 shows the recommendation table/matrix. Each column
corresponds to one source and is presented to the user in a shued
order and using anonymous labels to avoid bias. If no recommen-
dation matches, the user can type in the correct identier name
directly. By default, identiers are annotated globally and automat-
ically annotated at any further occurrence within the document to
enable signicant time savings. In the rare case of a double meaning
within the same document, a locally dierent annotation is possible.
All annotations made by the user are shown as rows at the top
of the document and saved in a separate annotation le. Finally,
the user’s selection is stored in an evaluation le to compare the
usefulness of the four sources.
Figure 1: AnnoMathTeX recommendations for formula
identier annotation.
3 EVALUATION
As a proof-of-concept, we evaluate the performance of recommen-
dations for formula identiers comparing the four sources. An-
notating a sample of 100 identiers from 10 dierent Wikipedia
articles, we nd that the acceptance distribution (item coverage) of
the sources is {arXiv: 35%, Wikipedia: 16%, Wikidata: 13%, Word-
Window: 35%}. Overall, 82% of the recommendations were accepted.
On average, the accepted recommendation was ranked third (3.0)
out of ten, with a ranking distribution of {arXiv: 2.3, Wikipedia:
4.0, Wikidata: 2.5, WordWindow: 3.1}. We conclude that in most
cases, the recommendations are useful, and thus, the system can
signicantly speed up the annotation process.
Furthermore, 99% of the identiers could be annotated globally,
saving the user 1045 annotations - on average 105 per document
and 10 per identier.
4 CONCLUSION & OUTLOOK
We demonstrated a rst recommender for mathematical identier
annotation. Our presented system enables researchers to quickly
disambiguate formula identiers, and thus contributes signicantly
towards the aim of making mathematical documents machine-
interpretable. Converting mathematical language statements en-
coded in formulae into natural language is a crucial task for en-
abling semantic search queries, and for improving mathematical
recommender and question answering systems.
In a preliminary evaluation, our system suggested correct names
for 78% of the examined identier instances. As a next step, we
will implement the possibility to further deepen the annotation by
referencing [2]. The user will be able to link formulae and identiers
to items of the semantic knowledge-base Wikidata. Having tagged
documents by these items, substituting formulae and identiers by
numbers (Wikidata IDs), will yield a sparse semantic "ngerprint"
index, which can be queried by ID.
Subsequently, we plan to carry out a large-scale user study in
which we will evaluate the formula name recommendations from
the following sources: 1) surrounding text 2) a history of manual
inserts 3) a self created database of annotated formulae, and 4)
Wikidata.
Our long-term aim is to directly integrate our annotation rec-
ommender into the editing or composing views of both Wikipedia
and Overleaf. This would allow for the Wikipedia and research
communities to be directly included in the semantication process
of mathematical articles and research papers (see Figure 2).
Figure 2: Future integration of formula identier annota-
tion recommendation in Wikipedia articles (Wikitext) and
Overleaf documents (LaTeX).
ACKNOWLEDGMENTS
This work was supported by the German Research Foundation
(DFG grant GI-1259-1). We thank the Wikimedia Foundation for
hosting the system.
REFERENCES
[1]
R. Hambasan and M. Kohlhase. “Faceted Search for Mathe-
matics”. In: LWA. Vol. 1458. CEUR Workshop Proceedings.
CEUR-WS.org, 2015, pp. 33–44.
[2]
M. Kohlhase. “Math Object Identiers - Towards Research
Data in Mathematics”. In: LWDA. Vol. 1917. CEUR Workshop
Proceedings. CEUR-WS.org, 2017, p. 241.
[3]
G. Y. Kristianto, G. Topic, and A. Aizawa. “Extracting Tex-
tual Descriptions of Mathematical Expressions in Scientic
Papers”. In: D-Lib Magazine 20.11/12 (2014).
[4]
P. Scharpf et al. “AnnoMathTeX - a Formula Identier An-
notation Recommender System for STEM Documents”. In:
Proceedings of the 13th ACM Conference on Recommender Sys-
tems (RecSys 2019). Copenhagen, Denmark: ACM, Sept. 2019.
[5]
M. Schubotz et al. “Introducing MathQA: a Math-Aware ques-
tion answering system”. In: Information Discovery and Deliv-
ery 46.4 (2018), pp. 214–224.
[6]
M. Schubotz et al. “Semantication of Identiers in Mathemat-
ics for Better Math Information Retrieval”. In: SIGIR. ACM,
2016, pp. 135–144.
[7]
D. Vrandecic and M. Krötzsch. “Wikidata: a free collaborative
knowledgebase”. In: Commun. ACM 57.10 (2014), pp. 78–85.
AnnoMathTeX - a Formula Identifier Annotation
Recommender System for STEM Documents RecSys 2019, 16th-20th September 2019, Copenhagen, Denmark
Listing 1: Use the following BibTeX code to cite this article
@InProceedings{ Scharpf2019b ,
Ti t le = { An n oM a th T eX - a F or mu l a I d en ti f ie r An n ot a ti o n R ec o mm e nd e r S ys t em fo r S TE M
Documents},
Au t ho r = { Sc h ar pf , P hi l ip p an d M a ck e rr a ch e r , Ia n a nd Sc h ub ot z , Mo r it z an d B ee l ,
Jo e ra n an d B re i ti n ge r , Co r in n a a nd Gi pp , Be la } ,
Bo o k t it l e = {P r o c ee d i n gs o f th e 13 th A CM C onf e r en c e o n Re c o m me n d e r Sys t e ms ( R e c Sy s 2 0 19 )
},
Ye ar = { 20 19 } ,
Ad d re s s = { Co p en h ag en , D en m ar k } ,
Mo n th = { S ep t . } ,
Pu b l is h e r = {A CM } ,
To p ic = { m at hi r }
}
... We recently introduced a first machine learning approach for Formula Concept Discovery (Scharpf et al., 2019a). Using Doc2Vec (Le & Mikolov, 2014) encodings and k-means clustering, equivalent representations of formulas were retrieved and evaluated. ...
... To build machine-interpretable datasets, manual annotation is thus inevitable. Since this is very time-consuming, formula and identifier annotation recommender systems, such as 'AnnoMathTeX ' Scharpf et al. (2019a' Scharpf et al. ( , 2021a, are built to speed up the process. ...
... For the retrieval of example Formula Concepts, we employ the following three methods: In Method 1, we perform searches by the Formula Concept name in a corpus of publications, a Wikipedia article, and a textbook, respectively. In Method 2, we employ machine learning to retrieve equivalent representations of formulas (Scharpf et al., 2019a), which occur most often (duplicates) in a selected corpus containing astrophysics papers from the NTCIR arXiv dataset (Aizawa et al., 2014). For an introduction of the dataset, see the paragraph 'Data selection' in Sect. ...
Article
Full-text available
Citation-based Information Retrieval (IR) methods for scientific documents have proven effective for IR applications, such as Plagiarism Detection or Literature Recommender Systems in academic disciplines that use many references. In science, technology, engineering, and mathematics, researchers often employ mathematical concepts through formula notation to refer to prior knowledge. Our long-term goal is to generalize citation-based IR methods and apply this generalized method to both classical references and mathematical concepts. In this paper, we suggest how mathematical formulas could be cited and define a Formula Concept Retrieval task with two subtasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). While FCD aims at the definition and exploration of a ‘Formula Concept’ that names bundled equivalent representations of a formula, FCR is designed to match a given formula to a prior assigned unique mathematical concept identifier. We present machine learning-based approaches to address the FCD and FCR tasks. We then evaluate these approaches on a standardized test collection (NTCIR arXiv dataset). Our FCD approach yields a precision of 68% for retrieving equivalent representations of frequent formulas and a recall of 72% for extracting the formula name from the surrounding text. FCD and FCR enable the citation of formulas within mathematical documents and facilitate semantic search and question answering, as well as document similarity assessments for plagiarism detection or recommender systems.
... We recently introduced a first machine learning approach for Formula Concept Discovery Scharpf et al. [2019b]. Using Doc2Vec Le and Mikolov [2014] encodings and k-means clustering, equivalent representations of formulas were retrieved and evaluated. ...
... To build machine-interpretable datasets, manual annotation is thus inevitable. Since this is very time-consuming, formula and identifier annotation recommender systems, such as 'AnnoMathTeX ' Scharpf et al. [2019b' Scharpf et al. [ , 2021a are built to speed up the process. ...
... For the retrieval of example Formula Concepts, we employ the following three methods: In Method 1, we perform searches by the Formula Concept name in a corpus of publications, a Wikipedia article, and a textbook, respectively. In Method 2, we employ machine learning to retrieve equivalent representations of formulas Scharpf et al. [2019b], which occur most often (duplicates) in a selected corpus containing astrophysics papers from the NTCIR arXiv dataset Aizawa et al. [2014]. For an introduction of the dataset, see the paragraph 'Data selection' in Section 3.2.2. ...
Preprint
Full-text available
Citation-based Information Retrieval (IR) methods for scientific documents have proven effective for IR applications, such as Plagiarism Detection or Literature Recommender Systems in academic disciplines that use many references. In science, technology, engineering, and mathematics, researchers often employ mathematical concepts through formula notation to refer to prior knowledge. Our long-term goal is to generalize citation-based IR methods and apply this generalized method to both classical references and mathematical concepts. In this paper, we suggest how mathematical formulas could be cited and define a Formula Concept Retrieval task with two subtasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR). While FCD aims at the definition and exploration of a 'Formula Concept' that names bundled equivalent representations of a formula, FCR is designed to match a given formula to a prior assigned unique mathematical concept identifier. We present machine learning-based approaches to address the FCD and FCR tasks. We then evaluate these approaches on a standardized test collection (NTCIR arXiv dataset). Our FCD approach yields a precision of 68% for retrieving equivalent representations of frequent formulas and a recall of 72% for extracting the formula name from the surrounding text. FCD and FCR enable the citation of formulas within mathematical documents and facilitate semantic search and question answering as well as document similarity assessments for plagiarism detection or recommender systems.
... Since an appropriate translation is generally context-dependent, a translator must use MathIR [141] techniques to access sufficient semantic information. Hence, advances in translating L A T E X to CAS syntaxes also contribute directly towards related MathIR tasks, including entity linking [150,208,212,316,319,321,322], math search engines [92,181,182,203,211,236,274], semantic tagging of math formulae [71,402], recommendation systems [30,31,50,319], type assistance systems [103,106,14,321,400], and even plagiarism detection platforms [253,254,334]. ...
... Since an appropriate translation is generally context-dependent, a translator must use MathIR [141] techniques to access sufficient semantic information. Hence, advances in translating L A T E X to CAS syntaxes also contribute directly towards related MathIR tasks, including entity linking [150,208,212,316,319,321,322], math search engines [92,181,182,203,211,236,274], semantic tagging of math formulae [71,402], recommendation systems [30,31,50,319], type assistance systems [103,106,14,321,400], and even plagiarism detection platforms [253,254,334]. ...
... However, successful solutions in this area focus on similarity measures and do not necessarily require a deep understanding of the meaning and content of a formula. Likewise, other tasks in MathIR, such as entity linking, use similarity measures to retrieve connections between entities rather than semantic relatedness [208,319,321]. Thus, many related work in MathIR is not particularly beneficial for translating presentational encodings to computable formats. ...
Chapter
Full-text available
In this chapter, we will focus on the research task II , i.e., we develop a new semantification process that addresses the issues of existing approaches outlined in the previous chapter. We identified two main issues with existing MathIR approaches for disambiguation and semantification of LATEX expressions. First, many semantification approaches solely focus on single tokens, such as identifiers, or the entire mathematical expression but miss to enrich the essential subexpressions between both extremes semantically.
... • AnnoMathTeX [22,9]: A tool to annotate mathematical LaTeX expressions with mathematical concepts. ...
... AnnoMathTex The lack of high-quality annotated datasets for mathematical literature is a major challenge for MathIR. To address this issue, we developed AnnoMathTex [22,9], a system that provides arti cial intelligence (AI) guidance to improve the work ow of curating annotated datasets for mathematical literature. AnnoMathTex can recommend annotations for Wikipedia articles, drawing from arXiv, Wikipedia, Wikidata, the text surrounding the formula to be annotated, and previous user-made annotations. ...
Preprint
Full-text available
This project investigated new approaches and technologies to enhance the accessibility of mathematical content and its semantic information for a broad range of information retrieval applications. To achieve this goal, the project addressed three main research challenges: (1) syntactic analysis of mathematical expressions, (2) semantic enrichment of mathematical expressions, and (3) evaluation using quality metrics and demonstrators. To make our research useful for the research community, we published tools that enable researchers to process mathematical expressions more effectively and efficiently.
... In this process, a math recommendation system will assist the consumers by recommending scientiic information based on comparable mathematical Fig. 6. Workflow of the Tangent system information preferences [105] [25]. The researchers of the MIR domain have begun their research in the development of a recommender system for mathematical knowledge or mathematically oriented scientiic documents. ...
Article
Mathematical formulas are commonly used to demonstrate theories and basic fundamentals in the Science, Technology, Engineering, and Mathematics (STEM) domain. The burgeoning research in the STEM domain results in the mass production of scientific documents that contain both textual and mathematical terms. In scientific information, the definition of mathematical formulas is expressed through context and symbolic structure that adheres to strong domain-specific notions. Whereas the retrieval of textual information is well-researched, and numerous text-based search engines are present. However, textual information retrieval systems are inadequate for searching scientific information containing mathematical formulas, including simple symbols to complicated mathematical structures. The retrieval of mathematical information is infancy, and it requires the inclusion of new technologies and tools to promote the retrieval of scientific information and the management of digital libraries. This paper provides a comprehensive study of mathematical information retrieval, highlights their challenges and future opportunities.
... Extracting text is easier than extracting mathematical expressions because the formula as presented in a PDF does not allow capturing the formula's structure or semantics, available in LaTeX or MathML 1 [5]. Annotating math is a highly complex task supported by specialized tools to enrich mathematical formulae, such as MioGatto [2], MathAlign [1], and AnnoMathTeX [22]. These tools allow to save math in its original form, such as LaTeX or MathML, but none support recording annotation on a document pair. ...
Preprint
Full-text available
This demo paper presents the first tool to annotate the reuse of text, images, and mathematical formulae in a document pair-TEIMMA. Annotating content reuse is particularly useful to develop plagiarism detection algorithms. Real-world content reuse is often obfuscated, which makes it challenging to identify such cases. TEIMMA allows entering the obfuscation type to enable novel classifications for confirmed cases of plagiarism. It enables recording different reuse types for text, images, and mathematical formulae in HTML and supports users by visualizing the content reuse in a document pair using similarity detection methods for text and math.
Chapter
Defined as “the use of ideas, concepts, words, or structures without appropriately acknowledging the source to benefit in a setting where originality is expected" [6], plagiarism poses a severe concern in the rapidly increasing number of scientific publications.
Preprint
Full-text available
Physical model building is essential for realizing digital twins in the manufacturing industry and requires much toil. We aim to develop automated physical model builder (AutoPMoB) that can automatically build physical models from literature databases. AutoPMoB requires several fundamental technologies, and domain-specific datasets play a vital role in developing such technologies. Although datasets related to variables have been created, there has been no dataset in the chemical engineering domain. To create such a dataset, in this study, we developed an algorithm for extracting variable symbols from documents and a variable annotation tool, VARAT, based on the algorithm. We used the tool and created a dataset containing about 1,733 variable symbols from 45 papers on physical models of five manufacturing processes. VARAT enables us to quickly and accurately extract the variable symbols from documents and reduces the time for annotation per paper to less than half, which streamlines the annotation process.
Article
Full-text available
Purpose This paper aims to present an open source math-aware Question Answering System based on Ask Platypus. Design/methodology/approach The system returns as a single mathematical formula for a natural language question in English or Hindi. These formulae originate from the knowledge-based Wikidata. The authors translate these formulae to computable data by integrating the calculation engine sympy into the system. This way, users can enter numeric values for the variables occurring in the formula. Moreover, the system loads numeric values for constants occurring in the formula from Wikidata. Findings In a user study, this system outperformed a commercial computational mathematical knowledge engine by 13 per cent. However, the performance of this system heavily depends on the size and quality of the formula data available in Wikidata. As only a few items in Wikidata contained formulae when the project started, the authors facilitated the import process by suggesting formula edits to Wikidata editors. With the simple heuristic that the first formula is significant for the paper, 80 per cent of the suggestions were correct. Originality/value This research was presented at the JCDL17 KDD workshop.
Conference Paper
Full-text available
We present an open source math-aware Question Answering System based on Ask Platypus. Our system returns as a single mathematical formula for a natural language question in English or Hindi. This formulae originate from the knowledge-base Wikidata. We translate these formulae to computable data by integrating the calculation engine sympy into our system. This way, users can enter numeric values for the variables occurring in the formula. Moreover, the system loads numeric values for constants occurring in the formula from Wikidata. In a user study, our system outperformed a commercial computational mathematical knowledge engine by 13 %. However, the performance of our system heavily depends on the size and quality of the formula data available in Wikidata. Since only a few items in Wikidata contained formulae when we started the project, we facilitated the import process by suggesting formula edits to Wikidata editors. With the simple heuristic that the first formula is significant for the article, 80 % of the suggestions were correct.
Conference Paper
Full-text available
Mathematical formulae are essential in science, but face challenges of ambiguity, due to the use of a small number of identifiers to represent an immense number of concepts. Corresponding to word sense disambiguation in Natural Language Processing, we disambiguate mathematical identifiers. By regarding formulae and natural text as one monolithic information source, we are able to extract the semantics of identifiers in a process we term Mathematical Language Processing (MLP). As scientific communities tend to establish standard (identifier) notations, we use the document domain to infer the actual meaning of an identifier. Therefore, we adapt the software development concept of namespaces to mathematical notation. Thus, we learn namespace definitions by clustering the MLP results and mapping those clusters to subject classification schemata. In addition, this gives fundamental insights into the usage of mathematical notations in science, technology, engineering and mathematics. Our gold standard based evaluation shows that MLP extracts relevant identifier-definitions. Moreover, we discover that identifier namespaces improve the performance of automated identifier-definition extraction, and elevate it to a level that cannot be achieved within the document context alone.
Conference Paper
Full-text available
Mathematical concepts and formulations play a fundamental role in many scientific domains. As such, the use of mathematical expressions represents a promising method of interlinking scientific papers. The purpose of this study is to provide guidelines for annotating and detecting natural language descriptions of mathematical expressions, enabling the semantic enrichment of mathematical information in scientific papers. Under the proposed approach, we first manually annotate descriptions of mathematical expressions and assess the coverage of several types of textual span: fixed context window, apposition, minimal noun phrases, and noun phrases. We then developed a method for automatic description extraction, whereby the problem was formulated as a binary classification by pairing each mathematical expression with its description candidates and classifying the pairs as correct or incorrect. Support vector machines (SVMs) with several different features were developed and evaluated for the classification task. Experimental results showed that an SVM model that uses all noun phrases delivers the best performance, achieving an F1-score of 62.25% against the 41.47% of the baseline (nearest noun) method.
Conference Paper
Faceted search is one of the most practical ways to browse a large corpus of information. Information is categorized automatically for a given query and the user is given the opportunity to further refine his/her query. Many search engines offer a powerful faceted search engine, but only on the textual level. Faceted Search in the context of Math Search is still unexplored territory. In this paper, we describe one way of solving the faceted search problem in mathematics: by extracting recognizable formula schemata from a given set of formulae and using these schemata to divide the initial set into formula classes. Also, we provide a direct application by integrating this solution with existing services.
Article
Wikidata allows every user to extend and edit the stored information, even without creating an account. A form based interface makes editing easy. Wikidata's goal is to allow data to be used both in Wikipedia and in external applications. Data is exported through Web services in several formats, including JavaScript Object Notation, or JSON, and Resource Description Framework, or RDF. Data is published under legal terms that allow the widest possible reuse. The value of Wikipedia's data has long been obvious, with many efforts to use it. The Wikidata approach is to crowdsource data acquisition, allowing a global community to edit the data. This extends the traditional wiki approach of allowing users to edit a website. In March 2013, Wikimedia introduced Lua as a scripting language for automatically creating and enriching parts of articles. Lua scripts can access Wikidata, allowing Wikipedia editors to retrieve, process, and display data. Many other features were introduced in 2013, and development is planned to continue for the foreseeable future.
Math object identifers -towards research data in mathematics
  • Michael Kohlhase
Michael Kohlhase. Math object identifers -towards research data in mathematics. In LWDA, volume 1917 of CEUR Workshop Proceedings, page 241. CEUR-WS.org, 2017.
Faceted Search for Mathematics
  • R Hambasan
  • M Kohlhase
R. Hambasan and M. Kohlhase. "Faceted Search for Mathematics". In: LWA. Vol. 1458. CEUR Workshop Proceedings. CEUR-WS.org, 2015, pp. 33-44.
CEUR-WS.org, 2017. Michael Kohlhase. Math object identifiers - towards research data in mathematics
  • Michael Kohlhase
  • Kohlhase Michael