Content uploaded by Philipp Scharpf
Author content
All content in this area was uploaded by Philipp Scharpf on Sep 23, 2019
Content may be subject to copyright.
Preprint from https://www.gipp.com/pub/
P. Scharpf et al. “AnnoMathTeX - a Formula Identier Annotation Recommender System for STEM Documents”. In: Proceedings of
the 13th ACM Conference on Recommender Systems (RecSys 2019). Copenhagen, Denmark: ACM, Sept. 2019
AnnoMathTeX - a Formula Identifier Annotation
Recommender System for STEM Documents
Philipp Scharpf1, Ian Mackerracher1, Moritz Schubotz2,
Joeran Beel3, Corinna Breitinger1, Bela Gipp12
1University of Konstanz, Germany (rst.last@uni-konstanz.de)
2University of Wuppertal, Germany (last@uni-wuppertal.de)
3Trinity College Dublin, Ireland (rst.last@tcd.ie)
ABSTRACT
Documents from science, technology, engineering and mathemat-
ics (STEM) often contain a large number of mathematical formu-
lae alongside text. Semantic search, recommender, and question
answering systems require the occurring formula constants and
variables (identiers) to be disambiguated. We present a rst imple-
mentation of a recommender system that enables and accelerates
formula annotation by displaying the most likely candidates for
formula and identier names from four dierent sources (arXiv,
Wikipedia, Wikidata, or the surrounding text). A rst evaluation
shows that in total, 78% of the formula identier name recommen-
dations were accepted by the user as a suitable annotation. Further-
more, document-wide annotation saved the user the annotation of
ten times more other identier occurrences. Our long-term vision
is to integrate the annotation recommender into the edit-view of
Wikipedia and the online LaTeX editor Overleaf.
CCS CONCEPTS
•Information systems →Information retrieval;
KEYWORDS
Information Retrieval, Mathematical Information Retrieval, Recom-
mender Systems, Semantication, Wikipedia/Wikidata
1 INTRODUCTION
Documents from Science, Technology, Engineering, and Mathe-
matics (STEM) often contain numerous mathematical formulae [1],
which are crucial to understanding the semantics of the text. If the
formula characters (constants or variables) are not annotated, the
mathematical statement of a formula cannot be understood and
queried. However, if for example the formula
S=
1
− |R|/|I|·|U|
was annotated
{S
:
sparsity,R
:
rat inдs,I
:
items,U
:
users }
, the
characters (in the following referred to as formula identiers) are
translated into words that represent their meaning. This enables se-
mantic search, recommender and mathematical question answering
systems [5] to nd documents with formulae that for example
•allow calculating spar sityor
•
allow calculating
sparsity
, given
rat inдs
,
items
, and
users
or
•contain specic variables, such as ratinдsand items or
•relate rat inдsand users .
These are examples of structured queries, which require machine-
interpretability of mathematical documents to approach Mathemat-
ical Language Understanding (MLU). A large part of the mathemat-
ical knowledge today is either contained within research papers
(LaTeX) or in condensed form in Wikipedia articles (Wikitext).
Wikipedia articles are only semi-structured (linked). For the di-
rect retrieval of specic facts and systematic queries, Wikidata was
launched in 2012 [7]. Language-independent items (identied by
a unique ID) are linked by properties. In addition to natural lan-
guage statements, mathematical formulae were transferred from
Wikipedia [5] as items with a "dening formula" property that al-
lows a LaTeX formula string as value. However, only a few formulae
contain their identier names. Thus, a large part is not machine-
interpretable (=allowing structured queries).
Prior research has aimed to extract the identier meaning from
the text that surrounds the formula [6, 3], but all approaches lack
an important element: the quality control aorded by a human
expert verier. Annotating multitudinous formulae can be tedious.
Since the identier annotation in a document must be globally con-
sistent, annotating each instance individually should be avoided.
We address these shortcomings by introducing an annotation rec-
ommender system
1
for formula identiers at the document level.
We evaluate our system’s performance while comparing the user’s
acceptance of recommendations from four dierent sources.
2 ANNOMATHTEX
The workow of our system is as follows
2
: a user uploads a mathe-
matical document in Wikitext or LaTeX format. The system displays
the text while highlighting formulae and identiers. The formulae
are located by searching for their environment tags
(
<math>, $, \{equation}, \{align}
, etc.). Parsing the formulae
yields their identiers, which are then highlighted. If the user clicks
on a formula identier, AnnoMathTeX presents recommendations
for its name, which we extracted using four dierent sources : 1)
arXiv - candidates
3
extracted from the surrounding text of 60 M
formulae 2) Wikipedia - candidates
4
extracted from denitions in
mathematical English articles 3) Wikidata - candidates retrieved via
a SPARQL query
5
4) a surrounding text window of
±
5words around
the formula. The recommendations are then generated from static
dump lists and ranked by the occurrence frequency in their sources.
1System hosted by Wikimedia Foundation at annomathtex.wmabs.org
2Demo video available at bit.ly/annomathtex
3http://ntcir-math.nii.ac.jp/data
4https://en.wikipedia.org/wiki/User:Physikerwelt
5https://query.wikidata.org
RecSys 2019, 16th-20th September 2019, Copenhagen, Denmark
Philipp Scharpf1, Ian Mackerracher1, Moritz Schubotz2,
Joeran Beel3, Corinna Breitinger1, Bela Gipp12
Figure 1 shows the recommendation table/matrix. Each column
corresponds to one source and is presented to the user in a shued
order and using anonymous labels to avoid bias. If no recommen-
dation matches, the user can type in the correct identier name
directly. By default, identiers are annotated globally and automat-
ically annotated at any further occurrence within the document to
enable signicant time savings. In the rare case of a double meaning
within the same document, a locally dierent annotation is possible.
All annotations made by the user are shown as rows at the top
of the document and saved in a separate annotation le. Finally,
the user’s selection is stored in an evaluation le to compare the
usefulness of the four sources.
Figure 1: AnnoMathTeX recommendations for formula
identier annotation.
3 EVALUATION
As a proof-of-concept, we evaluate the performance of recommen-
dations for formula identiers comparing the four sources. An-
notating a sample of 100 identiers from 10 dierent Wikipedia
articles, we nd that the acceptance distribution (item coverage) of
the sources is {arXiv: 35%, Wikipedia: 16%, Wikidata: 13%, Word-
Window: 35%}. Overall, 82% of the recommendations were accepted.
On average, the accepted recommendation was ranked third (3.0)
out of ten, with a ranking distribution of {arXiv: 2.3, Wikipedia:
4.0, Wikidata: 2.5, WordWindow: 3.1}. We conclude that in most
cases, the recommendations are useful, and thus, the system can
signicantly speed up the annotation process.
Furthermore, 99% of the identiers could be annotated globally,
saving the user 1045 annotations - on average 105 per document
and 10 per identier.
4 CONCLUSION & OUTLOOK
We demonstrated a rst recommender for mathematical identier
annotation. Our presented system enables researchers to quickly
disambiguate formula identiers, and thus contributes signicantly
towards the aim of making mathematical documents machine-
interpretable. Converting mathematical language statements en-
coded in formulae into natural language is a crucial task for en-
abling semantic search queries, and for improving mathematical
recommender and question answering systems.
In a preliminary evaluation, our system suggested correct names
for 78% of the examined identier instances. As a next step, we
will implement the possibility to further deepen the annotation by
referencing [2]. The user will be able to link formulae and identiers
to items of the semantic knowledge-base Wikidata. Having tagged
documents by these items, substituting formulae and identiers by
numbers (Wikidata IDs), will yield a sparse semantic "ngerprint"
index, which can be queried by ID.
Subsequently, we plan to carry out a large-scale user study in
which we will evaluate the formula name recommendations from
the following sources: 1) surrounding text 2) a history of manual
inserts 3) a self created database of annotated formulae, and 4)
Wikidata.
Our long-term aim is to directly integrate our annotation rec-
ommender into the editing or composing views of both Wikipedia
and Overleaf. This would allow for the Wikipedia and research
communities to be directly included in the semantication process
of mathematical articles and research papers (see Figure 2).
Figure 2: Future integration of formula identier annota-
tion recommendation in Wikipedia articles (Wikitext) and
Overleaf documents (LaTeX).
ACKNOWLEDGMENTS
This work was supported by the German Research Foundation
(DFG grant GI-1259-1). We thank the Wikimedia Foundation for
hosting the system.
REFERENCES
[1]
R. Hambasan and M. Kohlhase. “Faceted Search for Mathe-
matics”. In: LWA. Vol. 1458. CEUR Workshop Proceedings.
CEUR-WS.org, 2015, pp. 33–44.
[2]
M. Kohlhase. “Math Object Identiers - Towards Research
Data in Mathematics”. In: LWDA. Vol. 1917. CEUR Workshop
Proceedings. CEUR-WS.org, 2017, p. 241.
[3]
G. Y. Kristianto, G. Topic, and A. Aizawa. “Extracting Tex-
tual Descriptions of Mathematical Expressions in Scientic
Papers”. In: D-Lib Magazine 20.11/12 (2014).
[4]
P. Scharpf et al. “AnnoMathTeX - a Formula Identier An-
notation Recommender System for STEM Documents”. In:
Proceedings of the 13th ACM Conference on Recommender Sys-
tems (RecSys 2019). Copenhagen, Denmark: ACM, Sept. 2019.
[5]
M. Schubotz et al. “Introducing MathQA: a Math-Aware ques-
tion answering system”. In: Information Discovery and Deliv-
ery 46.4 (2018), pp. 214–224.
[6]
M. Schubotz et al. “Semantication of Identiers in Mathemat-
ics for Better Math Information Retrieval”. In: SIGIR. ACM,
2016, pp. 135–144.
[7]
D. Vrandecic and M. Krötzsch. “Wikidata: a free collaborative
knowledgebase”. In: Commun. ACM 57.10 (2014), pp. 78–85.
AnnoMathTeX - a Formula Identifier Annotation
Recommender System for STEM Documents RecSys 2019, 16th-20th September 2019, Copenhagen, Denmark
Listing 1: Use the following BibTeX code to cite this article
@InProceedings{ Scharpf2019b ,
Ti t le = { An n oM a th T eX - a F or mu l a I d en ti f ie r An n ot a ti o n R ec o mm e nd e r S ys t em fo r S TE M
Documents},
Au t ho r = { Sc h ar pf , P hi l ip p an d M a ck e rr a ch e r , Ia n a nd Sc h ub ot z , Mo r it z an d B ee l ,
Jo e ra n an d B re i ti n ge r , Co r in n a a nd Gi pp , Be la } ,
Bo o k t it l e = {P r o c ee d i n gs o f th e 13 th A CM C onf e r en c e o n Re c o m me n d e r Sys t e ms ( R e c Sy s 2 0 19 )
},
Ye ar = { 20 19 } ,
Ad d re s s = { Co p en h ag en , D en m ar k } ,
Mo n th = { S ep t . } ,
Pu b l is h e r = {A CM } ,
To p ic = { m at hi r }
}