Content uploaded by Philipp Scharpf

Author content

All content in this area was uploaded by Philipp Scharpf on Sep 23, 2019

Content may be subject to copyright.

Preprint from https://www.gipp.com/pub/

P. Scharpf et al. “AnnoMathTeX - a Formula Identier Annotation Recommender System for STEM Documents”. In: Proceedings of

the 13th ACM Conference on Recommender Systems (RecSys 2019). Copenhagen, Denmark: ACM, Sept. 2019

AnnoMathTeX - a Formula Identifier Annotation

Recommender System for STEM Documents

Philipp Scharpf1, Ian Mackerracher1, Moritz Schubotz2,

Joeran Beel3, Corinna Breitinger1, Bela Gipp12

1University of Konstanz, Germany (rst.last@uni-konstanz.de)

2University of Wuppertal, Germany (last@uni-wuppertal.de)

3Trinity College Dublin, Ireland (rst.last@tcd.ie)

ABSTRACT

Documents from science, technology, engineering and mathemat-

ics (STEM) often contain a large number of mathematical formu-

lae alongside text. Semantic search, recommender, and question

answering systems require the occurring formula constants and

variables (identiers) to be disambiguated. We present a rst imple-

mentation of a recommender system that enables and accelerates

formula annotation by displaying the most likely candidates for

formula and identier names from four dierent sources (arXiv,

Wikipedia, Wikidata, or the surrounding text). A rst evaluation

shows that in total, 78% of the formula identier name recommen-

dations were accepted by the user as a suitable annotation. Further-

more, document-wide annotation saved the user the annotation of

ten times more other identier occurrences. Our long-term vision

is to integrate the annotation recommender into the edit-view of

Wikipedia and the online LaTeX editor Overleaf.

CCS CONCEPTS

•Information systems →Information retrieval;

KEYWORDS

Information Retrieval, Mathematical Information Retrieval, Recom-

mender Systems, Semantication, Wikipedia/Wikidata

1 INTRODUCTION

Documents from Science, Technology, Engineering, and Mathe-

matics (STEM) often contain numerous mathematical formulae [1],

which are crucial to understanding the semantics of the text. If the

formula characters (constants or variables) are not annotated, the

mathematical statement of a formula cannot be understood and

queried. However, if for example the formula

S=

1

− |R|/|I|·|U|

was annotated

{S

:

sparsity,R

:

rat inдs,I

:

items,U

:

users }

, the

characters (in the following referred to as formula identiers) are

translated into words that represent their meaning. This enables se-

mantic search, recommender and mathematical question answering

systems [5] to nd documents with formulae that for example

•allow calculating spar sityor

•

allow calculating

sparsity

, given

rat inдs

,

items

, and

users

or

•contain specic variables, such as ratinдsand items or

•relate rat inдsand users .

These are examples of structured queries, which require machine-

interpretability of mathematical documents to approach Mathemat-

ical Language Understanding (MLU). A large part of the mathemat-

ical knowledge today is either contained within research papers

(LaTeX) or in condensed form in Wikipedia articles (Wikitext).

Wikipedia articles are only semi-structured (linked). For the di-

rect retrieval of specic facts and systematic queries, Wikidata was

launched in 2012 [7]. Language-independent items (identied by

a unique ID) are linked by properties. In addition to natural lan-

guage statements, mathematical formulae were transferred from

Wikipedia [5] as items with a "dening formula" property that al-

lows a LaTeX formula string as value. However, only a few formulae

contain their identier names. Thus, a large part is not machine-

interpretable (=allowing structured queries).

Prior research has aimed to extract the identier meaning from

the text that surrounds the formula [6, 3], but all approaches lack

an important element: the quality control aorded by a human

expert verier. Annotating multitudinous formulae can be tedious.

Since the identier annotation in a document must be globally con-

sistent, annotating each instance individually should be avoided.

We address these shortcomings by introducing an annotation rec-

ommender system

1

for formula identiers at the document level.

We evaluate our system’s performance while comparing the user’s

acceptance of recommendations from four dierent sources.

2 ANNOMATHTEX

The workow of our system is as follows

2

: a user uploads a mathe-

matical document in Wikitext or LaTeX format. The system displays

the text while highlighting formulae and identiers. The formulae

are located by searching for their environment tags

(

<math>, $, \{equation}, \{align}

, etc.). Parsing the formulae

yields their identiers, which are then highlighted. If the user clicks

on a formula identier, AnnoMathTeX presents recommendations

for its name, which we extracted using four dierent sources : 1)

arXiv - candidates

3

extracted from the surrounding text of 60 M

formulae 2) Wikipedia - candidates

4

extracted from denitions in

mathematical English articles 3) Wikidata - candidates retrieved via

a SPARQL query

5

4) a surrounding text window of

±

5words around

the formula. The recommendations are then generated from static

dump lists and ranked by the occurrence frequency in their sources.

1System hosted by Wikimedia Foundation at annomathtex.wmabs.org

2Demo video available at bit.ly/annomathtex

3http://ntcir-math.nii.ac.jp/data

4https://en.wikipedia.org/wiki/User:Physikerwelt

5https://query.wikidata.org

RecSys 2019, 16th-20th September 2019, Copenhagen, Denmark

Philipp Scharpf1, Ian Mackerracher1, Moritz Schubotz2,

Joeran Beel3, Corinna Breitinger1, Bela Gipp12

Figure 1 shows the recommendation table/matrix. Each column

corresponds to one source and is presented to the user in a shued

order and using anonymous labels to avoid bias. If no recommen-

dation matches, the user can type in the correct identier name

directly. By default, identiers are annotated globally and automat-

ically annotated at any further occurrence within the document to

enable signicant time savings. In the rare case of a double meaning

within the same document, a locally dierent annotation is possible.

All annotations made by the user are shown as rows at the top

of the document and saved in a separate annotation le. Finally,

the user’s selection is stored in an evaluation le to compare the

usefulness of the four sources.

Figure 1: AnnoMathTeX recommendations for formula

identier annotation.

3 EVALUATION

As a proof-of-concept, we evaluate the performance of recommen-

dations for formula identiers comparing the four sources. An-

notating a sample of 100 identiers from 10 dierent Wikipedia

articles, we nd that the acceptance distribution (item coverage) of

the sources is {arXiv: 35%, Wikipedia: 16%, Wikidata: 13%, Word-

Window: 35%}. Overall, 82% of the recommendations were accepted.

On average, the accepted recommendation was ranked third (3.0)

out of ten, with a ranking distribution of {arXiv: 2.3, Wikipedia:

4.0, Wikidata: 2.5, WordWindow: 3.1}. We conclude that in most

cases, the recommendations are useful, and thus, the system can

signicantly speed up the annotation process.

Furthermore, 99% of the identiers could be annotated globally,

saving the user 1045 annotations - on average 105 per document

and 10 per identier.

4 CONCLUSION & OUTLOOK

We demonstrated a rst recommender for mathematical identier

annotation. Our presented system enables researchers to quickly

disambiguate formula identiers, and thus contributes signicantly

towards the aim of making mathematical documents machine-

interpretable. Converting mathematical language statements en-

coded in formulae into natural language is a crucial task for en-

abling semantic search queries, and for improving mathematical

recommender and question answering systems.

In a preliminary evaluation, our system suggested correct names

for 78% of the examined identier instances. As a next step, we

will implement the possibility to further deepen the annotation by

referencing [2]. The user will be able to link formulae and identiers

to items of the semantic knowledge-base Wikidata. Having tagged

documents by these items, substituting formulae and identiers by

numbers (Wikidata IDs), will yield a sparse semantic "ngerprint"

index, which can be queried by ID.

Subsequently, we plan to carry out a large-scale user study in

which we will evaluate the formula name recommendations from

the following sources: 1) surrounding text 2) a history of manual

inserts 3) a self created database of annotated formulae, and 4)

Wikidata.

Our long-term aim is to directly integrate our annotation rec-

ommender into the editing or composing views of both Wikipedia

and Overleaf. This would allow for the Wikipedia and research

communities to be directly included in the semantication process

of mathematical articles and research papers (see Figure 2).

Figure 2: Future integration of formula identier annota-

tion recommendation in Wikipedia articles (Wikitext) and

Overleaf documents (LaTeX).

ACKNOWLEDGMENTS

This work was supported by the German Research Foundation

(DFG grant GI-1259-1). We thank the Wikimedia Foundation for

hosting the system.

REFERENCES

[1]

R. Hambasan and M. Kohlhase. “Faceted Search for Mathe-

matics”. In: LWA. Vol. 1458. CEUR Workshop Proceedings.

CEUR-WS.org, 2015, pp. 33–44.

[2]

M. Kohlhase. “Math Object Identiers - Towards Research

Data in Mathematics”. In: LWDA. Vol. 1917. CEUR Workshop

Proceedings. CEUR-WS.org, 2017, p. 241.

[3]

G. Y. Kristianto, G. Topic, and A. Aizawa. “Extracting Tex-

tual Descriptions of Mathematical Expressions in Scientic

Papers”. In: D-Lib Magazine 20.11/12 (2014).

[4]

P. Scharpf et al. “AnnoMathTeX - a Formula Identier An-

notation Recommender System for STEM Documents”. In:

Proceedings of the 13th ACM Conference on Recommender Sys-

tems (RecSys 2019). Copenhagen, Denmark: ACM, Sept. 2019.

[5]

M. Schubotz et al. “Introducing MathQA: a Math-Aware ques-

tion answering system”. In: Information Discovery and Deliv-

ery 46.4 (2018), pp. 214–224.

[6]

M. Schubotz et al. “Semantication of Identiers in Mathemat-

ics for Better Math Information Retrieval”. In: SIGIR. ACM,

2016, pp. 135–144.

[7]

D. Vrandecic and M. Krötzsch. “Wikidata: a free collaborative

knowledgebase”. In: Commun. ACM 57.10 (2014), pp. 78–85.

AnnoMathTeX - a Formula Identifier Annotation

Recommender System for STEM Documents RecSys 2019, 16th-20th September 2019, Copenhagen, Denmark

Listing 1: Use the following BibTeX code to cite this article

@InProceedings{ Scharpf2019b ,

Ti t le = { An n oM a th T eX - a F or mu l a I d en ti f ie r An n ot a ti o n R ec o mm e nd e r S ys t em fo r S TE M

Documents},

Au t ho r = { Sc h ar pf , P hi l ip p an d M a ck e rr a ch e r , Ia n a nd Sc h ub ot z , Mo r it z an d B ee l ,

Jo e ra n an d B re i ti n ge r , Co r in n a a nd Gi pp , Be la } ,

Bo o k t it l e = {P r o c ee d i n gs o f th e 13 th A CM C onf e r en c e o n Re c o m me n d e r Sys t e ms ( R e c Sy s 2 0 19 )

},

Ye ar = { 20 19 } ,

Ad d re s s = { Co p en h ag en , D en m ar k } ,

Mo n th = { S ep t . } ,

Pu b l is h e r = {A CM } ,

To p ic = { m at hi r }

}