Conference PaperPDF Available

AnnoMathTeX: a formula identifier annotation recommender system for STEM documents

Authors:

Abstract and Figures

Documents from science, technology, engineering and mathematics (STEM) often contain a large number of mathematical formulae alongside text. Semantic search, recommender, and question answering systems require the occurring formula constants and variables (identifiers) to be disambiguated. We present a first implementation of a recommender system that enables and accelerates formula annotation by displaying the most likely candidates for formula and identifier names from four different sources (arXiv, Wikipedia, Wikidata, or the surrounding text). A first evaluation shows that in total, 78% of the formula identifier name recommendations were accepted by the user as a suitable annotation. Furthermore, document-wide annotation saved the user the annotation of ten times more other identifier occurrences. Our long-term vision is to integrate the annotation recommender into the edit-view of Wikipedia and the online LaTeX editor Overleaf.
Content may be subject to copyright.
Preprint from https://www.gipp.com/pub/
P. Scharpf et al. “AnnoMathTeX - a Formula Identier Annotation Recommender System for STEM Documents”. In: Proceedings of
the 13th ACM Conference on Recommender Systems (RecSys 2019). Copenhagen, Denmark: ACM, Sept. 2019
AnnoMathTeX - a Formula Identifier Annotation
Recommender System for STEM Documents
Philipp Scharpf1, Ian Mackerracher1, Moritz Schubotz2,
Joeran Beel3, Corinna Breitinger1, Bela Gipp12
1University of Konstanz, Germany (rst.last@uni-konstanz.de)
2University of Wuppertal, Germany (last@uni-wuppertal.de)
3Trinity College Dublin, Ireland (rst.last@tcd.ie)
ABSTRACT
Documents from science, technology, engineering and mathemat-
ics (STEM) often contain a large number of mathematical formu-
lae alongside text. Semantic search, recommender, and question
answering systems require the occurring formula constants and
variables (identiers) to be disambiguated. We present a rst imple-
mentation of a recommender system that enables and accelerates
formula annotation by displaying the most likely candidates for
formula and identier names from four dierent sources (arXiv,
Wikipedia, Wikidata, or the surrounding text). A rst evaluation
shows that in total, 78% of the formula identier name recommen-
dations were accepted by the user as a suitable annotation. Further-
more, document-wide annotation saved the user the annotation of
ten times more other identier occurrences. Our long-term vision
is to integrate the annotation recommender into the edit-view of
Wikipedia and the online LaTeX editor Overleaf.
CCS CONCEPTS
Information systems Information retrieval;
KEYWORDS
Information Retrieval, Mathematical Information Retrieval, Recom-
mender Systems, Semantication, Wikipedia/Wikidata
1 INTRODUCTION
Documents from Science, Technology, Engineering, and Mathe-
matics (STEM) often contain numerous mathematical formulae [1],
which are crucial to understanding the semantics of the text. If the
formula characters (constants or variables) are not annotated, the
mathematical statement of a formula cannot be understood and
queried. However, if for example the formula
S=
1
− |R|/|I|·|U|
was annotated
{S
:
sparsity,R
:
rat inдs,I
:
items,U
:
users }
, the
characters (in the following referred to as formula identiers) are
translated into words that represent their meaning. This enables se-
mantic search, recommender and mathematical question answering
systems [5] to nd documents with formulae that for example
allow calculating spar sityor
allow calculating
sparsity
, given
rat inдs
,
items
, and
users
or
contain specic variables, such as ratinдsand items or
relate rat inдsand users .
These are examples of structured queries, which require machine-
interpretability of mathematical documents to approach Mathemat-
ical Language Understanding (MLU). A large part of the mathemat-
ical knowledge today is either contained within research papers
(LaTeX) or in condensed form in Wikipedia articles (Wikitext).
Wikipedia articles are only semi-structured (linked). For the di-
rect retrieval of specic facts and systematic queries, Wikidata was
launched in 2012 [7]. Language-independent items (identied by
a unique ID) are linked by properties. In addition to natural lan-
guage statements, mathematical formulae were transferred from
Wikipedia [5] as items with a "dening formula" property that al-
lows a LaTeX formula string as value. However, only a few formulae
contain their identier names. Thus, a large part is not machine-
interpretable (=allowing structured queries).
Prior research has aimed to extract the identier meaning from
the text that surrounds the formula [6, 3], but all approaches lack
an important element: the quality control aorded by a human
expert verier. Annotating multitudinous formulae can be tedious.
Since the identier annotation in a document must be globally con-
sistent, annotating each instance individually should be avoided.
We address these shortcomings by introducing an annotation rec-
ommender system
1
for formula identiers at the document level.
We evaluate our system’s performance while comparing the user’s
acceptance of recommendations from four dierent sources.
2 ANNOMATHTEX
The workow of our system is as follows
2
: a user uploads a mathe-
matical document in Wikitext or LaTeX format. The system displays
the text while highlighting formulae and identiers. The formulae
are located by searching for their environment tags
(
<math>, $, \{equation}, \{align}
, etc.). Parsing the formulae
yields their identiers, which are then highlighted. If the user clicks
on a formula identier, AnnoMathTeX presents recommendations
for its name, which we extracted using four dierent sources : 1)
arXiv - candidates
3
extracted from the surrounding text of 60 M
formulae 2) Wikipedia - candidates
4
extracted from denitions in
mathematical English articles 3) Wikidata - candidates retrieved via
a SPARQL query
5
4) a surrounding text window of
±
5words around
the formula. The recommendations are then generated from static
dump lists and ranked by the occurrence frequency in their sources.
1System hosted by Wikimedia Foundation at annomathtex.wmabs.org
2Demo video available at bit.ly/annomathtex
3http://ntcir-math.nii.ac.jp/data
4https://en.wikipedia.org/wiki/User:Physikerwelt
5https://query.wikidata.org
RecSys 2019, 16th-20th September 2019, Copenhagen, Denmark
Philipp Scharpf1, Ian Mackerracher1, Moritz Schubotz2,
Joeran Beel3, Corinna Breitinger1, Bela Gipp12
Figure 1 shows the recommendation table/matrix. Each column
corresponds to one source and is presented to the user in a shued
order and using anonymous labels to avoid bias. If no recommen-
dation matches, the user can type in the correct identier name
directly. By default, identiers are annotated globally and automat-
ically annotated at any further occurrence within the document to
enable signicant time savings. In the rare case of a double meaning
within the same document, a locally dierent annotation is possible.
All annotations made by the user are shown as rows at the top
of the document and saved in a separate annotation le. Finally,
the user’s selection is stored in an evaluation le to compare the
usefulness of the four sources.
Figure 1: AnnoMathTeX recommendations for formula
identier annotation.
3 EVALUATION
As a proof-of-concept, we evaluate the performance of recommen-
dations for formula identiers comparing the four sources. An-
notating a sample of 100 identiers from 10 dierent Wikipedia
articles, we nd that the acceptance distribution (item coverage) of
the sources is {arXiv: 35%, Wikipedia: 16%, Wikidata: 13%, Word-
Window: 35%}. Overall, 82% of the recommendations were accepted.
On average, the accepted recommendation was ranked third (3.0)
out of ten, with a ranking distribution of {arXiv: 2.3, Wikipedia:
4.0, Wikidata: 2.5, WordWindow: 3.1}. We conclude that in most
cases, the recommendations are useful, and thus, the system can
signicantly speed up the annotation process.
Furthermore, 99% of the identiers could be annotated globally,
saving the user 1045 annotations - on average 105 per document
and 10 per identier.
4 CONCLUSION & OUTLOOK
We demonstrated a rst recommender for mathematical identier
annotation. Our presented system enables researchers to quickly
disambiguate formula identiers, and thus contributes signicantly
towards the aim of making mathematical documents machine-
interpretable. Converting mathematical language statements en-
coded in formulae into natural language is a crucial task for en-
abling semantic search queries, and for improving mathematical
recommender and question answering systems.
In a preliminary evaluation, our system suggested correct names
for 78% of the examined identier instances. As a next step, we
will implement the possibility to further deepen the annotation by
referencing [2]. The user will be able to link formulae and identiers
to items of the semantic knowledge-base Wikidata. Having tagged
documents by these items, substituting formulae and identiers by
numbers (Wikidata IDs), will yield a sparse semantic "ngerprint"
index, which can be queried by ID.
Subsequently, we plan to carry out a large-scale user study in
which we will evaluate the formula name recommendations from
the following sources: 1) surrounding text 2) a history of manual
inserts 3) a self created database of annotated formulae, and 4)
Wikidata.
Our long-term aim is to directly integrate our annotation rec-
ommender into the editing or composing views of both Wikipedia
and Overleaf. This would allow for the Wikipedia and research
communities to be directly included in the semantication process
of mathematical articles and research papers (see Figure 2).
Figure 2: Future integration of formula identier annota-
tion recommendation in Wikipedia articles (Wikitext) and
Overleaf documents (LaTeX).
ACKNOWLEDGMENTS
This work was supported by the German Research Foundation
(DFG grant GI-1259-1). We thank the Wikimedia Foundation for
hosting the system.
REFERENCES
[1]
R. Hambasan and M. Kohlhase. “Faceted Search for Mathe-
matics”. In: LWA. Vol. 1458. CEUR Workshop Proceedings.
CEUR-WS.org, 2015, pp. 33–44.
[2]
M. Kohlhase. “Math Object Identiers - Towards Research
Data in Mathematics”. In: LWDA. Vol. 1917. CEUR Workshop
Proceedings. CEUR-WS.org, 2017, p. 241.
[3]
G. Y. Kristianto, G. Topic, and A. Aizawa. “Extracting Tex-
tual Descriptions of Mathematical Expressions in Scientic
Papers”. In: D-Lib Magazine 20.11/12 (2014).
[4]
P. Scharpf et al. “AnnoMathTeX - a Formula Identier An-
notation Recommender System for STEM Documents”. In:
Proceedings of the 13th ACM Conference on Recommender Sys-
tems (RecSys 2019). Copenhagen, Denmark: ACM, Sept. 2019.
[5]
M. Schubotz et al. “Introducing MathQA: a Math-Aware ques-
tion answering system”. In: Information Discovery and Deliv-
ery 46.4 (2018), pp. 214–224.
[6]
M. Schubotz et al. “Semantication of Identiers in Mathemat-
ics for Better Math Information Retrieval”. In: SIGIR. ACM,
2016, pp. 135–144.
[7]
D. Vrandecic and M. Krötzsch. “Wikidata: a free collaborative
knowledgebase”. In: Commun. ACM 57.10 (2014), pp. 78–85.
AnnoMathTeX - a Formula Identifier Annotation
Recommender System for STEM Documents RecSys 2019, 16th-20th September 2019, Copenhagen, Denmark
Listing 1: Use the following BibTeX code to cite this article
@InProceedings{ Scharpf2019b ,
Ti t le = { An n oM a th T eX - a F or mu l a I d en ti f ie r An n ot a ti o n R ec o mm e nd e r S ys t em fo r S TE M
Documents},
Au t ho r = { Sc h ar pf , P hi l ip p an d M a ck e rr a ch e r , Ia n a nd Sc h ub ot z , Mo r it z an d B ee l ,
Jo e ra n an d B re i ti n ge r , Co r in n a a nd Gi pp , Be la } ,
Bo o k t it l e = {P r o c ee d i n gs o f th e 13 th A CM C onf e r en c e o n Re c o m me n d e r Sys t e ms ( R e c Sy s 2 0 19 )
},
Ye ar = { 20 19 } ,
Ad d re s s = { Co p en h ag en , D en m ar k } ,
Mo n th = { S ep t . } ,
Pu b l is h e r = {A CM } ,
To p ic = { m at hi r }
}
... Our evaluation sample consists of formula concepts, which were annotated using the AnnoMathTeX 9 formula and identifier annotation recommender system [33,29]. The formulae were taken from an already existing benchmark selection of 25 Wikipedia articles from physics (classical mechanics). ...
... Therefore, we did not calculate an IDCG and nDCG. (Table 1) was created from a selection of 25 physics Wikipedia articles, for which formula and identifier entities were linked using a formula and identifier name annotation recommender system [33]. The formula selection is persisted on the benchmark platform MathMLben (https://mathmlben.wmflabs.org) ...
... In 2019, Dhar et al. [33] proposed a signature-based hashing scheme, which constructed the search engine "SigMa", based on mathematical expressions, to retrieve documents by perceiving the high structure in mathematical expres- sions, which solves the problem that scientific texts based on mathematical expressions are not adapted to the traditional text retrieval system. Scharpf et al. [34] applied mathematical expressions to the document recommendation system, which annotated the variables and constants of mathematical expressions; the method disambiguates mathematical identifiers and achieves good results. ...
... In 2019, Dhar et al. [33] proposed a signature-based hashing scheme, which constructed the search engine "SigMa", based on mathematical expressions, to retrieve documents by perceiving the high structure in mathematical expressions, which solves the problem that scientific texts based on mathematical expressions are not adapted to the traditional text retrieval system. Scharpf et al. [34] applied mathematical expressions to the document recommendation system, which annotated the variables and constants of mathematical expressions; the method disambiguates mathematical identifiers and achieves good results. ...
Article
Full-text available
Traditional mathematical search models retrieve scientific documents only by mathematical expressions and their contexts and do not consider the ontological attributes of scientific documents, which result in gaps between the queries and the retrieval results. To solve this problem, a retrieval and ranking model is constructed that synthesizes the information of mathematical expressions with related texts, and the ontology attributes of scientific documents are extracted to further sort the retrieval results. First, the hesitant fuzzy set of mathematical expressions is constructed by using the characteristics of the hesitant fuzzy set to address the multi-attribute problem of mathematical expression matching; then, the similarity of the mathematical expression context sentence is calculated by using the BiLSTM two-way coding feature, and the retrieval result is obtained by synthesizing the similarity between the mathematical expression and the sentence; finally, considering the ontological attributes of scientific documents, the retrieval results are ranked to obtain the final search results. The MAP_10 value of the mathematical expression retrieval results on the Ntcir-Mathir-Wikipedia-Corpus dataset is 0.815, and the average value of the NDCG@10 of the scientific document ranking results is 0.9; these results prove the effectiveness of the scientific document retrieval and ranking method.
... To disambiguate and match mathematical expressions in Wikipedia articles to Wikidata items [16], the 'AnnoMathTeX' formula and identifier annotation recommender system 7 was developed. The system is designed to suggest Wikidata item name and QID candidates provided from several sources, such as the arXiv 8 , Wikipedia, Wikidata, or the text that surrounds the formula. ...
... For the classification of natural language texts, such as legal or medical documents, explainer approaches have already successfully been applied. However, documents from Science, Technology, Engineering, and Mathematics (STEM) disciplines are more difficult to tackle since they contain a significant amount of mathematical formulae alongside text [14,16]. ...
Conference Paper
Full-text available
Documents from Science, Technology, Engineering, and Mathematics (STEM) disciplines usually contain a significant amount of mathematical formulae alongside text. Some Mathematical Information Retrieval (MathIR) systems, e.g., Mathematical Question Answering (MathQA), exploit knowledge from Wikidata. Therefore, the mathematical information needs to be stored in items. In the last years, there have been efforts to define several properties and seed formulae together with their constituting identifiers into Wikidata. This paper summarizes the current state, challenges, and discussions related to this endeavor. Furthermore , some data mining methods (supervised formula annotation and concept retrieval) and applications (question answering and classification explainability) of the mathematical information are outlined. Finally, we discuss community feedback and issues related to integrating Mathematical Entity Linking (MathEL) into Wikidata and Wikipedia, which was rejected in 33% and 12% of the test cases, for Wikidata and Wikipedia respectively. Our long-term goal is to populate Wikidata, such that it can serve a variety of automated math reasoning tasks and AI systems.
... Apart from unsupervised methods for FC retrieval, also supervised approaches were presented. The 'AnnoMathTeX' formula and identifier annotation recommender system is designed to disambiguate and match mathematical expressions in Wikipedia articles to Wikidata items [22]. The system suggests annotation name and item candidates provided from several sources, such as the arXiv, Wikipedia, Wikidata, or the text that surrounds the formula. ...
... At the transition from unlemmatized to lemmatized, the precision decreases (mean: 0.47 < 0.88) but recall increases (0.30 > 0.14). The lemmatized mode has a better F1 than unlemmatized in both mean (0.30 > 0. 22) and max (0.49 > 0.35). The full lists of results can be found in the repository. ...
Preprint
Full-text available
Document subject classification is essential for structuring (digital) libraries and allowing readers to search within a specific field. Currently, the classification is typically made by human domain experts. Semi-supervised Machine Learning algorithms can support them by exploiting the labeled data to predict subject classes for unclassified new documents. However, while humans partly do, machines mostly do not explain the reasons for their decisions. Recently, explainable AI research to address the problem of Machine Learning decisions being a black box has increasingly gained interest. Explainer models have already been applied to the classification of natural language texts, such as legal or medical documents. Documents from Science, Technology, Engineering, and Mathematics (STEM) disciplines are more difficult to analyze, since they contain both textual and mathematical formula content. In this paper, we present first advances towards STEM document classification explainability using classical and mathematical Entity Linking. We examine relationships between textual and mathematical subject classes and entities, mining a collection of documents from the arXiv preprint repository (NTCIR and zbMATH dataset). The results indicate that mathematical entities have the potential to provide high explainability as they are a crucial part of a STEM document.
... To achieve our research goals, we carry out a three-step pipeline ( Figure 1). First, we assign formula and identifier names in selected Wikipedia articles using our »AnnoMathTeX« system 1 that was recently introduced [15]. Second, we create Formula Concept items in the Wikidata knowledge-base. ...
... Since disambiguation requires understanding context, human inspection is needed in most cases. To facilitate and speedup the process, annotation recommender systems can be used [15]. Previous research has been focused on tag recommendation or suggestion. ...
Preprint
Full-text available
Mathematical information retrieval (MathIR) applications such as semantic formula search and question answering systems rely on knowledge-bases that link mathematical expressions to their natural language names. For database population, mathematical formulae need to be annotated and linked to semantic concepts, which is very time-consuming. In this paper, we present our approach to structure and speed up this process by supporting annotators with a system that suggests formula names and meanings of mathematical identifiers. We test our approach annotating 25 articles on en.wikipedia.org. We evaluate the quality and time-savings of the annotation recommendations. Moreover, we watch editor reverts and comments on Wikipedia formula entity links and Wikidata item creation and population to ground the formula semantics. Our evaluation shows that the AI guidance was able to significantly speed up the annotation process by a factor of 1.4 for formulae and 2.4 for identifiers. Our contributions were reverted in 12% of the edited Wikipedia articles and 33% of the Wikidata items within a test window of one month. The >>AnnoMathTeX<< annotation recommender system is hosted by Wikimedia at annomathtex.wmflabs.org. In the future, our data refinement pipeline is ready to be integrated seamlessly into the Wikipedia user interface.
... Recommendation systems are implemented in the fields of e-commerce, e-learning, e-library, e government and e-business services, and include the recommendations on movies, music, broadcasters, books, papers, blogs, conferences, tourism attractions and learning material. A recommendation system will automatically assist by recommending scientific documents based on the preferences of similar mathematical formulas (Scharpf, 2019). The recommendation system that makes and accelerates formula annotation by showing the most likely nominee candidates. ...
Chapter
Mathematical formulas are widely used to express ideas and fundamental principles of science, technology, engineering, and mathematics. The rapidly growing research in science and engineering leads to a generation of a huge number of scientific documents which contain both textual as well as mathematical terms. In a scientific document, the sense of mathematical formulae is conveyed through the context and the symbolic structure which follows the strong domain specific conventions. In contrast to textual information, developed mathematical information retrieval systems have demonstrated the unique and elite indexing and matching approaches which are beneficial to the retrieval of formulae and scientific term. This chapter discusses the recent advancement in formula-based search engines, various formula representation styles and indexing techniques, benefits of formula-based search engines in various future applications like plagiarism detection, math recommendation system, etc.
... Formula Disambiguation I Similar formulas can have vastly different meanings in different contexts [15,16,17,18]. This is especially true for single symbols used in these formulas as researchers in different fields will certainly have assigned a different meaning to symbols. ...
Preprint
Full-text available
We present zbMATH Open, the most comprehensive collection of reviews and bibliographic metadata of scholarly literature in mathematics. Besides our website https://zbMATH.org which is openly accessible since the beginning of this year, we provide API endpoints to offer our data. The API improves interoperability with others, i.e., digital libraries, and allows using our data for research purposes. In this article, we (1) illustrate the current and future overview of the services offered by zbMATH; (2) present the initial version of the zbMATH links API; (3) analyze potentials and limitations of the links API based on the example of the NIST Digital Library of Mathematical Functions; (4) and finally, present the zbMATH Open dataset as a research resource and discuss connected open research problems.
... The formula entailment approach recognized the entailment between the math user query and indexed formulas. AnnoMathTex-a recommender system (Scharpf et al., 2019) that enables formula annotation by assigning a meaning to identifier from the text surrounds the formula. Formula embedding model, i.e., Tangent-CFT system (Mansouri et al., 2019), uses the two hierarchical representations to represents the mathematical information, i.e., Symbol Layout Trees (SLTs) and Operator Trees (OPTs), and considered the path between pair of symbols for tuple generation. ...
Article
Retrieval of mathematical information from scientific documents is one of the crucial tasks. Numerous Mathematical Information Retrieval (MIR) systems have been developed, which mainly focus on the improvement over the indexing and the searching mechanism, the poor results obtained for evaluation measures depict major limitations of such systems. These enhance the scope of improvement and new innovations through the inclusion of functionalities, which can resolve the challenges of MIR system. Further, to improve the performance of the MIR systems, this paper proposed a formula embedding and generalization approach with the context, in addition to innovative relevance measurement technique. In this approach, documents are preprocessed by the document preprocessor module and extracted the formulas in Presentation MathML format with their context. The formula embedding and generalization modules of the proposed approach formed the binary vectors where ’1’ represents the presence, and ’0’ represents the absence of a particular entity in a formula, and subsequently, the vectors of formulas with context are indexed by the indexer. The innovative relevance measurement technique of the proposed approach ranked those documents first, which are retrieved by both formula embedding and generalization modules as compared to the individual one. The proposed approach has been tested on the MathTagArticles of Wikipedia of NTCIR-12, and the obtained results verify the significance of the context of the formula and the dissimilarity factor in the retrieval of mathematical information.
Article
In recent years, Educational Recommender Systems (ERSs) have attracted great attention as a solution towards addressing the problem of information overload in e-learning environments and providing relevant recommendations to online learners. These systems play a key role in helping learners to find educational resources relevant and pertinent to their profiles and context. So, it is necessary to identify information that helps learner’s profile definition and in identifying requests and interests. In this context, we suggest to take advantage of the annotation activity used usually in the learning context for different purposes and which may reflect certain learner’s characteristics useful as input data for the recommendation process. Therefore, we propose an educational recommender system of web services based on learner’s annotative activity to assist him in his learning activity. This process of recommendation is founded on two preparatory phases: the phase of modelling learner’s personality profile through analysis of annotation digital traces in learning environment realized through a profile constructor module and the phase of discovery of web services which can meet the goals of annotations made by learner via the web service discovery module. The evaluation of the developed annotation based recommendation system through empirical studies realized on groups of learners based on the Student’s t-test showed significant results.
Article
Full-text available
Purpose This paper aims to present an open source math-aware Question Answering System based on Ask Platypus. Design/methodology/approach The system returns as a single mathematical formula for a natural language question in English or Hindi. These formulae originate from the knowledge-based Wikidata. The authors translate these formulae to computable data by integrating the calculation engine sympy into the system. This way, users can enter numeric values for the variables occurring in the formula. Moreover, the system loads numeric values for constants occurring in the formula from Wikidata. Findings In a user study, this system outperformed a commercial computational mathematical knowledge engine by 13 per cent. However, the performance of this system heavily depends on the size and quality of the formula data available in Wikidata. As only a few items in Wikidata contained formulae when the project started, the authors facilitated the import process by suggesting formula edits to Wikidata editors. With the simple heuristic that the first formula is significant for the paper, 80 per cent of the suggestions were correct. Originality/value This research was presented at the JCDL17 KDD workshop.
Conference Paper
Full-text available
We present an open source math-aware Question Answering System based on Ask Platypus. Our system returns as a single mathematical formula for a natural language question in English or Hindi. This formulae originate from the knowledge-base Wikidata. We translate these formulae to computable data by integrating the calculation engine sympy into our system. This way, users can enter numeric values for the variables occurring in the formula. Moreover, the system loads numeric values for constants occurring in the formula from Wikidata. In a user study, our system outperformed a commercial computational mathematical knowledge engine by 13 %. However, the performance of our system heavily depends on the size and quality of the formula data available in Wikidata. Since only a few items in Wikidata contained formulae when we started the project, we facilitated the import process by suggesting formula edits to Wikidata editors. With the simple heuristic that the first formula is significant for the article, 80 % of the suggestions were correct.
Conference Paper
Full-text available
Mathematical formulae are essential in science, but face challenges of ambiguity, due to the use of a small number of identifiers to represent an immense number of concepts. Corresponding to word sense disambiguation in Natural Language Processing, we disambiguate mathematical identifiers. By regarding formulae and natural text as one monolithic information source, we are able to extract the semantics of identifiers in a process we term Mathematical Language Processing (MLP). As scientific communities tend to establish standard (identifier) notations, we use the document domain to infer the actual meaning of an identifier. Therefore, we adapt the software development concept of namespaces to mathematical notation. Thus, we learn namespace definitions by clustering the MLP results and mapping those clusters to subject classification schemata. In addition, this gives fundamental insights into the usage of mathematical notations in science, technology, engineering and mathematics. Our gold standard based evaluation shows that MLP extracts relevant identifier-definitions. Moreover, we discover that identifier namespaces improve the performance of automated identifier-definition extraction, and elevate it to a level that cannot be achieved within the document context alone.
Conference Paper
Full-text available
Mathematical concepts and formulations play a fundamental role in many scientific domains. As such, the use of mathematical expressions represents a promising method of interlinking scientific papers. The purpose of this study is to provide guidelines for annotating and detecting natural language descriptions of mathematical expressions, enabling the semantic enrichment of mathematical information in scientific papers. Under the proposed approach, we first manually annotate descriptions of mathematical expressions and assess the coverage of several types of textual span: fixed context window, apposition, minimal noun phrases, and noun phrases. We then developed a method for automatic description extraction, whereby the problem was formulated as a binary classification by pairing each mathematical expression with its description candidates and classifying the pairs as correct or incorrect. Support vector machines (SVMs) with several different features were developed and evaluated for the classification task. Experimental results showed that an SVM model that uses all noun phrases delivers the best performance, achieving an F1-score of 62.25% against the 41.47% of the baseline (nearest noun) method.
Conference Paper
Faceted search is one of the most practical ways to browse a large corpus of information. Information is categorized automatically for a given query and the user is given the opportunity to further refine his/her query. Many search engines offer a powerful faceted search engine, but only on the textual level. Faceted Search in the context of Math Search is still unexplored territory. In this paper, we describe one way of solving the faceted search problem in mathematics: by extracting recognizable formula schemata from a given set of formulae and using these schemata to divide the initial set into formula classes. Also, we provide a direct application by integrating this solution with existing services.
Article
Wikidata allows every user to extend and edit the stored information, even without creating an account. A form based interface makes editing easy. Wikidata's goal is to allow data to be used both in Wikipedia and in external applications. Data is exported through Web services in several formats, including JavaScript Object Notation, or JSON, and Resource Description Framework, or RDF. Data is published under legal terms that allow the widest possible reuse. The value of Wikipedia's data has long been obvious, with many efforts to use it. The Wikidata approach is to crowdsource data acquisition, allowing a global community to edit the data. This extends the traditional wiki approach of allowing users to edit a website. In March 2013, Wikimedia introduced Lua as a scripting language for automatically creating and enriching parts of articles. Lua scripts can access Wikidata, allowing Wikipedia editors to retrieve, process, and display data. Many other features were introduced in 2013, and development is planned to continue for the foreseeable future.
Math object identifers -towards research data in mathematics
  • Michael Kohlhase
Michael Kohlhase. Math object identifers -towards research data in mathematics. In LWDA, volume 1917 of CEUR Workshop Proceedings, page 241. CEUR-WS.org, 2017.
Faceted Search for Mathematics
  • R Hambasan
  • M Kohlhase
R. Hambasan and M. Kohlhase. "Faceted Search for Mathematics". In: LWA. Vol. 1458. CEUR Workshop Proceedings. CEUR-WS.org, 2015, pp. 33-44.