sTeX+ - a System for Flexible Formalization of Linked Data

Andrea Kohlhase, Michael Kohlhase, Christoph Lange

Journal Article: 06/2010; DOI: abs/1006.4474

Abstract

We present the sTeX+ system, a user-driven advancement of sTeX - a semantic extension of LaTeX that allows for producing high-quality PDF documents for (proof)reading and printing, as well as semantic XML/OMDoc documents for the Web or further processing. Originally sTeX had been created as an invasive, semantic frontend for authoring XML documents. Here, we used sTeX in a Software Engineering case study as a formalization tool. In order to deal with modular pre-semantic vocabularies and relations, we upgraded it to sTeX+ in a participatory design process. We present a tool chain that starts with an sTeX+ editor and ultimately serves the generated documents as XHTML+RDFa Linked Data via an OMDoc-enabled, versioned XML database. In the final output, all structural annotations are preserved in order to enable semantic information retrieval services. Comment: I-SEMANTICS 2010, September 1-3, 2010, Graz, Austria

Source: arXiv

Comments on this publication

ResearchGate members can add comments. Sign up now and post your comment!

Similar publications

Page 1
 
Page 2
 
Page 3
 
Page 4
 
Page 5
 
End of preview.
Page 1
stex-isem.tex 1265 2010-03-09 10:54:39Z ako
STEX+ – a System for Flexible Formalization of Linked Data
Andrea Kohlhase
German Research Center for
Artificial Intelligence (DFKI)
Enrique-Schmidt-Str. 5
28359 Bremen, Germany
Andrea.Kohlhase@dfki.de
Michael Kohlhase
Jacobs University Bremen
P. O. Box 750561
28725 Bremen, Germany
m.kohlhase@jacobs-
university.de
Christoph Lange
Jacobs University Bremen
P. O. Box 750561
28725 Bremen, Germany
ch.lange@jacobs-
university.de
ABSTRACT
We present the STEX+ system, a user-driven advancement
of STEX — a semantic extension of LATEX that allows for
producing high-quality PDF documents for (proof)reading
and printing, as well as semantic XML/OMDoc documents
for the Web or further processing. Originally STEX had
been created as an invasive, semantic frontend for author-
ing XML documents. Here, we used STEX in a Software
Engineering case study as a formalization tool. In order to
deal with modular pre-semantic vocabularies and relations,
we upgraded it to STEX+ in a participatory design pro-
cess. We present a tool chain that starts with an STEX+
editor and ultimately serves the generated documents as
XHTML+RDFa Linked Data via an OMDoc-enabled, ver-
sioned XML database. In the final output, all structural
annotations are preserved in order to enable semantic infor-
mation retrieval services.
Categories and Subject Descriptors
D.2.1 [Software Engineering]: Requirements/Specifica-
tions—Languages; I.2.4 [Artificial Intelligence]: Knowl-
edge Representation Formalisms and Methods—Representa-
tion languages; I.7.2 [Document and Text Processing]:
Document Preparation
General Terms
Documentation, Human Factors, Languages, Management
Keywords
formalization, LATEX, Linked Data, software engineering, se-
mantic authoring, annotation, metadata, RDFa, vocabular-
ies, ontologies
1. INTRODUCTION
An important issue in the Semantic Web community was and
still is the “Authoring Problem”: How can we convince peo-
ple not only to use semantic technologies, but also prepare
them for creating semantic documents (in a broad sense)?
I-SEMANTICS 2010, September 1–3, 2010, Graz, Austria
Here, we were interested in formalizing a collection of LATEX
documents into a set of files in the OMDoc format, an XML
vocabulary specialized for managing mathematical informa-
tion, and further on to Linked Data for interactive browsing
and querying on the Semantic Web.
Concretely, the object of our study was the collection of doc-
uments created in the course of the 3-year project “Siche-
rungskomponente fu¨r Autonome Mobile Systeme (SAMS)”
at the German Research Center for Artificial Intelligence
(DFKI). SAMS built a software safety component for au-
tonomous mobile service robots developed and certified it as
SIL-3 standard compliant (see [13]). Certification required
the software development to follow the V-model (figure 1)
and to be based on a verification of certain safety proper-
ties in the proof checker Isabelle [33]. The V-model man-
dates e. g. that relevant document fragments get justified
and linked to corresponding fragments in other members of
the document collection in an iterative refinement process
(the arms of the ‘V’ from the upper left over the bottom to
the upper right and in-between in figure 1).
Figure 1: A Document View on the V-Model
System development with respect to this regime results in a
highly interconnected collection of design documents, certi-
fication documents, code, formal specifications, and formal
proofs. This collection of documents “SAMSDocs” [35] make
up the basis of a case study in the context of the FormalSafe
project [12] at DFKI Bremen, where they serve as a basis for
research on machine-supported change management, infor-
mation retrieval, and document interaction. In this paper,
we report on the formalization project of the collection of
LATEX documents in SAMSDocs (that we will without further
ado also abbreviate with SAMSDocs).
ar
X
iv
:1
00
6.
44
74
v1
[
cs
.SE
]
23
Ju
n 2
01
0
Page 2
stex-isem.tex 1265 2010-03-09 10:54:39Z ako
Not surprisingly, the interplay between the fields Semantic
Web and Human-Computer Interaction played an important
role as the “Authoring Problem” of the first is often tack-
led via methods of the second. One such approach is that
of “invasive technology” [21] with the basic idea that from a
user’s perspective, semantic authoring and general editing
are the same, so why not offer semantic functionalities as
an extension of well-known editing systems, thereby ‘invad-
ing’ the existent ones. We started with LATEX not only be-
cause a good portion of our case study was written in it, but
also as LATEX constitutes the state-of-the art authoring so-
lution for many scientific/technical/mathematical document
collections. Despite its text-based nature it is widely consid-
ered the most efficient tool for the task. Therefore, we used
the invasive OMDoc frontend for LATEX documents called
STEX [26]. In the formalization process its conceptual us-
ability weaknesses (for the task) were identified and within
a participatory design process it evolved into the invasive
formalization tool STEX+.
In section 2, we will present the STEX system, especially its
realization of Linked Data creation. Then we describe in
section 3 the formalization process of SAMSDocs with STEX,
our challenges, and our (pre-)solutions. In section 4 we re-
port the enhancements of STEX realized in and for the case
study to STEX+. Having STEX+ documents with Linked
Data and ontological markup, we describe (potential) ser-
vices and their implementation design in section 5. Section 6
summarizes related work, and section 7 concludes the paper.
2. STEX: OBJ.-ORIENTED L
ATEX MARKUP
STEX [26, 37] is an extension of the LATEX language that
is geared towards marking up the semantic structure un-
derlying a document. The main concept in STEX is that
of a “semantic macro”, i. e., a TEX command sequence S
that represents a meaningful (mathematical) concept C: the
TEX formatter will expand S to the presentation of C. For
instance, the command sequence \positiveReals (from
listing 1) is a semantic macro that represents a mathematical
symbol — the set R+ of positive real numbers. While the use
of semantic macros is generally considered a good markup
practice for scientific documents (e. g., because they allow
to adapt notation by macro redefinition and thus increase
reusability), regular TEX/LATEX does not offer any infras-
tructural support for this. STEX does just this by adopting
a semantic, ‘object-oriented’ approach to semantic macros
by grouping them into “modules”, which are linked by an
“imports” relation. To get a better intuition, consider
Listing 1: An STEX Module for Real Numbers
\begin{module}[id=reals]
\importmodule[../background/sets]{sets}
\symdef{Reals}{\mathbb{R}}
\symdef{greater}[2]{#1>#2}
5 \symdef{positiveReals}{\Reals^+}
\begin{definition}[id=posreals.def,
title=Positive Real Numbers]
$\defeq\positiveReals
{\setst{\inset{x}\Reals}{\greater{x}0}}$
10 \end{definition}
...
\end{module}
which would be formatted to
Definition 2.1 (Positive Real Numbers): R+ := {x ∈ R | x > 0}
Here, STEX’s \symdef macro – invasive by to its delib-
erate resemblance of (La)TEX’s \def and \newcommand
– generates a respective semantic macro, for instance the
\positiveReals with representation R+. Note the sym-
bol inheritance scheme of STEX: The markup in the mod-
ule reals has access to semantic macros \setst (“set such
that”) and \inset (set membership) from the module sets
that was imported by the document \importmodule direc-
tive from the ../background/sets.tex. Furthermore, it
has access to the \defeq (definitional equality) that was in
turn imported by the module sets.
From this example we can already see an organizational ad-
vantage of STEX over LATEX: we can define the (semantic)
macros close to where the corresponding concepts are de-
fined, and we can (recursively) import mathematical mod-
ules. But the main advantage of markup in STEX is that it
can be transformed to XML via the LATEXML system [32]:
Listing 2 shows the OMDoc [25] representation generated
from the STEX sources in listing 1. OMDoc is a semantics-
oriented representation format for mathematical knowledge
that extends the formula markup formats OpenMath [7] and
MathML [2] to a document markup format.
Listing 2: An XML Version of Listing 1
<theory xml:id="reals">
<imports from="../background/sets.omdoc#sets"/>
<symbol xml:id="Reals"/>
<notation>
5 <prototype><OMS cd="reals" name="Reals"/></prototype>
<rendering><m:mo>R</m:mo></rendering>
</notation>
<symbol xml:id="greater"/><notation>. . .</notation>
<symbol xml:id="positiveReals"/><notation>. . .</notation>
10 <definition xml:id="posreals.def" for="positiveReals">
<meta property="dc:title">Positive Real Numbers</meta>
<OMOBJ>
<OMA>
<OMS cd="mathtalk" name="defeq"/>
15 <OMS cd="reals" name="positiveReals"/>
<OMA>
<OMS cd="sets" name="setst"/>
<OMA>
<OMS cd="sets" name="inset"/>
20 <OMV name="x"/>
<OMS cd="reals" name="reals"/>
</OMA>
<OMA>
<OMS cd="reals" name="greater"/>
25 <OMV name="x"/>
<OMI>0</OMI>
</OMA>
</OMA>
</OMA>
30 </OMOBJ>
</definition>
. . .
</theory>
One thing that jumps out from the XML in this listing
is that it incorporates all the information from the STEX
markup that was invisible in the PDF produced by format-
ting it with TEX.
OMDoc itself has been used as a storage and exchange for-
mat for automated theorem provers, software verification
systems, e-learning software, and other applications [25, chap-
Page 3
stex-isem.tex 1265 2010-03-09 10:54:39Z ako
ter 26], but due to its focus on semantic structures, it is not
intended to be consumed by human readers. The Java-based
JOMDoc [19] library uses the notation elements to gener-
ate human-readable XHTML+MathML from OMDoc. Fig-
ure 2 shows the result of rendering the document from list-
ing 2 in a MathML-aware browser. In contrast to the PDF
output we can directly create from STEX, XHTML+MathML
allows for interactivity. In particular, our JOBAD Java-
Script framework enables modular interactive services in
rendered XHTML+MathML documents [14]. These services
utilize the semantic structures of mathematical formulae. In
our rendered documents, each formula in human-readable
Presentation MathML carries the original semantic Open-
Math representation of the formula, as shown in listing 2, as
a hidden annotation.
Client-side JOBAD services, which exclusively rely on anno-
tations given inside a document, have already been imple-
mented for folding and unfolding subterms of formulae and
for controlling the display of redundant brackets in complex
formulae. The symbol definition lookup service, shown in
figure 2, interacts with a server backend: It traverses the
links to symbol and their corresponding definition el-
ements that are established by the OMS elements in Open-
Math – for example, <OMS cd="sets" name="inset"/>
encodes the URI ../background/sets.omdoc#inset –
and retrieves the document at that URI as XHTML+Math-
ML.1 JOBAD’s ability to integrate an arbitrary number of
services, which can talk to different server backends and
which are enabled depending on the context, i. e., the se-
mantic structure of the part of a mathematical formula that
the user has selected, turns our rendered mathematical doc-
uments into powerful mashups [28]. On any symbol, for
example, definition lookup is enabled. On any expression
where a number is multiplied with a special symbol repre-
senting a unit of measurement, a unit conversion client that
talks to a remote unit conversion web service is enabled. The
JOBAD architecture has been designed without depending
on a particular backend; for most of our services we are us-
ing the extensible XML-aware database TNTBase [39, 40,
11], which has special support for OMDoc and integrates
the JOMDoc rendering library.
Figure 2: Listing 1 as Dynamic XHTML+MathML
1This is the MathML way of representing Linked Data. In
section 5, we describe how we have now extended this feature
to cover RDFa Linked Data.
3. FORMALIZATIONWITHSTEXTOWARDS
STEX+
In this section we describe the process of formalizing the
SAMSDocs collection of LATEX documents created in the course
of the SAMS project with the STEX system. We use the user’s
perspective to point to the requirements for STEX+ that
evolved in this process.
As we all know all too well: Formalizing is never easily done.
In our project we had the additional challenge of doing it
without corruption of the PDF layout that was produced
with LATEX. Here, STEX fits well, as it generates PDF and
transforms to XML. In figure 3 we can see the general course
of action:
i) we identified document fragments (“objects”) that con-
stitute a coherent, meaningful unit like the state of a
document “rd.” or its description “ready for certifica-
tion”, then
ii) we translated it into the STEX format, realizing for ex-
ample that “rd.” is a recurring symbol and “ready for
certification”its definition (therefore designing the SAMS-
Docs macro “SDdef”), and finally
iii) we polished these macros in the STEX specific sty-files
so that the PDF layout remained as before and the
generated XML represented the intended logical struc-
ture, for instance the use of the OMDoc XML elements
symbol and definition.
Note that definitions are common objects in mathematical
documents, therefore STEX naturally provides a definition
environment. So why didn’t we use that? Because the doc-
ument model of OMDoc, which we obtain by transforming
STEX using LATEXML, does not allow definitions in tables,
as the former are stand-alone objects from an ontological
perspective. If one authors a formal document, this view
is taken, so no problem arises, but if one formalizes an ex-
isting document, layout and cognitive side-conditions have
to be taken into account. We therefore realized that we
could not simply add basic STEX markup to the LATEX source
yielding formal objects, we rather needed to add pre-formal
markup in the formalization process (we speak of (seman-
tic) preloading).
Whenever project-wide (semantic) layout schemes were dis-
covered, that were frequently used, we extended the macro
set of STEX suitably (enabling preloading“project structures”[22],
i. e. project-induced ones which is quite different from “doc-
ument [layout] structures” [ibid.], e. g. by subsections that
is supported by STEX core features, see DCMsubsection in
figure 3). The table layout for example was often used for
lists of symbol definitions. So we created the SDTab-def
environment which can host as many SDdef commands as
wanted (see fig. 3). This increased the efficiency of the for-
malizing process tremendously.
Another difference between authoring and semantic preload-
ing consisted in the order of the formalization steps. While
the order of the first typically consists of “chunking” (i. e.,
building up structure e. g. by setting up theories),“spotting”
(i. e., coining objects), and“relating”(i. e., making relation-
ships between objects or structures explicit), the order of
the second is made up of spotting, then relating or chunk-
Page 4
stex-isem.tex 1265 2010-03-09 10:54:39Z ako
Figure 3: The Formalization Workflow via STEX: Definition Table of “document state”
ing. The last two were done simultaneously, because STEX
offers a very handy inheritance scheme for symbol macros —
as long as the chunks are in order, which could be sensibly
done for some but not for all at this stage in the formal-
ization process. Generally, many ‘guiding’ services of STEX,
that STEX considered to be features, turned out to be too
rigid.
As a consequence we heavily used very light annotations at
the beginning: It was sufficient to identify a certain docu-
ment fragment and to mark it with a referencable ID like
“state-doc-rd”. Shortly afterwards, we realized that some
more basic markup was necessary, since we wanted to for-
malize our knowledge of types/categories of these objects
and their conceptual belonging. For this we developed a
set of “ad-hoc semantification macros” with named at-
tributes like SDobject[id], SDmore[id,cat,for],
SDisa[id,cat,for,follows,theory,imports,tab],
or SDreferences[id,file,refid]2. The ‘more’ func-
tionality provided by SDmore was required due to logically
contiguous objects that were interspersed in a document.
With this set we preloaded “object structures” [ibid.], i.e.
object-induced ones. Note that the ad-hoc semantification
macros enabled the formalizer to develop her own metadata
vocabulary.
As soon as the document boundaries went down, we real-
ized that an object had many occurrences in several of the
documents in the SAMSDocs collection. For example, first
2We use subsets of a general attributes set for all of our
STEX extensions to lower the learning curve for the use of
the markup macros.
an object was introduced as a high-level concept in the con-
tract, then it was specified in another document, refined in
a detailed specification, implemented in the code, reviewed
at some stage, and so on until it was finally described in
the manual. Thus, we had to preload “collection struc-
tures” [ibid.] as well, which consisted in the development
process model, the V-model as seen in figure 1. Here, we
built our personal V-model macros, e. g. SemVMrefines,
SemVMimplements, or SemVMdescribesUse.
Additionally, we created an STEX extension especially suited
for preloading“organizational structures”[ibid.]. This is con-
sidered different from project structures as organizational
markup is very probable to be reusable for other projects
with the same organizational structures. For example, SAMS
used a document version management as well as a docu-
ment review history, so that environments VMchangelist,
VMcertification with corresponding list entry macros
VMchange, VMcertified were built. Another example is
the processing state of a document, which can be marked up
easily by using the VMdocstate macro as seen in figure 4.
We noted that the necessary formalization depth of some
documents was naturally deeper than others. For example,
it didn’t seem sensible to formalize the contract too much,
as it was created as a high-level communication document,
whereas the detailed specification needed a lot of formaliza-
tion. The manual had an interesting mixed state of formality
and informality, as it was again geared towards communi-
cation, but it needed to be very precise. In conclusion we
note that the mathematical content of the documents (i. e.,
the mathematical objects and their relations) was only one
of the knowledge sources that needed to be formalized and
Page 5
stex-isem.tex 1265 2010-03-09 10:54:39Z ako
Figure 4: Referencing a “document state”
marked up. In the course of the formalization it has be-
come apparent that the knowledge in such complex collec-
tions is multi-dimensional (cf. [22] for an in-depth analy-
sis). Thus, the requirements for extending STEX to STEX+
were (i) to generate XML output that preserves the seman-
tics annotated in the preloading phase, (ii) and to take into
account the multi-dimensionality of our ad-hoc semantifica-
tion macros in a way that technically enables browsing and
querying. These requirements were satisfied by enabling the
generation of RDFa from our annotations and making them
accessible to Linked Data services, as we will describe in the
following sections.
4. STEX+: A METADATA-EXTENSION OF
STEX
All the arrows in figure 1 are examples of relations between
document fragments in the SAMSDocs corpus that needed to
be made explicit in addition to the mathematical relations
that STEX had originally supported; the revision histories
of documents and the social networks of their authors con-
stitute further dimensions of knowledge. For situations like
these, we had incorporated RDFa [1] as a flexible metadata
framework into the OMDoc format [31]. In the course of
this case study, the RDFa integration was revised and ex-
tended and will become part of the upcoming OMDoc ver-
sion 1.3 [27]. The main idea for this integration is to realize
that any concrete document markup format can only treat a
certain set of objects and their relations via its respective na-
tive markup infrastructure. All other objects and relations
can be added via RDFa annotations to the host language –
assuming the latter is XML-based.
It is crucial to realize that, for machine support, the meta-
data objects and relations are given a machine-processable
meaning via suitable ontologies. Moreover, ontologies are
just special cases of (mathematical) theories, which import
appropriate theories for the logical background, e. g. descrip-
tion logic, and whose symbols are the entities (class, proper-
ties, individuals) of ontologies. Thus, STEX and OMDoc can
play a dual role for Linked Data in documents with math-
ematical content. They can be used as markup formats for
the documents and at the same time as the markup formats
for the ontologies. We have explored this correspondence
for OMDoc in previous work and implemented a translation
between OMDoc and OWL [31, 30].
To understand our contribution, note that we can view LATEX
and STEX as frameworks for defining domain-specific vocab-
ularies in classes and packages; LATEX is used for layout as-
pects, and STEX can additionally handle the semantic as-
pects of the vocabularies. STEX uses this approach to de-
fine special markup e. g. for definitions (see lines 10 to 31
in listing 2). Note that to define STEX markup functional-
ity like the definition environment, we have to provide a
LATEX environment definition (so that the formatting via
LATEX works) and a LATEXML binding (to specify the XML
transformation for the definition environment). As the
OMDoc vocabulary is finite and fixed, STEX can (and does)
supply special LATEX macros and environments and their
LATEXML bindings. But the situation is different for the
flexible, RDFa-based metadata extension in OMDoc 1.3 we
mentioned above, with a potentially infinite supply of vocab-
ularies. At the start of the SAMSDocs preloading effort, STEX
already supported a common subset of metadata vocabular-
ies. For instance the Dublin Core title metadata element
in line 11 of listing 2 is the transformation result of using
the KeyVal [9] pair title=. . . in the optional argument of
the definition environment.
For the SAMSDocs case study we started in the same way
by adding a package with LATEXML bindings to STEX. The
\VMdocstate macro shown in the “STEX” box of figure 4
allowed us to annotate a document with its processing state.
This is transformed to an RDFa-annotated omdoc root el-
ement, as shown in the “OMDoc” box underneath and in
the black, solid parts of the RDF graph in figure 5. We can
already see that the STEX extension for SAMSDocs exactly
consists in a domain-specific metadata vocabulary exten-
sion, and that using the custom vocabulary hides markup
complexity from the author. Again, SAMSDocs only needed
a finite vocabulary extension, so this approach was feasible,
but of restricted applicability, since developing the SAMSDocs
package for STEX required insights into STEX internals and
LATEXML bindings. Thus this extension approach lacks the
flexible user-extensibility that would be needed to scale up
further.
To enable user-extensibility, we add a new declaration form
\keydef to the core STEX functionality (yielding STEX+)
— like \symdef in that it is inherited via the module im-
ports relation, only that it defines a KeyVal key instead of
a semantic macro. To understand its application, we ratio-
nally reconstruct the v:hasState relation from the exam-
ple in the OMDoc box of figure 4. To do this, we use STEX to
create a metadata vocabulary for document states: we create
a certification module, which defines the hasState
metadata relation and adds it to the KeyVal keys of the
document environment. The metalanguage macro is a
variant of importmodule that imports the meta language,
i. e., the language in which the meaning of the new symbols
is expressed; here we use OWL.
Listing 3: A Metadata Ontology for Certification
\begin{module}[id=certification]
\metalanguage[../background/owl]{owl}
\keydef{document}{hasState}
End of preview.
Preview full-text

Science & Research Jobs

Keywords

final output
 
generated documents
 
Graz
 
LaTeX
 
modular pre-semantic vocabularies
 
semantic extension
 
semantic frontend
 
semantic information retrieval services
 
semantic XML/OMDoc documents
 
September 1-3
 
Software Engineering case study
 
sTeX
 
sTeX+
 
sTeX+ editor
 
sTeX+ system
 
versioned XML database
 
XHTML+RDFa Linked Data