Conference PaperPDF Available

W3C ITS 2.0 in OASIS XLIFF 2.1 Managing metadata throughout the multilingual content lifecycle

Authors:

Abstract

XLIFF is the XML Localization Interchange File Format. The current OASIS Standard version is XLIFF Version 2.0 [XLIFF 2.0]. XLIFF Version 2.1 [XLIFF 2.1 [csprd01]] concluded the 1st public review period on 25th November 2016. The major new features added to [XLIFF 2.1] compared to [XLIFF 2.0] are the native [ITS 2.0] support and the Advanced Validation feature via NVDL and Schematron. The Advanced Validation feature for XLIFF 2 was first introduced at FEISGILTT 2014 [1] and covered extensively at XML London 2016 [2]. In this paper and presentation, we want to look in detail at the [ITS 2.0] native support feature. In this paper and XML Prague presentation we will explain in detail about W3C [ITS 2.0] metadata categories support in [XLIFF 2.1] and which ITS data in XLIFF 2.1 are accessible or not to generic ITS Processors despite the use of the W3C namespace.
1
W3C ITS 2.0 in OASIS XLIFF 2.1
Managing metadata throughout the multilingual content lifecycle
David Filip, ADAPT Centre at Trinity College
Dublin <david.filip@adaptcentre.ie>
Abstract
XLIFF is the XML Localization Interchange File Format. The current OASIS Standard version is XLIFF
Version 2.0 [XLIFF 2.0]. XLIFF Version 2.1 [XLIFF 2.1 [csprd01]] concluded the 1st public review period on
25th November 2016. The major new features added to [XLIFF 2.1] compared to [XLIFF 2.0] are the native
[ITS 2.0] support and the Advanced Validation feature via NVDL and Schematron. The Advanced Validation
feature for XLIFF 2 was first introduced at FEISGILTT 2014 [1] and covered extensively at XML London 2016
[2]. In this paper and presentation, we want to look in detail at the [ITS 2.0] native support feature.
In this paper and XML Prague presentation we will explain in detail about W3C [ITS 2.0] metadata categories
support in [XLIFF 2.1] and which ITS data in XLIFF 2.1 are accessible or not to generic ITS Processors despite
the use of the W3C namespace.
Table of Contents
Introduction .................................................................................................................. 1
Lay of the land ............................................................................................................. 2
ITS metadata categories and their purpose ......................................................................... 4
Source metadata that inform Extraction behavior ......................................................... 5
Other metadata that inform localization behavior ......................................................... 5
Subject Matter related datacats ................................................................................. 6
Metadata that are produced during or by localization transformations of content .......... ...... 6
ITS metadata categories from the XLIFF representation point of view . ............. .. ............. 7
The nitty gritty ............................................................................................................ 10
Impact and what's next ................................................................................................. 11
Bibliography ............................................................................................................... 11
This research was conducted at the ADAPT Research Centre, Trinity College Dublin, Ireland.
The ADAPT Centre is funded under the SFI (Science Foundation Ireland) Research Centres
Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
Introduction
This paper is about managing Internationalization metadata throughout the multilingual content
lifecycle. Even though corporations and governments routinely need to present the same, equivalent,
or comparable content in various languages, multilingual content is usually not consumed in more
than one language at the same time by the same end user. Typically the target audience consumes
the content in their preferred language and if everything works well they don't even need to be aware
that the monolingual content they consume is part of a multilingual content repository or a result of a
Translation, Localization, or cultural adaptation process.
Thus Multilingualism is transparent to the end user if implemented properly. To achieve end user
transparency the corporations, governments, inter- or extra-national agencies need to develop and
employ Internationalization, Localization, and Translation capabilities. While Internationalization
is primarily done on a monolingual content or product, Localization, and Translation when done at
a certain level of maturity - as a repeatable process possibly aspiring to efficiencies of scale and
automation - requires a persistent Bitext format.
W3C ITS 2.0 in OASIS XLIFF 2.1
2
This paper describes an open, standard, and transparent way of managing Internationalization
metadata in multilingual repositories from seeding them in monolingual source or pivot content
through extraction to an open Bitext format, manipulating or injecting relevant metadata categories
during the Bitext Roundtrip, to keeping, archiving or throwing away the metadata that arrived
processed in the target content.
Lay of the land
The foundational Internationalization Standard is of course [Unicode] along with some related
Unicode Annexes. But in this paper we are taking the Unicode support for granted and will be looking
at the domain standards W3C ITS and OASIS XLIFF that are the open standards relevant for covering
the industry process areas outlined in the Introduction.
For a long time, XML has been another unchallenged foundation of the multilingual content
interoperability and hence practically all Localization and Internationalization standards started as
or became at some point XML vocabularies. Paramount industry wisdom is stored in data models
that had been developed over decades as XML vocabularies at OASIS, W3C, LISA (RIP) and
elsewhere. Although ITS is based on abstract metadata categories, [ITS 1.0] had only provided
specific implementable recommendation for XML. The simple yet ingenious idea of ITS is to provide
a reusable namespace that can be injected into existing formats. Although the notion of a namespace
is not confined to XML, again ITS 1.0 was only specifically injectable into XML vocabularies.
[ITS 2.0] provides local and global methods for metadata storage not only in XML but also in HTML
5, it also looked at mapping into non-XML formats such as [NIF], albeit in a non-normative way.
Because native HTML does not support the notion of namespaces, ITS 2.0 has to use attributes that
are prefixed with the string its- for the purpose of being recognized as an HTML 5 module. [ITS
2.0] also introduced many new metadata categories compared with [ITS 1.0]. ITS 1.0 only looked at
metadata in source content that would somehow help inform the Internationalization and Localization
processes down the line. ITS 2.0 brought brand new and sometimes complex metadata categories
that contain information produced during the localization processes or during the language service
transformations that are necessary to produce target content and are typically facilitated by Bitext. This
naturally led to a non-normative mapping of [ITS 2.0] to [XLIFF 1.2] and to [XLIFF 2.0]. Thus ITS 2.0
became a very useful extension to XLIFF. One of the main reasons why [XLIFF 2.0] is not backwards
compatible with [XLIFF 1.2] is that the OASIS XLIFF TC and the wider stakeholder community
wanted to create XLIFF 2 with a modularized data model. [XLIFF 2.0] has a small non-negotiable core
but at the same time it brings 8 namespace based modules for advanced functionality. The modular and
extensible design aims at easy production of "dot" revisions or releases of the standard. XLIFF 2.0 was
intended as the first in the future family of backwards compatible XLIFF 2 standards that will share the
maximally interoperable core (as well as successful modules surviving from 2.0). XLIFF 2 makes a
distinction between modules and extensions. While module features are optional, Conformant XLIFF
Agents are bound by an absolute prohibition to delete module based metadata, whereas deletion of
extension based data is discouraged but not prohibited. The ITS Module is the biggest feature that was
requested by the industry community and approved by the TC for specification as part of [XLIFF 2.1].
The easiest metadata category to explain the idea of ITS is Translate; this is simply a boolean flag that
can be used to indicate Translatability or not in source content.
Example 1. Translate expressed locally in HTML
<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8>
<title>Translate flag test: Default</title>
</head>
<body>
<p>The <span translate=no>World Wide Web Consortium</span> is
making the World Wide Web worldwide!</p>
W3C ITS 2.0 in OASIS XLIFF 2.1
3
</body>
</html>
Example 2. Translate expressed locally in XML
<messages its:version="2.0" xmlns:its="http://www.w3.org/2005/11/its">
<msg num="123">Click Resume Button on Status Display or <panelmsg its:translate="no"
>CONTINUE</panelmsg> Button on printer panel</msg>
</messages>
Since it is not always practically possible to create local annotations, or the given source
format or XML vocabulary has elements or attributes with clear semantics with regards to some
Internationalization data categories such as Translate, in most cases, ITS 2.0 also defines a way to
express a given data category globally.
Example 3. Translate expressed globally in XML
<its:rules version="2.0" xmlns:its="http://www.w3.org/2005/11/its">
<its:translateRule translate="no" selector="//code"/>
</its:rules>
In the above the global its:translateRule indicates that the content of <code> elements is
not to be translated.
XLIFF 2 Core has its own native local method how to express Translatability, it uses the
xlf:translate attribute. Here and henceforth the prefix xlf: indicates this OASIS namespace
urn:oasis:names:tc:xliff:document:2.0. Because XLIFF is the Bitext format that is
used to manage the content structure during the service roundtrip in a source format agnostic way,
XLIFF needs to make a hard distinction between the structural and the inline data. We know the
structural vs inline distinction from many XML vocabularies and HTML. Some typical structural
elements are Docbook <section> or <para> as well as HTML <p>. This is how XLIFF 2 will
encode non-Translatability of a structural element:
Example 4. XLIFF Core @translate on a structural leaf element
<unit id='1' translate="yes">
<segment>
<source>Translatable text</source>
</segment>
</unit>
<unit id='2' translate="no">
<segment>
<source>Non-translatable text</source>
</segment>
</unit>
The above could be an Extraction of the following HTML snippet:
<p translate='yes'>Translatable text</p>
<p translate='no'>Non-translatable text</p>
The same snipped could be also represented like this:
Example 5. XLIFF representing ITS Translate by Extraction behavior w/o
explicit metadata
<unit id='1'>
<segment>
W3C ITS 2.0 in OASIS XLIFF 2.1
4
<source>Translatable text</source>
</segment>
</unit>
However, it is quite likely that the non-translatable structural elements could provide the translators
with some critical context information. Hence the non-extraction behavior can only be recommended
if the Extracting Agent human or machine can make the call if there is or isn't some sort of contextual
or linguistic relationship.
In case of the Translate metadata category being expressed inline, XLIFF has to use its Translate
Annotation:
Example 6. XLIFF Core @translate used inline
<unit id='1'>
<segment>
<source>Text <pc id='1'/><mrk id='m1' translate='no'>Code</mrk></pc></source>
</segment>
</unit>
The above could be an Extraction of the following HTML snippet:
<p>Text <code translate='no'>Code</code></p>
Also inline, there is an option to "hide" the non-translatable content like this:
Example 7. XLIFF representing ITS Translate by Extraction behavior w/o
explicit metadata
<unit id='1'>
<segment>
<source>Text <ph id='1'/></source>
</segment>
</unit>
Again not displaying of the non-translatable content can be detrimental to the process, as both human
and machine translation agents would produce unsuitable translations in case there is some linguistic
relationship between the displayed translatable text and the content hidden by the placeholder code.
Because XLIFF has its own native method of expressing translatability, generic ITS decorators could
not succeed. ITS processors can however access the translatability information within XLIFF using
the following global rule:
Example 8. ITS global rule to detect translatability in XLIFF
<its:rules version="2.0" queryLanguage="xpath">
<!-- Rules for Translate -->
<its:translateRule selector="//xlf:*[@translate='no']" translate='no'/>
<its:translateRule selector="//xlf:*[@translate='yes']" translate='yes'/>
</its:rules>
In the section The nitty gritty we will explain why this rule will sometimes fail and how to address
the fail cases.
ITS metadata categories and their purpose
In this section we will first have a look at the 19 [ITS 2.0] metadata categories from the functional
point of view and later on from the XLIFF representation point of view.
W3C ITS 2.0 in OASIS XLIFF 2.1
5
Source metadata that inform Extraction behavior
Translate [http://www.w3.org/TR/its20/#trans-datacat], Locale Filter [http://www.w3.org/TR/its20/
#LocaleFilter], External Resource [http://www.w3.org/TR/its20/#externalresource], and Elements
Within Text [http://www.w3.org/TR/its20/#elements-within-text] are actually all methods to inform
Extraction behavior. As such, these are quite important for creation of Bitext management formats
such as XLIFF but don't need to be necessarily explicitly expressed within those formats. See the
detailed discussion of Translate above in Lay of the land and also below From the XLIFF point of
view for the other Extraction informing datacats.
Locale Filter [http://www.w3.org/TR/its20/#LocaleFilter] is a method that can make content
conditional based on a target locale. This can be used on source as well as target content. For instance,
legal content will be different for the locales fr-FR and fr-CA.
External Resource [http://www.w3.org/TR/its20/#externalresource] indicates an external usually non-
text part of content that also needs to be changed in order to produce fully adapted target content.
Typically this points to external graphics, scripts or binaries.
Many formats don't contain all the localizable text in a linear structure. Elements Within Text [http://
www.w3.org/TR/its20/#elements-within-text] helps extractors properly handle and not lose context
for text placed in footnotes, endnotes, alt, or contextual hint text etc; on the other hand it can place
externally located text within a linear sequence that makes sense for human consumers, but where it
does not appear in the native environment.
Other metadata that inform localization behavior
Language Information [http://www.w3.org/TR/its20/#language-information] uses the [BCP 47] data
model via xml:lang to indicate the natural language of content. This is obviously very useful in
case you want to source translations or even just render the content with proper locale specifics.
Target Pointer [http://www.w3.org/TR/its20/#target-pointer] is used to lessen the pain when working
with multilingual documents. In a document to become multilingual such as packaging desktop
publishing file, certain areas are designated to hold translations in target languages. It is very bad idea
to try and use multilingual documents or spreadsheets to actually manage versions of multilingual
content. Such formats should be only used when all target content has been produced via some sort
of a Bitext management roundtrip.
Localization Note [http://www.w3.org/TR/its20/#locNote-datacat] is basically a free text field with a
human readable localization instruction/warning or advice. Although a Localization Note can contain
any type of localization information it is not advisable to use it to prescribe Extraction behavior, except
perhaps as a preliminary step before using one of the interoperable datacats that inform Extraction.
Directionality [http://www.w3.org/TR/its20/#directionality] has quite a profound Internationalization
impact, it let's renderers decide at the protocol level (as opposed to the plain text or script level)
whether the content is to be displayed left to right (LTR - Latin script default) or right to left (RTL
- Arabic or Hebrew script default). But the Unicode Bidirectional Algorithm [UAX #9]as well as
directionality provisions in HTML and many XML vocabularies changed since 2012/2013, so the ITS
2.0 specification text is actually not very helpful here. This obviously doesn't affect the importance of
the abstract data category and of having proper display behavior for bidirectional content.
Preserve Space [http://www.w3.org/TR/its20/#preservespace] indicates via xml:space whether or
not whitespace characters are significant. If whitespace is significant in source content it is usually
significant also in the target content, this is more often then not an internal property of the content
format, but it's important to keep this characteristics through transformation pipelines. The danger
that this category is trying to prevent is the loss of significant whitespace characters that could not
be recovered.
ID Value [http://www.w3.org/TR/its20/#idvalue] indicates via xml:id a globally unique identifier
that should be preserved during translation and localization transformations mainly for the purposes
of reimport of target content to all the right places in the native environment.
W3C ITS 2.0 in OASIS XLIFF 2.1
6
Allowed Characters [http://www.w3.org/TR/its20/#allowedchars] and Storage Size [http://
www.w3.org/TR/its20/#storagesize ] are basically a way how to inform the Localization providers
about certain system limitations that happen to apply to the source but are also expected to be
fulfilled by the target. Often this is used to avoid issues from using localized content in insufficiently
internationalized environments, be it by sticking to the rule or by manual post-processing if the source
limitation cannot be possibly fulfilled by the target in certain locales. For instance a legacy system is
ASCII only and all Localizations are required to keep to this restriction (which is impossible for non-
Latin script based languages and isn't great for most Latin script based languages except for English).
Localizations might need to be held in certain database fields that happen to impose a Storage Size
restriction.
Subject Matter related datacats
Terminology [http://www.w3.org/TR/its20/#terminology] can simply indicate words or multi-word
expressions as terms or non-terms. This is how the category worked in ITS 1.0. In ITS 2.0,
Terminology can be more useful by pointing to definitions or indicating a confidence score, which
is especially useful in cases the Terminology entry was seeded automatically. Terminology doesn't
belong exclusively here. Together with Text Analysis it can be actually injected into the content during
any stage of the lifecycle and is not limited to source. However, it is very important for the localization
process, human or machine driven, to have Terminology annotated be it even only the simple Boolean
flag.
Text Analysis [http://www.w3.org/TR/its20/#textanalysis] is a sister category to Terminology that
is new in ITS 2.0. It is intended to hold mostly automatically sourced (possibly semi-supervised)
entity disambiguation information. This can be useful for translators and reviewers but can also enrich
reading experience in purely monolingual settings.
Domain [] can be used to indicate content topic, specialization or subject matter focus that is required
to produce certain translations. This can be for instance used to select a suitably specialized MT engine,
such as one trained on an automotive bilingual corpus in case an automotive domain is indicated or.
In another use case, a language service provider will use a sworn translator and require in country
legal subject matter review in case the domain was indicated as legal. Although ITS data categories
are defined independently and don't have implementation dependencies, Domain information is well
suited for usage together with the Terminology and Text Analysis datacats.
Metadata that are produced during or by localization
transformations of content
It might seem that this type of metadata is completely new in [ITS 2.0], since [ITS 1.0] concentrated
almost exclusively on source metadata. However, as mentioned above, Terminology - that was present
in already in ITS 1.0 - can be injected at any point and is not confined to source. Also Directionality
is a characteristics of source as well as target and has profound importance during the roundtrip. This
was however not the focus in ITS 1.0.
MT Confidence [http://www.w3.org/TR/its20/#mtconfidence], Localization Quality Issue
[http://www.w3.org/TR/its20/#lqissue], Localization Quality Rating [http://www.w3.org/TR/its20/
#lqrating], and Provenance [http://www.w3.org/TR/its20/#provenance] - all new categories in ITS 2.0
- can be only produced during Localization transformations; specifically, during Machine Translation,
during a review or quality assurance process, during or immediately after a manual or automated
translation or revision.
MT Confidence [http://www.w3.org/TR/its20/#mtconfidence] gives a simple score between 0 and 1
that encodes the automated translation system's internal confidence that the produced translation is
correct. This score isn't interoperable but can be used in single engine scenarios for instance to color
code the translations for readers or post-editors. It can also be used for storing the data for several
engines and running comparative studies to make the score interoperable first in specific environments
and later on maybe generally.
W3C ITS 2.0 in OASIS XLIFF 2.1
7
Localization Quality Issue [http://www.w3.org/TR/its20/#lqissue] contains a taxonomy of possible
Translation and Localization errors that can be applied in annotations of arbitrary content spans.
The taxonomy ensures that this information can be exchanged among various Localization roundtrip
agents. Although this mark up is typically introduced in a Bitext environment on target spans, marking
up source isn't exclude and can be very practical, especially when implementing the feedback or even
reporting a source issue. Importantly, the issues and their descriptions can be Extracted into target
content and consumed by monolingual reviewers in the native environment.
Localization Quality Rating [http://www.w3.org/TR/its20/#lqrating] is again a simple score that gives
a percentage indicating the quality of any portion of content. This score is obviously only interoperable
within an indicated Localization Quality Rating system or metrics. Typically flawless quality is
considered 100 % and various issue rates per translated volume would strike down percentages,
possibly dropping under an acceptance threshold that can be also specified.
Provenance [http://www.w3.org/TR/its20/#provenance] in ITS is strictly specialized to indicate
only translation and revision agents. Agents can be organizations, people or tools or described by
combinations of those. For instance, Provenance can indicate that the Reviser John Doe from ACME
Language Quality Assurance Inc. produced a content revision with the Perfect Cloud Revision Tool.
ITS metadata categories from the XLIFF representation
point of view
When ITS is becoming an XLIFF module, the most important point of view is whether or not any of
the data categories already exist or are partially present in XLIFF. Fully overlapping categories need
to be mapped. Fully absent categories are simply implemented from scratch using the ITS provisions.
Most challenging are the cases of partial overlap. Several datacats were not implemented for various
reasons explained below.
Already in
The ITS data categories that already were available in [XLIFF 2.0] are Preserve Space [http://
www.w3.org/TR/its20/#preservespace], Translate [http://www.w3.org/TR/its20/#trans-datacat], and
External Resource [http://www.w3.org/TR/its20/#externalresource].
We've already covered how Translate is represented in XLIFF 2 back in Lay of the land.
Both ITS and XLIFF do use the xml:space to represent the Preserve Space [http://www.w3.org/
TR/its20/#preservespace] behavior. However, XLIFF 2 prohibits setting of xml:space lower than
on the <unit> element. The reason being that xml:space set lower would not evaluate properly
on XLIFF inline pseud-spans. Thus XLIFF cannot explicitly express mixed Preserve space behavior
inline. It can however normalize all inline content spans that had xml:space="default" and set
xml:space="preserve" on the ancestor <unit>. Thus, XLIFF 2 can express the same Preserve
Space behavior as ITS, despite the pseudo-spans complication.
External Resource [http://www.w3.org/TR/its20/#externalresource] can be extracted into XLIFF 2
using the [XLIFF 2.0] Resource Data Module [http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/
xliff-core-v2.0-os.html#resourceData_module], which is actually more expressive and requires the
indication of the media type for each external resource. XLIFF Extractors have the extra onus to check
and set the media type when extracting the External Resource information. While ITS Processors can
be pointed to the external resources, they don't need to care for the extra media type info that is not
required by the ITS datacat.
Localization Note [] was listed in [XLIFF 2.1 [csprd01]] as another data category that is fully available
through XLIFF. However, it transpired based on the issue #5 [https://issues.oasis-open.org/browse/
XLIFF-5] raised by Yves Savourel that there are some subtle differences in scope between the ITS
Localization Note and the XLIFF Core <note. Although it is possible to roundtrip Localization Note
data in XLIFF 2, it is strictly speaking not accessible by ITS Processors in the exactly same way as by
XLIFF Agents. The culprit here is inheritance. Inheritance is one of the basic principles of ITS data
W3C ITS 2.0 in OASIS XLIFF 2.1
8
categories and so the ITS Localization Note information actually applies to all children of the node
where it was specified. XLIFF Notes can apply to high level immutable structural elements <file>,
<group>, and <unit> and from the business point of view the Notes do apply to their whole
content, nevertheless the Notes content is isolated in a wrapper that is a sibling to the structural payload
children and no defaults or inheritance are defined for the structural descendants of the structural host.
Simply speaking if a Docbook <section> elements hosts a Localization Note, the information will
be retrieved by ITS Processors from all of its descendants, while the <group> element based Note
that will be created during extraction via the mapping will not be imposed on the <group> element's
descendants. Although from the business point of view, the information does apply to all of the nested
content, the Note information actually does not get inherited in the strict XML sense.
Thus, the Localization Note will move to the Partial overlap data categories in all upcoming [XLIFF
2.1] publications.
Implemented from scratch
Allowed Characters [http://www.w3.org/TR/its20/#allowedchars], Domain [], Locale Filter [http://
www.w3.org/TR/its20/#LocaleFilter], Localization Quality Issue [http://www.w3.org/TR/its20/
#lqissue], Localization Quality Rating [http://www.w3.org/TR/its20/#lqrating], Text Analysis [http://
www.w3.org/TR/its20/#textanalysis] are fully defined within the [XLIFF 2.1] ITS Module.
Each of the above defines it's own custom annotation to be applicable on XLIFF Inline Content.
Application on structural levels is straightforward.
Localization Quality Issue [http://www.w3.org/TR/its20/#lqissue] and Localization Quality Rating
[http://www.w3.org/TR/its20/#lqrating] are injected during the XLIFF roundtrip and can be Extracted
to the target content in the native environment if supported by the Merger (with full Extractor
knowledge).
Locale Filter [http://www.w3.org/TR/its20/#LocaleFilter] is also listed under Not represented. It
shouldn't be used within XLIFF Documents that already have specified the target language.
xlf:trgLang is optional if there are no <xlf:target> elements in the XLIFF Document.
Corporate content owners participating in XLIFF 2 development expressed the requirement to be able
to produce preliminary XLIFF Documents with only source content and the Locale Filter metadata
that can be further processed by Localization Service Providers to create multiple XLIFF Documents
with the target language set and the Locale Filter metadata fully consumed (no longer present) by
inclusion of only the relevant source content.
The remaining datacats work just smoothly as intended in [ITS 2.0] and don't pose any specific
implementation challenges.
Partial overlap
Language Information [http://www.w3.org/TR/its20/#language-information], MT Confidence [http://
www.w3.org/TR/its20/#mtconfidence], Provenance [http://www.w3.org/TR/its20/#provenance],
Terminology [http://www.w3.org/TR/its20/#terminology], and Storage Size [http://www.w3.org/TR/
its20/#storagesize ] have partial overlap with XLIFF 2.0 features.
Language Information [http://www.w3.org/TR/its20/#language-information] used on structural
elements is fully supported by XLIFF Core xlf:srcLang and xlf:trgLang attributes. The
culprit is in usage of foreign language spans inline within another source or target language. XLIFF
didn't have provisions to handle this use case because the xml namespace (and hence xml:lang)
is prohibited within the XLIFF inline data model. This is because xml:lang could not properly
interact with XLIFF Core pseudo-spans. itsm:lang was defined in [XLIFF 2.1 [csprd01]] but this
is now gone since [XLIFF 2.1] now operates with the W3C its namespace that doesn't have its own
language information attribute and reuses xml:lang.
MT Confidence [http://www.w3.org/TR/its20/#mtconfidence] has an overlap with the XLIFF
2.0 Translation Candidates Module [http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-
W3C ITS 2.0 in OASIS XLIFF 2.1
9
v2.0-os.html#candidates]. Explicit usage of ITS based MT Confidence on XLIFF Core is
problematic as it only makes sense to be used on unmodified MT suggestions and this cannot
be reliably discerned within XLIFF Core. MT Enrichers can be mandated to only use MT
Confidence with unmodified MT suggestions. However, Core only Modifiers cannot be mandated
to delete module based data after a human intervention. ITS Processors unfortunately cannot
access the MT Confidence information contained in mtc:matchQuality (mtc: inicates
urn:oasis:names:tc:xliff:matches:2.0 which is the Translation Candidates module
namespace) because ITS does not specify a gloal pointer for MT Confidence. This is where ITS is still
in need of change or at least an extension.
Provenance [http://www.w3.org/TR/its20/#provenance] has an overlap with the Change Tracking
Module. Here the situation is particularly convoluted since XLIFF 2.1 deprecates the
urn:oasis:names:tc:xliff:changetracking:2.0 namespace and replaces it with the
urn:oasis:names:tc:xliff:changetracking:2.1 because of significant changes in
this module since XLIFF 2.0. In XLIFF Core, Provenance works fine as if from scratch; however,
in the Change Tracking Module, Provenance metadata needs to interact in an interoperable way with
some of its native attributes.
Terminology [http://www.w3.org/TR/its20/#terminology] overlaps with XLIFF
Core Term Annotation [http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-
os.html#termAnnotation]. This was relatively the easiest case of partial overlap, since the Core
Annotation could be complemented with an additional ITS Annotation to cover additional ITS based
parts of the Terminology information. ITS Terminology and XLIFF Term also nicely interacts with
the XLIFF 2.0 Glossary Module [http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-
os.html#glossary-module] which easily maps to and from TBX Basic.
Storage Size [http://www.w3.org/TR/its20/#storagesize] has been listed as a case of partial
overlap, since this data category can be expressed as an extension of the XLIFF 2.0 Size
and Length Restriction Module [http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-
os.html#size_restriction_module] (SLR). The specifics of the extension have not been given due to
lack of industry interest. Nevertheless, [XLIFF 2.1] states that Storage Size has to be implemented
(if at all) using the Size and Length Restriction Module, which is a readily extensible framework for
giving all sorts of "fitting somewhere" restrictions; not only size and length, but also for instance
custom display areas filled by letters in specific fonts and sizes, even volume or weight when needed.
ITS Storage Size is simply a special case of a SLR based storage restriction.
Not represented
Directionality [http://www.w3.org/TR/its20/#directionality], Elements Within Text [http://
www.w3.org/TR/its20/#elements-within-text], ID Value [http://www.w3.org/TR/its20/#idvalue],
Locale Filter [http://www.w3.org/TR/its20/#LocaleFilter], Target Pointer [http://www.w3.org/TR/
its20/#target-pointer] are listed as not represented datacats in [XLIFF 2.1].
Elements Within Text [http://www.w3.org/TR/its20/#elements-within-text] is fully consumed by
Extractor behavior. Properly extracted information will either result in formation of XLIFF Core Sub-
flows [http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html#subflowsdesc] or a
standard linear order of XLIFF Core segments and units. The metadata will be however not represented
in XLIFF Documents and the correct target creation behavior will be only accessible to a Merger
that has full Extractor knowledge [http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-
os.html#d0e333].
It is preferable that Locale Filter [http://www.w3.org/TR/its20/#LocaleFilter] is fully consumed by
Extraction and the metadata doesn't need to be represented within XLIFF Documents. See however,
Implemented from scratch for an exception.
Target Pointer [http://www.w3.org/TR/its20/#target-pointer] information can form a part of the
Extractor / Merger knowledge that is required to populate the native multilingual document, there are
however no provisions to represent this in XLIFF Documents. On the other hand XLIFF as a Bitext
W3C ITS 2.0 in OASIS XLIFF 2.1
10
format is a kind of multilingual document and there is a fairly simple ITS rule (within the ITS Module
Schematron schema [http://docs.oasis-open.org/xliff/xliff-core/v2.1/csprd01/schemas/itsm.sch]) that
lets generic ITS processors to parse XLIFF 2 for target content.
XLIFF Core doesn't use xml:id, all XLIFF ids are of the type xs:NMTOKEN and uniqueness scopes
are defined at several separate levels. Thus the ID Value [http://www.w3.org/TR/its20/#idvalue]
information is fully consumed by the Extraction / Merge behavior.
XLIFF Core has it's own fully fledged Directionality [http://www.w3.org/TR/its20/#directionality]
capability. So in a sense the abstract datacat should count as "already in", nevertheless
the XLIFF Core Directionality [http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-
os.html#d0e9515] doesn't need mapped to the specific ITS 2.0 Directionality provisions, as explained
in Other metadata that inform localization behavior.
The nitty gritty
In the discussion so far, we have seen that there is a fairly good semantic match between ITS
and XLIFF. However, there is a couple of principal challenges that are not at all easy to address
properly. The solutions adopted during the resolution of the 1st Public Review Comments [XLIFF 2.1
[csprd01]], especially the issue #9 [https://issues.oasis-open.org/browse/XLIFF-9], were driven by the
intention to achieve maximum possible interoperability and making most of the XLIFF-encoded ITS
metadata accessible to generic ITS processors without the need to implement specific XLIFF Agent
conformance requirements.
The ITS Module namespace that was originally used to encode new ITS metadata within XLIFF
was urn:oasis:names:tc:xliff:itsm:2.1. This has been replaced by the original W3C
namespace https://www.w3.org/2005/11/its/ for the 2nd Public Review Draft primarily
to make more ITS categories directly accessible by generic ITS Processors. The principal cause that
forced the OASIS XLIFF TC to reuse the W3C namespace was that ITS processors could not be
directed by ITS global rules (within the ITS Module Schematron schema [http://docs.oasis-open.org/
xliff/xliff-core/v2.1/csprd01/schemas/itsm.sch]) even to some data categories that were implemented
in XLIFF from scratch. This was impossible due to the lack of global pointers for those categories
within the [ITS 2.0] specification. While the theoretically proper action would have been to keep the
distinct OASIS ITS Module namespace and add the needed global pointers to ITS (a new dot release),
this would not be practically achievable. In W3C, Working Groups (WG) are only mandated to work
on specific work items and are disbanded after the intended Recommendations are published, hence a
new WG would need to be assembled and mandated to produce a new version of ITS.
In spite of [XLIFF 2.1] now using the W3C namespace for the ITS Module, there is a systematic scope
mismatch between the XLIFF defined ITS attributes and the ITS defined XML attributes. Because
ITS 2.0 has no provision to parse pseudo-spans, it will necessarily fail to identify spans formed by
XLIFF Core <sm/> and <em/> markers.
In XLIFF, Modifiers can always transform <mrk id="1">span of text</mrk> into <sm
id="1"/>span of text<em startRef="1"/>, which is fundamentally inaccessible by ITS
Processors without extended provisions. Unmodified or unextended ITS Rules will find the <sm/>
nodes, if those nodes do hold the W3C ITS namespace based attributes or native XLIFF attributes that
can be globally pointed to by ITS rules, yet they will fail to identify the pseudo-spans and will consider
the <sm/> nodes empty, ultimately failing to identify the proper scope of the correctly identified
datacat. XLIFF implementers who want to make their XLIFF Stores maximally accessible to ITS
processors are encouraged to avoid forming of <sm/> based spans, it is however often not possible.
Had it been possible, XLIFF would have not needed to define <sm/> and <em/> delimited pseudo-
spans in the first place.
Principal reasons to form pseudospans include the following requirements: 1) capability to represent
non-XML content, 2) need for overlapping annotations, 3) capability to represent annotations
overlapping with formatting spans as well as 4) annotations broken by segmentation (which has to be
represented as well formed structural albeit transient nodes).
W3C ITS 2.0 in OASIS XLIFF 2.1
11
Impact and what's next
[XLIFF 2.1] gives guidance how to roundtrip each of the 19 ITS 2.0 datacats. All of the ITS module's
based metadata is accessible by ITS Processors, except for the pseudo-span issue described above. ITS
Procesors can easily implement an additional capability to detect spans like this one <sm id="1"/
>span of text<em startRef="1"/> without going into any more XLIFF specific features.
Existing and overlapping features are not accessible in cases, where ITS 2.0 lacks global pointers. It
is again relatively easy and straightforward to introduce these as extensions via the W3C ITS Interest
Group (IG). This IG does not have the mandate to produce normative additions to ITS 2.0 or a new
version, yet it can introduce new useful extensions and keep track of features needed for a potential
new major version.
The release of a technically stable public review draft of [XLIFF 2.1] constitutes another important step
in harmonization of Internationalization and Localization standards based at OASIS, W3C, Unicode
Consortium and elsewhere. Early adopters of XLIFF 2.1 should subscribe to the XLIFF TC Comment
List [https://www.oasis-open.org/committees/comments/index.php?wg_abbrev=xliff] to be notified
on further progress of the review drafts towards the official publication as an OASIS Standard,
hopefully in summer 2017. Importantly, XLIFF 2.1 contains the media type registration template
which will result in a definitive registration of the extension xlf for the XLIFF 2 family of standards.
The abstract object model for XLIFF [https://github.com/oasis-tcs/xliff-omos-om] as well as non-
XML serializations of that model (such as JLIFF [https://github.com/oasis-tcs/xliff-omos-jliff]) are
being developed at the OASIS XLIFF OMOS TC [https://www.oasis-open.org/committees/xliff-
omos/]. This TC also looks into mappings to and from other Internationalization and Localization
standards, as well as Localization service APIs and reference architectures. XLIFF proper - the original
XML serialization of XLIFF 2 - continues being developed at the OASIS XLIFF TC [].
Bibliography
[1] S. Saadatfar and D. Filip: Advanced Validation Techniques for XLIFF 2. Localisation Focus, vol. 14, no. 1,
pp. 43-50, April 2015. http://www.localisation.ie/locfocus/issues/14/1
[2] S. Saadatfar and D. Filip: Best Practice for DSDL-based Validation. XML London 2016 Conference
Proceedings, May 2016.
[BCP 47] M. Davis, Ed. Tags for Identifying Languages, http://tools.ietf.org/html/bcp47 IETF (Internet
Engineering Task Force).
[NIF] S. Hellmann, J. Lehmann, S. Auer, and M. Brümmer: Integrating NLP using Linked Data. 12th
International Semantic Web Conference, Sydney, Australia, 2013. http://svn.aksw.org/papers/2013/
ISWC_NIF/public.pdf
[ITS 1.0] C. Lieske and F. Sasaki, Eds.: Internationalization Tag Set (ITS) Version 1.0. W3C Recommendation,
03 April 2007. W3C. https://www.w3.org/TR/its/
[ITS 2.0] D. Filip, S. McCance, D. Lewis, C. Lieske, A. Lommel, J. Kosek, F. Sasaki, Y. Savourel, Eds.:
Internationalization Tag Set (ITS) Version 2.0. W3C Recommendation, 29 October 2013. W3C. http://
www.w3.org/TR/its20/
[UAX #9] M. Davis, A. Lanin, and A. Glass, Eds.: UAX #9: Unicode Bidirectional Algorithm.. Version: Unicode
9.0.0, Revision 35, 18 May 2016. Unicode Consortium. http://www.unicode.org/reports/tr9/tr9-35.html
[Unicode] K. Whistler et al., Eds.: The Unicode Standard. Version 9.0 - Core Specification, July 2016. Unicode
Consortium. http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf
[XLIFF 1.2] Y. Savourel, J. Reid, T. Jewtushenko, and R. M. Raya, Eds.: XLIFF Version 1.2, OASIS Standard.
OASIS, 2008. Y. Savourel, D. Filip, R. M. Raya, and Y. Savourel, Eds.: XLIFF Version 1.2. OASIS
Standard, 01 February 2008. OASIS. http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html
W3C ITS 2.0 in OASIS XLIFF 2.1
12
[XLIFF 2.0] T. Comerford, D. Filip, R. M. Raya, and Y. Savourel, Eds.: XLIFF Version 2.0. OASIS Standard, 05
August 2014. OASIS. http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html
[XLIFF 2.1 [csprd01]] D. Filip, T. Comerford, S. Saadatfar, F. Sasaki, and Y. Savourel, Eds.: XLIFF Version
2.1. Public Review Draft 01, 14 October 2016. OASIS http://docs.oasis-open.org/xliff/xliff-core/v2.1/
csprd01/xliff-core-v2.1-csprd01.html
[XLIFF 2.1] D. Filip, T. Comerford, S. Saadatfar, F. Sasaki, and Y. Savourel, Eds.: XLIFF Version 2.1. Public
Review Draft 02, February 2017. OASIS http://docs.oasis-open.org/xliff/xliff-core/v2.1/csprd02/xliff-
core-v2.1-csprd02.html
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
This paper proposes a best practice guide to apply Document Schema Definition Languages (DSDL) for validation of an arbitrary industry vocabulary. The research is based mainly on a practical case study of creating such an optimized set of DSDL validation artefacts for XLIFF 2, a complex industry vocabulary. Available schema languages have advanced functionality, enhanced expressivity and can be used in concert if needed. This advantage, on the other hand, makes the creation of a stable and robust set of validation artefacts hard, because there would usually be more than one way to describe the same Functional Dependencies or Integrity Constraints and various validation tasks can be solved by more than one schema language.
Brümmer: Integrating NLP using Linked Data
  • S Nif
  • J Hellmann
  • S Lehmann
  • M Auer
[NIF] S. Hellmann, J. Lehmann, S. Auer, and M. Brümmer: Integrating NLP using Linked Data. 12th International Semantic Web Conference, Sydney, Australia, 2013. http://svn.aksw.org/papers/2013/ ISWC_NIF/public.pdf [ITS 1.0] C. Lieske and F. Sasaki, Eds.: Internationalization Tag Set (ITS) Version 1.0. W3C Recommendation, 03 April 2007. W3C. https://www.w3.org/TR/its/
Advanced Validation Techniques for XLIFF 2 Localisation Focus
  • S Saadatfar
  • D Filip
S. Saadatfar and D. Filip: Advanced Validation Techniques for XLIFF 2. Localisation Focus, vol. 14, no. 1, pp. 43-50, April 2015. http://www.localisation.ie/locfocus/issues/14/1
Advanced Validation Techniques for XLIFF 2
  • S Saadatfar
  • D Filip
S. Saadatfar and D. Filip: Advanced Validation Techniques for XLIFF 2. Localisation Focus, vol. 14, no. 1, pp. 43-50, April 2015. http://www.localisation.ie/locfocus/issues/14/1
Internationalization Tag Set (ITS) Version 2.0. W3C Recommendation
  • D Filip
  • S Mccance
  • D Lewis
  • C Lieske
  • A Lommel
  • J Kosek
  • F Sasaki
  • Y Savourel
D. Filip, S. McCance, D. Lewis, C. Lieske, A. Lommel, J. Kosek, F. Sasaki, Y. Savourel, Eds.: Internationalization Tag Set (ITS) Version 2.0. W3C Recommendation, 29 October 2013. W3C. http:// www.w3.org/TR/its20/
XLIFF Version 1.2, OASIS Standard. OASIS
  • K Whistler
K. Whistler et al., Eds.: The Unicode Standard. Version 9.0-Core Specification, July 2016. Unicode Consortium. http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf [XLIFF 1.2] Y. Savourel, J. Reid, T. Jewtushenko, and R. M. Raya, Eds.: XLIFF Version 1.2, OASIS Standard. OASIS, 2008. Y. Savourel, D. Filip, R. M. Raya, and Y. Savourel, Eds.: XLIFF Version 1.2. OASIS Standard, 01 February 2008. OASIS. http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html [XLIFF 2.0] T. Comerford, D. Filip, R. M. Raya, and Y. Savourel, Eds.: XLIFF Version 2.0. OASIS Standard, 05 August 2014. OASIS. http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html [XLIFF 2.1 [csprd01]] D. Filip, T. Comerford, S. Saadatfar, F. Sasaki, and Y. Savourel, Eds.: XLIFF Version 2.1. Public Review Draft 01, 14 October 2016. OASIS http://docs.oasis-open.org/xliff/xliff-core/v2.1/ csprd01/xliff-core-v2.1-csprd01.html
XLIFF Version 2.1. Public Review Draft 02
  • D Filip
  • T Comerford
  • S Saadatfar
  • F Sasaki
  • Y Savourel
[XLIFF 2.1] D. Filip, T. Comerford, S. Saadatfar, F. Sasaki, and Y. Savourel, Eds.: XLIFF Version 2.1. Public Review Draft 02, February 2017. OASIS http://docs.oasis-open.org/xliff/xliff-core/v2.1/csprd02/xliffcore-v2.1-csprd02.html
XLIFF Version 2.1. Public Review Draft 01
  • K Whistler
K. Whistler et al., Eds.: The Unicode Standard. Version 9.0 -Core Specification, July 2016. Unicode Consortium. http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf [XLIFF 1.2] Y. Savourel, J. Reid, T. Jewtushenko, and R. M. Raya, Eds.: XLIFF Version 1.2, OASIS Standard. OASIS, 2008. Y. Savourel, D. Filip, R. M. Raya, and Y. Savourel, Eds.: XLIFF Version 1.2. OASIS Standard, 01 February 2008. OASIS. http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html [XLIFF 2.0] T. Comerford, D. Filip, R. M. Raya, and Y. Savourel, Eds.: XLIFF Version 2.0. OASIS Standard, 05 August 2014. OASIS. http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html [XLIFF 2.1 [csprd01]] D. Filip, T. Comerford, S. Saadatfar, F. Sasaki, and Y. Savourel, Eds.: XLIFF Version 2.1. Public Review Draft 01, 14 October 2016. OASIS http://docs.oasis-open.org/xliff/xliff-core/v2.1/ csprd01/xliff-core-v2.1-csprd01.html