Conference PaperPDF Available

A No-Compromises Architecture for Digital Document Preservation

Authors:

Abstract and Figures

The Multivalent Document Model offers a practical, proven, no- compromises architecture for preserving digital documents of potentially any data format. We have implemented from scratch such complex and cur- rently important formats as PDF and HTML, as well as older formats includ- ing scanned paper, UNIX manual pages, TeX DVI, and Apple II AppleWorks word processing. The architecture, stable since its definition in 1997, ex- tends easily to additional document formats, defines a cross-format docu- ment tree data structure that fully captures semantics and layout, supports full expression of a format's often idiosyncratic concepts and behavior, en- ables sharing of functionality across formats thus reducing implementation effort, can introduce new functionality such as hyperlinks and annotation t o older formats that cannot express them, and provides a single interface (API) across all formats. Multivalent contrasts sharply with emulation and con- version, and advances Lorie's Universal Virtual Computer with high-level architecture and extensive implementation.
Content may be subject to copyright.
A No-Compromises Architecture
for Digital Document Preservation
Thomas A. Phelps and P.B. Watry
University of Liverpool
Liverpool, Great Britain
phelps@ACM.org, P.B.Watry@liverpool.ac.uk
Abstract. The Multivalent Document Model offers a practical, proven, no-
compromises architecture for preserving digital documents of potentially
any data format. We have implemented from scratch such complex and cur-
rently important formats as PDF and HTML, as well as older formats includ-
ing scanned paper, UNIX manual pages, TeX DVI, and Apple II AppleWorks
word processing. The architecture, stable since its definition in 1997, ex-
tends easily to additional document formats, defines a cross-format docu-
ment tree data structure that fully captures semantics and layout, supports
full expression of a format's often idiosyncratic concepts and behavior, en-
ables sharing of functionality across formats thus reducing implementation
effort, can introduce new functionality such as hyperlinks and annotation to
older formats that cannot express them, and provides a single interface (API)
across all formats. Multivalent contrasts sharply with emulation and con-
version, and advances Lorie's Universal Virtual Computer with high-level
architecture and extensive implementation.
1 Introduction
Of the many issues to digital preservation—capture (reading data from old physical
media or harvesting web sites), provenance, metadata, data management, long-term
storage, availability, disaster prevention, multiple data types (scientific data, video),
protecting intellectual property, and others—this paper focuses on the problem of
obsolescence of digital document data formats. It is an important problem: “obsoles-
cence of media formats and data formats is the most demanding problem while preser-
vation of bitstreams can be mastered by using well-known techniques” [17].
Within this focus of digital document formats, it is worthwhile to consider what
constitutes successful preservation. For documents on paper, preservation of the
physical material implies that the content can be viewed, if sometimes under restricted
access. Digital works are unlike paper in that the preservation of the material itself,
the data files, is trivially accomplished, once the documents have are been initially
collected, by successive copying.
However, viewing of digital documents is complex. Beyond a few text-based for-
mats such as ASCII text, document formats are severely if not entirely unreadable
Proceedings of the 9th European Conference on Research and Advanced Technology fo
r
Digital Libraries (ECDL 2005), September 18-23, 2005 Vienna, Austria.
without decoding by specialized software. Digital documents often include time-based
content such as sounds, video, and animations. Digital documents often contain active
elements such as forms, scripts, and plug-ins. In the context of scientific data and
software, which parallel documents with embedded programs, Messerschmitt [12]
points out that the distinction between the two is “rapidly blurring” and states that
“data and software preservation targets are not separate, but should be assumed from
the beginning to be largely inseparable”. Marshall and Golovchinsky [9] consider the
additional nuanced dimensions of literary hypertexts, “that arise from the works on-
screen appearance, its interactive behavior, and the ways a readers interaction with the
work is recorded”. Similar arguments could be made for research systems in general
and any other document system with idiosyncratic concepts. Preservation of a viewing
capability is especially problematic if the software is proprietary, the software runs
only obsolete hardware, and the data formats are not public.
And viewing is not enough. More is expected of digital documents than paper. Us-
ers expect to copy and paste text, images, videos, and other content types. Institu-
tions, web sites, and individuals all expect to search the contents of documents (this
function is so fundamental it is being built into next-generation operating systems).
For some users, text-to-speech and automatic Braille generation are essential. Compa-
nies and researchers want to perform text mining and automatic language translation.
People want to convert documents to the format du jour, such as for handheld devices
with small screens. Researchers want to infer the semantic structure of documents,
utilizing all the information the document contains, everything from layout coordi-
nates to style sheets (if any) to explicit semantic structure (if any). Users want to add
hyperlinks and annotate, even if the document format does not support those concepts.
We can expect the future to be increasingly demanding as new applications are in-
vented that rely upon potentially any aspect of document content.
2 Related Work
2.1 Hardware Preservation
Document software, like software in general, often requires specific supporting soft-
ware and, directly or indirectly (perhaps through the operating system), specific hard-
ware. The problem is hardware breaks down, and new generation hardware may not be
compatible with the software. Hardware preservation to preserve the readability of the
original digital document by maintaining the original hardware and software indefi-
nitely. Such a hardware museum is destined to ultimately fail as the hardware breaks
down with no other like machine to cannibalize for parts, and parts are too specialized
to resume manufacture cost effectively.
2.2 Emulation
Required hardware can be emulated in software on current (more powerful) computers,
and therefore emulators can reproduce a document’s exact appearance and behavior. It
requires quite a bit of work by experts to emulate a computer, especially a modern
computer, but there are many applications for such emulators and several companies
sell them. When the current computer grows obsolete, a new emulator for it can run
the emulator of the previous generation, and so on, creating an ever-growing stack of
emulators, which may or may not be sustainable.
In any case, the document content remains trapped within the emulator. Somewhere
within the emulator's memory soup are program data structures that represent “the
document”. But finding the document and extracting it remains at least as difficult as
interpreting the document file's original bitstream. We would like to add the document
content to a search engine or send the document to others to read without the overhead
of the emulation stack, but cannot.
2.3 Conversion/Migration
Conversion, also called migration, takes material in an older format and recodes it into
a newer format. This can have some success for simple data; perhaps the many for-
mats of raster images can all be represented on a two-dimensional grid of color values.
But digital documents are more complex and in general semantically incompatible
from one another, and conversions from one to another almost always lose informa-
tion for the fundamental reason that some concepts in one document format cannot be
expressed in the other.
As a document format evolves every few years with new releases of the correspond-
ing software, the software can usually read the last couple versions of its own format,
but documents older than a mere several years may become unreadable. Thus, the
conversion process requires constant attention, constant migration. This chain from
format to format can lose information at every step, relentlessly degrading quality.
While data loss is almost guaranteed for conversions between document formats, it is
likely even within upgrades to the same software application.
Today, although emulation and conversion suffer well-known problems, they are
often seen as the only ways. The UK National Archives [3] tries to mitigate the dam-
age done by developing a database of file formats, called PRONOM, that “allows for
the automatic generation of migration pathways, by identifying every possible conver-
sion route between a source and target format, with information about how each con-
version stage will affect the content”. Nevertheless, even if the damage at each step i s
limited, when multiplied by tens or hundreds of years of conversions, and such a time
span is after all the point of preservation in the first place, the data loss is substantial
and certain.
CAMiLEON [10] addresses the cumulative data loss problem by always converting
from the original bytestream. Documents are read into an intermediate format and
various output formats can be developed as needed. The architecture was demonstrated
on a selection of vector graphics formats. This is promising, but faces additional
issues when applied to more complex documents. Even among vector graphics for-
mats, semantic gaps required elements to be downgraded, and we can expect more of
this (even complete data loss in places) with complex document models, which may
or may not be an acceptable compromise. It does not address document behavior, such
as a JavaScript manipulation of an HTML DOM. The intermediate format seems to be
a union set of concepts from all supported formats, and as a practical matter would
likely become exceedingly large and unwieldy as the hundreds or thousands of docu-
ment formats were adopted, many with idiosyncratic concepts and most all with in-
numerable small but potentially important variations on common structures.
2.4 Universal Format
Some systems convert all sources into a single universal format, which it uses for all
further operations. XML seems like an attractive candidate as it captures semantics and
structure, is extensible, and is easy to parse. Virtual Paper [2] and UpLib [6] (neither
of which claim to be a basis for digital preservation) solve the multiple format prob-
lem by capturing image and text representations of all documents, one “projection” to
view and the other to search.
The most famous examples of universal formats are PostScript and PDF, which
boast the unique advantage that they can already capture any document that can be
printed (which is effectively all formats with static content) and increasingly more
applications are generating PDF directly and at a higher semantic level than what is
sent through a printer driver. In a single format, PDF supports high fidelity viewing
as well as text-based operations such as searching, and the PDF file format can bundle
the original document bitstream for future editing or more demanding preservation.
Adobe promotes PDF for archiving [1], pointing out that PDF is a publicly available
(but not open) standard and supports XML metadata records, among other features. A
PDF metastandard for archiving called PDF/A [5] identifies “the set of PDF compo-
nents that may be used and restrictions on the form of their use,” such as disallowing
the patented LZW compression filter and requiring that all fonts be embedded.
Somehow the universal format is eternal, and perhaps it becomes so important that
society ensures this. Nevertheless, the approach has its limitations. It is simply not
practical to completely capture all aspects of all document formats in a union set
format. The format would be unwieldy, hostile to full implementation, and would
have to be updated constantly as new formats are introduced. So-called universal for-
mats must of practical necessity select certain features and leave others behind, and
thus there is a conversion step and corresponding data loss to their use.
2.5 Universal Virtual Computer
Raymond Lorie proposes writing data interpreters “that can extract the data from the
bit stream and return it to the caller in an understandable way, so that it may be trans-
ferred to a new system” [8]. Programs are written against a Universal Virtual Com-
puter (UVC) so that in the future, all that is needed is an implementation of the UVC
on the computer of the day to run the interpreters and thus read the data.
The UVC is extremely cautious about what is certain about the future, and requires
little more than the equivalent of a simple microprocessor and memory. Considering
this approach from the practical point of view of software engineers charged with
building a system that embraces hundreds of document formats of sometimes great
complexity, this is not enough.
In practice, software engineers need an architecture outlining the large-scale organi-
zation of the software to be built and detailing the interactions among the many com-
ponents. For preservation of digital documents, this architecture should embrace such
domain-specific concepts as “document”, “metadata”, “text”, “behavior”, and “struc-
ture”. In practice, software engineers require a high level language, such as Java, and
libraries of pre-built functions (all of which can be compiled to the UVC). A level
above the UVC must interface with hardware, such as displays, keyboards, and mice.
Lorie's UVC is a solid start, and now it is time for higher-level architecture and
implementation.
3 The Multivalent Architecture’s Benefits for Preservation
We now examine the Multivalent* Document Model to see how its architectural quali-
ties support digital document preservation. The purpose here is not a presentation of
the architecture per se (for that see [15] or a concise presentation in [16]), but an elu-
clidation of how important aspects of the digital preservation problem are solved by
certain aspects of the architecture, sometimes uniquely so.
The architecture is powerful and versatile, as can be appreciated from the following
description of an earlier application of the architecture to a browser. The Multivalent
Browser natively displays many document formats (PDF, HTML, scanned paper,
UNIX manual pages, TeX DVI, others) and supports in situ annotation (highlights,
notes, executable copy editor markup, Notemarks) across all formats. Annotations can
attach to any point of a document (letter or image), can apply to documents that are
read-only (such as the New York Times home page), anchor with Robust Locations so
they can reattach correctly even if the source document has been extensively edited, and
exploit Robust Hyperlinks to find a document if it moves elsewhere on the Internet.
The architecture has been implemented. The system totals over 100,000 lines and
over 4 million characters of source code. The document parsers mentioned above and
the browser are freely available online [13]. In the past year, the implementation has
been deployed for preservation, first in the San Diego Supercomputer Center's Persis-
tent Archive Testbed project [18].
The architecture is proven over time. Since 1997, as the API has evolved and im-
plementation has advanced, the Multivalent architecture has remained stable. (This
* Multivalent was born as a thesis project at UC Berkeley, and the creator has since moved
to the University of Liverpool.
predates Lorie’s UVM, but it took Lorie to indirectly point out its suitability for
digital preservation.)
The architecture has many interlocking concepts, and it can be instructive to first
briefly consider the totality. New document formats are supported by media adaptors,
which are code components that translate concrete document formats into runtime data
structures. The primary data structure is the document tree, which represents the entire
content of a document (as a scroll, or a page at a time), including everything from the
text and images, to scripts, to the semantic structure (hierarchy and attributes), to the
physical layout. Active (programmatic) elements of a specific document or a document
genre, such as hyperlinks or outline opening and collapsing, are implemented by
behaviors, which are program code with complete access to the document contents.
The particular behaviors that apply to a document or genre are listed in XML-format
hubs.
The remainder of this section fleshes out those architectural concepts that address
specific aspects of the digital document format preservation problem.
3.1 Media Adaptors
New document formats are supported by media adaptors, which are code components
that translate concrete document data formats into runtime data structures, primarily
the document tree. Currently implemented media adaptors include PDF, HTML,
scanned paper of two OCR formats, UNIX manual pages, TeX DVI, ASCII, and Ap-
ple II AppleWorks word processing, among others.
Media adaptors encapsulate format-specific parsing knowledge, and are obligated to
eliminate any need for further reference to the concrete bitstream. This entails correct-
ing the format wherever needed (coercing HTML to comply to a DTD), and presenting
the rest of the system with uniform word units, which may require splitting lines in
ASCII or pasting together word fragments in PDF or TeX DVI.
The core system has no media adaptors officially “built in”, although a few popular
ones happen to be bundled with the usual distribution. The core system merely associ-
ates a MIME type or file type suffix to a hunk of code. The system provides all of the
modern access to and control of documents in general. Because there is no distinction
between obsolete document formats and those in current use, obsolete document for-
mats are as vigorous as those in current use.
Media adaptors directly read original concrete document data formats. This avoids a
problem of conversion in which bugs or approximations in one stage cumulatively
degrade quality. Bugs, while always undesirable, are more benign in media adaptors,
because once they are fixed, all subsequent viewings and other uses are automatically
corrected. In the same way, partial implementations of formats, such as ones being
painstakingly step-by-step reverse engineered, are incrementally improvable. Any
progress can be disseminated and exploited immediately, without delaying until perfec-
tion is reached, and improvements can be distributed as they are achieved.
The capability to read original document formats is bundled with the system and
therefore always available on demand. With conversion, perhaps the apparatus em-
ployed converted all the known documents in bulk, perhaps by a third party, and now
the user encountering a new instance must revive that. With hardware preservation, the
museum of hardware and software has a geographic location and even if it is on the
network, it may not be amenable to opening its fragile, irreplaceable exhibits to ran-
dom poking from millions of Web surfers. With a universal format, the fact that the
new format may be easy to parse does us no good unless the document has been pre-
processed.
Media adaptors serve as operational definitions of document formats. When media
adaptors are the result of reverse engineering, their operational definitions also serve as
the de facto specifications. When media adaptors are based on separate specification
definitions, they remain essential as they illuminate the dark corners of real world
(ab)use that lie outside the light of the specification. For example, HTML as found on
the Web is almost never correct, and it often requires considerable correction, not to
the W3C’s HTML specification, but to the operation definition given by Microsoft
Internet Explorer. In Multivalent, these operational definitions are part of a live sys-
tem, so they are always being tuned and kept up to date.
As compared to conversion, media adaptors move the preservation problem from
constant massaging of billions of documents to maintaining one media adaptor per
format. Preserving individual documents is reduced to just copying the bits.
3.2 Document Tree
The primary data structure is the document tree, which represents the entire content of
a document (as a single scroll, or a page at a time). The hierarchical structure of a
document is directly reflected in the hierarchy of parent-child nodes in the tree, and all
nodes may contain attributes. For documents such as SGML, XML and SVG, the tree
directly reflects the parse tree of the document. HTML is similarly represented, but
after correction to a DTD. Internal nodes of the tree are structural, and leaves hold
content (text, bitmapped images). All nodes have layout bounding boxes (coordinates
and dimensions), with internal nodes containing the union rectangle of their children.
Ordinarily structure and layout coincide, but sometimes a special branch at the root of
the tree is required to accommodate divergences such as floating images and multiple
columns. Metadata is available as attributes on the root of the tree. Media adaptors can
introduce new nodes types when needed, unlike HTML’s Document Object Model.
Remarkably, all document formats seen so far fit comfortably into a common docu-
ment tree, from the fixed-format scanned paper and PDF at one end of the continuum,
to the flowed HTML and UNIX manual pages at the other.
The document trees of sophisticated document formats are decorated with spans.
Hypertext links and font styling are both span types. Spans provide leaf-to-leaf (and
within leaf) control over appearance: font family, size, style; foreground and back-
ground color; underlining; line width; and more. Spans also control interaction, report-
ing keys pressed and mouse activity within the span. A handful of spans are reused by
media adaptors for many different document formats.
Note that the document tree is not an intermediate format or a universal format.
Unlike an intermediate format, the document tree is used directly for document appear-
ance and behavior, preserving full document expression. (The document tree employs
concepts common across formats where possible and can be used for conversion.)
Whereas universal formats can bloat as the union set of incorporated formats, the
document tree is tailored to an individual format at a time, free of overhead.
In support of preservation, the document tree opens access to all document content.
Conversion, far from opening everything, typically eliminates unusual content types.
Emulation hides content in an impenetrable box. In fact, in one way the document tree
is superior to the original software editors/viewers for a format, because that software
probably did not give access to other applications, at least not to such an extreme
comprehensiveness of text, images, structure, styling, and layout.
The document tree unifies the representation of all document formats and lifts them
to a common set of modern document abstractions. For example, operating system-
and application-specific character encodings, including the various ways of dealing
with large international character sets, are all normalized to Unicode. Tools and serv-
ices target the abstractions and automatically work across all formats. A document
analysis application has access to content, style, and layout, regardless if source was
scanned paper or TeX DVI. That search engine in the previous section that adopted
TeX DVI could also collect the hyperlinks on DVI to add to its crawl.
3.3 Behaviors
Active (programmatic) elements of a specific document or a document genre, are im-
plemented by behaviors. A span on the document tree above is an example of a type
of behavior, a media adaptor is a behavior, and outline opening and collapsing is im-
plemented with a combination of behaviors.
Behaviors are arbitrary program code with complete access to the document con-
tents, the network, and the disk (subject to security restrictions, but no architectural
limitations). Behaviors can be arbitrarily large. Behaviors can arbitrarily edit the
document tree. The sole restriction on a behavior is that it adhere to a certain interface
for communication with the system and other behaviors.
For preservation, behaviors fully embrace the active and idiosyncratic aspects of a
document format. There is no limitation to, say, what JavaScript can access of an
inherently limited scripting level. For example, PDF defines a set of annotation types,
such as ink and stamp, that none of the other implemented document formats do, and
with a set of properties that are unlike any other document format. Each annotation
type (some but not all of which are presently implemented) becomes a behavior type,
and each annotation instance a behavior instance. If a literary hypertext needs a new
hyperlink type with special features, it could introduce it as a behavior.
3.4 Hubs
The particular behaviors that apply to a document or genre are listed in XML-format
hubs. Hubs use XML attributes to customize general behaviors (passing a URL to a
hyperlink, for example) in the same way programming languages employ parameters
for functions. Hubs use XML hierarchy to nest more complex data associated with a
behavior; for example, when the user authors a note-type annotation, the note behav-
ior is saved under the top-level, and the content of the note, its fonts and colors, and
even annotations on that annotation, are nested hierarchically within the note.
Some behaviors apply only to one document (e.g., annotations), some to genres
(e.g., the manual page outliner control), and others to multiple formats (e.g., a pop up
menu that can send word under cursor to a definition service).
Behaviors developed for one format (or for no particular format at all) can be asso-
ciated to others formats via hubs and thus bring new ideas to old formats. This is not
strictly required by preservation (fully expressing the format is sufficient) but is neces-
sary to bring older formats out of the ghetto and into parity with newer ones, and after
all, someday today's formats will be considered ancient too. Users want to annotate all
of their documents, whether PDF which has annotation types, or ASCII or WordStar
(or HTML!), which do not, and hubs associate function out-of-band and therefore are
free of limitations of expressiveness in these formats.
4 Practicalities
4.1 Realization
If the Multivalent Document Model defines an architecture well suited to the needs of
digital document preservation, it is immediately apparent that it will require a large-
scale implementation effort to fully embrace the 100s or 1000s of file formats. The
large number and the fact that they are generally semantically incompatible from one
another inherently force individual attention and thus demand considerable effort. But
the implementation effort is no more than UVM or CAMiLEON.
Fortunately, the work is highly parallelizable to independent teams implementing
media adaptors for different formats. The Multivalent architecture defines the necessary
technical points of coordination, but otherwise imposes no bureaucratic overhead, and
individual teams can choose formats of local importance or interest. Since all the
media adaptors are part of the same architecture, common components can be shared,
as is presently done for paragraph formatting of multiple fonts and for hyperlinks.
Our task is considerably lighter than the sum total of all the original document
software. Preservation emphasizes appearance and behavior, not the considerable edit-
ing component of a system. Devising the document format requires considerable intel-
lectual effort, which we merely read from a specification. We use modern technology
and tools, whereas some original systems were written in assembly language to run in
48K bytes of memory.
4.2 Preservation of Preservation
Any preservation strategy will take maintenance to adapt it to future technologies, and
our system is preserved in the same way as Lorie's UVC. In the UVC, implementa-
tions target a simple core and the maintenance problem is reduced to porting the core.
Multivalent is implemented entirely in Java, and our UVC is the Java virtual machine
(JVM). The JVM is directly analogous to the UVC as both are virtual machines at
more or less the level of assembly language. Java's VM is somewhat more compli-
cated, but the primary consideration is not absolute simplicity but rather a complete,
rigorous definition (Java’s is given in [7]) coupled with a “reasonable” level of im-
plementation achievability. Perhaps it would not be politically auspicious for IBM to
point to a Sun technology as the bedrock of its preservation strategy, but the fact that
numerous companies have implemented compatible JVMs proves its viability.
It would be absurd to build a large system in the assembly language of virtual ma-
chines. Moving from satisfaction of the key self-preservation requirement to the soft-
ware engineering considerations of building a large system, we must choose a high-
level language. (The use of Java's VM does not imply the use of the Java language
itself, as many programming languages can compile to the VM, just as many pro-
grams can compile to the different microprocessors. Different groups could choose
different languages and effectively cooperate.)
5 Future
The Multivalent architecture is well suited for preservation, but was originally de-
signed for use in a browser. It could benefit from the insights of experts in preserva-
tion to ensure that the overall approach fully embraces all essential details before a
large-scale implementation effort is launched.
An important subproject will be the collection of document format specifications.
These are important for their intrinsic status as the intentional definitions, and consid-
ering how time consuming reverse engineering is, it is important for software engi-
neers to have easy access to these specifications. Wheatly [19] catalogs numerous
books, web sites (for example, [20]) and projects that collect many types of file for-
mats. Companion subprojects should collect implementations, which are the opera-
tional definitions of formats, and sample documents for developers.
Undoubtedly the present technology will need to be generalized and refined, and al-
ready one area in particular area is evident. Documents are often found in wrappers of
various kinds, sometimes for compression (such as .zip files) and sometimes in vir-
tual filesystems (such as the Structured Storage used by Microsoft Office applica-
tions). A layer underneath document parsers would need parse these structures and
provide access to the documents inside.
This paper has concentrated on digital documents, but there are many other media
types (some of which are embedded in documents) in need of preservation, such as
scientific data, audio, music scores, video, multimedia such as Macromedia Flash, and
DVD menu programs, to name a few. It is unclear whether these all can be accommo-
dated under a common architecture. But Multivalent has already demonstrated its appli-
cability to a variety of documents with text, images, vector graphics, and program-
matic manipulation of a document tree. Even it is limited to this class of a couple
hundred formats, billions of document instances make a claim for some social value.
6 Conclusion
Compared to existing approaches to digital document preservation, the Multivalent
Document Model offers a step forward. Compared to conversion, the original docu-
ment remains perfectly preserved. Compared to emulation, the content of the docu-
ment is easily available. Compared to Lorie’s UVM, Multivalent defines the high
level architecture necessary for software engineers, and Multivalent's implementation
of number of complex and obsolete document formats prove the architecture's power
and no-compromises suitability for preservation. Multivalent is a proven plan in the
present for the future of preserving the past.
Table 1. Comparison of selected systems used for digital preservation
PDF UVM CAMiLEON Multivalent
Defined 1990 2001 2001 (?) 1997 / applied to
preservation in 2004
Demonstra-
tions
everything that can
be printed
JPEG and GIF bit-
mapped images
(claimed PDF in fact
based on a conver-
sion to HTML)
interconversion
among SVG, Draw ,
WMF vector graphics
PDF, HTML,
scanned paper, T eX
DVI, UNIX manual
pages, Apple II
AppleWorks
Method printer driver ca p-
tures print stream, o r
app directly gener-
ates. Format eternally
supported.
read original bit-
stream by document
interpreter
read original bit-
stream into interme-
diate representation,
convert to another
file format
read original bit-
stream and build
runtime data struc-
tures
Strengths captures static as-
pects of all formats,
well developed, well
defined
potential to fully
express document
appearance and
behavior
as compared to other
conversion, only one
level of quality
degradation
fully expresses
document app ear -
ance and behavior
Use by
Applications
use Acrobat or third-
party library
undefined App du jour picks up
output file format
Live runtime linking
(also amenable to
conversions)
Document
Architecture
emphasis on graphi-
cal appe arance;
structure expressible
but not common
undefined intermediate repre-
sentation (either
unwieldy union set of
all formats, or leave
out idiosyncratic)
fully developed
(media adaptors,
document tree with
structure and layout,
behaviors, spans,
hubs, ...; fixed and
flowed layouts)
Software
Engineering
File format well
documented, Acrobat
API, many third-party
libraries
low-level assembly
language UVM (in
practice use Java)
(unknown) well-exercised
system API, high-
level language (Java)
Maintenance Upgrade money to
Adobe
port UVM to new
machines
develop new output
formats, "software
longevity principles"
port Java VM to ne w
machines
Drawbacks everything must look
like PDF (f ix ed
layout, paginated):
lose idiosyncrasies
and behavior)
Implementation
immature: No docu-
ment architecture.
UVM too low level
for development
Conversion's semantic
gap between formats
downgrades or loses
data. Loses behavior.
Intimate linking by
apps, or develop own
apps. (No compro-
mises for document
quality.)
References
1. Adobe Systems, Inc. PDF as a Standard for Ar chiving, Adobe white paper.
http://www.adobe.com/products/acrobat/pdfs/pdfarchiving.pdf
2. Birrell, A. and McJones, P. Virtual Paper Web site, (1995–1997).
http://www.research.compaq.com/SRC/virtualpaper/
3. Brown, A. Preserving the Digital Heritage: Building a Digital Archive for UK Govern-
ment Records, In Proceedings of Online Information. (2003)
http://www.nationalarchives.gov.uk/preservation/digitalarchive/
4. IBM. Digital Asset Preservation Tool Web site,
http://www.alphaworks.ibm.com/tech/uvc?Open&ca=daw-flHts-120204
5. International Standards Organization. ISO/CD 19005-1, Document manage-
ment—Electronic document file format for long-term preservation—Part 1: Use of PDF
(PDF/A), November 31, (2003). http://www.aiim.org/documents/standards/ISO_19005-
1_(E).doc
6. Janssen, W.C. and Popat, K. UpLib: a universal personal digital library system, In Pro-
ceedings of the ACM symposium on Document Engineering, (2003) 234–242
7. Lindholm, T. and Yellin, F. The Java Virtual Machine Specification, 2nd edition,
Addison-Wesley Longman Publishing Co., Inc. (1999)
8. Lorie, R. Long term preservation of digital information, In Proceedings of the First
ACM/IEEE-CS Joint Conference on Digital Libraries (2001) 346–352
9. Marshall, C.C. and Golovchinsky, G. Saving Private Hypertext: Requirements and
pragmatic dimensions for preservation. In Proceedings of ACM Hypertext (2004)
130–138
10. Mellor, P, Wheatley, P, Sergeant, D. "Migration on Request : A Practical Technique for
Digital Preservation" ECDL (2002)
11. Meehan, J., Taft, E., Chernicoff, S., Rose, C., Karr, R. PDF Reference, fifth edition,
(2004)
12. Messerschmitt, D.G. Opportunities for Research Libraries in the NSF Cyberinfrastruc-
ture Program, ARL Bimonthly Report 229 (2003).
http://www.arl.org/newsltr/229/cyber.html
13. Multivalent Web site. http://multivalent.sourceforge.net
14. The National Archives. PRONOM Web site.
http://www.nationalarchives.gov.uk/pronom/
15. Phelps, T.A. Multivalent Documents: Anytime, Anywhere, Any Type, Every Way
User-Improv able D igital Documen ts and Syst e m s , Ph.D. Dissertation, University of
California, Berkeley (1998)
16. Phelps, T.A. and Wilensky, R. The Multivalent Browser: a platform for new ideas, In
Proceedings of Document Engineering, (2001) 58–67
17. Rodig, P., Borghoff, U.M, Scheffczyk, J., and Schmitz, L. Preservation of digital
publications: An OAIS extension and implementation, In Proceedings of the ACM
Symposium on Document Engineering, (2003) 131–139.
18. San Diego Supercomputer Center. Persistent Archive Testbed (PAT).
http://www.sdsc.edu/PAT/
19. Wheatley, P. Survey and Assessment of Sources of Information on File Formats and
Software Documentation (2003)
http://www.jisc.ac.uk/uploaded_documents/FileFormatsreport.pdf
20. Wotzits Format? Web site. http://www.wotsit.org/
... An extension to our current Multivalent browser has been implemented, in order that it can support already available Java codecs. For example, there is currently support for MPEG (1,2,4) and OGG files. The next priorities are to write java codecs for MXF wrapped JPEG2000 and MXF wrapped D10. ...
... By executing Media engines within the JVM, they will be able to interact directly with the data, straight from their current operating system and user interfaces. This has several advantages over the most commonly proposed approaches of migrating file formats and of emulating the whole computer architecture [2]. ...
... We can also use the Multivalent tool to describe the format of a data set as a XML schema, for example the Data Format Description Language (DFDL) schema 7 . The SHAMAN integrated project focused on the use of the Multivalent object model and Fab4 browser as a mechanism to apply future display and manipulation mechanisms to data that comes from the past (Phelps and Watry, 2005). ...
Article
Full-text available
This paper describes work undertaken by Data Intensive Cyber Environments Center (DICE) at the University of North Carolina at Chapel Hill and the University of Liverpool on the development of an integrated preservation environment, which has been presented at the National Coordination Office for Networking and Information Technology Research and Development (NITRD), at the National Science Foundation, and at the European Commission. The underlying technology is based on the integrated Rule-Oriented Data System (iRODS), which implements a policy-based approach to distributed data management. By differentiating between different phases of the data life cycle based upon the evolution of data management policies, the infrastructure can be tuned to support data publication, data sharing, data analysis and data preservation. It is possible to build generic data management infrastructure that can evolve to meet the management requirements of each user community, federal agency and academic research project. In order to manage the properties of the data collections, we have developed and integrated scalable digital library services that support the discovery of, and access to, material organized as a collection.The integrated preservation environment prototype implements specific technologies that are capable of managing a wide range of preservation requirements, from parsing of legacy document formats, to enforcement of preservation policies, to validation of trustworthiness assessment criteria. Each capability has been demonstrated and is instantiated in multiple instances, both in the United States as part of the DataNet Federation Consortium (DFC) and through multiple European projects, primarily the FP7 SHAMAN project.
... The architecture consists of three layers. The first layer of this architecture is the presentation layer [38]. Presentation services are provided through a web interface in articulation with Multivalent Media Engines for rendering the different data types. ...
Article
Full-text available
Digital preservation is the persistent archiving of digital assets for future access and reuse, irrespective of the underlying platform and software solutions. Existing preservation systems have a strong focus on Grids, but the advent of cloud technologies offers an attractive option. We describe a middleware system that enables a flexible choice between a Grid and a cloud for ad-hoc computations that arise during the execution of a preservation workflow and also for archiving digital objects. The choice between different infrastructures remains open during the lifecycle of the archive, ensuring a smooth switch between different solutions to accommodate the changing requirements of the organization that needs its digital assets preserved. We also offer insights on the costs, running times, and organizational issues of cloud computing, proving that the cloud alternative is particularly attractive for smaller organizations without access to a Grid or with limited IT infrastructure.
... There are claims that developing such an emulation solution is far too complex and expensive (Granger, 2000). Another drawback mentioned is that the lack of data exchange between the emulated and real environment is too much of a disadvantage to make it a worthwhile solution (Phelps & Watry, 2005). ...
Article
Full-text available
In recent years a lot of research has been undertaken to ascertain the most suitable preservation approach. For a long time migration was seen as the only viable approach, whereas emulation was looked upon with scepticism due to its technical complexity and initial costs. In 2004, the National Library of the Netherlands (Koninklijke Bibliotheek, [KB]) and the Nationaal Archief of the Netherlands acknowledged the need for emulation, especially for rendering complex digital objects without affecting their authenticity and integrity. A project was started to investigate the feasibility of emulation by developing and testing an emulator designed for digital preservation purposes. In July 2007 this project ended and delivered a durable x86 component-based computer emulator: Dioscuri, the first modular emulator for digital preservation.
... Multivalent can be thought of as an emulation environment that is written using a higher level language (Java) that separates the problem of parsing from display, and that provides a library of standard operations that can be used to display and manipulate documents and data. The Multivalent architecture is designed to interpret a digital entity based upon a digital ontology that represents the structural, semantic, spatial, and temporal relations inherent within a digital entity (Phelps & Watry, 2005). In this way, it is able to render all records from their original form and guarantee the correct interpretation of the record in future preservation environments. ...
Article
Full-text available
The National Archives and Records Administration (NARA) and EU SHAMAN projects are working with multiple research institutions on tools and technologies that will supply a comprehensive, systematic, and dynamic means for preserving virtually any type of electronic record, free from dependence on any specific hardware or software. This paper describes the joint development work between the University of Liverpool and the San Diego Supercomputer Center (SDSC) at the University of California, San Diego on the NARA and SHAMAN prototypes. The aim is to provide technologies in support of the required generic data management infrastructure. We describe a Theory of Preservation that quantifies how communication can be accomplished when future technologies are different from those available at present. This includes not only different hardware and software, but also different standards for encoding information. We describe the concept of a "digital ontology" to characterize preservation processes; this is an advance on the current OAIS Reference Model of providing representation information about records. To realize a comprehensive Theory of Preservation, we describe the ongoing integration of distributed shared collection management technologies, digital library browsing, and presentation technologies for the NARA and SHAMAN Persistent Archive Testbeds.
Conference Paper
In design and engineering, it is important to preserve more than just the actual documents making up the product data. For knowledge-heavy industries it is of critical importance to also preserve the soft knowledge of the overall process, the so-called product lifecycle. The idea here is not only to send the designs into the future, but also the knowledge about processes, decision making, and people. In order to preserve this knowledge, it needs to be captured at content creation time, a process currently mostly independent from the act of preservation. This paper discusses how to make tools and applications used at content creation time, especially in design and engineering, but also, in general, preservation-aware by using the OpenConjurer approach and framework.
Article
In design and engineering, it is important to preserve more than the actual documents making up the product data. For knowledge-intensive industries it is of critical importance to also preserve the soft knowledge of the overall process within the product life cycle. The idea is not only to preserve the designs for the future, but also the knowledge about processes, decision making, and people. In order to preserve this knowledge, it is necessary to captured it at content creation time, a process currently mostly independent from the preservation process. This paper discusses how to make applications in content creation (e.g., in design and engineering) preservation-aware by using the OpenConjurer approach and framework.
Article
Full-text available
Game preservation is a critical issue for game studies. Access to historic materials forms a vital core to research and this field is no different. However, there are serious challenges to overcome for preservationists in terms of developing a strategic and inclusion programme to retain access to obsolete games. Emulation, as a strategy already applied by major developers and the gaming community, is introduced and the KEEP project, designed to create an open emulation access platform is described. Author Keywords Games, preservation, emulation, archiving
Presentation
Full-text available
Erweiterung des Referenzmodells OAIS (Open Archival Information System) um den Prozess der Ablösung digitaler Objekte von (obsoleten) physischen Medien (Pre-Ingest) sowie Implementierung mittels eines Datenbankmanagementsystems.
Conference Paper
Full-text available
Over the last decades, the amount of digital documents has increased exponentially. Nevertheless, traditional document engineering methods are applied. Even worse, the long-term preservation issues have been neglected in standard document life cycle implementations.Our digital (cultural) heritage is, therefore, highly endangered by the silent obsolescence of data formats, software and hardware. Severe losses of information already happened. It is high time to implement concrete solutions.Fortunately numerous institutions already target these issues. Moreover, with the OAIS reference model a rich standardized conceptual framework is available, which already serves as implementation basis. This paper discusses an extension to the OAIS reference model and illustrates a prototype implementation of a document life cycle that is enriched by functions for long-term preservation.More precisely, this paper aims to provide first solutions to the following three problem areas: 1. Detachment: OAIS defines no functions for the process of detaching digital documents prior to the ingest function. This detachment function is modeled in great detail and implemented for the provision of the so-called OAIS's submission information packages (SIP). 2. DBMS: OAIS defines a very complex functionality. We show how a standard database management system (DBMS) can support a wide variety of required functionalities in an integrated and homogenous way. Among others OAIS's data management, archival storage, and access are supported. 3. Metadata: So far, OAIS does not cover any aspects of the metadata generation. Here, we briefly discuss the (semi-)automatic generation of a metadata set.In order to evaluate the feasibility of our approach, we built a first prototype. We carried out our experiments in close cooperation with the Bavarian State Library, Munich, which is engaged in numerous international initiatives dealing with the problem of long-term preservation. Our University Library also supported us by delivering a representative test set of digital publications. We conclude our paper by presenting some lessons learned from our conceptual work and from our real world experiments.
Conference Paper
Full-text available
The preservation of literary hypertexts presents significant challenges if we are to ensure continued access to them as the underlying technology changes. Not only does such an effort involve standard digital preservation problems of representing and refreshing metadata, any constituent media types, and structure; hypertext preservation poses additional dimensions that arise from the work's on-screen appearance, its interactive behavior, and the ways a reader's interaction with the work is recorded. In this paper, we describe aspects of preservation introduced by literary hypertexts such as the need to reproduce their modes of interactivity and their means of capturing and using records of reading. We then suggest strategies for addressing the pragmatic dimensions of hypertext preservation and discuss their status within existing digital preservation schemes. Finally, we examine the possible roles various stakeholders within and outside of the hypertext community might assume, including several social and legal issues that stem from preservation.
Conference Paper
The Multivalent Browser is built on a architecture that separates functionality from concrete document format. Almost all functionality is made available via relatively small modules of code called behaviors that programmers can write to extend the core system. Behaviors can be as significant and powerful as parser-renderers for scanned paper, HTML, or TeX DVI; as fine-grained as hyperlinks, cookies, and the disabling of menu items; and as innovative or uncommon as in situ annotatins, "lenses", collapsible outline displays, new GUI widgets, and Robust Hyperlink support. Behaviors can be combined in arbitrary groups for each individual document, in effect spontaneously creating a custom browser for every one. Common aspects of document functionality can be shared, so that, for example, the same behavior that handles multipage support for scanned paper documents also provides such support for DVI and PDF; similarly, the behaviors that support fine-grain annotation of HTML also support identical annotation on scanned paper, UNIX manual pages, DVI, and PDF.We have designed and implemented this architecture, and implemented behaviors that support all of the above functionality and more. Here we describe the architecture that allows such power and fine-grained access, yet composes disparate behaviors and resolves their mutual conflicts.
Conference Paper
We describe the design and use of a personal digital library system, UpLib. The system consists of a full-text indexed repository accessed through an active agent via a Web interface. It is suitable for personal collections comprising tens of thousands of documents (including papers, books, photos, receipts, email, etc.), and provides for ease of document entry and access as well as high levels of security and privacy. Unlike many other systems of the sort, user access to the document collection is assured even if the UpLib system is unavailable. It is "universal" in the sense that documents are canonically represented as projections into the text and image domains, and uses a predominantly visual user interface based on page images. UpLib can thus handle any document format which can be rendered as pages. Provision is made for alternative representations existing alongside the text-domain and image-domain representation, either stored or generated on demand. The system is highly extensible through user scripting, and is intended to be used as a platform for further work in document engineering. UpLib is assembled largely from open-source components (the current exception being the OCR engine, which is proprietary).
Conference Paper
Maintaining a digital object in a usable state over time is a crucial aspect of digital preservation. Existing methods of preserving have many drawbacks. This paper describes advanced techniques of data migration which can be used to support preservation more accurately and cost effectively. To ensure that preserved works can be rendered on current computer systems over time, “traditional migration” has been used to convert data into current formats. As the new format becomes obsolete another conversion is performed, etcetera. Traditional migration has many inherent problems as errors during transformation propagate throughout future transformations. CAMiLEON’s software longevity principles can be applied to a migration strategy, offering improvements over traditional migration. This new approach is named “Migration on Request.” Migration on Request shifts the burden of preservation onto a single tool, which is maintained over time. Always returning to the original format enables potential errors to be significantly reduced.
Conference Paper
The preservation of digital data for the long term presents a variety of challenges from technical to social and organizational. The technical challenge is to ensure that the information, generated today, can survive long term changes in storage media, devices and data formats. This paper presents a novel approach to the problem. It distinguishes between archiving of data files and archiving of programs (so that their behavior may be reenacted in the future).For the archiving of a data file, the proposal consists of specifying the processing that needs to be performed on the data (as physically stored) in order to return the information to a future client (according to a logical view of the data). The process specification and the logical view definition are archived with the data.For the archiving of a program behavior, the proposal consists of saving the original executable object code together with the specification of the processing that needs to be performed for each machine instruction of the original computer (emulation).In both cases, the processing specification is based on a Universal Virtual Computer that is general, yet basic enough as to remain relevant in the future.