Content uploaded by Timothy Hart
Author content
All content in this area was uploaded by Timothy Hart on Sep 21, 2017
Content may be subject to copyright.
Metadata Standard for
Future Digital Preservation
Honours Thesis by:
Timothy Robert Hart
Supervisor: Denise de Vries
Date: June 2015
"Submitted to the School of Computer Science, Engineering, and Mathematics in the
Faculty of Science and Engineering in partial fulfilment of the requirements for the
degree of Bachelor of Information Technology - Honours at Flinders University -
Adelaide Australia"
"I certify that this work does not incorporate without acknowledgment any
material previously submitted for a degree or diploma in any university; and that to
the best of my knowledge and belief it does not contain any material previously
published or written by another person except where due reference is made in the
text."
Sign ___________ Date_18 / 06 / 2015
i
TABLE OF CONTENTS
TABLE OF CONTENTS .............................................................................................. i
LIST OF FIGURES ...................................................................................................... ii
LIST OF TABLES ....................................................................................................... ii
Dedication ................................................................................................................... iii
Acknowledgement ....................................................................................................... iv
Abstract ........................................................................................................................ v
1.0 Introduction ............................................................................................................ 1
1.1 Scope .................................................................................................................. 3
1.2 Structure ............................................................................................................. 3
2. 0 Literature review ................................................................................................... 4
2.1 Digital Preservation ............................................................................................ 4
2.2 Preservation Techniques ..................................................................................... 6
2.2.1 Migration ..................................................................................................... 6
2.2.2 Emulation .................................................................................................... 9
2.3 Metadata ........................................................................................................... 12
2.4 Tools ................................................................................................................. 17
2.5 Training ............................................................................................................ 18
3.0 Motivation ............................................................................................................ 20
4.0 Development ........................................................................................................ 21
5.0 Case study ............................................................................................................. 32
5.1 Image > Copy > Upload (Social Media) .......................................................... 33
5.2 Digital Image Modification Test (Photoshop) .................................................. 34
5.3 PDF Default Metadata Test .............................................................................. 35
5.4 Word to PDF - Word vs. Adobe conversion. ................................................... 36
ii
6.0 Discussion ............................................................................................................ 38
7.0 Conclusion and Recommendations ...................................................................... 42
8.0 Future work .......................................................................................................... 45
9.0 References ............................................................................................................ 47
Appendix A ................................................................................................................ 52
Appendix B ................................................................................................................. 55
Appendix C ................................................................................................................. 57
LIST OF FIGURES
Figure 1 - Changed pixels are displayed in grey, darker parts indicate loss of
transparent layer after migration. ................................................................................. 8
Figure 2 - PREMIS numbered schema ....................................................................... 22
Figure 3 - PREMIS data dictionary element breakdown ........................................... 23
Figure 4 - Dublin Core element breakdown ............................................................... 24
Figure 5 - Dublin Core XML representation (Book) ................................................. 25
Figure 6 - Dublin Core Descriptive Language (Book) ............................................... 25
Figure 7 - PREMIS XML ........................................................................................... 25
Figure 8 - Randomly selected digital image (JPEG) .................................................. 33
Figure 9 - Digital Image - Original > Photoshop ....................................................... 34
LIST OF TABLES
Table 1 - Crosswalk mapping example ...................................................................... 31
Table 2 - Test one - Original photo copied and modified .......................................... 33
Table 3 - Test two - Photoshop modification ............................................................. 34
Table 4 - PDF conversion data ................................................................................... 37
iii
Dedication
I dedicate this thesis to my family and friends, for without their love and support, this
journey would not have been the same. There is no greater reward than the pride in
which my hard work has given them. Furthermore I dedicate this thesis as proof to
those who doubt a gamer like myself can achieve great things, we are passionate
about what we love and it is when we share that passion with our work that makes
anything possible.
iv
Acknowledgement
My first acknowledgement is for my supervisor, Denise de Vries, for her support and
feedback throughout my honours. The ongoing feedback and meetings with Denise
kept me on track and filled me with confidence after each milestone was successfully
reached. I am grateful for her wisdom and kindness. I could not have asked for a
better supervisor.
I once again acknowledge my family for their love and support, but especially my
parents, Afroditi and Dennis, and my Papou, Tim. They never stopped supporting me
throughout my studies, always making sure I had everything I needed. Above all it
was their pride that pushed me to achieve what I have achieved to date. I would also
like to thank my Yiayia, Georgia, who encouraged my education from a young age. I
cannot go without mentioning my uncles, Tony and Andy, who always led by
example of how not to do things in life, always making sure I took the right pathway
and always looking out for me.
I thank my partner, Phoebe, for the love and support throughout my study, always
being there to give advice and keeping my spirit up to help me through the stressful
times. I cherished our long drives together as an escape and appreciate all the great
meals that were prepared at the end of the day, allowing me to push on through the
study with anticipation for that delicious reward.
I acknowledge the lecturers at Flinders University, namely Neville Williams, Carl
Mooney, Kate Deller-Evans, Romana Challans, and of course Denise de Vries. Each
lecturer played a role in influencing my decision to take the academic pathway.
Special thanks to Romana Challans for being my mentor throughout my post
graduate studies and for always believing in me. You are and have been my teacher,
boss, college, fellow student, and above all, my dear friend.
My last acknowledgment is to my friends, especially the friends I made in my
undergraduate degree who were with me all the way and still support me. A very
special thanks to my friend Jonah, who has been a brother to me, for if it were not for
our Skype conversations, my journey would not have been as entertaining.
v
Abstract
Digital preservation is often accomplished by either of the following two techniques.
Migration, a method that migrates one format to another and is the simpler of the
two. The other method is emulation, one that recreates computer system
environments that allow a digital object to be accessed in its original state, preserving
its functionality and interactability.
Emulation removes the risk of data loss, however, with this comes great complexity.
Replicating a computing environment to meet all the requirements and dependencies
of certain file types requires detailed knowledge about the file itself. This ranges
from what software created the file, what operating system they are compatible with
as well as any other dependant variables. This information can be extracted from the
files metadata, assuming it is present, accurate, and complete.
Testing has revealed that original files of specific types often contain rich metadata,
however, more often than not, you are not dealing with an original. Some files types
have proven to not, by default, contain adequate metadata which is a result of the
creator not establishing it at the time of creation. Testing has also revealed the
delicacy of metadata and how minor changes can impact it, especially once a file has
been uploaded to social media.
In this thesis, I analyse current metadata standards and propose possible solutions for
a global metadata standard that will enable memory institutions to better preserve
information about provenance, modification history, and descriptive information.
This will not only preserve files and their interactability, but add functionality to the
files within their repositories, allowing efficient searchability and manageability.
1
1.0 Introduction
Digital objects, unlike the old dusty books we find at our local library, cannot
withstand the sands of time. There are more threats to our digital objects than that of
a physical object. Digital objects face degradation, obsolescence, and possibly one of
the biggest threats, human error. Long gone are the days where we simply save our
files to a floppy disc, leaving them in air tight storage and forgetting about them. We
cannot simply store and forget anymore, not with the current nature of technology,
rapidly advancing and sometimes leaving legacy formats and software for dead.
Although we can still access these files using legacy devices, mapping old floppy
discs to digital images, it is best to avoid this effort and plan ahead. Digital
preservation has one goal, to preserve these objects, well that is at least how it
started. Through time and as research has shown, digital preservation needs to ensure
more than just preservation, e.g. access to a file, it needs to preserve validity,
authenticity, interaction, and various other functionality.
Digital preservation is often carried out by either of the following two techniques.
Migration, the easier of the two methods, one that simply migrates from one format
to another. The other technique is emulation, one that recreates a computer systems
environment to allow a digital object to be accessed in its original state. Emulation
has been around since the 1960's and was only seen as a solution for digital
preservation in the 1990's (Kirschenbaum et al., 2013).
For many years migration has worked well and was trusted. However, as our
technology advanced so too did our software, therefore our digital files with it. This
led to digital files becoming dynamic, often made up of multiple files and
dependencies which led to the loss of data in migrated files. Some data loss can be
acceptable, it depends on the file itself. Whereas some files that may contain
formulae and calculations, if the data loss alters any parameters even by 1%, the
damage could be great.
With emulation we remove the risk of data loss, however, along with this benefit
comes great complexity. Replicating a computing system environment requires
detailed knowledge about the digital files which ranges from what software/operating
system they are compatible with and any other dependant software or hardware they
2
may require. This information can be found in the files metadata, assuming it is
present, accurate, and complete.
Metadata is the key to digital preservation, not only for preservation purposes, but
also to make the preservation mean something. Preserving access to a file should not
be the only objective, by preserving the history of the file, we can see what has been
done to the file, what changes it has gone through, hopefully detecting whether or not
we are dealing with an original file or not. We need to address why else we preserve
these files, what purpose does storing old historic data serve? Research of course.
Without the appropriate metadata, what information can we extract from these
preserved files that has any relevant meaning?
The what and how of metadata is established in metadata standards. There exists
many, each generally focusing on one or a subset of metadata types. No standard
covers all metadata as it is not seen as a necessity. Each standard has its own rules,
unique elements, and a way of presenting. Some standards may share the same
elements, either using the same name or an alternative. If a complete preservation of
an object is to take place, one must combine multiple standards for a complete set of
metadata. As these standards often change and are extensive, therefore supporting
modification, conflicting elements and inconsistencies can arise. Although tools and
software exist for capturing metadata and embedding it, they can only extract what is
there and embed data that is available such as system or software data as well as any
context data that the creator of the object has established. Basically, if it is not there,
the tool is useless. This is why humans still perform majority of the tasks associated
with digital preservation and why automation is difficult. The fragmented world of
metadata standards does not make it easy on humans for curation or creation which
in turn, makes it difficult for the existing tools.
This thesis suggests a new standard be created, a global standard that addresses each
type of metadata, bringing together existing elements along with new elements to
complement one another. With a global standard, it will make it easier on users,
namely librarians and archivists, by having a complete set of metadata elements,
presented in a way that is easy to understand and interpret. The future possibilities of
such a standard exceed this and could allow for specific software and tools to be
developed based on the new standard, which increases the likely hood of automation.
3
It is important to realise the human element of preservation is crucial and will be so
for a long time. Therefore, this thesis suggests and stresses the importance of training
staff responsible for digitisation, creation, and curation of digital objects.
1.1 Scope
The focal point of this thesis is on the metadata and the metadata standards,
specifically the elements within the standards. Emulation software already exists and
is being used to date, therefore further research into emulation and how it works is
not in the scope of this thesis. As there are many standards, only the main standards
have been addressed. Given time constraints, no changes have been made to existing
elements other than structure and presentation. Tools for extracting and embedding
metadata have been touched on, but other than for testing use case purposes, they
will not be addressed any further.
1.2 Structure
The thesis firstly introduces the digital preservation topics and narrows the scope of
the research. Following this is the literature review that provides insight into each of
these elements, discussing them in greater detail, revealing the issues that burden
each topic. The research begins with the basics in digital preservation, it then
discusses the two foremost digital preservation techniques. Following is extensive
research into metadata, ending the literature review with a briefing on relevant tools,
and information regarding training. The following section discusses the motivation
behind this thesis and the problems it aims to solve. The development section
discusses the new standard with examples of how current standards are presenting
their elements. This section includes an example of the new standard, presenting a
list of descriptive metadata elements from one of the existing standards, showing
new modifications and additions. The next section contains four use case tests that
show the delicacy of metadata and what happens to it through various changes. This
is followed by a discussion which will discuss the tests in further detail along with
other important points. The thesis concludes with a conclusion and recommendation,
finishing up with information regarding future work in this area.
4
2. 0 Literature review
2.1 Digital Preservation
Digital preservation has been a topic concerning librarians, archivists, and many
institutes looking to the future. There has been a number of break throughs in the
area throughout the years, many of which are still utilised to date. It is important to
understand the nature of digital files as well as the environments in which they are
bound to. It is equally important to understand the role humans play, the risk they
pose as well as the necessity in their efforts.
Digital materials present archivists, librarians, and curators alike with many
challenges involving issues such as: identifying and capturing digital cultural
heritage, ethical concerns, data integrity, accessibility, recovery, and the cost of
preservation (Kirschenbaum, 2010). This includes provenance, something that is
much harder to establish in the digital world, yet is crucial for authenticity and
integrity. For example, distinguishing original files from copies, or ensuring
legitimacy in user access and their content as is in such systems proposed by
Rabinovici-Cohen et al., (2013), a preservation system utilising cloud computing
technology. As Routhier Perry, (2014) states, "Preserving the original material is
important, but the information contained within is often more important to users.".
Proving the authenticity of this information is equally important (Becker et al.,
2009a). Authenticity is critical and is the primary focus for the InterPARES
(International Research on Permanent Authentic Records in Electronic Systems), a
collaborative research group focused on long-term preservation of authentic digital
records. The various institutes that make up this collaboration include Australia,
Asia, Europe, and North America, each striving to develop methodologies and
theories to provide the best means to ensure the accuracy and authenticity of digital
records (Pal, 2014).
One thing is certain, digital preservation is something we all need to get on board
with. Nayak and Singh, (2014) describe digital preservation in two distinct meanings,
one being the digitisation of analogue materials as well as the preservation of
digitally created materials. The other being the process that ensures digital materials
are preserved, protecting them from physical deterioration and technical
obsolescence. Reich, (2012) and Pellegrino, (2014) state that digital objects, unlike
5
books and other paper based materials, cannot be neglected and require an
environment to render in. Digital files, software, and computer games each possess
dependencies on these environments, both the hardware and software. Separating one
from the other can be difficult if not impossible and can make the content unusable
(PREMIS, 2012) and (McDonough, 2013). However, this is something that is
unavoidable at times given the nature of technological environments, continuously
subjected to change, improvements, and advances in technology (Pellegrino, 2014).
Due to this, digital files are under continuous threat of obsolescence, whether it be
the file itself or the computing environment it functions on (Dappert, 2013).
Chakrabarty, (2014) regards technical obsolescence as the greatest threat to the
accessibility of digital content. Format obsolescence may also occur in a number of
ways, one being as specified above, where technology advances and the environment
no longer supports a specific format. The other being when the dependant software is
upgraded and no longer supports backwards compatibility, thus making obsolete file
formats no longer recognisable, requiring the intervention of preservation methods
(Wheatley, 2004), (Kirschenbaum, 2010), (Phillips et al., 2013) and (Routhier Perry,
2014). Reich, (2012) identified this to not yet be a problem for digital content due to
file normalisation. Although this is a working solution, techniques such as this are
prone to risk and failure, jeopardising the validity and integrity of the file if not done
correctly which is generally unavoidable.
Risk and risk management are elementary aspects in digital preservation, (Strodl et
al., 2011). This includes risk of data loss posed by a number of technological
variables, identifying whether the file should be preserved (Pellegrino, 2014). If
preservation is required, identifying how to preserve, what to preserve, and the risk
involved in applying the chosen method must be addressed, a major concern for
professionals and one that requires specialised training (Routhier Perry, 2014). This
is further stressed by Chakrabarty, (2014), emphasising the effort needed to be
invested in the beginning of the preservation process, rather than in the continual
maintenance and conversion. However, Reich, (2012) argues that it must be an active
process as the bits and bytes of digital objects are fragile and require continuous
auditing and repair. Reich also states that digital content can be easily altered which
can make it difficult to authenticate the provenance of the file. Applying preservation
methods risks the integrity of the file and its contents, leading to an expensive
6
recovery process, however, doing nothing can result in greater costs. Becker et al.,
(2009) included the example of the BBC Domesday rescue in their study which
supports the need to address these issues earlier rather than later. Obsolescence led to
an almost irrecoverable data loss and required large amounts of money and effort to
access the data once more as well as ensuring preservation for the future. Another
example provided by Kirschenbaum, (2010) involving legacy software and operating
systems describe the challenges in identifying the application used to create the file
in question and then developing a strategy that does not risk the file's integrity. It can
be challenging as there are many considerations to make and steps to undergo, many
of which are often done manually. This therefore increasing the need for automation
and risk management to be implemented in preservation systems to mitigate risk of
data loss (Strodl et al., 2011). There is further risk to digital objects involved with
human interaction. Reich, (2012) and Routhier Perry, (2014) state that evidence
suggests human factors, both intentional and unintentional are the greatest cause of
corruption and data loss. This has been said to happen in centralised repositories and
multiple repositories, perhaps isolated, all administered by the same authority.
Further evidence shows humans currently performing the most necessary acts of
digital preservation. With human acts, comes human error and due to there being
both machine and human dependencies, risk is unavoidable. This leads further into
risk associated with the preservation methods of interest as well as the importance of
training.
2.2 Preservation Techniques
2.2.1 Migration
The first of two techniques being discussed is migration, currently regarded as the
most frequently deployed and trusted method (Pellegrino, 2014) and (von
Suchodoletz et al., 2013). As migration is a regularly used technique and well
known, details about the risks involved are addressed rather than the technique itself.
Research has shown that there is an agreed upon understanding that migration is not
without risk. In fact there is lacking confidence in migration for many formats as
described by Rimkus et al., (2014) in the study involving the members of the ARL, a
non-profit organisation of North American academic research libraries. The study
7
involved identifying the confidence in a wide range of file formats in their ability to
be preserved. The results show very few file formats with a relatively high
confidence level, in fact there are a number of formats with -100% relative
confidence given the metrics used to calculate in the study.
As modern digital objects may include dynamic content, multiple files, workflow,
and tool-chains, the pure data-centric strategy misses important contextual
information required for authentic re-enactment and cannot address non-linearity or
interaction properties (von Suchodoletz et al., 2013) and (Rechert et al., 2014). There
is also risk that some parts of the content may not be correctly converted, threatening
the objects integrity (Rechert et al., 2014) and (Becker et al., 2009a). Furthermore,
Dappert et al., (2013) explains that "the rendering or execution stack for any digital
object consists of inter-related components of files, software, virtual machines and
hardware. The boundaries between them are not inherently significant." The example
given involves a file containing a macro that is applied to its content, however, the
macros functionality may instead be contained in the execution software. Due to
artificial discontinuities possibly being drawn differently when migrated, parts of the
stack may be neglected. This hinders the re-producing of research results, complex
environments, and affects digital entities such as: executables, audio/video, software,
and computer games. In fact, Chakrabarty, (2014) perceives the risk to be extremely
high, especially in the case of mass format migration and on-going frequent
preservation treatments. Routhier Perry, (2014) and Waugh et al., (2000) support
this, stating successive migration causes degradation, increasing the likelihood of
error. As new technologies emerge, migration is required to be performed repeatedly
which can eventuate into something that no longer accurately resembles the original
or could be lost completely. In terms of mass migration, when migrating a large
collection of objects, the process may in fact take a considerable amount of time.
This could lead to the unlikely, but not impossible scenario of the newly migrated
objects being obsolete at the end of the process (Pellegrino, 2014).
Becker et al., (2009b) demonstrated a simple conversion of an image with a
transparent background from GIF to JPG, resulting in a 59.27% change in the pixels
with an RMSE (root-mean-square error) of 24034 as displayed in Figure 1.
8
However, this does not fall completely on the technique itself, rather how it was
delivered, e.g. the tool that was used. None the less, it shows that migration can alter
a file beyond recognition or render it useless. Note that in this example it was
identified what had been changed, however it is not always the case. Lawrence et al.,
(2000) gives example of the subtle errors that may not easily be identified in data
such as a floating point number. If the original format supports 16 digits and the
target format supports up to 8, this alters the output ever so slightly. This may not be
an issue for certain types of data, but in vector calculations such as in geographic
information systems, small errors can be quite significant. A spreadsheet converted
to ASCII will save the values within, however if those values were derived from
embedded formulae, the data will be lost. Some loss may be acceptable, but certain
loss can jeopardise a files authenticity and accuracy. Migrating Microsoft office
documents will rarely preserve correct layouts, footnotes, hyperlinks, and page
breaks. If the text is the only important content, this becomes acceptable loss,
however, if these elements are required for functionality or authenticity, this is
unacceptable loss (Becker et al., 2009a).
Knowledge of both the original format and the target format is required in order to
prepare a conversion program/tool as one may not support aspects of the other.
Specifications about the format are not always publicly available and evidence
suggests that most formats are not fully inter-changeable. Subsequently, it may be
Figure 1 - Changed pixels are displayed in grey, darker parts indicate loss of
transparent layer after migration.
9
impossible to determine what has been altered or lost (Lawrence et al., 2000) and
(Waugh et al., 2000).
The LOCKSS program (Reich, 2012), has a significant advantage that addresses
some of the issues migration has. It "migrates on access", creating a temporary copy
of the digital object, invoking appropriate format "migrators". This preserves the
original file, leaving it untouched and free from damage. Whilst (Becker et al.,
2009a) states that keeping the original bitstreams as backup is common practice,
having access to this does not guarantee they will be legible in the future. The same
can be said with the LOCKSS system. Creating temporary files works effectively
while the original is still in-tact, but what happens when an unforseen element
disturbs the system's ability to work with the original file? The original will have to
be converted eventually. When migration must be carried out, the process must be
described and stored within metadata, capturing the date, description, previous
format, tools used, and any other relevant details (Lupovici and Masanès, 2000).
Research has shown the issues related to migration, however, it must be recognised
that although the technique as a whole is an issue for certain types of preservation,
the real issues lies within metadata or more so the lack of. As Lawrence et al., (2000)
states, a digital object should not just be seen as a single entity, but often as a
collection of files, made up of relationships and dependencies. How this collection of
files relates to one another and how they are used is the information that is important
when migrating such a complex object. This information must be captured in the
metadata and must also survive the migration process for it to be successful. In fact,
metadata is the make or break of digital preservation, increasingly so for the next
technique discussed, emulation. See section 2.3 for further information regarding
metadata.
2.2.2 Emulation
Migration is lacking in suitable migration tools as they are not usually available for
dynamic and interactive digital material such as: interactive objects, software,
scientific toolchains, and databases; unnecessarily limiting the number of types that
can be archived (Rechert et al., 2012), (Rechert et al., 2014) and (von Suchodoletz et
al., 2013). 500,000 electronic publications across multiple medias are kept at the
German National Library, a subset was used in a use case which determined that
10
many of the artefacts were unable to run on modern day systems within their reading
rooms (Rechert et al., 2014). Compatibility issues due to missing components such
as: codec's, fonts, dependencies, libraries, applications, and version differences were
the cause, therefore not only due to risk, but limitations, we look towards emulation.
Emulation is not a new concept, it has been used in the computing industry since the
1960's, but it was considered for digital preservation starting in the 1990's, first
argued for by Jeff Rothenberg (Kirschenbaum et al., 2013). Various emulation
projects were developed through the 1990's to the 2000's. In 2004, the National
Library of the Netherlands and the National Archief of the Netherlands accepted the
need for emulation in order to maintain authenticity and integrity for complex digital
objects (Van der Hoeven et al., 2007). From this a project was initiated and then
completed in 2007, delivering a durable x86 component-based computer emulator
called "Dioscuri", the first of its kind.
Emulation rather than focusing on obsolete objects, focuses on the environments,
providing the means to replicate a virtual environment that allows deprecated digital
ecosystems to live on, allowing digital objects to be rendered or executed (Van der
Hoeven et al., 2007), (Becker et al., 2009a), (Pellegrino, 2014) and (Rechert et al.,
2014). Access is merely the first advantage, as stated by Routhier Perry, (2014) and
Rechert et al., (2012), emulation is the key in providing an authentic look and feel,
allowing users to see exactly how the content would have looked and allows files to
remain interactive. This is important for preserving historical authenticity. There are
cases where emulation has shown to be imperative to certain digital objects. One in
particular discussed by Kirschenbaum et al., (2013) involving a piece of electronic
literature encoded to self encrypt after a single reading. Within an emulated
environment, access to this material can be endless, removing risk of losing it due to
encryption.
Complex and interactive objects have many dependencies therefore requiring
detailed knowledge about the object. In most cases, using its creating applications is
the best way to render an object as most of its dependences will be covered (Rechert
et al., 2012). In fact, traditional digital objects such as PDF's and images may be
complemented by research data and other materials (von Suchodoletz et al., 2013),
which without, may make the object relatively useless. Rechert et al., (2012) goes on
11
to describe "viewpaths", an ordered list of dependencies for the selected digital
object, covering complete systems, both hardware and software. Identifying and
describing viewpaths is one of the challenges faced when preservation takes the
emulation approach. Further challenges lie in the selection of the emulator as it is
"impossible to choose a one-size-fits-all solution" (Van der Hoeven et al., 2007),
however within the KEEP project, the Trustworthy Online Technical Environment
Metadata Database (TOTEM) was developed to aid in this process, covering
environment and technical metadata (Rechert et al., 2012). The next challenge is
using the emulators as there is generally requirements for additional software and
configurations, which can be quite complex. Therefore as suggested by Rechert et
al., (2014), a "user-friendly platform" that conveniently wraps it all together is
needed, adding automation where possible. Dappert et al., (2013) describes the
process in which this would undergo in the use case described in their study, which
was a part of the Keeping Emulation Environments Portable (KEEP) project. The use
case involved emulating a radar simulation for a racing boat training package.
Technical metadata was supplied, although not adequate. The object was stored on a
diskette. It was assumed this was already transferred and imaged. TOTEM was used
to select PC/MSDOS/OS compatible pairings, then for each pairing an emulator was
selected. Suitable environments were set up using the KEEP framework. The
simulation was run and then the pairing with the best performance was chosen. The
process of each step is identified diagrammatically and in greater detail within the
study, identifying the underlying complexity and in turn, strengthening the need for
accurate and consistent metadata for the emulation process.
One of the emerging facets in digital preservation involving emulation is
"Emulation-as-a-Service", where emulation services are delivered through cloud
computing (von Suchodoletz et al., 2013), (Rabinovici-Cohen et al., 2013) and
(Rechert et al., 2014). These services can then be distributed, effectively removing
need to worry about the backend and focus on front end configuration, which does
not necessarily make it easier due to complexity. The PDS system (Rabinovici-
Cohen et al., 2013) utilises cloud technology to support logical preservation that
makes use of user-define and system-define metadata, brokering interconnections
between Open Archival Information System (OAIS) entities and multiple diverse
clouds. It groups the data with the metadata which then aids in automating the
12
preservation processes. The cloud computing model of scalability, elasticity, access
anywhere, and its pay-as-you-go cost effectiveness makes it an increasingly targeted
platform for emulation. However, further issue may lie in what is actually happening
in the cloud, is integrity, authenticity, and provenance guaranteed? von Suchodoletz
et al., (2013) suggests authenticity being difficult to verify in this scenario. Is an
accurate representation displayed before the user? There are of course many
determining variables, however, the focus of this thesis is making emulation possible
through metadata, not specifically on emulators.
2.3 Metadata
Metadata is the foundation of digital preservation and it was only seen as such during
the mid 1990's when awareness of the role it plays in long-term digital preservation
started to grow (Day, 2003). Bern, (2003) supports this, stating metadata being the
key to digital preservation has been firmly established. Day, (2003) describes
metadata as structured information that is used to describe, find, manage, control,
and preserve other information over time and is the best way to minimise risk of
losing access to our digital objects. Furthermore, it accompanies and refers to digital
objects, providing descriptive, structural, administrative, rights management, and
other important information. In fact, the capture, creation, and maintenance of
metadata is the dependant factor in almost all digital preservation strategies.
However, when capturing metadata, all necessary parameters required to render the
digital object must be acquired, which involves a full list of requirements that go
beyond operating system versions and software dependencies (Rechert et al., 2014).
Strodl et al., (2011) indentified the work being conducted on metadata at that time
with research projects ranging from: CASPAR, working on descriptive metadata.
PLANETS, investigating advanced characteristics and key preservation metadata
concepts, and further application-specific research conducted by: KEEP, LiWA,
ARCOMEM, and PrestoPRIME with a focus on supporting emulation, web
archiving, social web, and audio/video.
Rahman and Masud, (2014) identified what metadata can describe, ranging from:
data source, database tables, tuples, data models, software process, system
environments, subroutines, events, and also people and their roles in an IT system, to
name a few. Furthermore, describing items such as: file formats, version, byte order,
13
encoding, codec, channels, and dependencies such as fonts, styles, etc, can be crucial
to ensure continued access (Hutchins, 2012). Emphasis has been made on metadata
describing the past and present states of the content, determining if any unknown
alterations have been made and that the file can be uniquely identifiable (Lupovici
and Masanès, 2000). It is important that all captured information meet the 3 essential
characteristics that are: completeness, accuracy, and consistency (Windnagel, 2014).
Windnagel describes the three characteristics with completeness as all possible
metadata elements are covered to ensure full discoverability. For accuracy the
following examples are given: correct file format and size, correct spelling of words,
and sufficient subject keywords. Consistency is how well the data conforms to
semantic and structural standards.
The "what" metadata describes can be broken down based on the type of metadata.
The main types of metadata that have been identified consist of: Administrative
Metadata, Descriptive Metadata, Structural Metadata, Technical Metadata, and
Preservation Metadata (PREMIS, 2012), (Phillips et al., 2013), (Gartner and Lavoie,
2013), and (Dappert, 2013). Phillips et al., (2013) identifies another type of metadata,
transformative. This logs any events that led to changes to the digital object. It is
further stated that all of these types can be generated computationally with the
exception of descriptive metadata. These metadata types can also be seen as
reference information, provenance, context, fixity, and representation as
recommended by the OAIS model, one of the most influential standards/models in
the area (Lupovici and Masanès, 2000), (Day, 2003) and (Pal, 2014).
Administrative metadata holds information that specifies attributes such as how and
when an object was created as well as permissions, e.g. who can access the file
(Phillips et al., 2013). Descriptive metadata describes intellectual entities, e.g.
publication information such as creator, title, and can characterise the content
through classification and subject terms (PREMIS, 2012). Structural Metadata
describes internal structure of an object and the relationships between their parts, e.g.
how a set of images assembles into a complete book (PREMIS, 2012). Technical
metadata specifies attributes about the object, e.g. width, height, and bit depth of an
image as well as describing the format (Hutchins, 2012), (PREMIS, 2012) and
(Allasia et al., 2014). Preservation metadata is perhaps an unfamiliar concept for
information professionals due to ambiguity surrounding its scope and purpose
14
(Gartner and Lavoie, 2013). PREMIS, (2012) defines preservation metadata as
information that supports the digital preservation process. Specifically supporting
functions such as maintaining authenticity, identity, renderability, understandability,
and viability (PREMIS, 2012) and (Gartner and Lavoie, 2013). Preservation
metadata does not fit into any one category, but spans across multiple, such as
administrate, descriptive, structural and technical metadata (PREMIS, 2012) and
(Gartner and Lavoie, 2013). Therefore the function of the metadata should not be
used to understand preservation metadata, but instead the process and the larger
purpose it supports, that is long-term digital preservation (Gartner and Lavoie, 2013).
An example of what information may be present in preservation metadata is
described by Gartner and Lavoie, (2013) and includes: provenance of the object,
rights management, technical information, and the interpretative environment
information of the object. This does not exhaust the types of information that may be
present and it will differ from schema to schema, but it gives a basic idea of the
scope. For instance, the TIMBUS project identifies what is needed to preserve
computing environments so that business processes and services can be preserved.
The research conducted in the project has shown that specific environmental
information is required such as: technical environment description, dependencies,
and process descriptions, overall leading to a complete preservation metadata
(Dappert, 2013). Preservation metadata can be seen as the ultimate goal, built up of
lower level metadata types. However, it may not contain all the required data to
ensure complete functionally, it may only contain the means to preserve the object. It
may not contain appropriate descriptive metadata to give context and allow for
searching, sorting, and management functions. In fact the PREMIS data dictionary
does not focus on descriptive metadata at all as it has been established well by other
standards.
There are wide range of standards, each addressing different types of metadata and
scenarios. Not all standards will be covered, but the "de facto" standards will be
addressed. As metadata needs to fulfil a great deal, standards are something that are
not easily accomplished (Day, 2003), which explains why there are so many and why
multiple standards are generally used. Without consistent standards which can be
agreed upon, it becomes increasingly difficult to offer aggregated collecting,
discovery, and preservation services (Brazier, 2013). This may be one of the main
15
reasons a lot of standards use other well established standards to cover areas they do
not. Allasia et al., (2014) recognise and stress the importance of metadata standards
and created a novel standard, based around a model aimed at audio/visual objects.
The standard made use of each type of metadata from de facto standards such as
Dublin Core and MPEG-7 for descriptive elements, and PREMIS for preservation
metadata. Dublin Core, MAB, MARC are just some of the names often seen when it
comes to descriptive metadata and they are widely accepted and used, Dublin Core
being the most prominent (Strodl et al., 2011). The following standards addressed by
Bekaert et al., (2003) specify an XML-based data structure: "Metadata Encoding and
Transmission Standard (METS), the IMS Content Packaging XML Binding, the
Sharable Content Object Reference Model (SCORM) and the XML packaging
approach developed by CCSDS Panel 2". METS is classified as one of the most
prominent preservation metadata implementation strategies, the other being PREMIS
(Routhier Perry, 2014). The PREMIS standard started in 2003 and resulted in a data
dictionary that defines preservation metadata. Digital provenance and relationships
are some of the fields PREMIS pays close attention to, however the standard does
not cover descriptive metadata as it is identified as being domain specific and is less
crucial for the preservation process. However it is stated that it is important for
discovery and helping decision makers during the process (PREMIS, 2012). It is
increasingly important in multimedia objects as they are made up of multiple and
sometimes dynamic/interactive files. The metadata is used to describe, organise, and
package, e.g. describing the makeup of the multimedia object as a collection of
images, audio tracks, and video. Smith and Schirling, (2006) state this is necessary
because multimedia objects are not self-describing for computer interpretation,
making it difficult to identify semantic-level content. Without this information,
context is lost, and searching/retrieval becomes difficult as you do not have
information on the subject of an image, for example, what the image is of. Rechert et
al., (2012) believe PREMIS and METS to be migration oriented, however, there has
been papers and presentations based on using it for emulation by (Dappert and
Peyrard, 2012). Dappert et al., (2012) displayed how additions can be made to the
standard, allowing it to link to a technical registry such as TOTEM. Dappert et al.,
(2013) further identified the gaps in the PREMIS version at the time, suggesting a
range of changes and additions, with clear examples displaying how the new
16
elements can be used to describe environments. Ongoing changes and updates are
being made to the standard, strengthening the capabilities it offers for preservation.
When it comes to describing multimedia objects, MPEG-7 is what is used to describe
it (Rahman and Masud, 2014). It provides an extensive set of tools which allow
multimedia content description. These descriptions cover audio features such as
melody and rhythm, as well as: visual features, specifications for encoding and
transporting, and other information that may be queried in a search (Smith and
Schirling, 2006). Chang et al., (2001) indentified the value in harmonizing or
synchronising MPEG-7 with other domain-specific standards as they each share the
same objective. However, MPEG-7 defines over 450 metadata types using XML and
although there have been efforts made to harmonise with other standards such as
Dublin Core, it was not always possible due to disjointed purposes (Smith and
Schirling, 2006). Although focused on audio/visual content, MPEG-7 can also be
used with non-MPEG formats (Chang et al., 2001). Another MPEG standard,
MPEG-21, provides a framework for exchanging digital objects across devices and
networks, specifying metadata for packaging, rights management, and environment
adaption (Smith and Schirling, 2006). Although the primary focus has been
audio/video and alike, MPEG-21 can be used for a range of different complex
objects such as scientific datasets, electronic texts, and journals.
With a range of different standards, many of which address the same or similar
elements, there exists a form of mapping two or more standards together, this is
called crosswalking (OCLC, 2010). Generally represented in a two to three column
table addressing two metadata schemas, mapping each element to their counterpart.
Wheatley, (2004) describes the process in which a repository follows from ingest
through to access, involving the creation and management of metadata, the use of
extraction tools and automation, frameworks to record changes in metadata, and also
involves storing the metadata in a separate location. This is said to make metadata a
more manageable task, further allowing several objects of the same format to point to
the same metadata. Monitoring and updating then become a simpler task, however,
convenience and simplification may be at the cost of great risk. Due to human
interaction, which leads to human error or misconduct, should something happen to
the metadata managing a large collection of items, 1 error could become N errors.
17
PREMIS, (2012) addresses where metadata should be stored, explaining database
storage has the advantages of fast access, easily updated and allows query/report
functionality. Storing the metadata with the digital object in a repository makes it
harder for the digital object to be separated from its metadata, also allowing the same
preservation strategies to be performed on the metadata. The recommended practice
is said to store critical metadata in both ways. Ideally, as much information should be
embedded with the object itself, as extraction tools will be able to capture it. Further
recommendations made by Pellegrino, (2014) state that metadata can be stored in
ASCII as it is a sure way to represent metadata that is backwards compatible and
easily understood by humans. Rousidis et al., (2014) emphasises the "quality" of
metadata being vital for interoperability and goes on to explain the issues faced with
the Dryad repository. Some of the issues described include consistency, something
that is affected due to digital object evolution which can transform the objects
metadata. The quality of the metadata stored in the repository is important if accurate
analysis and monitoring were to be performed on the growth of the repository, this
required a full analysis of the stored metadata. This led to the discovery of misuse
issues, consistency, and quality issues. Interoperability is further identified as a
requirement by (Day, 2003) and (Windnagel, 2014), and is the ultimate goal and
objective of MPEG-7 (Chang et al., 2001).
2.4 Tools
The tools in question are related to metadata extraction and creation. There are many
different tools that aim to achieve similar goals, each meeting both success and
failure as there is no perfect tool. Tools is a rather small focus in this study, as the
focus lies in the "what" rather than "how" as the first step is identifying what needs to
be captured and done in order to perform preservation. Hutchins, (2012) broadly
categorises metadata into extrinsic and intrinsic, the latter being included in the byte
stream of the object and the other is generated and stored externally. Thus stating
metadata extraction being considered as making intrinsic metadata extrinsic via
extraction tools that read an objects byte stream, presenting them in a different
format. Bern, (2003) states "as much automation as possible is a must" and now there
are tools that are available and some automation is possible, which is shown by
Hutchins, (2012) in the testing of metadata extraction tools such as: ExifTool,
MediaInfo, pdfinfo, and Apache Tika. Tests were performed on a number of digital
18
objects from image files, multimedia files to PDF and office files alike. Promising
results followed the testing process, however, errors occurred. Pellegrino, (2014) and
Kirschenbaum, (2010) identify a range of other tools: PRONOM, AONS(I and II),
DROID, JHOVE, FITS, NLNZ. All these tools in one way or another are the same,
however how they perform and the output in which they deliver are different. In fact
Kirschenbaum identifies that these tools are generally used in conjunction with one
another as each can support the other. Furthermore, some of these tools are used as
part of the five key PREMIS tools, with the addition of the PREMIS Creation Tool,
HandS and the PREMIS in METS Toolbox, these tools also work in conjunction
with each other. For example, JHOVE can identify file formats and validate files,
producing detailed technical metadata, however, the output is not in PREMIS
standards, the PREMIS creation tool generates PREMIS object entities from the
JHOVE or DROID output (Gartner and Lavoie, 2013). Bern, (2003) puts emphasis
on the addition of manually captured metadata, especially for high-level metadata,
generally troublesome for automated tools. Therefore humans and tools must work
together in order to achieve completeness.
Hutchins, (2012) identifies the issues with extraction tools lie in the obvious factor
that if the metadata is not there in the first place, it cannot be extracted. Technical
metadata is generally present and can be extracted successfully, however descriptive
and administrative metadata are dependent on human input at the time of creation or
curation. This further leads to problems in automation as certain fields such as
"Author" and other descriptive fields alike may end up with a generic name such as
the username of the system in which the file was transferred from. If the automation
workflow involved the extraction process and then depositing the metadata to a
registry or database which was then to sort the information using descriptive
metadata, consistency and authenticity are at risk.
2.5 Training
Although there are tools which can be used for authentication, extraction, and
providing access, there are still important techniques used to establish trust in digital
objects and they are organisational, requiring human behaviours (Kirschenbaum,
2010). Professional ethics, skills, attitude, and monitoring then become equally as
important in digital preservation. The vast array of techniques, tools, and standards
19
make digital preservation rather complex and fragmented. This can make it a
daunting experience, regardless of skill level. As humans play a pivotal role, they
must have the required knowledge to bring this fragmented world together in an
effective preservation solution. Therefore training must accommodate all of this. It is
also essential due to there being many different standards, or the lack of standards
used in metadata, meaning different variations of the same word or description can
lead to errors in automated systems and humans are better at identifying such things
(Routhier Perry, 2014).
It is clear not everybody is ready to start preserving on a scale that is needed, many
staff members in libraries and archives do not possess the required skills to select,
collect, mange and curate, simply because they have not received training (Brazier,
2013) and (Routhier Perry, 2014).
Training needs to start addressing the need to start collecting digital content much
earlier, further addressing that it is far easier to recreate the context of an object
while it is still current, rather than relying on digital archaeology (Brazier, 2013).
Knowing what to preserve and the best methods to use is another great concern
amongst professionals, recommending specialised training (Routhier Perry, 2014).
Metadata needs to be stressed as it is imperative to making preservation mean
something. Especially in libraries, where the history on an object, its provenance, is
equally as important as preserving the object itself. Not to mention academic
research, which may be reliant in precise and accurate representations of past
experiment results or mathematical based objects.
Brazier, (2013) identifies the training being offered at the British Library involving
staff training in the basics of working with researchers, collections in digital
environments, use of social networking, presentation skills, working with digital
objects, basics in web and programming, and also includes an update on metadata,
touching on Dublin Core, METS, MODS and XML. However, this hardly seems
efficient to truly train staff in digital preservation. Strodl et al., (2011) states that
many research projects in this area offer training courses about their work. This
involves training seminars, summer schools, and public events. DPE, PLANETS,
CASPAR were the first to consolidate training with the wePreserve platform.
However this is very fragmented and only offers the basics, for those who attend.
20
Currently, Strodl et al., (2011) states formal qualifications for professional training
are missing and addresses the need to form different sectors: academic, industry,
culture heritage, and private sectors for training. In addition to formal training,
education must be considered, offering programs in digital preservation. This should
be targeted at librarians and archivists.
3.0 Motivation
The motivation of this thesis comes from wanting to solve the problems digital
preservation faces and increasing awareness on the various topics discussed. The
problem forms somewhat of a nested list of problems, starting with digital
preservation as a whole, then metadata, and lastly humans. The problems stated
should not been seen in a hierarchical manner, as each problem is the result of the
other. Firstly, digital preservation may not be viewed as a critical requirement, often
because there are simple solutions that have a good track record on the surface,
namely migration. However, research has shown migration's track record to be very
different under the surface, therefore, more consideration must be made for digital
preservation. This is where awareness is crucial. Once awareness is raised and it is
no longer deemed appropriate to simply execute a simple and quick solution, the rest
of the problems are revealed. Assuming one wants to perform preservation on a
digital object so it can be accessed and interacted with in the future, emulation is the
required technique. As mentioned in literature, it is a complex technique that relies
on complete and accurate metadata. There lies the second problem, metadata.
Preserving an object so it can remain through time is relatively easy, but ensuring it
still runs as it did in its original environment is the challenge, which emulation
compensates by creating a replicated environment. For this environment to be
created, information about the environment must be known, this information is held
within the metadata, assuming the metadata is present. Now comes the problems of
humans. As we are still performing majority of the operations in the preservation
process, it is up to humans to ensure the metadata are sufficient for emulation. This is
no easy task and without computational automation it never will be. A standard can
make it easier by identifying what metadata should be captured and how to create it,
what naming conventions and what format to use. Study has shown one size does not
fit all with standards, so there are many, each focusing on a select area. If one wants
21
to simply preserve a digital object, they could utilise an existing preservation
metadata standard. If the object is being stored into a repository where searching and
sorting takes place, descriptive metadata need to be established, requiring yet another
standard. It is not certain that the same standards will be used for the same types of
metadata each time, therefore the risk of inconsistencies increases. In fact, very few
standards are readily available and easy to interpret for the standard user. If the
elements of the MPEG-7 standard were desired, one must purchase the ISO, which is
parted into 12 ISOs (“MPEG-7,” 2014). With simplicity and completeness in mind, it
would be beneficial to have a centralised standard that covers all types of metadata.
The purpose of this would be to aid in the early and later stages of emulation,
ensuring an object has the correct information to allow for preservation as well as
information to allow for management, e.g. sorting, searching, and describing the
object.
Newly found motivation comes from the United Nations Education, Scientific and
Cultural Organization (UNESCO, 2014), an organisation that is addressing digital
preservation now and one that sees digital preservation as a current matter. The goals
of this thesis align with the current goals of UNESCO, an international organisation,
this only strengthens the importance of the work presented before you. For all these
reasons, awareness is the key. For digital preservation to be successful and even
possible, we need to get on board now, ensuring newly created content is ready for
the next leap in technology, and that is my motivation.
4.0 Development
The development of the standard is currently in its early stages, starting off as a seed
intending to grow. As there are countless metadata that an object could possess,
creating a standard that covers even a small percentage is no simple task for a team
of one. The idea is to identify the bare minimal metadata needed to preserve an
object and then addressing metadata based on the functionality requirements. The
requirements will differ for each object as well as the intended location for the
object. For example, if the object is destined for an archive or library, then querying,
searching, and sorting functionality will be useful. Therefore descriptive metadata
are needed and those elements exist within other standards. The new standard will
consists of PREMIS (“PREMIS,” 2015) with Dublin Core (“DCMI Home,” 2015)
22
amendments to begin with. The version of PREMIS at the time of development is
currently outdated. Version 3 is in its draft stages and although the privilege to view
this draft was granted by one of the leading authors, distribution in anyway was
forbidden. This does not cause any problems as the updates to PREMIS can be
implemented at a later date with little impact to the structure. Depending on when the
new version is made available, it may or may not be addressed in this thesis.
Most standards are presented in two forms, a data dictionary and an XML schema or
Resource Description Framework (RDF) model. Both forms offer a simple way to
display the metadata elements so that it can be easily interpreted. Figure 2 shows the
numbering structure of the entity list in the PREMIS standard.
As specified in the PREMIS data dictionary (PREMIS, 2012), each element contains
parentheses which indicate whether it is Mandatory (M) or Optional (O) and
Repeatable (R) or Not Repeatable (NR). Note that repeatable means the element can
take multiple values. Some elements are defined by one or two types which are
bitstream, file, and representation. If none are specified, all are applicable. A file is
known by an operating system and is a sequence of bytes, named and ordered. It
contains a file format, system characteristics, and access permissions. The
characteristics are made of elements such as size and last modification date. A
bitstream is data within a file with properties for preservation purposes. A
representation is a set of files which includes structural metadata that specify how the
files come together. For example, an article represented by an image for each page,
accompanied by an XML file with the structure of how the pages are ordered to
make the article. The numbered list hierarchy is unique to PREMIS as Dublin Core
Figure 2 - PREMIS numbered schema
23
uses an alphabetised index. Each element in both standards is broken down into an
informative table, offering greater detail. PREMIS offers quite a detailed breakdown
of each element and easily identifies related elements as seen in Figure 3. Although
this method offers a great amount of detail, both on the element itself and how to use
it, accessing each of these elements involves searching through a document as there
is no online repository to interact with. Dublin Core is best known for its simplicity
and the detailed breakdown is exactly that. Originally Dublin Core consisted of 15
elements, it has now grown and the original 15 elements have been refined by what
is now referred to as terms. Figure 4 displays the breakdown of the date element,
which is a refined version of the original date element. It shows what sub-elements it
is refined by, similar to the PREMIS semantic components cell.
Figure 3 - PREMIS data dictionary element breakdown
24
Dublin core offers an advantage PREMIS does not and that is an online repository
that you can interact with. Each element in the "refined by" cell can be clicked,
taking the user to that element and so forth.
Both data dictionaries can be easily read, interrupted, and although visually different,
both achieve the same result. However, there are differences in the XML
representations. PREMIS follows a set structure and is rather complex, but still
relatively easy to interpret. Dublin core is flexible, although predominantly displayed
in RDF, it can be implemented using other standards or even a combination. Given
its simplicity, adapting to other standards and new elements is easily achieved.
Dublin core XML representations are generally examples of the standard in use,
which can be seen in Figure 5 describing a book. The following Figure 6 shows the
same information in a basic description language. The schema remains in RDF and
without software to organise and make sense of it, viewing it in its base form justifies
why it is typically represented in XML.
Figure 4 - Dublin Core element breakdown
25
The PREMIS XML schema is well presented and is parted into sections for different
types of elements. For example, Figure 7 shows the complexType
objectCharacteristicsComplexType. Being a complexType means it is made up of
multiple metadata terms. minOccurs and maxOccurs are present in both PREMIS
and Dublin Core XML which are parameters indicating the minimum and maximum
number of occurrences for each metadata term.
Figure 5 - Dublin Core XML representation (Book)
Figure 6 - Dublin Core Descriptive Language (Book)
Figure 7 - PREMIS XML
26
PREMIS is dealing with complex preservation metadata, this being the reason for
requiring a more detailed and complicated schema. It further describes elements that
are made up of options, allowing one or more sub elements to be used which is
represented by the <xs:choice> tag. Preference then falls to the PREMIS schema and
representation, now influencing the way the new standard will be presented.
As PREMIS focuses on preservation metadata and Dublin Core focuses on
descriptive metadata, this is why both standards are being studied in detail, to find
what one can bring to the other in order to cover all areas of metadata. Of course this
is not limited to these two standards, but it is a start. There is an issue with Dublin
Core that lies in its simplicity, although being one of its advantages. This issue is
determined on a case by case basis, depending on how and what metadata are stored.
For example, although Dublin Core has now grown from its original 15 elements, a
lot of the elements are still high level such as the "name" element. Represented as a
full name string, there are no individual elements for "first name" and "last name"
and therefore must borrow from other standards. However, this problem has minimal
effect and is something easily addressed. High elements such as "name" can simply
be refined by sub elements, keeping the original element or removing it entirely. This
will be determined by the element itself and its uses. Using a library as a scenario,
searching for either of these elements can serve different purposes, in fact, given that
first name + last name = name, some would argue the name element should just be
removed.
Regarding the new standard, it is assumed preservation metadata are mostly covered
by PREMIS and is in working order, meaning there will be no changes made to the
current PREMIS schema. This will remain so until the new version is incorporated,
therefore the focus is on adopting descriptive metadata to strengthen and add
functionality to the existing schema. The following list is the predicted order of
phases:
Phase 1 - Identify descriptive metadata elements.
Phase 2 - Incorporate new elements into high level hierarchical list.
Phase 3 - Add, remove, modify (expand) new elements (suggestive).
Phase 4 - Develop XML schema / add to existing PREMIS schema.
Phase 5 - Research and development (other standards, elements, etc)
27
Phase 6 - Future work.
These six phases identify what is going to be done and what needs to be done, as not
all will be completed fully. The objective is to establish each phase with a subset of
elements that can then be easily repeated for the remaining elements. The following
list of elements will use a separate numbering structure, excluding PREMIS for the
moment. The reason for this being it will be easier to incorporate the Dublin Core
elements once the new version of PREMIS is incorporated as there will be changes
to the existing schema. Note, all existing elements will be included, some of which
may be removed at a later date as there are similar occurrences present in the
PREMIS schema which may prove more suitable. Elements in bold are new
additions/modifications. As a reminder, M = Mandatory, O = Optional, R =
Repeatable, NR = Not Repeatable. Repeatable = Holds multiple values.
1.1 Description (M, R)
1.1.1 Abstract (O, NR)
1.1.2 TableOfContents (O, NR)
1.2 Rights (M, R)
1.2.1 AccessRights (M, R)
1.2.2 License (M, R)
1.2.3 RightsHolder (Established as a sub-element now)
1.3 Accrual (O, R) (Established as a high-element now)
1.3.1 AccrualMethod (O, R)
1.3.2 AccrualPeriodicity (O, R)
1.3.3 AccruallPolicy (O, R)
1.4 Title (M, NR)
1.4.1 Alternative (O, NR)
1.5 Audience (M, R)
1.5.1 EducationLevel (O, R)
1.5.2 Mediator (O, R)
1.6 Date (M, NR)
1.6.1 Available (O, NR)
1.6.2 Created (M, NR)
1.6.3 DateAccepted (O, NR)
1.6.4 DateCopyrighted (O, NR)
1.6.5 DateSubmitted (O, NR)
1.6.6 Issued (O, NR)
1.6.7 Modified (M, R)
1.6.8 Valid (M, NR)
1.7 Identifier (M, NR)
1.7.1 BibliographicCitation (M, NR)
1.8 Relation (M, R)
1.8.1 ConformsTo (O, R)
1.8.2 HasFormat (O, R)
1.8.3 HasPart (M, R)
28
1.8.4 HasVersion (O, R)
1.8.5 isFormatOf (O, R)
1.8.6 isPartOf (M, R)
1.8.7 isReferencedBy (O, R)
1.8.8 isReplacedBy (O, R)
1.8.9 isRequiredBy (M, R)
1.8.10 isVersionOf (O, R)
1.8.11 References (O, R)
1.8.12 Replaces (O, R)
1.8.13 Requires (M, R)
1.8.14 Source (M, R)
1.8.15 MemberOf (O, R) (Established as a sub-element now)
1.9 Ownership (M, NR) (Established as a high-element now)
1.9.1 Creator (M, R)
1.9.1.1 Contributor (O, R)
1.9.2 Publisher (M, R) (Established as a sub-element now)
1.9.3 Provenance (M, NR) (Established as a sub-element now)
1.9.3.1 HistoryAction (M, R)
1.9.3.2 HistoryWhen (M, R)
1.9.3.3 HistoryAgent (M, R)
1.9.3.4 HistoryParameters (M, R) (See discussion 6.0)
1.10 Coverage (O, R)
1.10.1 Spatial (O, R)
1.10.2 Temporal (O, R)
1.11 Format (M, R)
1.11.1 Extent (M, R)
1.11.2 Medium (M, R)
1.12 InstructionalMethod (O, R)
1.13 Language (M, R)
1.14 Subject (M, R)
1.15 Contact (O, R) (New entry, basic information, importance varies)
1.15.1 EmailAddress (O, R)
1.15.2 Phone (O, R)
1.15.3 Address (O, R)
1.15.4 URL (O, NR)
1.16 Identity (O, NR) (New entry, for identifying a person, importance varies)
1.16.1 First Name (O, NR)
1.16.2 Last Name (O, NR)
1.16.3 Middle Name (O, R)
1.16.4 Maiden Name (O, NR)
1.16.5 D.O.B (O, NR)
1.16.6 Y.O.D (O, NR)
29
The above is the suggested hierarchy representation for the Dublin Core elements.
Note that the order of this list is not final and the list is subject to change. Minor
changes were made to incorporate stand-alone elements into appropriate sub-
elements and bringing sub-elements together under a high-element. New additions
were included with various functionality in mind. For example, the Contact group
could be important for tracking down the individual responsible for creating or
modifying a particular item, requiring assistance or authorisation credentials from
them. Given a long time period has occurred from when these entries were made to
when contact was needed, they may be outdated and no longer valid. However, there
is a "Valid" metadata term in the "Date" category which indicates the last date the
resource was valid. This can also be incorporated into the contact group. The Identity
group was added to further break down how a person's metadata are stored, including
the birth and death dates, which can be important information in certain contexts.
One thing to keep in mind is that each disciplinary field values specific information
differently. Therefore, in a library scenario, specifically in a university, the value of
each metadata element varies. This is why the standard must take from all sources to
cover a wider range of uses/users.
The following example illustrates what the XML representation will possibly look
like for the Dublin Core elements, however it is subject to change.
For a standard element, we simply make use of the <xs:element> tag such as in the
PREMIS XML schema. Using the 1.15.2 Phone element from the list above, the
representation is as follows:
<xs:element ref="Phone" minOccurs="0" maxOccurs="unbounded"/>
This representation states the Phone element is not mandatory, specified by the
minOccurs tag. The maximum number of records the Phone element can hold is
unbound. This allows for multiple numbers to be associated with this element, given
that a person usually has 3 contact numbers: home, work, mobile, but may have
more.
When dealing with a high-level element that contains sub-elements, we refer to the
<xs:complexType> tag which is used when dealing with multiple elements. As
30
mentioned earlier, there is a <xs:choice> tag which indicates a choice to be made
between two sequences of elements. This tag also comes under the complexType tag.
Using the Contact group, the XML representation is as follows:
<xs:complexType name="Contact">
<xs:sequence>
<xs:element ref="EmailAddress" minOccurs="0" macOccurs="unbounded"/>
<xs:element ref="Phone" minOccurs="0" macOccurs="unbounded"/>
<xs:element ref="Address" minOccurs="0" macOccurs="unbounded"/>
<xs:element ref="URL" minOccurs="0" macOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
The reason each of these fields is unbounded is because multiple records can be
made for each. The contact group is not just used for people, but it can be used for a
business or enterprise, which could have multiple websites, multiple locations, and
different branches. Currently the <xs:complexType> and <xs:sequence> tags will
make do, however, in time, a study may be conducted using a list of alternative tags
and testing them against a wide range of different level users to establish which tags
are easier to identify what they convey.
Regarding the existing elements, no changes have been made and any properties or
information still remains in the Dublin Core repository. Changes may be made at a
later date to improve or increase compatibility between other additions. There are
still countless additions that can be made, something that must be addressed on a
case by case basis.
If and when these changes are made to existing elements, or one element replaces
another element, crosswalking tables will be created. For example, the following
table shows how one schema is mapped to another, the elements used are for
example purposes only. Note that the original element from the left schema may not
always be in a metadata schema form and may just be a description or machine
language etc.
31
Table 1 - Crosswalk mapping example
Schema one
→
Schema two
Surname
→
LastName
Forename/GivenName/ChristianName
→
FirstName
Change history
→
HistoryAction
Something to consider when preserving digital objects is the content of the object.
For example, if digital images, art, and alike are the types of files being preserved,
one must address why these items are being preserved and the use of their
preservation. A history student may be interested in what a painting from a particular
date or event is conveying, yet the art student may require knowledge of the artist. If
it were a digital image, the same may apply, but a photography student may need to
know what type of camera was used, the type of lens, shutter speeds, and various
other information. These elements are often found embedded within digital images,
along with a range of other metadata. There is a standard that has its own schema and
SDK tool called Extensible Metadata Platform (XMP) (Adobe, 2015). The schema
makes use of Dublin Core elements as well as a wide range of unique elements that
cover information about cameras and editing/creation software. XMP is a widely
supported platform and also allows metadata to be created and embedded within
files. Understanding this metadata and what is automatically created at the time of
execution, whether it be creating a file, modifying, or any form of handling, allows
us to differentiate copy from original.
Section 5.0 provides case study tests that display what happens to the metadata
during various executions and compares them to the original states. Due to metadata
being lost over time as migration, preservation, and copies are made, it is important
our librarians and archivists are not only capturing necessary metadata, but ensuring
default metadata are present. Without this differentiation, validity is compromised as
well as any history of changes made to the digital object which further compromises
the files accuracy and completeness. This is why preserving provenance should be
top priority, possibly even more so than the preservation of the digital object itself.
High-level representations of these metadata will eventually be standardised and
incorporated into the standard, further study is required to establish this list.
32
5.0 Case study
The following case studies are made up of different file types being tested against
themselves after various changes are made to them. The purpose of these tests is to
show how vulnerable metadata are to changes and how easy it is to lose this precious
data. It also indentifies the unique metadata that are present within different file
types. As there is a vast amount of different file types, the tests were kept simple,
utilising only a few file types. JPEG, PDF, and DOCX were the file types tested
against. Given some tests provided extensive lists of metadata, only key metadata
was selected to be reviewed. The alternative purpose of these tests are to identify
what is missing in terms of metadata. All metadata was extracted using ExifTool
(Harvey, 2015). For testing purposes, ExifTool proved valuable for extracting
metadata. It was easy to use and had various methods of output. A range of online
metadata extractors were also tested, each providing similar results, however,
ExifTool provided better results. See Appendix A through to C for a subset example
of the metadata extracted from the following tests. Each of the following tests consist
of these steps:
Test one
o Testing a digital image and extracting its metadata
o Copying the image and testing the copy for any changes in metadata.
o Uploading the image to social media, downloading it and repeating
the test.
Test two
o Testing an original digital image and extracting its metadata.
o Modifying the image using photo editing software (Photoshop) and
testing the metadata for any changes.
Test three
o Basic PDF metadata test to see what could be derived from a
collection of PDF.
Test four
o Testing a Microsoft Office Word file and extracting the metadata.
o Converting to PDF through MS Office and testing.
o Converting to PDF with Adobe and testing.
33
5.1 Image > Copy > Upload (Social Media)
Table 2 - Test one - Original photo copied and modified
Original
Copy
Upload
Key Metadata
265 Lines. Dates,
Technical
Metadata, Creator,
Colour Data,
Image attributes,
Creation Tool,
Camera data,
Change and
Software history.
No change.
62 Lines.
Technical
Metadata, Creator,
Colour data, Image
attributes, Media
specific data.
New Metadata
No change.
No change.
Lost Metadata
No change.
203 Lines. Dates,
Colour Data,
Creation tool,
Camera data,
Change and
Software history.
Subset of original
metadata
Figure 8 - Randomly selected digital image (JPEG)
34
5.2 Digital Image Modification Test (Photoshop)
Table 3 - Test two - Photoshop modification
Original
Modified
Key Metadata
178 Lines. Dates,
Technical Metadata,
Extensive camera data.
201 Lines. Technical
Metadata, Reduced
camera data, Photoshop
data, Change and
software history.
New Metadata
Photoshop data, change
and software history.
Lost Metadata
Extensive camera data.
Figure 9 - Digital Image - Original > Photoshop
Modifications: Lighting balance and contrast tweaked, then changed to black and
white. Cropped.
35
5.3 PDF Default Metadata Test
Multiple PDF research papers were tested, each providing similar results. The main
metadata present for each PDF file consisted of: PDF version data, Author, Creator
(Software), Creation and modification dates, and basic XMP data. However, there
are certain PDF's that are missing most of this data. The author is merely the
computer name the PDF was created on, often being "Administrator" or "User". No
bibliographic data was present in the PDF files tested against. However, PDF
knowingly supports XMP metadata embedding, therefore there is no reason for this
data not to be present. Further testing of these files through the use of bibliographic
software such as Endnote and Zotero gave the ability to "Extract Metadata" which
revealed the bibliographic data. This information is not extracted from the file, but
from an online database that stores this information. The software searches for a
unique identifier within the PDF and then crawls the web until it finds a match. This
is by no means a reliable method as quite often the metadata could not be extracted
due to the unique identifier not being present or correct. In fact, one extraction test
revealed bibliographic metadata for an entirely different paper. Testing has shown
that some of this information may be lost or diluted through numerous submissions
and uploads, that is the nature of these papers. An original PDF created was lacking
in XMP data, however, some bibliographic data was present such as: key words, title,
and subject. Although still not adequate, it does show that it is present in some cases,
however, PDF metadata still remains quite inconsistent. See Appendix B for
complete set of metadata. The following test further indentifies the problems with
PDF metadata in their conversion from their original format e.g. Microsoft Word
DOCX files.
36
5.4 Word to PDF - Word vs. Adobe conversion.
The following test was conducted using a created word document with various
modifications. On creation, before opening the word document, no metadata was
present, in fact, trying to extract the XMP metadata provided an error. Once the file
was opened and the modifications were made, the metadata extraction was
successful. Microsoft office documents are written in XML and this information is
stored in a ZIP archive file, external from the document. This metadata was present
in the metadata extraction and conveys data specific to Microsoft Office Word. Note
that although this data is held in a separate file, the extraction was still performed on
the office file and still returned the data. Unique metadata is also present that shows
creation and modification dates, furthermore, how many times the document has
been edited as well as the total edit time (how long it has been edited for). Every
element within the document contains metadata that is presented in the ZIP data
which also specifies required version, compression details, file names, and locations
of images used within the document. It is clear that if the formatting of these
documents were to be preserved, this metadata is crucial. Without it, much can be
lost.
Making a copy of the original file preserves the metadata which has been the case for
each test. There are various ways to convert a word document into a PDF. One
method is using the Microsoft Office Word software by selecting "Publish" instead
of "Save As" which will create a PDF. The other way is to create an Adobe PDF, this
can be done a few different ways, one of which is right clicking a word document
and selecting the option to convert to an Adobe PDF. This method requires Windows
and Adobe software to be installed. Depending on which method is used, the
metadata varies. Using method one, all the metadata is stripped away and all that
remains is basic PDF metadata. Method two also stripped away the metadata, but it
created various XMP metadata. Both methods left the file with no real relevant
metadata. A summary table can be found on the following page.
The tests presented in this section display how delicate metadata can be and how
easy it is to lose this data. It has also shown that there is a lot of good information
that can be extracted from various files, but there is also a severe lacking of metadata
for preservation, search-ability, and giving a file context.
37
The following table summarises the word conversion test which made use of a newly
created word document with random edits. Some of the metadata elements listed
returned empty fields simply because they were not established in the document,
such as a title and subject.
Table 4 - PDF conversion data
Original Metadata
Converted (MS
Office)
Converted
(Adobe)
Key Metadata
139 Lines. File Size,
Dates, Basic
information.
ZIP data (See above
for description).
Title, Subject,
Creator, Description,
Keywords.
Last user to modify,
how many revisions,
creation/modification
date, total edit time.
Template, pages,
words, characters,
lines, paragraphs, etc.
25 Lines. File
Size, Dates.
PDF version, page
count, language,
author, creator
(MS office),
create/modify
dates.
54 Lines. File
Size, Dates.
PDF version, page
count, language,
author, creator,
create/modify
dates, company,
comments,
subject, title,
layout.
New Metadata
Basic PDF data.
A few extra
elements and
various XMP
metadata
including unique
identifiers, mostly
irrelevant to the
user.
New elements
were mostly
empty fields.
Lost Metadata
All MS Word
metadata
(formatting and
bibliographic)
All MS Word
metadata
(formatting and all
but a few
bibliographic
elements)
38
6.0 Discussion
To begin this discussion, let the focus be on the use case tests from the previous
sections. The amount of metadata present for each test was too much to list each
element. Therefore only the key metadata were generalised.
With the importance of provenance being established, one particular set of metadata
stood out in these tests and it was a set of metadata that was easily lost. These
elements contained the history of the digital object, namely "History Action",
"History When", "History Software Agent", and "History Parameters". These four
elements provide the more detailed data, however there are accompanying elements
that provide unique identifiers for these elements. Using the initial image from test
one. The data that was present in the specified tags conveyed the following
information:
History Action - saved, saved, saved, saved, converted, derived, saved.
History When - The first saved was at 2010:02:11 21:59:05, the last saved
was at 2010:02:11 22:12:01 with each action having its own timestamp.
History Software Agent - Adobe Photoshop CS4 Windows for each action.
History Parameters - Converted from TIFF to JPEG.
Although this particular set of information is basic, the elements are still important in
identifying what has happened to the digital file. In other cases this data may be more
revealing increasing the importance of it. The second test in fact only revealed this
information once modifications were made, which reveals the ability photo editing
software has to embed this important metadata. The first test also proved how easy it
is to lose this historical data once the file has been uploaded online.
To summarise, the tests clearly show the delicacy of metadata when taken through
various processes. Making copies of the files was without risk and did not have any
effect. Modifying the files resulted in loss of metadata and the creation of new
metadata. Uploading an image to social media such as Facebook heavily modified
the metadata, removing most of it. In this case it makes sense to strip some of the
metadata from uploaded files for security reasons. Preventing random users on the
internet from extracting personal information from these uploads is a justified reason.
39
However, certain metadata should remain, especially the history. With this
information, it would make it easier to identify real photos from fake. We have all
come across images that are impossible to know if they are real or have been
fabricated. Seeing that the image has been edited in Photoshop numerous times
would be a good indication of it being fake. This change also increases the ability to
differentiate copies from originals. Although detecting a copy of an original file
made on the same machine as the original is difficult, detecting a copy created on
another machine or that has been through a series of web based transmissions should
be clearly indicated in the metadata. The PDF tests revealed how bibliographic
metadata are obtained via online databases. This method makes sense as there will
most likely be numerous copies of a single PDF in cyber space, submitted to multiple
conventions, stored in various repositories. Ensuring each had the appropriate
metadata is a difficult task. However, for long-term preservation and in certain
preservation locations (archives), internet access may be a problem. Certain archives
may not have the capabilities to query online databases to generate the bibliographic
metadata. Therefore, this metadata should be embedded within the file that is
preserved, removing the reliance of an internet connection.
What these tests further establish is that no matter what metadata are automatically
created, it is never enough to ensure preservation, search-ability, and validity.
Ideally, successful preservation should meet all three criteria. This is why having a
global standard is a step forward to achieving this goal. As study has indicated, there
are many standards, each help to meet specific criteria, but none try to cover all
areas. PREMIS allows for preservation, disregarding descriptive metadata therefore
disregarding a files context and reduces its functionality. Dublin Core and standards
alike make this possible, however are not perfect and do not by any means cover
every aspect. Most of these standards allow for modification, customisation, and the
addition of new elements. Each case, each file, and each location is going to require a
different set of metadata. If you were to make additions, you need to determine what
standard to follow, or you have to look into the many standards that exist to see if
what you need is already established. If we have one global standard that covers each
type of metadata that uses a simple and consistent naming convention, then adding
new elements will be made easier for the user. This is why the representation and
presentation of the metadata is important, which is why efforts are made to present
40
the standard in numerous ways that are both informative and easy to interpret. This is
why the metadata schema will follow the PREMIS XML schema and will
incorporate any new elements a long with the Dublin Core elements into that format.
Furthermore, obsolescence, degradation, and updates to newer formats are threats not
only to digital objects, but the standard itself. However, given the presentation of the
standard, if XML where to no longer exist, the data dictionary will still be present
and the XML is simple enough to interpret and be re-developed into a newly
established language.
As shown, minor changes can have a major impact on metadata. So it should be no
surprise that the preservation technique, migration, is as risky as literature has
portrayed it to be. If we want to be complete, consistent, concise, and ensure validity,
migration should not be the go-to method of preservation. The disadvantages of this
method have clearly been established and proven throughout many years of study.
Although it should not be forgotten as it does serve some purpose, for long-term
preservation, emulation is the key.
With the new standard being suggested and developed, currently the format has been
established and an example of the Dublin Core elements combined with new
additions has been created. However, emulation still needs to be addressed in terms
of the new standard. A standard will not solve the problems with emulation, but it
can help. Currently the standard is not addressing emulation specific metadata
(Preservation Metdata), but it will reveal these elements once the new version of
PREMIS is addressed and implemented. Each digital object, whether it be a single
file or a package of some sort will require a unique set of metadata to establish a
successful environment for preservation. Certain file types may require a standard
environment to run in, therefore version metadata will be the important factor. The
standard will address how each of these elements should be recorded which can be
matched with how emulation software reads and interprets these fields. What the
standard cannot do is establish what digital object requires what metadata. This is
another standard in itself. Without some form of collaboration with emulation
software and its developers, the best that can be done is cover as much metadata as
possible. Through enough time and amendments, the standard will grow and
emulation software will advance. With this we move closer to automating the whole
process. For now, the purpose of the standard is to ensure newly created objects are
41
fit for preservation, but not only so they can be preserved, but so their preservation
has meaning. Giving these objects context allows various disciplines within research
based institutes to get much more out of the preserved objects, firstly having the
functionality to perform specific queries and then by having tailored information
extracted.
There is still much more work to be done on the standard, this will be addressed in
section 8.0 Future work.
With that aside, let the focus now be on us, people. It is going to be awhile before
automation replaces us completely, until then, it is going to take much more than a
standard to get the ball rolling on successful preservation. As mentioned earlier in
this thesis, humans still perform majority of the roles in digital preservation. This
standard will give an overall, easy to understand list of what metadata a file needs as
well as optional metadata which can be just as essential. This will make training the
individuals responsible for handling the digital files easier because we are no longer
presenting them with a fragmented world of various different standards and methods.
With this training, it is then hoped that newly created or digitised objects will be
established with a complete set of metadata that will give the object functionality and
eventually lead it to long-term preservation.
42
7.0 Conclusion and Recommendations
The goal of this thesis was to bring the dark areas of digital preservation into the
light, to raise awareness of its complication and its importance. The contribution this
thesis makes to the field of digital preservation starts by further revealing the issues
the current preservation techniques face, especially the risks involved with migration.
The issues with metadata have been revealed which includes the human element,
computational elements, and the metadata standards. The fragmented and vast
amount of metadata standards were identified as a problem that needed solving. This
thesis establishes the need to create a new global standard and begins the creation of
its elements, structure, format, and presentation. This starts with introducing
descriptive metadata elements, firstly from Dublin Core and then new elements and
modifications to the existing Dublin Core were designed. Due to reasons specified in
earlier sections, preservation metadata was excluded. The following elements were
modified from the Dublin Core list:
1.2.3 RightsHolder (now a sub-element)
1.3 Accrual (now a high-level element)
1.8.15 MemberOf (now a sub-element)
1.9 Ownership (now a high-level element)
1.9.2 Publisher (now a sub-element)
o 1.9.3 Provenance (now a sub element of Publisher and a high-level
element for 1.9.3.1 - 1.9.3.4)
The following elements were additions:
1.9.3.1 HistoryAction
1.9.3.2 HistoryWhen
1.9.3.3 HistoryAgent
1.9.3.4 HistoryParameters
1.15 Contact
o 1.15.1 EmailAddress
o 1.15.2 Phone
o 1.15.3 Address
43
o 1.15.4 URL
1.16 Identity
o 1.16.1 First Name
o 1.16.2 Last Name
o 1.16.3 Middle Name
o 1.16.4 Maiden Name
o 1.16.5 D.O.B
o 1.16.6 Y.O.D
Each element is subject to change, including the structure and order.
From this, existing metadata standards are strengthened by one another, forming into
one global standard that stands a much better chance of increasing the success of
digital preservation. The standard will make it easier for both the human element and
computational elements in terms of use and the creation of preservation software.
The standard will also aid existing software, especially if said software chooses to
adapt to the new standard. The standard has been designed to grow and as future
research unveils new best practices, the standard can easily adapt, becoming
stronger.
Digital preservation is a necessity, it always has been and it is becoming increasingly
so. Without it, there is much to lose. It has been established that the main method of
preservation used over the years is not as effective as some would like to believe,
especially as time progresses and digital file complexity with it. We should not be
taking risks with our data when there are other methods that reduce this risk
considerably, if not remove it entirely. Emulation has been established as the
preferred and better suited method of preservation. By using emulation, we allow our
digital objects to be rendered in the closest resemblance to their original
environments and therefore allows us to interact with these objects as they were
designed to be. However, emulation is not currently in a state where it can be
achieved without great effort. Although we see and use emulation more regularly
than we might think, such as in video game emulators, experimental simulation
software, virtual machines, and other various software; tasking it with long-term
digital preservation is a much harder task to accomplish. This method is heavily
44
reliant on metadata, therefore is heavily reliant on the software that creates this
metadata and the humans that fill in the gaps where software cannot. Each part, both
computer and human, play their part well, however, both are subject to making
errors. Although these elements are what make emulation, they are also what break
it.
Given metadata is the key, numerous standards have been created to establish a list
of suitable metadata that will allow digital preservation to take place. The standards
are meant to make things easier and they do to some degree, however, there are so
many, it is rather fragmented and there are inconsistencies. Whenever you have two
or more entities doing the same thing, there are always bound to be inconsistencies.
The nature of human error further increases risk of these inconsistencies.
Computational automation has been established as a necessity of great importance,
however, this is also reliant on metadata. Sure you can have emulation software that
extracts all the required metadata from an item and then creates the emulated
environment, but the metadata has to be there in the first place. So there will always
be a man-in-the-middle, at least for the time being. Technology may advance to the
point where all these issues are addressed, but for now, this is not the case. Therefore
we must work on ourselves as much as the software we are using.
Having established our position in preservation not going anywhere anytime soon,
we must address ourselves once again. As humans, we like things to be in one place,
we love it when one size fits all and although the one size fits all method cannot
work in this case, we can definitely work towards it with the suggested global
standard. By creating the global standard, we create one list of metadata elements
using one naming convention and one set of formats; XML and numbered data
dictionary structured lists. The suggested elements are by no means a complete set, in
fact, the standard will never be complete, it will be ever growing. Each case where
the standard is used will bring new additions or variations of what is already present.
Given its simplicity, it allows for these changes to easily be made. The standard was
never meant to be completed by just one person, but grown by the community.
Awareness is the key and has been the intention of this thesis all along. We need to
get on board now if we want to ensure our digital history and objects are safe for the
future. Focusing the gaze upon librarians and archivists as we have to start
45
somewhere. By ensuring our libraries and archives are performing safe and effective
metadata creation, we can ready ourselves for the advancements in emulation and
fight off digital obsolescence.
8.0 Future work
The seed has been planted and as it grows there will be much more work to do. The
standard itself must grow, addressing new metadata and the benefits of other
standards. Once all the selected elements are brought in and analysed, the
presentation of each of these elements must be brought into the new standard format.
The new version of PREMIS will be incorporated into the global standard. Insight
into the draft version indicates the new version will be tailored to address emulation
and not just migration as its earlier revisions are believed to be focused on.
On top of this, a new repository and data dictionary must be created that includes
every element specified in the global standard. Currently, all this data is held in
differently locations, the Dublin Core online metadata repository holds all the Dublin
Core data and the PREMIS website holds all the PREMIS data. Creating one, online
and interactive repository where you can access every element will make it
extremely easier for users to search and understand each element without having to
go from one location to another. Having everything in one location further increases
managerial task efficiency.
Further research into metadata creation and extraction tools will be conducted. The
XMP platform will need to be tested in its efficiency for creating metadata as well as
its simplicity to evaluate its place within institutes such as libraries. Testing of
environment establishing software and models such as TOTEM will be conducted to
determine the accuracy of the resulting suggested specifications. Identifying the
similarities of the range of tools and their workflows will help in identifying a
possible solution to bring some of these tools together to create a new meta-tool that
accomplishes in one task what each tool accomplishes individually. Alternative
options will be addressed such as creating a dashboard rather than creating new
software. This will allow users to interact with a dashboard style user interface that is
made up of different software modules in the back end which will make it a much
easier task for the user.
46
Advances in emulation software will be monitored carefully, specifically how they
initialise the emulation, how the metadata is entered as well as what metadata is used
to create the environments.
Once the standard is at a point where it can be used to establish an effective set of
metadata for a predefined set of file formats, a training module will be discussed.
Targeting university libraries, collaboration with librarians and researchers will take
place to get an understanding of how metadata is perceived in these areas and what
must be addressed. By understanding the knowledge level in terms of metadata will
allow a tailored training module to be developed to educate and train staff in
effective metadata management. The training will increase awareness of metadata
and the importance of it. It will train users to identify what metadata are needed for
long-term preservation and what can be created to add searchability and
manageability functionality. Training with an assortment of tools will also need to be
done to enable effective metadata extraction and creation. By having this training, it
is hoped that the material that moves through the library is secure for future
preservation, but also is given functionality that students and researchers can make
use of now. Some of the metadata elements contain extremely useful information
depending on the discipline of the user and this data should be utilised.
Lastly, a complete crosswalk may be created, mapping the new standard schema to
that of the de-facto standards for the elements that crossover. The crosswalk will be
updated each time a new element is brought in or new elements are created to
overwrite existing elements from other standards. This will then show how each
element relates to its counterpart. This in turn will help users adjust to the new
standard if they are familiar with the existing standards.
47
9.0 References
Adobe, 2015. Extensible Metadata Platform (XMP). Available at:
http://www.adobe.com/products/xmp.html [Accessed May 22, 2015].
Allasia, W., Bailer, W., Gordea, S. & Chang, W., 2014. A Novel Metadata Standard
for Multimedia Preservation. Proceedings of iPres. Available at:
http://www.joanneum.at/uploads/tx_publicationlibrary/BAW-
ipres2014_mpaf_cr_v4.pdf [Accessed February 11, 2015].
Anon, 2015. Exchangeable image file format. Wikipedia, the free encyclopedia.
Available at:
http://en.wikipedia.org/w/index.php?title=Exchangeable_image_file_format&ol
did=664942639 [Accessed June 12, 2015].
Anon, 2014. MPEG-7. Wikipedia, the free encyclopedia. Available at:
http://en.wikipedia.org/w/index.php?title=MPEG-7&oldid=609717303
[Accessed May 17, 2015].
Anon, Tumblr Photography Photo: Tumblr Photography. Available at:
http://www.fanpop.com/clubs/tumblr-
photography/images/29376619/title/tumblr-photography-photo [Accessed June
12, 2015].
Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A. & Hofman, H.,
2009. Systematic planning for digital preservation: evaluating potential
strategies and building preservation plans. International Journal on Digital
Libraries, 10(4), pp.133–157.
Becker, C., Kulovits, H., Kraxner, M., Gottardi, R., Rauber, A. & Welte, R., 2009.
Adding quality-awareness to evaluate migration web-services and remote
emulation for digital preservation. In Research and Advanced Technology for
Digital Libraries. Springer, pp. 39–50. Available at:
http://link.springer.com/chapter/10.1007/978-3-642-04346-8_6 [Accessed
February 10, 2015].
Bekaert, J., Hochstenbach, P. & Sompel, H. van de, 2003. Using MPEG-21 DIDL to
represent complex digital objects in the Los Alamos National Laboratory
Digital Library. D-Lib Magazine; 2003 [9] 11, 9(11).
Bern, 2003. The Long-term Preservation of Databases, ERPANET.
Brazier, C., 2013. born. digital@ british. library: the opportunities and challenges of
implementing a digital collection development strategy. Available at:
http://library.ifla.org/id/eprint/222 [Accessed February 4, 2015].
Chakrabarty, M., 2014. REQUISITE OF DIGITAL PRESERVATION FOR
SHIFTING OF PRINT MEDIA TO DIGITALISED ONE AND THE ROLE OF
LIBRARY PROFESSIONAL TO ENHANCE THE GROWTH. International
Journal of Library and Information Studies, 4(2). Available at:
http://www.ijlis.org/img/2014_Vol_4_Issue_2/81-88.pdf [Accessed February 4,
2015].
48
Chang, S.-F., Sikora, T. & Purl, A., 2001. Overview of the MPEG-7 standard.
Circuits and Systems for Video Technology, IEEE Transactions on, 11(6),
pp.688–695.
Dappert, A., 2013. Metadata for Preserving Computing Environments. The
Preservation of Complex Objects, p.63.
Dappert, A. & Peyrard, S., 2012. Describing Digital Object Environments in
PREMIS. In Proceedings of the 9th International Conference on Preservation
of Digital Objects (iPRES2012). Citeseer, pp. 69–76.
Dappert, A., Peyrard, S., Chou, C.C.H. & Delve, J., 2013. Describing and Preserving
Digital Object Environments. New Review of Information Networking, 18(2),
pp.106–173.
Dappert, A., Peyrard, S., Delve, J. & Chou, C.C.H., 2012. Describing Digital Object
Enrionments in PREMIS.
Day, M., 2003. Integrating metadata schema registries with digital preservation
systems to support interoperability: a proposal. In International Conference on
Dublin Core and Metadata Applications (DC-2003). University of Bath.
Available at: http://opus.bath.ac.uk/23599/ [Accessed March 26, 2014].
DCMI, 2015. DCMI Home. : Dublin Core Metadata Initiative (DCMI). Available at:
http://dublincore.org/ [Accessed May 7, 2015].
Gartner, R. & Lavoie, B., 2013. Preservation Metadata (2nd Edition), Digital
Preservation Coalition. Available at:
http://www.dpconline.org/component/docman/doc_download/894-dpctw13-03
[Accessed February 13, 2015].
Groenewegen, D. & Treloar, A., 2013. Adding Value by Taking a National and
Institutional Approach to Research Data: The ANDS Experience. International
Journal of Digital Curation, 8(2), pp.89–98.
Harvey, P., 2015. ExifTool. Available at: http://owl.phy.queensu.ca/~phil/exiftool/
[Accessed May 25, 2015].
Van der Hoeven, J., Lohman, B. & Verdegem, R., 2007. Emulation for digital
preservation in practice: The results. International journal of digital curation,
2(2), pp.123–132.
Hutchins, M., 2012. Testing software tools of potential interest for digital
preservation activities at the national library of australia. National Library of
Australia Staff Papers.
Kirschenbaum, M.G., 2010. Digital forensics and born-digital content in cultural
heritage collections, Washington, D.C: Council on Library and Information
Resources.
Kirschenbaum, M., Lee, C.A., Woods, K., Chassanoff, A. & others, 2013. From
Bitstreams to Heritage: Putting Digital Forensics into Practice in Collecting
Institutions. Available at: http://drum.lib.umd.edu/handle/1903/14736
[Accessed February 9, 2015].
49
Lawrence, G.W., Kehoe, W.R., Rieger, O.Y., Walters, W.H. & Kenney, A.R., 2000.
Risk Management of Digital Information: A File Format Investigation., ERIC.
Lupovici, C. & Masanès, J., 2000. Metadata for long-term preservation. Biblioteque
Nationale de France, NEDLIB Consortium. Available at:
http://www.kb.nl/sites/default/files/docs/preservationmetadata.pdf [Accessed
February 13, 2015].
McDonough, J., 2013. A tangled web: Metadata and problems in game preservation.
Gaming environments and virtual worlds, 3, pp.49–62.
Nayak, C. & Singh, V.P., 2014. IMPACT OF DIGITAL LIBRARY AND
INFORMATION SERVICES: A USER’S PERSPECTIVE. e-Library Science
Research Journal, 2(5). Available at: http://lsrj.in/UploadedArticles/213.pdf
[Accessed February 4, 2015].
OCLC, 2010. Metadata Schema Transformation Services. Available at:
http://www.oclc.org/research/themes/data-
science/schematrans.html?urlm=168910 [Accessed May 27, 2015].
Pal, M.K., 2014. Preservation of Print and Digital Sources of Information: Problems
and Solutions. Opportunities and Challenges of the Institutional Library in
Rural Areas, p.122.
Pellegrino, J., 2014. A Multi-agent Based Digital Preservation Model. arXiv preprint
arXiv:1408.6126. Available at: http://arxiv.org/abs/1408.6126 [Accessed
February 9, 2015].
Phillips, M., Bailey, J., Goethals, A. & Owens, T., 2013. The NDSA Levels of
Digital Preservation: An Explanation and Uses. IS&T Archiving, Washington,
USA.
PREMIS, 2015. PREMIS. Preservation Metadata Maintenance Activity (Library of
Congress). Available at: http://www.loc.gov/standards/premis/ [Accessed May
7, 2014].
PREMIS, 2012. PREMIS Data Dictionary for Preservation Metadata. Available at:
http://www.loc.gov/standards/premis/v2/premis-2-2.pdf.
Rabinovici-Cohen, S., Marberg, J., Nagin, K. & Pease, D., 2013. PDS Cloud: Long
Term Digital Preservation in the Cloud. In IEEE, pp. 38–45. Available at:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6529266
[Accessed February 10, 2015].
Rahman, A. & Masud, M., 2014. Review on Metadata Management and
Applications. International Journal of Advanced Computing Research, 1.
Rechert, K., von Suchodoletz, D., Liebetraut, T., de Vries, D. & Steinke, T., 2014.
Design and Development of an Emulation-Driven Access System for Reading
Rooms. Archiving Conference, 2014(1), pp.126–131.
Rechert, K., Valizada, I., von Suchodoletz, D. & Latocha, J., 2012. bwFLA – A
Functional Approach to Digital Preservation. PIK - Praxis der
Informationsverarbeitung und Kommunikation, 35(4). Available at:
50
http://www.degruyter.com/view/j/piko-2012-35-issue-4/pik-2012-0044/pik-
2012-0044.xml [Accessed February 11, 2015].
Reich, V.A., 2012. LOCKSS: ensuring access through time. Ciência da Informação,
41(1). Available at:
http://revista.ibict.br/index.php/ciinf/article/viewArticle/2125 [Accessed
February 5, 2015].
Rimkus, K., Padilla, T., Popp, T. & Martin, G., 2014. Digital Preservation File
Format Policies of ARL Member Libraries: An Analysis. D-Lib Magazine,
20(3/4). Available at: http://www.dlib.org/dlib/march14/rimkus/03rimkus.html
[Accessed February 5, 2015].
Rousidis, D., Garoufallou, E., Balatsoukas, P. & Sicilia, M.-A., 2014. Data Quality
Issues and Content Analysis for Research Data Repositories: The Case of
Dryad. In Let’s Put Data to Use: Digital Scholarship for the Next Generation,
18th International Conference on Electronic Publishing, Thessaloniki, Greece.
Available at: http://elpub.scix.net/data/works/att/106_elpub2014.content.pdf
[Accessed February 4, 2015].
Routhier Perry, S., 2014. Digitization and Digital Preservation: A Review of the
Literature. SLIS Student Research Journal, 4(1), p.4.
Smith, J.R. & Schirling, P., 2006. Metadata standards roundup. IEEE MultiMedia,
13(2), pp.84–88.
Strodl, S., Petrov, P., Rauber, A. & others, 2011. Research on digital preservation
within projects co-funded by the European Union in the ICT programme.
Vienna University of Technology, Tech. Rep. Available at:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.227.9406&rep=rep1
&type=pdf [Accessed February 13, 2015].
Von Suchodoletz, D., Rechert, K., Valizada, I. & Strauch, A., 2013. Emulation as an
Alternative Preservation Strategy-Use-Cases, Tools and Lessons Learned. In
GI-Jahrestagung. pp. 592–606.
Swanson, T.P., 2014. PREMIS-Lite, a Preservation Metadata Generator. Available
at: http://digitalcommons.unl.edu/libphilprac/1078/ [Accessed February 9,
2015].
UNESCO, 2014. PERSIST: UNESCO Digital Strategy for Information
Sustainability. Available at: http://www.unesco.org/new/en/media-
services/single-
view/news/persist_unesco_digital_strategy_for_information_sustainability/#.V
W6Hss-qpBd [Accessed March 6, 2015].
Waugh, A., Wilkinson, R., Hills, B. & Dell’Oro, J., 2000. Preserving digital
information forever. In Proceedings of the fifth ACM conference on Digital
libraries. ACM, pp. 175–184. Available at:
http://dl.acm.org/citation.cfm?id=336659 [Accessed February 11, 2015].
Wheatley, P., 2004. Institutional repositories in the context of digital preservation.
Microform & imaging review, 33(3), pp.135–146.
51
Windnagel, A., 2014. The Usage of Simple Dublin Core Metadata in Digital Math
and Science Repositories. Journal of Library Metadata, 14(2), pp.77–102.
52
Appendix A
Test one data (subset) - Note some metadata removed for privacy reasons, relevance
and readability.
Before upload
---- ExifTool ----
ExifTool Version Number : 9.95
Warning : Non-standard header for APP1 XMP segment
---- System ----
File Name : Tumblr-Photography-tumblr-photography-29376619-500-333.jpg
File Size : 194 kB
File Modification Date/Time : 2015:05:18 15:20:37+09:30
File Access Date/Time : 2015:05:18 15:21:23+09:30
File Creation Date/Time : 2015:05:18 15:20:37+09:30
File Permissions : rw-rw-rw-
---- File ----
File Type : JPEG
File Type Extension : jpg
MIME Type : image/jpeg
Current IPTC Digest : 5f9dfe3525f833a10ed0ded29c8039aa
Image Width : 500
Image Height : 333
Encoding Process : Baseline DCT, Huffman coding
Bits Per Sample : 8
Color Components : 3
Y Cb Cr Sub Sampling : YCbCr4:4:4 (1 1)
---- JFIF ----
JFIF Version : 1.01
Resolution Unit : inches
X Resolution : 300
Y Resolution : 300
---- ICC-header ----
Profile CMM Type : Lino
Profile Version : 2.1.0
Profile Class : Display Device Profile
Color Space Data : RGB
Profile Connection Space : XYZ
Profile Date Time : 1998:02:09 06:49:00
Profile File Signature : acsp
Primary Platform : Microsoft Corporation
CMM Flags : Not Embedded, Independent
Device Manufacturer : IEC
Device Model : sRGB
Device Attributes : Reflective, Glossy, Positive, Color
Rendering Intent : Perceptual
Connection Space Illuminant : 0.9642 1 0.82491
Profile Creator : HP
Profile ID : 0
---- ICC_Profile ----
Profile Copyright : Copyright (c) 1998 Hewlett-Packard Company
---- IPTC ----
Application Record Version : 2
By-line : Design d15 Peter Zvonar
---- Photoshop ----
IPTC Digest : 5f9dfe3525f833a10ed0ded29c8039aa
X Resolution : 300
Displayed Units X : inches
Y Resolution : 300
Displayed Units Y : inches
Global Angle : 30
Global Altitude : 30
Version Info : Adobe Photoshop.Adobe Photoshop CS4.
Photoshop Quality : 11
Photoshop Format : Standard
53
Progressive Scans : 3 Scans
---- XMP-x ----
XMP Toolkit : Adobe XMP Core 4.2.2-c063 53.352624, 2008/07/30-18:12:18
---- XMP-tiff ----
Make : Canon
Camera Model Name : Canon EOS 40D
Image Width : 3888
Image Height : 2592
Samples Per Pixel : 3
Photometric Interpretation RGB
X Resolution : 300
Y Resolution : 300
Resolution Unit : inches
Compression : LZW
Planar Configuration : Chunky
Orientation : Horizontal (normal)
---- XMP-exif ----
Flash Fired : False
Flash Return : No return detection
Flash Mode : Off
Flash Function : False
Flash Red Eye Mode : False
---- XMP-xmp ----
Modify Date : 2010:02:11 22:12:01+01:00
Create Date : 2009:10:08 17:22:51.00+02:00
Creator Tool : Adobe Photoshop Lightroom
Metadata Date : 2010:02:11 22:12:01+01:00
---- XMP-dc ----
Creator : Design d15 Peter Zvonar
---- XMP-aux ----
Serial Number : 530122237
Lens Info : 17-40mm f/?
Lens : EF17-40mm f/4L USM
---- XMP-xmpMM ----
Instance ID : xmp.iid:E61975FE5117DF11B9C9C9C6636E2ED3
Document ID : xmp.did:AF90595E4F17DF11B9C9C9C6636E2ED3
Original Document ID : xmp.did:AF90595E4F17DF11B9C9C9C6636E2ED3
History Action : saved, saved, saved, saved, converted, derived, saved
History Instance ID : xmp.iid:AF90595E4F17DF11B9C9C9C6636E2ED3,
xmp.iid:B090595E4F17DF11B9C9C9C6636E2ED3, xmp.iid:B190595E4F17DF11B9C9C9C6636E2ED3,
xmp.iid:E51975FE5117DF11B9C9C9C6636E2ED3, xmp.iid:E61975FE5117DF11B9C9C9C6636E2ED3
History When : 2010:02:11 21:59:05+01:00, 2010:02:11 22:03:41+01:00, 2010:02:11
22:03:41+01:00, 2010:02:11 22:12:01+01:00, 2010:02:11 22:12:01+01:00
History Software Agent : Adobe Photoshop CS4 Windows, Adobe Photoshop CS4 Windows, Adobe
Photoshop CS4 Windows, Adobe Photoshop CS4 Windows, Adobe Photoshop CS4 Windows
History Changed : /, /, /, /, /
History Parameters : from image/tiff to image/jpeg, converted from image/tiff to image/jpeg
---- Composite ----
Aperture : 10.0
Flash : Off, Did not fire
Image Size : 500x333
Megapixels : 0.167
Scale Factor To 35 mm Equivalent: 1.6
Shutter Speed : 1/60
Circle Of Confusion : 0.019 mm
Field Of View : 66.4 deg
Focal Length : 17.0 mm (35 mm equivalent: 27.5 mm)
Hyperfocal Distance : 1.56 m
Light Value : 11.6
After upload to social media
---- ExifTool ----
ExifTool Version Number : 9.95
---- System ----
File Name : 11150759_10153185027526072_6864029197881280694_n.jpg
File Size : 53 kB
File Modification Date/Time : 2015:05:18 16:40:05+09:30
54
File Access Date/Time : 2015:05:18 16:40:09+09:30
File Creation Date/Time : 2015:05:18 16:40:05+09:30
File Permissions : rw-rw-rw-
---- File ----
File Type : JPEG
File Type Extension : jpg
MIME Type : image/jpeg
Current IPTC Digest : 57f7ae01408bc07c864d532fa0a3fc23
Image Width : 500
Image Height : 333
Encoding Process : Progressive DCT, Huffman coding
Bits Per Sample : 8
Color Components : 3
Y Cb Cr Sub Sampling : YCbCr4:2:0 (2 2)
---- JFIF ----
JFIF Version : 1.02
Resolution Unit : None
X Resolution : 1
Y Resolution : 1
---- IPTC ----
By-line : Design d15 Peter Zvonar
Original Transmission Reference : 9z1dwMYf0SxVlhrhPWhZ
Special Instructions :
FBMD01000adb030000750e00002a230000ba260000472a00003c4c0000527b0000827d00009a830000ee880000
71d30000
---- ICC-header ----
Profile CMM Type : lcms
Profile Version : 2.1.0
Profile Class : Display Device Profile
Color Space Data : RGB
Profile Connection Space : XYZ
Profile Date Time : 2012:01:25 03:41:57
Profile File Signature : acsp
Primary Platform : Apple Computer Inc.
CMM Flags : Not Embedded, Independent
Device Attributes : Reflective, Glossy, Positive, Color
Rendering Intent : Perceptual
Connection Space Illuminant : 0.9642 1 0.82491
Profile Creator : lcms
---- ICC_Profile ----
Profile Copyright : FB
---- Composite ----
Image Size : 500x333
Megapixels : 0.167
55
Appendix B
Test three PDF metadata example. Note - Test two excluded due to large amounts of
extensive metadata with similarities of test one metadata, mainly consisting of
camera and Photoshop specifics.
PDF one
---- ExifTool ----
ExifTool Version Number : 9.95
---- System ----
File Name : bitstreams-to-heritage.pdf
Directory : -----
File Size : 998 kB
File Modification Date/Time : 2014:03:19 15:50:04+10:30
File Access Date/Time : 2015:05:18 17:06:33+09:30
File Creation Date/Time : 2015:05:18 17:06:32+09:30
File Permissions : rw-rw-rw-
---- File ----
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
---- PDF ----
PDF Version : 1.5
Linearized : Yes
Author : Rex
Create Date : 2013:11:13 08:43:21-05:00
Creator : Adobe InDesign CS6 (Macintosh)
Modify Date : 2013:11:13 08:43:21-05:00
Producer : Acrobat Distiller 11.0 (Macintosh)
Title : 65845 Bitcurator Book Justified.indd
Page Count : 44
---- XMP-x ----
XMP Toolkit : Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03
---- XMP-xmp ----
Create Date : 2013:11:13 08:43:21-05:00
Modify Date : 2013:11:13 08:43:21-05:00
Creator Tool : Adobe InDesign CS6 (Macintosh)
---- XMP-pdf ----
Producer : Acrobat Distiller 11.0 (Macintosh)
---- XMP-dc ----
Format : application/pdf
Creator : Rex
Title : 65845 Bitcurator Book Justified.indd
---- XMP-xmpMM ----
Document ID : uuid:03cdeea2-9d6e-a043-b864-7eaca94dcc18
Instance ID : uuid:c5974481-4f83-2a43-8144-4cb796bccc57
56
PDF two
---- ExifTool ----
ExifTool Version Number : 9.95
---- System ----
File Name : PaperIJACR02-04-14.pdf
Directory :
File Size : 235 kB
File Modification Date/Time : 2014:12:10 14:52:10+10:30
File Access Date/Time : 2015:05:18 17:07:19+09:30
File Creation Date/Time : 2015:05:18 17:07:17+09:30
File Permissions : rw-rw-rw-
---- File ----
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
---- PDF ----
PDF Version : 1.5
Linearized : No
Page Count : 8
Language : en-CA
Tagged PDF : Yes
Title :
Author :
Subject : IEEE Transactions on Magnetics
Creator : Microsoft® Office Word 2007
Create Date : 2014:05:19 21:42:23+03:00
Modify Date : 2014:05:19 21:42:23+03:00
Producer : Microsoft® Office Word 2007
57
Appendix C
Test four. Note - the following data displays a subset of the original DOCX file's ZIP
data and bibliographic/descriptive data compared to the conversion metadata.
Word DOCX metadata (subset)
---- ZIP ---- (subset example)
Zip Required Version : 20
Zip Bit Flag : 0x0006
Zip Compression : Deflated
Zip Modify Date : 1980:01:01 00:00:00
Zip CRC : 0x75ef191e
Zip Compressed Size : 371
Zip Uncompressed Size : 1364
Zip File Name : [Content_Types].xml
Zip Required Version : 20
Zip Bit Flag : 0x0006
Zip Compression : Deflated
Zip Modify Date : 1980:01:01 00:00:00
Zip CRC : 0xb71a911e
Zip Compressed Size : 243
Zip Uncompressed Size : 590
---- XMP-dc ----
Title :
Subject :
Creator : user
Description :
---- XML ----
Keywords :
Last Modified By : user
Revision Number : 2
Create Date : 2015:05:20 06:38:00Z
Modify Date : 2015:05:20 06:39:00Z
Template : Normal.dotm
Total Edit Time : 1 minute
Pages : 1
Words : 13
Characters : 79
Application : Microsoft Office Word
Doc Security : None
Lines : 1
Paragraphs : 1
Scale Crop : No
Links Up To Date : No
Characters With Spaces : 91
Shared Doc : No
Hyperlinks Changed : No
App Version : 12.0000
58
Converted to PDF with MS office Word
---- ExifTool ----
ExifTool Version Number : 9.95
---- System ----
File Name : exiftest - WtoPDF.pdf
Directory : -----
File Size : 416 kB
File Modification Date/Time : 2015:05:20 16:14:28+09:30
File Access Date/Time : 2015:05:20 16:14:34+09:30
File Creation Date/Time : 2015:05:20 16:14:34+09:30
File Permissions : rw-rw-rw-
---- File ----
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
---- PDF ----
PDF Version : 1.5
Linearized : No
Page Count : 1
Language : en-AU
Tagged PDF : Yes
Author : user
Creator : Microsoft® Office Word 2007
Create Date : 2015:05:20 16:14:28+09:30
Modify Date : 2015:05:20 16:14:28+09:30
Producer : Microsoft® Office Word 2007
Converted to Adobe PDF
---- ExifTool ----
ExifTool Version Number : 9.95
---- System ----
File Name : exiftest - convert.pdf
Directory : -----
File Size : 835 kB
File Modification Date/Time : 2015:05:20 16:15:55+09:30
File Access Date/Time : 2015:05:20 16:16:01+09:30
File Creation Date/Time : 2015:05:20 16:16:01+09:30
File Permissions : rw-rw-rw-
---- File ----
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
---- PDF ----
PDF Version : 1.5
Linearized : Yes
Author : user
Comments :
Company :
Create Date : 2015:05:20 16:15:52+09:30
Creator : Acrobat PDFMaker 10.1 for Word
Modify Date : 2015:05:20 16:15:55+09:30
Producer : Adobe PDF Library 10.0
Source Modified : D:20150520063923
Subject :
Title :
Tagged PDF : Yes
Page Layout : OneColumn
Page Count : 1
---- XMP-x ----
XMP Toolkit : Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26
59
---- XMP-xmp ----
Modify Date : 2015:05:20 16:15:55+09:30
Create Date : 2015:05:20 16:15:52+09:30
Metadata Date : 2015:05:20 16:15:55+09:30
Creator Tool : Acrobat PDFMaker 10.1 for Word
---- XMP-xmpMM ----
Document ID : uuid:49d1265d-db99-4c1c-bcea-96e2a243aba7
Instance ID : uuid:0828849b-f640-4890-a1f7-9e2aad23ba36
Subject : 2
---- XMP-dc ----
Format : application/pdf
Title :
Description :
Creator : user
---- XMP-pdf ----
Producer : Adobe PDF Library 10.0
Keywords :
---- XMP-pdfx ----
Source Modified : D:20150520063923
Company :
Comments :