Architecting an extensible digital repository
ABSTRACT The digital collection and archives (DCA) in partnership with Academic Technology (AT) at Tufts University developed a digital library solution for long-term storage and integration of existing digital collections, such as Perseus, TUSK, Bolles and Artifact. We describe the Tufts digital library (TDL) architecture. TDL is an extensible, modular, flexible and scalable architecture that uses Fedora at its core. The extensible nature of the TDL architecture allows for seamless integration of collections that may be developed in the future, while leveraging the extensive tools that are available as part of individual digital library applications at Tufts. We describe the functionality and implementation details of the individual components of TDL. Two applications that have successfully interfaced with TDL are presented. We conclude with some remarks about the future development of TDL.
-
Citations (0)
-
Cited In (0)
Page 1
Architecting an Extensible Digital Repository
Anoop Kumar, Ranjani Saigal
Academic Technology
Tufts University
16 Dearborn Road
Medford, MA 02155
{anoop.kumar,ranjani.saigal}@t
ufts.edu
Robert Chavez
Digital Collections and Archives
Tufts University
Tisch Library-35 Professors
Medford, MA 02155
Robert.chavez@tufts.edu
Nikolai Schwertner
Department of Computer Science
Tufts University
161 College Ave.
Medford Ma, 02155
nikolai.schwertner@tufts.edu
ABSTRACT
The Digital Collection and Archives (DCA) in partnership with
Academic Technology (AT) at Tufts University developed a
digital library solution for long-term storage and integration of
existing digital collections, like Perseus, TUSK, Bolles and
Artifact. In this paper, we describe the Tufts Digital Library
(TDL) architecture. TDL is an extensible, modular, flexible and
scalable architecture that uses FEDORA at its core. The
extensible nature of the TDL architecture allows for seamless
integration of collections that may be developed in the future,
while leveraging the extensive tools that are available as part of
individual digital library applications at Tufts. We describe the
functionality and implementation details of the individual
components of TDL. Two applications that have successfully
interfaced with TDL are presented. We conclude with some
remarks about the future development of TDL.
Categories and Subject Descriptors
H.3.7 [Digital Libraries]: Collection, Dissemination, Standards,
System Issues, User Issues.
General Terms
Management, Performance, Design, Security.
Keywords
Digital Library, preservation, FEDORA, VUE.
1. Introduction
During the past decade colleges and universities have witnessed
an exponential growth in digital information available for
teaching and learning. There are many collections of digital
objects including images, texts, audios and videos that have great
value in a diverse set of fields. As the quantity of information
continues to increase and these collections expand, there is need
for a repository that can provide appropriate storage and access to
all these valuable material in a flexible and extensible manner for
the foreseeable future. This need has led many organizations to
select a digital library solution that can assimilate the current
collections and accommodate new materials as they become
available.
The Digital Collection and Archives (DCA) in partnership with
Academic Technology (AT) at Tufts University has developed a
digital library solution that provides for long-term storage and
integration of existing digital collections, while leveraging the
extensive tools developed by individual digital library projects at
Tufts. The Digital Library system developed to serve the needs at
Tufts builds on and extends successful models that are currently
in vogue in the digital library world. This paper describes the
architecture of the Tufts Digital Library (TDL) which is designed
to allow assimilation and interoperability of existing Tufts digital
libraries while allowing new creators of digital materials to add
their content and write new applications for using and managing
the material.
The paper describes two applications – Visual Understanding
Environment (VUE), a concept mapping application and Tufts
Digital Library Search that successfully interface with this
architecture to use the content of the repository.
2. Related Work
The architecture for TDL incorporates the concepts described in
the emerging standards for trusted digital repositories [5], and
complies with the Reference Model for an Open Archival
Information System (OAIS) functional and information models
for archival information systems [6]. On a practical level this
means that we have applied the principles of the “trusted digital
repository” and OAIS guidelines in the arrangement of our system
architecture [6], matching requirements to system services.
The format to assign persistent IDs or Uniform Resource Names
(URN) to objects across digital collections uses the standard
specified in the RFCs on URNs [8-11]. OCLC’s Persistent
Uniform Resource Locators (PURL) was used as a basis to create
a “Naming Service” that creates and resolves URNs..
The implementation of FEDORA [1] by University of Virginia
forms the core of the TDL architecture. It provides the
“plumbing” or framework for all the components and the services
in the architecture. Havard’s LDI, [12] uses a model that clearly
separates the collections infrastructure, access infrastructure and
common services to support their large and unusually
Page 2
decentralized library system. This modular approach has been
used a basis to develop the TDL component architecture. We also
used the ideas of interoperability, scalability and digital
preservation that form the core of the Making of America II [13]
project.
The abstract model that represents digital objects is drawn from
Lagoze’s Warwick framework [3] and Kahn and Wilensky’s
Framework for Distributed Digital Objects[20]. The model was
developed taking into consideration the Digital Repository (DR)
Interface proposed by MIT’s Open Knowledge Initiative (OKI)
[7]. The objects in the repository can be easily accessed and
managed by applications that support the OKI-DR interface.
TDL uses the Dublin Core[21] Metadata set for storing fields such
as author, title, subject, etc. Administrative metadata, gathered
through the METS XML file is used for processing FEDORA
Objects. Metadata is also acquired and managed as
“Datastreams” - a concept that is proposed in the FEDORA object
model.
The TDL search application uses Lucene [21], which is part of the
Jakarta project. It is an open source search engine that provides
full text based search, metadata search and advanced search
features.
3. Designing the Tufts Digital Library
3.1 Need for a New Architecture
Three major discipline specific digital library applications have
been developed at Tufts University. Persues, [15] a digital library
in the area of classics has over a million objects and receives
about 8 million page hits a month. The collection includes
multilingual marked up texts, images audios and videos. The
Tufts University Science Knowledgebase or TUSK [17] is a
repository of materials specific to the health sciences field. It is
widely used by the Medical, Veterinary and Dental schools for
teaching a variety of courses. Artifact is a collection of about
2500 images that is used to teach courses in Art History [18].
Tufts also has two substantial collections of digital materials - the
Bolles Collection [18] which is a collection of some unique
historical maps of London and Crime and Punishment [19], which
a collection of videos, images and cases used for simulations by
Political Science faculty. Table 1. provides details about these
collections.
Table 1. Digital Libraries at Tufts
Digital
Libraries
Size Description
Perseus 50 million words The subset of Perseus
Project
collection data that the
Tufts Digital library is
working
composed primarily of
highly structured
encoded XML texts of
many types, including
various forms of Lexica,
Grammars,
Encyclopedias,
and modern
texts.
Classics
with is
TEI
ancient
language
Bolles
Collection
13 million words,
25,000
geospatial datasets
and multimedia
objects
images,
The
contains highly structured
TEI encoded XML texts,
PDF documents, high
resolution TIFF images,
QuickTime
Reality files.
Bolles collection
Virtual
TUSK 15,000 documents,
125
(approximately)
courses
Tufts University Science
Knowledgebase (TUSK)
contains full-text syllabi,
digital slide
lecture recordings (audio
and video) and notes,
exam
evaluation
bibliographies linked to
full-text articles,
other resources
available by the faculty
of Tufts University.
images,
questions,
forms,
and
made
Artifact 2500 images Artifact contains over
2500 images
corresponding data, with
links to the Art History
slide collection database
containing
entries. It integrates on-
demand viewing
searching with Internet-
based adaptations
traditional learning aids,
such as flashcards, for
review and study
and
120,000
and
of
Crime
Punishment
and 400 images and
videos.
The repository contains
images in gif format and
videos in mov format to
support simulations used
in Political
courses.
Science
Page 3
Perseus, Artifact and TUSK have an extensive set of tools
associated with them that allow users to access content in a
manner that is most suited to their discipline-specific needs. The
collections are continuously expanding adding content in a variety
of formats. The current architecture of these libraries is not built
to accommodate such expansion. There is a need for development
of a new digital library architecture that is modular, scalable and
economically viable. The architecture should allow for
persistence of objects across collections and reusability of content
by multiple applications.
While having a centralized university-wide repository application
is an attractive proposition, the diverse classes of digital objects
represented in the various collections pose several challenges.
The repository must be able to ingest and manage diverse
materials and the ingestion and management processes must be
able to scale to handle large volumes of content. In addition to
the preservation of the objects, the repository architecture must be
flexible enough to provide the appropriate hooks so that we can
design services that are capable of delivering the content in a
user-friendly manner to different user communities. For example,
digital objects from the Perseus project Classics collection that are
stored in the repository needs to be disseminated through complex
language tools developed by the Perseus project that link syntax,
grammar, and references to particular people and things across the
entire collection.[4] On the other hand, digital objects from the
TUSK collection require a completely different kind of
dissemination - one that resembles a courseware environment.
The modular and interoperable nature of the TDL architecture
allows us to use tools developed within one application such as
Perseus to be used on objects that belong to another application
such as TUSK. This makes TDL a powerful architecture that can
effectively use tools that have been developed by disparate
applications.
3.2 System Specifications
A modular system that meets the necessary functional
requirements of a long-term digital repository was considered as
the most suitable architecture for TDL. It had to be flexible and
extensible enough to meet the diverse storage and access needs of
data providers and application builders within the Tufts
community and in the potential federated digital library
community. Principles of “trusted digital repository” [5] and
OAIS guidelines were applied in arrangement of system the
architecture. Table 2. details the OAIS requirements along with
the matching system service.
Table 2. Requirements and system services.
Requirements System Services
Unique
identification of materials
and persistent Naming Service
Use of Archival Information
Packages (AIP)
Digital Object Provider (DOP)
Service
Use of Submission Information
Packages (SIP)
Drop Box, Ingestion Service
Use
Information Packages (DIP)
of Dissemination DOP Service
Authentication and integrity
checking
DOP Service
Dissemination Disseminators,
Service,
Application, Search Service
Caching
Library Digital
Access Search
applications
Service and other
4. TDL Architecture
An architecture made of loosely coupled modular services
emerged as the solution that would be best suited to create a
flexible, extensible and scalable digital library that could subsume
our current digital libraries while allowing for future yet to be
determined applications. This model extends the framework
provided by the UVA implementation of FEDORA which forms
the core of TDL.
An architecture which addressed the issue of scalability by
defining a number of logical units and their relation in the context
of the digital library was devised. HTTP/HTTPS was chosen as
the communication protocol of choice between these units. This
choice allows use of wide array of server tools in the
implementation of each service with the prospect of using the
internet as the transport layer. Scalability was the main motivation
for minimizing the lines of dependency between the services in
the model.
Figure 1. TDL Architecture
Page 4
Figure 1. shows the component services that comprise the TDL
architecture.
TDL was explicitly designed to facilitate the business processes
that are associated with the creation and use of the library. The
design of the component services was done in conjunction with
the design of the business processes associated with each service.
Each component was designed to effectively support the
corresponding business process and interface appropriately with
other components.
The architecture is comprised of five basic services.
•
Drop Box and Ingestion Service provides a conduit for
objects to be uploaded into TDL. This does the
validation and tagging of the objects as part of the
preprocessing and then ingests the objects.
•
Naming Service creates a unique persistent identifier
which is the Universal Resource Name (URN) for the
object. The service also resolves URNs.
•
FEDORA Repository Service provides management of
and access to named digital objects
•
Indexing and Search Service indexes the digital
objects and provides a search mechanism.
•
Application Creation Service provides a mechanism
for external applications to interface with the repository.
4.1 Drop Box and Ingestion Service
The “Drop Box” as the name suggests is a location where users
can place digital objects that need to reside in TDL. It provides
temporary data storage during the pre-processing phase. The drop
box contains a template file provided by the archivists. The
template file has basic metadata which is associated with all
objects. The drop allows for association of additional metadata
with the objects. The drop box also tests for validity of object
types.
The Ingestion Service automatically collects the objects from the
Drop Box. It validates the FEDORA object schema and waits for
archivists to perform content quality review before approving or
rejecting objects based on archival standards. It calls the Naming
Service to obtain an URN for approved objects. It takes the
content, binds it with the associated metadata and prepares the
METS object, which is then ingested into FEDORA. It gets the
PID from FEDORA and calls the Naming Service to associate
PIDs with URNs. Finally it informs the contributors about the
success or failure of the attempt to ingest the object.
Figure 2. Ingestion Service
4.2 Naming Service
FEDORA provides a very limited system for referencing objects.
Every object in FEDORA is assigned a PID (Persistent Identifier)
in the format: “string:number”. This makes it difficult to track
and reference objects uniquely across collections. Furthermore
objects may move between FEDORA servers creating a need for
an identifier that is uniquely associated with the object,
independent of the repository in which it resides. The Naming
Service creates a URN. It also creates a binding between the URN
and the FEDORA PID and provides a resolution service to locate
the object.
The convention developed for the TDL URN is as follows:
tufts:school name:owner:[collection:]item name.
The first field of the URN created through this service is always
tufts. The second field is ‘project name’ which is unique for any
project registered through this service. If an object is not
associated with a project, it is allocated to the default project.
‘collection’ is an optional field provided by the projects.
Collection helps further classify repositories in a project. ‘item
name’ can be provided
owners/contributors or it will be created by the service. The URN
formed by combining these four fields is guaranteed to be unique.
by the project/repository
4.3 FEDORA Repository Service
The FEDORA Repository Service forms the core of TDL. The
key features of the architecture are: (1) support for heterogeneous
data types; (2) accommodation of new types as they emerge; (3)
aggregation of mixed, possibly distributed, data into complex
objects; (4) the ability to specify multiple content disseminations
of these objects; and (5) the ability to associate rights
management schemes with these disseminations.
The following sections describe TDL’s implementation of the
repository model, objects, behaviors and disseminators and
custom modifications.
Page 5
4.3.1 Repository Model
TDL’s implementation of FEDORA is a modification of the
implementation developed by the University of Virginia.
Modifications were necessary to create a fast and efficient
production system. Figure 3. details the different components of
the repository model.
Figure 3. The Repository Model
4.3.2 Objects, Behaviors and Disseminators
Each object in the repository is identified with a particular
content-type. Consistent with the FEDORA model each content
type in the repository has a set of associated behaviors and
disseminators. Following is the list of content types that are
supported in TDL.
•
TUFTS_STD_IMAGE
•
XML_TO_HTMLDOC
•
TUFTS_BINARY_FILE
•
TUFTS_VUE_CONCEPT_MAP
•
TUFTS_COLLECTION
All content types contain disseminators supported by FEDORA’s
Behavior Definition (bdef) fedora-system:3 and demo:277, which
is a Behavior Definition to support indexing. FEDORA-system:3
supports few basic disseminators
viewObjectProfile, getMethodIndex,
getItemIndex, viewItemIndex,
viewDublinCore [14]. Additional Behavior Definitions and
Disseminators were linked to the content types to make the
objects usable by the applications. Tables 3-7 show the
association of content types with FEDORA behaviors and
disseminators that have been developed for TDL.
like
getObjectProfile,
viewMethodIndex,
getDublinCore, getItem,
Dissemination Description
getThumbnail Returns thumbnail sized image
(120 x 120 pixels).
getImage Returns image in jpeg or gif
format.
getStandard Gets a screen size of the image
(650-850 pixel width).
getResized Returns image with specified
width and height.
getZoomedImage Returns image specified
magnification.
getImageTile Returns
specified by location x, y and
dimensions width and height.
a tile of image
Table 3. Dissemination Index for TUFTS_STD_IMAGE
Dissemination Description
getXML Returns
document in raw XML format.
the content of
getTOC Returns the Table of Content
getInfo Returns
about the document.
basic information
getDocument Returns
browse-able format.
the document in
getChunk Returns the specified chapter
from the document.
Table 4. Dissemination Index for XML_TO_HTMLDOC
Dissemination Description
getFile Returns the binary file.
Table 5. Dissemination Index for TUFTS_BINARY_FILE
Dissemination Description
getConceptMap Returns
generated and used by VUE.
the concept Map
getManifest Returns
describing the content in VUE
concept map.
the manifest file
getResource Returns the specified resource
used by VUE concept map.
Table 6. Dissemination Index for
TUFTS_VUE_CONCEPT_MAP