Architecting an extensible digital repository
ABSTRACT The digital collection and archives (DCA) in partnership with Academic Technology (AT) at Tufts University developed a digital library solution for long-term storage and integration of existing digital collections, such as Perseus, TUSK, Bolles and Artifact. We describe the Tufts digital library (TDL) architecture. TDL is an extensible, modular, flexible and scalable architecture that uses Fedora at its core. The extensible nature of the TDL architecture allows for seamless integration of collections that may be developed in the future, while leveraging the extensive tools that are available as part of individual digital library applications at Tufts. We describe the functionality and implementation details of the individual components of TDL. Two applications that have successfully interfaced with TDL are presented. We conclude with some remarks about the future development of TDL.
Architecting an Extensible Digital Repository
Anoop Kumar, Ranjani Saigal
16 Dearborn Road
Medford, MA 02155
Digital Collections and Archives
Tisch Library-35 Professors
Medford, MA 02155
Department of Computer Science
161 College Ave.
Medford Ma, 02155
The Digital Collection and Archives (DCA) in partnership with
Academic Technology (AT) at Tufts University developed a
digital library solution for long-term storage and integration of
existing digital collections, like Perseus, TUSK, Bolles and
Artifact. In this paper, we describe the Tufts Digital Library
(TDL) architecture. TDL is an extensible, modular, flexible and
scalable architecture that uses FEDORA at its core. The
extensible nature of the TDL architecture allows for seamless
integration of collections that may be developed in the future,
while leveraging the extensive tools that are available as part of
individual digital library applications at Tufts. We describe the
functionality and implementation details of the individual
components of TDL. Two applications that have successfully
interfaced with TDL are presented. We conclude with some
remarks about the future development of TDL.
Categories and Subject Descriptors
H.3.7 [Digital Libraries]: Collection, Dissemination, Standards,
System Issues, User Issues.
Management, Performance, Design, Security.
Digital Library, preservation, FEDORA, VUE.
During the past decade colleges and universities have witnessed
an exponential growth in digital information available for
teaching and learning. There are many collections of digital
objects including images, texts, audios and videos that have great
value in a diverse set of fields. As the quantity of information
continues to increase and these collections expand, there is need
for a repository that can provide appropriate storage and access to
all these valuable material in a flexible and extensible manner for
the foreseeable future. This need has led many organizations to
select a digital library solution that can assimilate the current
collections and accommodate new materials as they become
The Digital Collection and Archives (DCA) in partnership with
Academic Technology (AT) at Tufts University has developed a
digital library solution that provides for long-term storage and
integration of existing digital collections, while leveraging the
extensive tools developed by individual digital library projects at
Tufts. The Digital Library system developed to serve the needs at
Tufts builds on and extends successful models that are currently
in vogue in the digital library world. This paper describes the
architecture of the Tufts Digital Library (TDL) which is designed
to allow assimilation and interoperability of existing Tufts digital
libraries while allowing new creators of digital materials to add
their content and write new applications for using and managing
The paper describes two applications – Visual Understanding
Environment (VUE), a concept mapping application and Tufts
Digital Library Search that successfully interface with this
architecture to use the content of the repository.
2. Related Work
The architecture for TDL incorporates the concepts described in
the emerging standards for trusted digital repositories , and
complies with the Reference Model for an Open Archival
Information System (OAIS) functional and information models
for archival information systems . On a practical level this
means that we have applied the principles of the “trusted digital
repository” and OAIS guidelines in the arrangement of our system
architecture , matching requirements to system services.
The format to assign persistent IDs or Uniform Resource Names
(URN) to objects across digital collections uses the standard
specified in the RFCs on URNs [8-11]. OCLC’s Persistent
Uniform Resource Locators (PURL) was used as a basis to create
a “Naming Service” that creates and resolves URNs..
The implementation of FEDORA  by University of Virginia
forms the core of the TDL architecture. It provides the
“plumbing” or framework for all the components and the services
in the architecture. Havard’s LDI,  uses a model that clearly
separates the collections infrastructure, access infrastructure and
common services to support their large and unusually
decentralized library system. This modular approach has been
used a basis to develop the TDL component architecture. We also
used the ideas of interoperability, scalability and digital
preservation that form the core of the Making of America II 
The abstract model that represents digital objects is drawn from
Lagoze’s Warwick framework  and Kahn and Wilensky’s
Framework for Distributed Digital Objects. The model was
developed taking into consideration the Digital Repository (DR)
Interface proposed by MIT’s Open Knowledge Initiative (OKI)
. The objects in the repository can be easily accessed and
managed by applications that support the OKI-DR interface.
TDL uses the Dublin Core Metadata set for storing fields such
as author, title, subject, etc. Administrative metadata, gathered
through the METS XML file is used for processing FEDORA
Objects. Metadata is also acquired and managed as
“Datastreams” - a concept that is proposed in the FEDORA object
The TDL search application uses Lucene , which is part of the
Jakarta project. It is an open source search engine that provides
full text based search, metadata search and advanced search
3. Designing the Tufts Digital Library
3.1 Need for a New Architecture
Three major discipline specific digital library applications have
been developed at Tufts University. Persues,  a digital library
in the area of classics has over a million objects and receives
about 8 million page hits a month. The collection includes
multilingual marked up texts, images audios and videos. The
Tufts University Science Knowledgebase or TUSK  is a
repository of materials specific to the health sciences field. It is
widely used by the Medical, Veterinary and Dental schools for
teaching a variety of courses. Artifact is a collection of about
2500 images that is used to teach courses in Art History .
Tufts also has two substantial collections of digital materials - the
Bolles Collection  which is a collection of some unique
historical maps of London and Crime and Punishment , which
a collection of videos, images and cases used for simulations by
Political Science faculty. Table 1. provides details about these
Table 1. Digital Libraries at Tufts
Perseus 50 million words The subset of Perseus
collection data that the
Tufts Digital library is
composed primarily of
encoded XML texts of
many types, including
various forms of Lexica,
13 million words,
contains highly structured
TEI encoded XML texts,
PDF documents, high
resolution TIFF images,
TUSK 15,000 documents,
Tufts University Science
contains full-text syllabi,
lecture recordings (audio
and video) and notes,
bibliographies linked to
available by the faculty
of Tufts University.
Artifact 2500 images Artifact contains over
corresponding data, with
links to the Art History
slide collection database
entries. It integrates on-
searching with Internet-
traditional learning aids,
such as flashcards, for
review and study
and 400 images and
The repository contains
images in gif format and
videos in mov format to
support simulations used
Perseus, Artifact and TUSK have an extensive set of tools
associated with them that allow users to access content in a
manner that is most suited to their discipline-specific needs. The
collections are continuously expanding adding content in a variety
of formats. The current architecture of these libraries is not built
to accommodate such expansion. There is a need for development
of a new digital library architecture that is modular, scalable and
economically viable. The architecture should allow for
persistence of objects across collections and reusability of content
by multiple applications.
While having a centralized university-wide repository application
is an attractive proposition, the diverse classes of digital objects
represented in the various collections pose several challenges.
The repository must be able to ingest and manage diverse
materials and the ingestion and management processes must be
able to scale to handle large volumes of content. In addition to
the preservation of the objects, the repository architecture must be
flexible enough to provide the appropriate hooks so that we can
design services that are capable of delivering the content in a
user-friendly manner to different user communities. For example,
digital objects from the Perseus project Classics collection that are
stored in the repository needs to be disseminated through complex
language tools developed by the Perseus project that link syntax,
grammar, and references to particular people and things across the
entire collection. On the other hand, digital objects from the
TUSK collection require a completely different kind of
dissemination - one that resembles a courseware environment.
The modular and interoperable nature of the TDL architecture
allows us to use tools developed within one application such as
Perseus to be used on objects that belong to another application
such as TUSK. This makes TDL a powerful architecture that can
effectively use tools that have been developed by disparate
3.2 System Specifications
A modular system that meets the necessary functional
requirements of a long-term digital repository was considered as
the most suitable architecture for TDL. It had to be flexible and
extensible enough to meet the diverse storage and access needs of
data providers and application builders within the Tufts
community and in the potential federated digital library
community. Principles of “trusted digital repository”  and
OAIS guidelines were applied in arrangement of system the
architecture. Table 2. details the OAIS requirements along with
the matching system service.
Table 2. Requirements and system services.
Requirements System Services
identification of materials
and persistent Naming Service
Use of Archival Information
Digital Object Provider (DOP)
Use of Submission Information
Drop Box, Ingestion Service
Information Packages (DIP)
of Dissemination DOP Service
Authentication and integrity
Application, Search Service
Service and other
4. TDL Architecture
An architecture made of loosely coupled modular services
emerged as the solution that would be best suited to create a
flexible, extensible and scalable digital library that could subsume
our current digital libraries while allowing for future yet to be
determined applications. This model extends the framework
provided by the UVA implementation of FEDORA which forms
the core of TDL.
An architecture which addressed the issue of scalability by
defining a number of logical units and their relation in the context
of the digital library was devised. HTTP/HTTPS was chosen as
the communication protocol of choice between these units. This
choice allows use of wide array of server tools in the
implementation of each service with the prospect of using the
internet as the transport layer. Scalability was the main motivation
for minimizing the lines of dependency between the services in
Figure 1. TDL Architecture
Figure 1. shows the component services that comprise the TDL
TDL was explicitly designed to facilitate the business processes
that are associated with the creation and use of the library. The
design of the component services was done in conjunction with
the design of the business processes associated with each service.
Each component was designed to effectively support the
corresponding business process and interface appropriately with
The architecture is comprised of five basic services.
Drop Box and Ingestion Service provides a conduit for
objects to be uploaded into TDL. This does the
validation and tagging of the objects as part of the
preprocessing and then ingests the objects.
Naming Service creates a unique persistent identifier
which is the Universal Resource Name (URN) for the
object. The service also resolves URNs.
FEDORA Repository Service provides management of
and access to named digital objects
Indexing and Search Service indexes the digital
objects and provides a search mechanism.
Application Creation Service provides a mechanism
for external applications to interface with the repository.
4.1 Drop Box and Ingestion Service
The “Drop Box” as the name suggests is a location where users
can place digital objects that need to reside in TDL. It provides
temporary data storage during the pre-processing phase. The drop
box contains a template file provided by the archivists. The
template file has basic metadata which is associated with all
objects. The drop allows for association of additional metadata
with the objects. The drop box also tests for validity of object
The Ingestion Service automatically collects the objects from the
Drop Box. It validates the FEDORA object schema and waits for
archivists to perform content quality review before approving or
rejecting objects based on archival standards. It calls the Naming
Service to obtain an URN for approved objects. It takes the
content, binds it with the associated metadata and prepares the
METS object, which is then ingested into FEDORA. It gets the
PID from FEDORA and calls the Naming Service to associate
PIDs with URNs. Finally it informs the contributors about the
success or failure of the attempt to ingest the object.
Figure 2. Ingestion Service
4.2 Naming Service
FEDORA provides a very limited system for referencing objects.
Every object in FEDORA is assigned a PID (Persistent Identifier)
in the format: “string:number”. This makes it difficult to track
and reference objects uniquely across collections. Furthermore
objects may move between FEDORA servers creating a need for
an identifier that is uniquely associated with the object,
independent of the repository in which it resides. The Naming
Service creates a URN. It also creates a binding between the URN
and the FEDORA PID and provides a resolution service to locate
The convention developed for the TDL URN is as follows:
tufts:school name:owner:[collection:]item name.
The first field of the URN created through this service is always
tufts. The second field is ‘project name’ which is unique for any
project registered through this service. If an object is not
associated with a project, it is allocated to the default project.
‘collection’ is an optional field provided by the projects.
Collection helps further classify repositories in a project. ‘item
name’ can be provided
owners/contributors or it will be created by the service. The URN
formed by combining these four fields is guaranteed to be unique.
by the project/repository
4.3 FEDORA Repository Service
The FEDORA Repository Service forms the core of TDL. The
key features of the architecture are: (1) support for heterogeneous
data types; (2) accommodation of new types as they emerge; (3)
aggregation of mixed, possibly distributed, data into complex
objects; (4) the ability to specify multiple content disseminations
of these objects; and (5) the ability to associate rights
management schemes with these disseminations.
The following sections describe TDL’s implementation of the
repository model, objects, behaviors and disseminators and
4.3.1 Repository Model
TDL’s implementation of FEDORA is a modification of the
implementation developed by the University of Virginia.
Modifications were necessary to create a fast and efficient
production system. Figure 3. details the different components of
the repository model.
Figure 3. The Repository Model
4.3.2 Objects, Behaviors and Disseminators
Each object in the repository is identified with a particular
content-type. Consistent with the FEDORA model each content
type in the repository has a set of associated behaviors and
disseminators. Following is the list of content types that are
supported in TDL.
All content types contain disseminators supported by FEDORA’s
Behavior Definition (bdef) fedora-system:3 and demo:277, which
is a Behavior Definition to support indexing. FEDORA-system:3
supports few basic disseminators
viewDublinCore . Additional Behavior Definitions and
Disseminators were linked to the content types to make the
objects usable by the applications. Tables 3-7 show the
association of content types with FEDORA behaviors and
disseminators that have been developed for TDL.
getThumbnail Returns thumbnail sized image
(120 x 120 pixels).
getImage Returns image in jpeg or gif
getStandard Gets a screen size of the image
(650-850 pixel width).
getResized Returns image with specified
width and height.
getZoomedImage Returns image specified
specified by location x, y and
dimensions width and height.
a tile of image
Table 3. Dissemination Index for TUFTS_STD_IMAGE
document in raw XML format.
the content of
getTOC Returns the Table of Content
about the document.
the document in
getChunk Returns the specified chapter
from the document.
Table 4. Dissemination Index for XML_TO_HTMLDOC
getFile Returns the binary file.
Table 5. Dissemination Index for TUFTS_BINARY_FILE
generated and used by VUE.
the concept Map
describing the content in VUE
the manifest file
getResource Returns the specified resource
used by VUE concept map.
Table 6. Dissemination Index for