Conference PaperPDF Available

YADDA2: Assemble your own digital library application from LEGO bricks

Authors:

Abstract

YADDA2 is an open software platform which facilitates creation of digital library applications. It consists of versatile building blocks providing, among others: storage, relational and full-text indexing, process management, and asynchronous communication. Its loosely-coupled service-oriented architecture enables deployment of highly-scalable, distributed systems.
YADDA2 – Assemble Your Own Digital Library Application
from Lego Bricks
Wojtek Sylwestrzak
Centre for Open Science
ICM, Univ. of Warsaw
ul. Prosta 69
00-838 Warszawa, Poland
w.sylwestrzak@icm.edu.pl
Tomasz Rosiek
Centre for Open Science
ICM, Univ. of Warsaw
ul. Prosta 69
00-838 Warszawa, Poland
t.rosiek@icm.edu.pl
Łukasz Bolikowski
Centre for Open Science
ICM, Univ. of Warsaw
ul. Prosta 69
00-838 Warszawa, Poland
l.bolikowski@icm.edu.pl
ABSTRACT
YADDA2 is an open software platform which facilitates cre-
ation of digital library applications. It consists of versa-
tile building blocks providing, among others: storage, re-
lational and full-text indexing, process management, and
asynchronous communication. Its loosely-coupled service-
oriented architecture enables deployment of highly-scalable,
distributed systems.
Categories and Subject Descriptors
H.3.7 [Information Storage and Retrieval]: Digital Li-
braries
General Terms
Design, Performance
1. INTRODUCTION
The document presents the recent results in development
of an open, flexible, high-performance software platform for
digital library applications, which we named YADDA2.
1.1 Motivation
Since the late 1990s, Interdisciplinary Centre for Math-
ematical and Computational Modelling (ICM) at the Uni-
versity of Warsaw has been providing access to full texts of
major scientific publishers (Elsevier, Springer, and recently
IEEE) for the Polish scientific community. We realized very
quickly that maintaining individual publishers’ platforms is
expensive and cumbersome, and we started building our own
software, named YADDA, to provide a full text search en-
gine and a single point of access to the heterogeneous con-
tent. Components of the YADDA software found their way
into a number of other digital library systems, including
D-NET (the DRIVER Network Evolution Toolkit software
suite [1]) or EuDML (the European Digital Mathematics Li-
brary [7]). As both the software and the collections of doc-
uments grew, the original design could no longer meet our
needs, especially in terms of flexibility and scalability. Based
on lessons learned and careful examination of other available
solutions, we have designed and are currently developing a
modular software platform named YADDA2, which could
Copyright is held by the author/owner(s).
JCDL’12, June 10–14, 2012, Washington, DC, USA.
ACM 978-1-4503-1154-0/12/06.
then be used to build concrete DL applications by us and
by others alike.
1.2 Related work
There are already several products and software frame-
works, often mature, often distributed under open-source
licenses, which boast high flexibility, modularity, and abil-
ity to work with third-party systems. Notable examples in-
clude: aDORe federation architecture [8], capable of stor-
ing hundreds of millions of digital objects and terabytes
of image files; CiteSeerX [4], a database of research pub-
lications in computer and information science and related
areas; dLibra Digital Library Framework [6] developed at
Poznan Supercomputing and Networking Center; D-NET
(DRIVER Network Evolution Toolkit) [5], originally created
for the DRIVER repository infrastructure; Greenstone [10,
9], produced by the New Zealand Digital Library Project
at the University of Waikato; H2O platform infrastructure,
launched by HighWire Press1; INVENIO2, originally devel-
oped at CERN; NCore [2], powered by Fedora [3], a frame-
work for creation, management, and preservation of digital
content.
2. SOFTWARE PLATFORM
2.1 Design goals
Design of the YADDA2 software platform was driven by
a need for high flexibility and scalability. We needed a soft-
ware platform that would facilitate creation of several types
of products: stand-alone repositories with a web front-end
and a publishing application in the back-end; repository fed-
erations containing of multiple autonomous collections, ac-
cessed though a central front-end; publication data ware-
houses aggregating content from multiple repositories in or-
der to provide long-term preservation of data and access for
researchers and analysts.
Ideally, building a new product should be reduced to as-
sembling reusable, configurable components. Generic ser-
vices such as metadata and content storages, full-text index,
relational index, batch processing engine, or authorization
and authentication should be readily available, their con-
figuration and assembly should be straightforward. At the
same time, an option to build custom compontents (in vari-
ous programming languages) and bridges should be retained.
1See: http://highwire.stanford.edu/publishers/H2O.dtl
2See: http://invenio-software.org/
419
The platform should be able to handle objects of vari-
ous types (documents, data sets, audio-visual content) and
in various formats. Typically, components should be pre-
pared to handle tens or hundreds of millions of objects. It
should be easy to deploy distributed systems in a way that
would be transparent to the individual components. Finally,
the platform should seamlessly interoperate with third-party
systems by embracing open protocols and standards for in-
formation exchange.
2.2 Architecture
YADDA2 allows to build distributed heterogeneous sys-
tems with multi-layer architecture. In most cases, an ar-
chitecture of a YADDA2 based system would consist of two
tiers: the base services tier and the applications tier. The
base services provide generic functionalities which are in-
dependent of the type of content being stored or otherwise
processed. The applications, on the other hand, use base
services to provide business logic and user interface.
YADDA2 architecture was designed in order to provide
high performance and scalability, with Service-Oriented Ar-
chitecture making it easy to seamlessly plug in additional
processing resources or mass storage to existing production
environments. At the same time it offers the flexibility of
open architecture. Users of the platform can easily add
new services or adapt applications to their specific needs.
One of the main features of the YADDA2 architecture is its
ability to be deployed it in distributed environment, spread-
ing across multiple organizations with different licencing and
authentication policies. The platform satisfies the require-
ment of flexible maintenance and security issues manage-
ment in infrastructure managed by more than one institu-
tion. YADDA2 includes advanced security context manage-
ment tools and allows to manage either fine-gained security
policies within particular application or coarse-grained li-
censing and authentication policies related to the access to
particular services by particular institutions. The ability
to effectively access the platform’s resources not only with
front-end applications but also with internal APIs, allows to
easily create ad-hoc analysis and data post-processing tools
by researchers.
From the technical point of view, YADDA2 architecture
identifies the following components comprising the hosting
platform: hosting infrastructure – service registries and ser-
vice containers, responsible for instantiating, managing and
communication between individual services; core services
components providing particular low level aspects of the
platform’s functionality (the current version of the platform
includes Metadata and Content Storage, Full-text Index,
Similarity Index, Batch Processing Engine, Relational In-
dex, and User Annotation Service); and platform clients
software components using the services of the infrastructure.
YADDA2 is based on the Java language and supports a
number of communication standards including HTTP, REST,
SOAP and RMI. In addition, applications created on a base
of YADDA2 support interfaces specific to digital libraries
like OAI-PMH and OpenSearch.
3. SUMMARY
We have presented YADDA2, a new software platform for
digital libraries. One differentiating factor is its open, mod-
ular, loosely-coupled design. Another one is its scalability,
not only in the sense of large amounts of data or large traf-
fic volumes that the derived applications can reliably handle,
but also its multi-scale capabilities, making it particularly
easy to tailor the size of an application to the specific needs,
from a small, local (or embedded) repository implementa-
tion, to large scale, complex distributed systems handling
heterogeneous content and services.
4. ACKNOWLEDGEMENTS
This work is supported by the National Centre for Re-
search and Development (NCBiR) under Grant No. SP/I/
1/77065/10 by the Strategic scientific research and experi-
mental development program: “Interdisciplinary System for
Interactive Scientific and Scientific-Technical Information”.
5. REFERENCES
[1] M. Artini, L. Candela, D. Castelli, P. Manghi,
M. Mikulicic, and P. Pagano. Sustainable Digital
Library Systems over the DRIVER Repository
Infrastructure. Lecture Notes in Computer Science,
5173:227–231, 2008.
[2] D. B. Krafft, A. Birkland, and E. J. Cramer. Ncore:
architecture and implementation of a flexible,
collaborative digital library. In Proceedings of the 8th
ACM/IEEE-CS joint conference on Digital libraries -
JCDL ’08, page 313, New York, New York, USA,
2008. ACM Press.
[3] C. Lagoze, S. Payette, E. Shin, and C. Wilper. Fedora:
an architecture for complex objects and their
relationships. International Journal on Digital
Libraries, 6(2):124–138, Dec. 2005.
[4] H. Li, I. Councill, W.-C. Lee, and C. L. Giles.
CiteSeerx: An architecture and Web service design for
an academic document search engine. In Proceedings
of the 15th international conference on World Wide
Web - WWW ’06, page 883, New York, New York,
USA, 2006. ACM Press.
[5] P. Manghi, M. Mikulicic, L. Candela, D. Castelli, and
P. Pagano. Realizing and Maintaining Aggregative
Digital Library Systems: D-NET Software Toolkit and
OAIster System. D-Lib Magazine, 16(3/4), Mar. 2010.
[6] C. Mazurek, T. Parko la, and M. Werla. Distributed
Digital Libraries Platform in the PIONIER Network.
Lecture Notes in Computer Science, 4172:488–491,
2006.
[7] W. Sylwestrzak, J. Borbinha, T. Bouche, A. Nowi´nski,
and P. Sojka. EuDML—Towards the European Digital
Mathematics Library. In Towards a Digital
Mathematics Library, pages 11–26, 2010.
[8] H. Van de Sompel, R. Chute, and P. Hochstenbach.
The aDORe federation architecture: digital
repositories at scale. International Journal on Digital
Libraries, 9(2):83–100, Oct. 2008.
[9] I. H. Witten, D. Bainbridge, and D. M. Nichols. How
to Build a Digital Library. Morgan Kaufmann, 2nd
edition, 2009.
[10] I. H. Witten, S. J. Boddie, D. Bainbridge, and R. J.
McNab. Greenstone: a comprehensive open-source
digital library software system. In Proceedings of the
fifth ACM conference on Digital libraries - DL ’00,
pages 113–121, New York, New York, USA, 2000.
ACM Press.
420
... For other components of SoncaDB we combine our own implementations with technologies such as MongoDB, Cassandra, and Lucene. Also, SONCA should integrate well with the overall SYNAT software platforms (see Fig. 1), hence its components are being prepared in accordance with technologies and standards of the project, including YADDA and BWMeta [11]. ...
Conference Paper
Full-text available
We present some problems and solutions for situations when compound and semantically rich nature of data records, such as scientific articles, creates challenges typical for big data processing. Using a case study of named entity matching in SONCA system we show how big data problems emerge and how they are solved by bringing together methods from database management and computational intelligence.
... YADDA2 [14] has a two-tier architecture, with base services tier providing generic fuctionalities independent of the type of content being processed, and application tier where business logic and user interfaces are located. YADDA2 facilitates creation of several types of products: ...
Chapter
Full-text available
One of the common problems when dealing with digital libraries is lack of classification codes in some of the documents. In the following publication we deal with this problem in a multi-label, hierarchical case of Mathematics Subject Classification System. We develop modifications of ML-KNN algorithm and show how they improve results given by the algorithm on example of Springer textual data.
... " It is a strategic project commissioned by the Polish National Centre for Research and Development. The system is based on the YADDA2 architecture [34], presented inFigure 1, developed at ICM UW. YADDA2 is an open, loosely-coupled, service-oriented and modular framework that facilitates development of digital repository applications. ...
Chapter
Full-text available
SYNAT platform powered by the YADDA2 architecture has been extended with the Author Disambiguation Framework and the Query Framework. The former framework clusters occurrences of contributor names into identities of authors, the latter answers queries about authors and documents written by them. This paper presents an outline of the disambiguation algorithms, implementation of the query framework, integration into the platform and performance evaluation of the solution.
Article
Full-text available
The Fedora architecture is an extensible framework for the storage, management, and dissemination of complex objects and the relationships among them. Fedora accommodates the aggregation of local and distributed content into digital objects and the association of services with objects. This allows an object to have several accessible representations, some of them dynamically produced. The architecture includes a generic Resource Description Framework (RDF)-based relationship model that represents relationships among objects and their components. Queries against these relationships are supported by an RDF triple store. The architecture is implemented as a web service, with all aspects of the complex object architecture and related management functions exposed through REST and SOAP interfaces. The implementation is available as open-source software, providing the foundation for a variety of end-user applications for digital libraries, archives, institutional repositories, and learning object systems.
Article
Full-text available
The need to federate repositories emerges in two distinctive scenarios. In one scenario, scalability-related problems in the operation of a repository reach a point beyond which continued service requires parallelization and hence federation of the repository infrastructure. In the other scenario, multiple distributed repositories manage collections of interest to certain communities or applications, and federation is an approach to present a unified perspective across these repositories. The high-level, 3-Tier aDORe federation architecture can be used as a guideline to federate repositories in both cases. This paper describes the architecture, consisting of core interfaces for federated repositories in Tier-1, two shared infrastructure components in Tier-2, and a single-point of access to the federation in Tier-3. The paper also illustrates two large-scale deployments of the aDORe federation architecture: the aDORe Archive repository (over 100,000,000 digital objects) at the Los Alamos National Laboratory and the Ghent University Image Repository federation (multiple terabytes of image files).
Article
Full-text available
This paper describes the Greenstone digital library software, a comprehensive, open-source system for the construction and presentation of information collections. Collections built with Greenstone offer effective full-text searching and metadata-based browsing facilities that are attractive and easy to use. Moreover, they are easily maintainable and can be augmented and rebuilt entirely automatically. The system is extensible: software `plugins' accommodate different document and metadata types.
Article
How to Build a Digital Library is the only book that offers all the knowledge and tools needed to construct and maintain a digital library, regardless of the size or purpose. It is the perfectly self-contained resource for individuals, agencies, and institutions wishing to put this powerful tool to work in their burgeoning information treasuries. The Second Edition reflects new developments in the field as well as in the Greenstone Digital Library open source software. In Part I, the authors have added an entire new chapter on user groups, user support, collaborative browsing, user contributions, and so on. There is also new material on content-based queries, map-based queries, cross-media queries. There is an increased emphasis placed on multimedia by adding a digitizing section to each major media type. A new chapter has also been added on internationalization, which will address Unicode standards, multi-language interfaces and collections, and issues with non-European languages (Chinese, Hindi, etc.). Part II, the software tools section, has been completely rewritten to reflect the new developments in Greenstone Digital Library Software, an internationally popular open source software tool with a comprehensive graphical facility for creating and maintaining digital libraries. As with the First Edition, a web site, implemented as a digital library, will accompany the book and provide access to color versions of all figures, two online appendices, a full-text sentence-level index, and an automatically generated glossary of acronyms and their definitions. In addition, demonstration digital library collections will be included to demonstrate particular points in the book. To access the online content please visit, www.greenstone.org. Sketches the history of libraries-both traditional and digital-and their impact on present practices and future directions Offers in-depth coverage of today's practical standards used to represent and store information digitally in all settings Uses Greenstone, freely accessible open-source software-available with interfaces in major languages (including Spanish, Chinese, and Arabic) Written for both technical and non-technical audiences - covers the entire spectrum of media, including text, images, audio and video Web-enhanced with software documentation, color illustrations, full-text index, source code, and more.
Article
Aggregative Digital Library Systems (ADLSs) provide end users with web portals to operate over an information space of descriptive metadata records, collected and aggregated from a pool of possibly heterogeneous repositories. Due to the costs of software realization and system maintenance, existing "traditional" ADLS solutions are not easily sustainable over time for the supporting organizations. Recently, the DRIVER EC project proposed a new approach to ADLS construction, based on Service-Oriented Infrastructures. The resulting D-NET software toolkit enables a running, distributed system in which one or multiple organizations can collaboratively build and maintain their service-oriented ADLSs in a sustainable way. In this paper, we advocate that D-NET's "infrastructural" approach to ADLS realization and maintenance proves to be generally more sustainable than "traditional" ones. To demonstrate our thesis, we report on the sustainability of the "traditional" OAIster System ADLS, based on DLXS software (University of Michigan), and those of the "infrastructural" DRIVER ADLS, based on D-NET.
Conference Paper
The DRIVER Infrastructure is an e-infrastructure providing an environment where organizations find the tools to aggregate heterogeneous content sources into uniform shared Information Spaces and then build and customize their Digital Library Systems to operate over them. In this paper, we shall show the benefits for organizations embracing the infrastructural approach by presenting the DRIVER infrastructure, its current status of maintenance, its participating organizations, and the first two systems built on top of its Information Space.
Conference Paper
One of the main focus areas of the PIONIER: Polish Optical Internet program was the development and verification of pilot services and applications for the information society. It was necessary to create a base for new developments in science, education, health care, natural environment, government and local administration, industry and services. Examples of such services are digital libraries, allowing to create multiple content and metadata repositories which can be used as a basis for the creation of sophisticated content-based services. In this paper we are presenting the current state of digital library services in the PIONIER network, we shortly describe dLibra – a digital library framework which is the software platform for the majority of PIONIER digital libraries. We also introduce two content-based services enabled on PIONIER digital libraries: distributed metadata harvesting and searching and virtual dynamic collections.
Conference Paper
CiteSeer is a scientific literature digital library and search engine which automatically crawls and indexes scientific documents in the field of computer and information science. After serving as a public search engine for nearly ten years, CiteSeer is starting to have scaling problems for handling of more documents, adding new feature and more users. Its monolithic architecture design prevents it from effectively making use of new web technologies and providing new services. After analyzing the current system problems, we propose a new architecture and data model, CiteSeerx. CiteSeerx that will overcome the existing problems as well as provide scalability and better performance plus new services and system features.
Article
Accession Number: 8543213; Williams, Wilda W. Sylvia, Margaret; Source Info: 11/15/2002, Vol. 127 Issue 19, p108; Subject Term: BOOKS -- Reviews; Subject Term: DIGITAL libraries; Subject Term: NONFICTION; Reviews & Products: HOW to Build a Digital Library (Book); NAICS/Industry Codes: 519120 Libraries and Archives; People: WITTEN, Ian; People: BAINBRIDGE, David; Number of Pages: 1/6p; Document Type: Book Review; Full Text Word Count: 257