Giovanni Soda

University of Florence, Florence, Tuscany, Italy

Are you Giovanni Soda?

Claim your profile

Publications (119)59.54 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Nowadays, Digital Libraries have become a widely used service to store and share both digital born documents and digital versions of works stored by traditional libraries. Document images are intrinsically non-structured and the structure and semantic of the digitized documents is in most part lost during the conversion. Several techniques related to the Document Image Analysis research area have been proposed in the past to deal with document image retrieval applications. In this chapter a survey about the more recent techniques applied in the field of recognition and retrieval of text and graphical documents is presented. In particular we describe techniques related to recognition-free approaches.
    09/2011: pages 181-204;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, the Earth Mover's Distance (EMD) is used as a similarity measure in the mathematical symbol retrieval task. The approach is based on the Bag-of-Visual-Words model. In our case the features extracted from each symbol are clustered by means of Self-Organizing Maps (SOM) and then occurrences of features in the clusters are accumulated in a vector of visual words. The comparison between the latter vectors is performed with the EMD which naturally allows to incorporate the topological organization of SOM clusters in the distance computation. The proposed approach is experimentally tested in a mathematical symbol retrieval task and compared with the cosine similarity and with some variants that have been recently proposed.
    2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18-21, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In the last years the interest in e-book readers is significantly growing. Two main document formats are supported by most devices: PDF and ePub. The PDF format is widely used to share documents allowing a cross-platform readability. However, it is not ideal for a comfortable reading on small screens. On the opposite, the ePub format is re-flowable and it is well suited for e-book readers. In this paper we describe a system for the conversion of PDF books to the ePub format aiming at inverting the text formatting made during the pagination. To this purpose, layout analysis techniques are performed to identify the book's table of contents and the main functional regions such as chapters, paragraphs, and notes.
    2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18-21, 2011; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: We describe one tool for Table of Content (ToC) identification and recognition from PDF books. This task is part of ongoing research on the development of tools for the semi-automatic conversion of PDF documents in the Epub format that can be read on several E-book devices. Among various sub-tasks, the ToC extraction and recognition is particularly useful for an easy navigation of book contents. The proposed tool first identifies the ToC pages. The bounding boxes of ToC titles in the book body are subsequently found in order to add suitable links in the Epub ToC. The proposed approach is tolerant to discrepancies between the ToC text and the corresponding titles. We evaluated the tool on several open access books edited by University Presses that are partner of the OAPEN EcontentPlus project
    Proceedings of the 2010 ACM Symposium on Document Engineering, Manchester, United Kingdom, September 21-24, 2010; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we describe our recent research for mathematical symbol indexing and its possible application in the Digital Library domain. The proposed approach represents mathematical symbols by means of Shape Contexts (SC) description. Indexed symbols are represented with a vector space-based method, but peculiar to our approach is the use of Self Organizing Maps (SOM) to perform the clustering instead of the commonly used k-means algorithm. The retrieval performance are measured on a large collection of mathematical symbols gathered from the widely used INFTY database.
    Digital Libraries - 6th Italian Research Conference, IRCDL 2010, Padua, Italy, January 28-29, 2010. Revised Selected Papers; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we describe a general approach for script (and language) recognition from printed documents and for writer identification in handwritten documents. The method is based on a bag of visual word strategy where the visual words correspond to characters and the clustering is obtained by means of Self Organizing Maps (SOM). Unknown pages (words in the case of script recognition) are classified comparing their vectorial representations with those of one training set using a cosine similarity. The comparison is improved using a similarity score that is obtained taking into account the SOM organization of cluster centroids. Promising results are presented for both printed documents and handwritten musical scores.
    20th International Conference on Pattern Recognition, ICPR 2010, Istanbul, Turkey, 23-26 August 2010; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper addresses the indexing and retrieval of mathematical symbols from digitized documents. The proposed approach exploits Shape Contexts (SC) to describe the shape of mathematical symbols. Indexed symbols are represented with a vector space-based method that is grounded on SC clustering. We explore the use of the Self Organizing Map (SOM) to perform the clustering and we compare several approaches to compute the SCs. The retrieval performance are measured on a large collection of mathematical symbols gathered from the widely used INFTY database.
    11/2009: pages 102-111;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper addresses the indexing and retrieval of math- ematical symbols from digitized documents. The proposed approach exploits Shape Contexts (SC) to describe the shape of mathematical symbols. Starting from the vector spacemethod, thatis basedon SCclustering, we explore the use of topological ordered clusters to improve the retrieval performance. The clustering is computed by means of Self- Organizing Maps that organize the clusters in two dimen- sional topologically ordered feature maps. The retrieval performance are compared with those obtained using the K-means clustering on a large collection of mathematical symbols gathered from the widely used INFTY database.
    10th International Conference on Document Analysis and Recognition, ICDAR 2009, Barcelona, Spain, 26-29 July 2009; 01/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: We describe a dimensionality reduction method used to perform similarity search that is tested on document image retrieval applications. The approach is based on data point projection into a low dimensional space obtained by merging together the layers of a Growing Hierarchical Self Organizing Map (GHSOM) trained to model the distribution of objects to be indexed. The low dimensional space is defined by embedding the GHSOM sub-maps in the space defined by a non-linear mapping of neurons belonging to the first level map. The latter mapping is computed with the Sammon projection algorithm. The dimensionality reduction is used in a similarity search framework whose aim is to efficiently retrieve similar objects on the basis of the distance among projected points corresponding to high dimensional feature vectors describing the indexed objects. We compare the proposed method with other dimensionality reduction techniques by evaluating the retrieval performance on three datasets.
    Image Analysis and Processing - ICIAP 2009, 15th International Conference Vietri sul Mare, Italy, September 8-11, 2009, Proceedings; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we explore the effectiveness of three clustering methods used to perform word image index- ing. The three methods are: the Self-Organazing Map (SOM), the Growing Hierarchical Self-Organazing Map (GHSOM), and the Spectral Clustering. We test these methods on a real data set composed of word images ex- trapolated from pages that are part of an encyclopedia of the XIXth Century. In essence, the word images are stored into the clusters defined by the clustering meth- ods and subsequently retrieved by identifying the closest cluster to a query word. The accuracy of the methods is compared considering the performance of our word re- trieval algorithm developed in our previous work. From the experimental results we may conclude that methods designed to automatically determine the number and the structure of clusters, such as GHSOM, are partic- ularly suitable in the context represented by our data set.
    The Eighth IAPR International Workshop on Document Analysis Systems, DAS 2008, September 16-19, 2008, Nara, Japan; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a dimensionality reduction method based on data point projection in an output space obtained by embedding the Growing Hierarchical Self Organizing Maps (GHSOM) computed from a training data-set. The dimensionality reduction is used in a similar- ity search framework whose aim is to efficiently retrieve similar objects on the basis of the Euclidean distance among high dimensional feature vectors projected in the reduced space. This research is motivated by applications aimed at performing Document Image Retrieval in Digital Libraries. In this paper we compare the proposed method with other di- mensionality reduction techniques evaluating the retrieval performance on three data-sets.
    Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR International Workshop, SSPR & SPR 2008, Orlando, USA, December 4-6, 2008. Proceedings; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this chapter, we discuss the use of Self Organizing Maps (SOM) to deal with various tasks in Document Image Analysis. The SOM is a particular type of artificial neural network that computes, during the learning, an unsupervised clustering of the input data arranging the cluster centers in a lattice. After an overview of the previous applications of unsupervised learning in document image analysis, we present our recent work in the field. We describe the use of the SOM at three processing levels: the character clustering, the word clustering, and the layout clustering, with applications to word retrieval, document retrieval and page classification. In order to improve the clustering effectiveness, when dealing with small training sets, we propose an extension of the SOM training algorithm that considers the tangent distance so as to increase the SOM robustness with respect to small transformations of the patterns. Experiments on the use of this extended training algorithm are reported for both character and page layout clustering.
    12/2007: pages 193-219;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose the combination of the self organizing map (SOM) and of the tangent distance for effective clustering in document image analysis. The proposed model (SOM_TD) is used for character and layout clustering, with applications to word retrieval and to page classification. By using the tangent distance it is possible to improve the SOM clustering so as to be more tolerant with respect to small local transformations of the input patterns.
    Image Analysis and Processing, 2007. ICIAP 2007. 14th International Conference on; 10/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we describe a system to perform Document Image Retrieval in Digital Libraries. The system allows users to retrieve digitized pages on the basis of layout similarities and to make textual searches on the documents without relying on OCR. The system is dis- cussed in the context of recent applications of document image retrieval in the field of Digital Libraries. We present the different techniques in a single framework in which the emphasis is put on the representation level at which the similarity between the query and the indexed documents is computed. We also report the results of some recent experiments on the use of layout-based document image retrieval. Document Image Retrieval (DIR) aims at identifying relevant documents relying on image features only. Until today, the largest portion of documents belonging to libraries is made by printed books and journals. The electronic counterparts of these physical objects are scanned documents that are traditionally the main subject of Document Image Analysis and Recognition research, which includes DIR. In this paper, we first review the current research in DIR with special interest in applications to digital libraries. Through this brief analysis we show how DIR techniques can offer new ways to explore large document collections. To support this view, we describe in the rest of the paper a document image retrieval system that has been developed by our research group. The system integrates tools aimed at performing the word indexing at the image level with layout-based retrieval components. The paper is organized as follows. In Section 2 we review the recent work on Document Image Retrieval. The proposed system is sketched in Section 3, whereas sections 4 and 5 analyze the word indexing and layout retrieval, respec- tively. Our final remarks are drawn in Section 6.
    Research and Advanced Technology for Digital Libraries, 11th European Conference, ECDL 2007, Budapest, Hungary, September 16-21, 2007, Proceedings; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we propose recognizing logo images by using an adaptive model referred to as recursive artificial neural network. At first, logo images are converted into a structured representation based on contour trees. Recursive neural networks are then learnt using the contourtrees as inputs to the neural nets. On the other hand, the contour-tree is constructed by associating a node with each exterior or interior contour extracted from the logo instance. Nodes in the tree are labeled by a feature vector, which describes the contour by means of its perimeter, surrounded area, and a synthetic representation of its curvature plot. The contour-tree representation contains the topological structured information of logo and continuous values pertaining to each contour node. Hence symbolic and sub-symbolic information coexist in the contour-tree representation of logo image. Experimental results are reported on 40 real logos distorted with artificial noise and performance of recursive neural network is compared with another two types of neural approaches.
    11/2006: pages 104-117;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Taxonomic reasoning is a typical inference task performed by many AI knowledge representation systems. We illustrate the effectiveness of taxonomic reasoning techniques as an active support to knowledge acquisition and schemas design in the advanced database environment LOGIDATA+, supporting complex objects and a rule-based language. The developed idea is that, by extending complex object data models with defined classes, it is possible to infer ISA relationships (i.e. compute subsumption) between classes on the basis of their descriptions. From a theoretical point of view, this approach makes it possible to give a formal definition of consistency to a schema, while, from a pragmatic point of view, it is possible to automatically classify a new class in the correct position of a given taxonomy.
    10/2006: pages 79-84;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of Self Organizing Maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 09/2006; 28(8):1187-99. DOI:10.1109/TPAMI.2006.162 · 5.69 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a system for the retrieval on the basis of layout similarity of document images belonging to collections stored in digital libraries. Layout regions are extracted and represented with the XY tree. The proposed indexing method combines a new tree clustering algorithm (based on self organizing maps) with principal component analysis. The combination of these techniques allows us to retrieve the most similar pages from large collections without the need for a direct comparison of the query page with each indexed document.
    Document Image Analysis for Libraries, 2006. DIAL '06. Second International Conference on; 05/2006
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we describe a system capable of extracting textual information from images of structured documents. In particular the model and the algorithms we described are used to process forms in which the information fields can not be located only by their position on the page, but can also be identified after locating the corresponding instruction fields. The proposed model is based on attributed relational graphs and performs form registration and location of information fields using algorithms based on the hypothesize-and-verify paradigm. The location of instruction fields is carried out in an holistic way, by using connectionist models.
    04/2006: pages 438-448;
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we give a syntax and define a denotational semantics for describing classes in LOGIDATA*, and give a subsumption algorithm that is sound, complete, and polynomial. The syntax allows us to describe both primitive and defined acyclic classes by means of the tuple and set type constructors; in particular, it is also possible to state cardinality constraints on sets. Semantics is defined by means of an interpretation function that maps the descriptions given at an intensional level into the value domain. This interpretation also takes the undefined element nil into account, thus allowing us to deal with a form of incomplete knowledge in a way that is semantically correct. After introducing defined classes and the notion of interpretation function, it is possible to formally define the concept of subsumption, an inference technique that makes it possible reasoning about objects and classes on the basis of their descriptions.
    04/2006: pages 85-104;

Publication Stats

2k Citations
59.54 Total Impact Points

Institutions

  • 1988–2010
    • University of Florence
      • Dipartimento di Ingegneria dell'Informazione
      Florence, Tuscany, Italy
  • 2001
    • Data Sciences International
      Maplewood, Minnesota, United States
  • 1999
    • Drug Study Institute
      Юпитер, Florida, United States
  • 1998
    • Università degli Studi di Siena
      • Department of Information Engineering and Mathematical
      Siena, Tuscany, Italy