To read the full-text of this research, you can request a copy directly from the authors.
... Although advanced deep learning models were recently utilized to empower several DLA systems [28][29][30][31], many typical techniques relied on standard pixel-based analysis and consecutive/integrative connected component (CC) analyses. Interestingly, these techniques are still the predominant practical techniques for effective region segmentation and classification, adopted in most DLA frameworks [4], [20], [22], [23], [26]. ...
Tons of
handwritten document
images, either historical or continuously reproduced in everyday life, are information-rich, awaiting extensive advanced technological attention from academic and industrial communities.
Document layout analysis
(DLA) is a master key that opens many doors for further sophisticated document image processing and understanding. It has been a primary interest of massive research on various domains, including
document image retrieval
(DIR), whereas little attention was paid to valuable semantic aspects. Recently,
semantic DLA
(SDLA) emerged to enable semantic information derivation and more invariant characterization. The viability of comparative-based SDLA for inferring
comparative semantic characteristics
has yet to be investigated. In this research, new comparative characteristics are proposed to empower more perceptive SDLA and improve retrieval capabilities. The proposed
Comparative-SDLA
not only utilizes the latent potency of semanticity in effectively characterizing handwritten document image layout but also exploits the power of comparability between relative semantic characteristics for further robust and enhanced DIR. A novel methodology framework is thoroughly described, implicating pairwise comparative-based
automatic image annotation
, document
ranking
by comparative characteristic, and
comparative feature
extraction. Several retrieval experiments on a sizable complex handwritten document dataset are conducted with extended performance evaluation, analysis, and comparison for comparative-based methods against non-comparative counterparts, highlighting promising capabilities to extend for other practical applications.
... Liebl and Burghardt [121] conducted a study to evaluate different deep neural network (DNN)-based page segmentation architectures for historical newspapers. The study aimed to assess the performance of these architectures while considering factors such as the size of training data, the number of labels, and different settings of tiling (the practice of dividing an image into smaller, overlapping or nonoverlapping regions (tiles) for analysis) and scaling (involves adjusting the size of the image or its components) for separating text, tables, and column lines in newspapers. ...
Historical document processing (HDP) corresponds to the task of converting the physical-bind form of historical archives into a web-based centrally digitized form for their
conservation
,
preservation
, and
ubiquitous access
. Besides the conservation of these invaluable historical collections, the key agenda is to make these
geographically distributed historical repositories
available for information mining and retrieval in a
web-centralized touchless mode
. Being a matter of interest for interdisciplinary scholars, the endeavor has garnered the attention of many researchers resulting in an immense body of the literature dedicated to digitization strategies. The present study first assembles the prevalent tasks essential for HDP into a pipeline and frames an outline for a generic workflow for historical document digitization. Then, it reports the latest task-specific state of the art which gives a brief discourse on the methods and open challenges in handling historical printed and handwritten script images. Next, grounded on various layout attributes, it further talks about the evaluation metrics and datasets available for observational and analytical purposes. The current study is an attempt to trail the contours of undergoing research and its bottlenecks thus, providing readers with a comprehensive view and understanding of existing studies and unfolding the open avenues for the future outlook.
... Performing consecutive or cumulative connected component (CC) and pixel analyses on a document image was a typical dominant technique enforced to initially identify regions and then classify them, as adopted by the majority of proposed DLA systems [17,19,22]. Furthermore, advanced deep learning models were also used for empowering different DLA frameworks [7][8][9]23]. ...
A document layout can be more informative than merely a document’s visual and structural appearance. Thus, document layout analysis (DLA) is considered a necessary prerequisite for advanced processing and detailed document image analysis to be further used in several applications and different objectives. This research extends the traditional approaches of DLA and introduces the concept of semantic document layout analysis (SDLA) by proposing a novel framework for semantic layout analysis and characterization of handwritten manuscripts. The proposed SDLA approach enables the derivation of implicit information and semantic characteristics, which can be effectively utilized in dozens of practical applications for various purposes, in a way bridging the semantic gap and providing more understandable high-level document image analysis and more invariant characterization via absolute and relative labeling. This approach is validated and evaluated on a large dataset of Arabic handwritten manuscripts comprising complex layouts. The experimental work shows promising results in terms of accurate and effective semantic characteristic-based clustering and retrieval of handwritten manuscripts. It also indicates the expected efficacy of using the capabilities of the proposed approach in automating and facilitating many functional, real-life tasks such as effort estimation and pricing of transcription or typing of such complex manuscripts.
... Along with conventional methods, machine learning methods works are also used for layout segmentation. According to a survey in [27], deep neural networks (DNNs) are used for segmenting pages of historical documents. However, the text and non-text detection methods and page layout analysis methods discussed so far only work with the fully decompressed documents, as illustrated in Figure-1, and they cannot be directly applied on compressed document images. ...
JPEG 2000 is a popular image compression technique that uses Discrete Wavelet Transform (DWT) for compression and subsequently provides many rich features for efficient storage and decompression. Though compressed images are preferred for archival and communication purposes, their processing becomes difficult due to the overhead of decompression and re-compression operations which are needed as many times the data needs to operate. Therefore in this research paper, the novel idea of direct operation over the JPEG 2000 compressed documents is proposed for extracting text and non-text regions without using any segmentation algorithm. The technique avoids full decompression of the compressed document in contrast to the conventional methods, where they fully decompress and then process. Moreover, JPEG 2000 features are explored in this research work to partially and intelligently decompress only the selected regions of interest at different resolutions and bitdepths to accomplish segmentation-less extraction of text and non-text regions. Finally Maximally Stable Extremal Regions (MSER) algorithm is used to extract the layout of segmented text and non-text regions for further analysis. Experiments have been carried out on the standard PRImA Layout Analysis Dataset leading to promising results and saving computational resources.
... In our primary research area use of convolutional neural networks (CNN) for layout analysis and Optical Character Recognition (OCR) of historical Chinese documents is expanding the quality and scope of available sources (Liebl & Burghardt, 2020;Oliveira et al., 2018). It also places high demands on our ability to re-use electronic resources as training data. ...
It is tempting to assume that FAIR data principles effectively apply globally. In practice, digital research platforms play a central role in ensuring the applicability of these principles to research exchange, where General Data Protection Regulation (EU) and Multi Level Protection Scheme 2.0 (PRC) provide the overarching legal frameworks. For this article, we conduct a systematic review of research into Chinese Republican newspapers as it appears in Chinese academic journal databases. We experimentally compare the results of repeated search runs using different interfaces and with different points of origin. We then analyze our results regarding the practical and technical accessibility conditions. Concluding with an analysis of conceptual mismatches surrounding the classification of items as “full-text“, and of a case of total data loss that is nevertheless symptomatic of the limited degree of data re-usability. Our results show structural challenges preventing Findability, Accessibility, Interoperability, and Re-usability from being put into practice. Since these experiments draw upon our Digital Humanities (DH) research, we include a state-of-the-field overview of historical Periodicals and digitization research in the PRC. Our research on the one hand addresses DH practitioners interested in digital collections, and technical aspects of document processing with a focus on historical Chinese sources. On the other hand, our experience is helpful to researchers irrespective of the topic. Our article is accompanied by a data publication containing sources and results of our experiments, as well as an online bibliography of the research articles we collected.
... [5] proposed a pipeline for extracting and searching visual content from historic print newspaper scans. A fully convolutional segmentation approach has been applied by [8] and [6] to extract content blocks from newspaper images. ...
Accessing daily news content still remains a big challenge for people with print-impairment including blind and low-vision due to opacity of printed content and hindrance from online sources. In this paper, we present our approach for digitization of print newspaper into an accessible file format such as HTML. We use an ensemble of instance segmentation and detection framework for newspaper layout analysis and then OCR to recognize text elements such as headline and article text. Additionally, we propose EdgeMask loss function for Mask-RCNN framework to improve segmentation mask boundary and hence accuracy of downstream OCR task. Empirically, we show that our proposed loss function reduces the Word Error Rate (WER) of news article text by 32.5 %.
... They use a convolutional neural network architecture for this purpose. (Liebl and Burghardt, 2020) employ a transfer learning approach that improves the generalization performance of the previous approach. ...
The task of automated recognition of handwritten texts requires various phases and technologies both optical and language related. This article describes an approach for performing this task in a comprehensive manner, using machine learning throughout all phases of the process. In addition to the explanation of the employed methodology, it describes the process of building and evaluating a model of manuscript recognition for the Spanish language. The original contribution of this article is given by the training and evaluation of Offline HTR models for Spanish language manuscripts, as well as the evaluation of a platform to perform this task in a complete way. In addition, it details the work being carried out to achieve improvements in the models obtained, and to develop new models for different complex corpora that are more difficult for the HTR task.
... dhSegment is a deep learning framework for historical document processing, including pixel-wise segmentation and extraction tasks [19]. Liebl and Burghardt benchmarked 11 different deep learning backbones for the pixel-wise segmentation of historic newspapers, including the separation of layout features such as text and tables [41]. The AIDA collaboration at the University of Nebraska-Lincoln has applied deep learning techniques to newspaper corpora including Chronicling America and the Burney Collection of British Newspapers [44,45,46] for tasks such as poetic content recognition [47,68], as well as visual content recognition using dhSegment [48]. ...
Chronicling America is a product of the National Digital Newspaper Program, a partnership between the Library of Congress and the National Endowment for the Humanities to digitize historic newspapers. Over 16 million pages of historic American newspapers have been digitized for Chronicling America to date, complete with high-resolution images and machine-readable METS/ALTO OCR. Of considerable interest to Chronicling America users is a semantified corpus, complete with extracted visual content and headlines. To accomplish this, we introduce a visual content recognition model trained on bounding box annotations of photographs, illustrations, maps, comics, and editorial cartoons collected as part of the Library of Congress's Beyond Words crowdsourcing initiative and augmented with additional annotations including those of headlines and advertisements. We describe our pipeline that utilizes this deep learning model to extract 7 classes of visual content: headlines, photographs, illustrations, maps, comics, editorial cartoons, and advertisements, complete with textual content such as captions derived from the METS/ALTO OCR, as well as image embeddings for fast image similarity querying. We report the results of running the pipeline on 16.3 million pages from the Chronicling America corpus and describe the resulting Newspaper Navigator dataset, the largest dataset of extracted visual content from historic newspapers ever produced. The Newspaper Navigator dataset, finetuned visual content recognition model, and all source code are placed in the public domain for unrestricted re-use.
Historical newspapers serve as invaluable resources for understanding past societies and preserving cultural heritage. However, digitizing these newspapers presents challenges due to their complex layouts and vast content. Article segmentation, involving the identification and extraction of individual articles from scanned newspaper images, is crucial for efficient information mining and retrieval. While some rule-based algorithms have been proposed, the applicability of deep neural networks (DNNs) for this task has recently gained attention. In this work, we explore the applicability of transfer learning to segment articles from historical newspaper images. For this, we employed nine pre-trained backbone architectures, specifically selected from the ResNet family, and proposed a bounding-box approximation based article segmentation module designed specifically for the task. Furthermore, we introduced a mean estimated article coverage metric that computes the segmentation capability of a model on an article-level. Experiments were conducted on the NAS dataset (NewsEye Article Separation), ensuring the relevance of our approach to historical data. Our study evaluates the performance of various pre-trained models, achieving a mean estimated article coverage of 0.956, 0.969, and 0.995 on the ONB, NLF, and BNF datasets, respectively. These findings underscore the effectiveness of transfer learning in adapting to historical layout analysis tasks, particularly article segmentation. Moreover, these results reaffirm the significance of transfer learning and pre-trained models as efficient tools for handling complex historical newspaper layouts.
Layout analysis is the crucial stage in the recognition system of newspapers. A good layout analysis results in better recognition results. In this paper, we detected the complex layout of newspapers in the Gurumukhi script. We have used a hybrid approach. In this approach, firstly, we proposed an algorithm to remove pictures from newspaper images that involves various image preprocessing tasks based on binarization, finding contours, and erosion on the image to remove the graphics from the image. This method also removes pictures from complex non-Manhattan layouts. Finally, we have trained the deep-leaning model based on a convolutional network to detect the columns of text from newspapers. We have created a dataset of 500 images labelled with five classes on which the model was trained. We have tested this method on the number of newspapers of the Gurumukhi script. The results show very good accuracy with this hybrid approach of layout detection.
Thumbnails accelerate the concise description of a collection of images that iterate through large datasets till it trains our deep learning model to capture the text-whitened thumbnails. Many court cases and forensic investigations have involved thumbnail pictures within laptops or mobile devices. Millions of thumbnails are often ready for digital forensics experts to export. Machine learning can quickly help identify an investigation’s targets. Text or objective recognition is a primary study solution for document digitization and forensic analysis. Inspired by the recent success of Transformer in many applications, in this paper, we adopt design transfer learning as an effective method of achieving excellent performance with a noisy training of dataset of thumbnails. This deep learning model aims to investigate the pre-trained model in the Torch Vision package employing tensors and Cuda-GPU parallel computing to emphasize the OCR (Optical Character Recognition) system engine. We report the preliminary results of our methods to help digital forensics experts to identify their targets robustly and efficiently.
Image downscaling is an essential operation to reduce spatial complexity for various applications and is becoming increasingly important due to the growing number of solutions that rely on memory-intensive approaches, such as applying deep convolutional neural networks to semantic segmentation tasks on large images. Although conventional content-independent image downscaling can efficiently reduce complexity, it is vulnerable to losing perceptual details, which are important to preserve. Alternatively, existing content-aware downscaling severely distorts spatial structure and is not effectively applicable for segmentation tasks involving document images. In this paper, we propose a novel image downscaling approach that combines the strengths of both content-independent and content-aware strategies. The approach limits the sampling space per the content-independent strategy, adaptively relocating such sampled pixel points, and amplifying their intensities based on the local gradient and texture via the content-aware strategy. To demonstrate its effectiveness, we plug our adaptive downscaling method into a deep learning-based document image segmentation pipeline and evaluate the performance improvement. We perform the evaluation on three publicly available historical newspaper digital collections with differences in quality and quantity, comparing our method with one widely used downscaling method, Lanczos. We further demonstrate the robustness of the proposed method by using three different training scenarios: stand-alone, image-pyramid, and augmentation. The results show that training a deep convolutional neural network using images generated by the proposed method outperforms Lanczos, which relies on only content-independent strategies.
Due to the idiosyncrasies of historical document images (HDI), growing attention over the last decades is being paid for proposing robust HDI analysis solutions. Many research studies have shown that Gabor filters are among the low-level descriptors that best characterize texture information in HDI. On the other side, deep neural networks (DNN) have been successfully used for HDI segmentation. As a consequence, we propose in this paper a HDI segmentation method that is based on combining Gabor features and DNN. The segmentation method focuses on classifying each document image pixel to either graphic, text or background. The novelty of the proposed method lies mainly in feeding a DNN with a Gabor filtered image (obtained by applying specific multichannel Gabor filters) instead of an original image as input. The proposed method is decomposed into three steps: a) filtered image generation using Gabor filters, b) feature learning with stacked autoencoder, and c) image segmentation with 2D U-Net. In order to evaluate its performance, experiments are conducted using two different datasets. The results are reported and compared with those of a recent state-of-the-art method.KeywordsHistorical document image segmentationGabor filtersDeep neural networksStacked autoencoder2D U-Net architecture
Print-oriented PDF documents are excellent at preserving the position of text and other objects but have difficulties in processing. Processable PDF documents will provide solutions to the unique needs of different sectors by paving the way for many innovations such as searching within documents, linking with different documents, or restructuring in a format that will increase the reading experience. In this chapter, a deep learning-based system design is presented that aims to export clean text content, separate all visual elements, and extract rich information from the content without losing the integrated structure of content types. While the F-RCNN model using the Detectron2 library was used to extract the layout, the cosine similarities between the wod2vec representations of the texts were used to identify the related clips, and the transformer language models were used to classify the clip type. The performance values on the 200-sample data set created by the researchers were determined as 1.87 WER and 2.11 CER in the headings and 0.22 WER and 0.21 CER in the paragraphs.
Data augmentation is a commonly used technique for increasing both the size and the diversity of labeled training sets by leveraging input transformations that preserve corresponding output labels. In computer vision, image augmentations have become a common implicit regularization technique to combat overfitting in deep learning models and are ubiquitously used to improve performance. While most deep learning frameworks implement basic image transformations, the list is typically limited to some variations of flipping, rotating, scaling, and cropping. Moreover, image processing speed varies in existing image augmentation libraries. We present Albumentations, a fast and flexible open source library for image augmentation with many various image transform operations available that is also an easy-to-use wrapper around other augmentation libraries. We discuss the design principles that drove the implementation of Albumentations and give an overview of the key features and distinct capabilities. Finally, we provide examples of image augmentations for different computer vision tasks and demonstrate that Albumentations is faster than other commonly used image augmentation tools on most image transform operations.
Background:
To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets.
Results:
The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset.
Conclusions:
In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1 score in evaluating binary classification tasks by all scientific communities.
Recent interest in complex and computationally expensive machine learning models with many hyperparameters, such as automated machine learning (AutoML) frameworks and deep neural networks, has resulted in a resurgence of research on hyperparameter optimization (HPO). In this chapter, we give an overview of the most prominent approaches for HPO. We first discuss blackbox function optimization methods based on model-free methods and Bayesian optimization. Since the high computational demand of many modern machine learning applications renders pure blackbox optimization extremely costly, we next focus on modern multi-fidelity methods that use (much) cheaper variants of the blackbox function to approximately assess the quality of hyperparameter settings. Lastly, we point to open problems and future research directions.
This work presents an in-depth analysis of the majority of the deep neural networks (DNNs) proposed in the state of the art for image recognition. For each DNN multiple performance indices are observed, such as recognition accuracy, model complexity, computational complexity, memory usage, and inference time. The behavior of such performance indices and some combinations of them are analyzed and discussed. To measure the indices we experiment the use of DNNs on two different computer architectures, a workstation equipped with a NVIDIA Titan X Pascal and an embedded system based on a NVIDIA Jetson TX1 board. This experimentation allows a direct comparison between DNNs running on machines with very different computational capacity. This study is useful for researchers to have a complete view of what solutions have been explored so far and in which research directions are worth exploring in the future; and for practitioners to select the DNN architecture(s) that better fit the resource constraints of practical deployments and applications. To complete this work, all the DNNs, as well as the software used for the analysis, are available online.
Semantic image segmentation, which becomes one of the key applications in image processing and computer vision domain, has been used in multiple domains such as medical area and intelligent transportation. Lots of benchmark datasets are released for researchers to verify their algorithms. Semantic segmentation has been studied for many years. Since the emergence of Deep Neural Network (DNN), segmentation has made a tremendous progress. In this paper, we divide semantic image segmentation methods into two categories: traditional and recent DNN method. Firstly, we briefly summarize the traditional method as well as datasets released for segmentation, then we comprehensively investigate recent methods based on DNN which are described in the eight aspects: fully convolutional network, up-sample ways, FCN joint with CRF methods, dilated convolution approaches, progresses in backbone network, pyramid methods, Multi-level feature and multi-stage method, supervised, weakly-supervised and unsupervised methods. Finally, a conclusion in this area is drawn.
Many deep neural networks trained on natural images exhibit a curious
phenomenon in common: on the first layer they learn features similar to Gabor
filters and color blobs. Such first-layer features appear not to be specific to
a particular dataset or task, but general in that they are applicable to many
datasets and tasks. Features must eventually transition from general to
specific by the last layer of the network, but this transition has not been
studied extensively. In this paper we experimentally quantify the generality
versus specificity of neurons in each layer of a deep convolutional neural
network and report a few surprising results. Transferability is negatively
affected by two distinct issues: (1) the specialization of higher layer neurons
to their original task at the expense of performance on the target task, which
was expected, and (2) optimization difficulties related to splitting networks
between co-adapted neurons, which was not expected. In an example network
trained on ImageNet, we demonstrate that either of these two issues may
dominate, depending on whether features are transferred from the bottom,
middle, or top of the network. We also document that the transferability of
features decreases as the distance between the base task and target task
increases, but that transferring features even from distant tasks can be better
than using random features. A final surprising result is that initializing a
network with transferred features from almost any number of layers can produce
a boost to generalization that lingers even after fine-tuning to the target
dataset.
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in
object category classification and detection on hundreds of object categories
and millions of images. The challenge has been run annually from 2010 to
present, attracting participation from more than fifty institutions.
This paper describes the creation of this benchmark dataset and the advances
in object recognition that have been possible as a result. We discuss the
challenges of collecting large-scale ground truth annotation, highlight key
breakthroughs in categorical object recognition, provide detailed a analysis of
the current state of the field of large-scale image classification and object
detection, and compare the state-of-the-art computer vision accuracy with human
accuracy. We conclude with lessons learned in the five years of the challenge,
and propose future directions and improvements.
In this work, we consider the evaluation of the semantic segmentation task. We discuss the strengths and limitations of the few existing measures, and propose new ways to evaluate semantic segmentation. First, we argue that a per-image score instead of one computed over the entire dataset brings a lot more insight. Second, we propose to take contours more carefully into account. Based on the conducted experiments, we suggest best practices for the evaluation. Finally, we present a user study we conducted to better understand how the quality of image segmentations is perceived by humans.
Informative benchmarks are crucial for optimizing the page segmentation step of an OCR system, frequently the performance limiting step for overall OCR system performance. We show that current evaluation scores are insufficient for diagnosing specific errors in page segmentation and fail to identify some classes of serious segmentation errors altogether. This paper introduces a vectorial score that is sensitive to, and identifies, the most important classes of segmentation errors (over-, under-, and mis-segmentation) and what page components (lines, blocks, etc.) are affected. Unlike previous schemes, our evaluation method has a canonical representation of ground truth data and guarantees pixel-accurate evaluation results for arbitrary region shapes. We present the results of evaluating widely used segmentation algorithms (x-y cut, smearing, whitespace analysis, constrained text-line finding, docstrum, and Voronoi) on the UW-III database and demonstrate that the new evaluation scheme permits the identification of several specific flaws in individual segmentation methods.
The machine learning community has been overwhelmed by a plethora of deep learning based approaches. Many challenging computer vision tasks such as detection, localization, recognition and segmentation of objects in unconstrained environment are being efficiently addressed by various types of deep neural networks like convolutional neural networks, recurrent networks, adversarial networks, autoencoders and so on. While there have been plenty of analytical studies regarding the object detection or recognition domain, many new deep learning techniques have surfaced with respect to image segmentation techniques. This paper approaches these various deep learning techniques of image segmentation from an analytical perspective. The main goal of this work is to provide an intuitive understanding of the major techniques that has made significant contribution to the image segmentation domain. Starting from some of the traditional image segmentation approaches, the paper progresses describing the effect deep learning had on the image segmentation domain. Thereafter, most of the major segmentation algorithms have been logically categorized with paragraphs dedicated to their unique contribution. With an ample amount of intuitive explanations, the reader is expected to have an improved ability to visualize the internal dynamics of these processes.
We propose a high-performance fully convolutional neural network (FCN) for historical handwritten document segmentation that is designed to process a single page in one step. The advantage of this model beside its speed is its ability to directly learn from raw pixels instead of using preprocessing steps e. g. feature computation or superpixel generation. We show that this network yields better results than existing methods on different public data sets. For evaluation of this model we introduce a novel metric that is independent of ambiguous ground truth called Foreground Pixel Accuracy (FgPA). This pixel based measure only counts foreground pixels in the binarized page, any background pixel is omitted. The major advantage of this metric is, that it enables researchers to compare different segmentation methods on their ability to successfully segment text or pictures and not on their ability to learn and possibly overfit the peculiarities of an ambiguous hand-made ground truth segmentation.
We note that common implementations of adaptive gradient algorithms, such as Adam, limit the potential benefit of weight decay regularization, because the weights do not decay multiplicatively (as would be expected for standard weight decay) but by an additive constant factor. We propose a simple way to resolve this issue by decoupling weight decay and the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam, and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). We also demonstrate that longer optimization runs require smaller weight decay values for optimal results and introduce a normalized variant of weight decay to reduce this dependence. Finally, we propose a version of Adam with warm restarts (AdamWR) that has strong anytime performance while achieving state-of-the-art results on CIFAR-10 and ImageNet32x32. Our source code is available at https://github.com/loshchil/AdamW-and-SGDW
Many deep neural networks trained on natural images exhibit a curious phenomenon in common: on the first layer they learn features similar to Gabor filters and color blobs. Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. Features must eventually transition from general to specific by the last layer of the network, but this transition has not been studied extensively. In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few surprising results. Transferability is negatively affected by two distinct issues: (1) the specialization of higher layer neurons to their original task at the expense of performance on the target task, which was expected, and (2) optimization difficulties related to splitting networks between co-adapted neurons, which was not expected. In an example network trained on ImageNet, we demonstrate that either of these two issues may dominate, depending on whether features are transferred from the bottom, middle, or top of the network. We also document that the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features. A final surprising result is that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset.
Deep-learning has proved in recent years to be a powerful tool for image analysis and is now widely used to segment both 2D and 3D medical images. Deep-learning segmentation frameworks rely not only on the choice of network architecture but also on the choice of loss function. When the segmentation process targets rare observations, a severe class imbalance is likely to occur between candidate labels, thus resulting in sub-optimal performance. In order to mitigate this issue, strategies such as the weighted cross-entropy function, the sensitivity function or the Dice loss function, have been proposed. In this work, we investigate the behavior of these loss functions and their sensitivity to learning rate tuning in the presence of different rates of label imbalance across 2D and 3D segmentation tasks. We also propose to use the class re-balancing properties of the Generalized Dice overlap, a known metric for segmentation assessment, as a robust and accurate deep-learning loss function for unbalanced tasks.
Deep-learning has proved in recent years to be a powerful tool for image analysis and is now widely used to segment both 2D and 3D medical images. Deep-learning segmentation frameworks rely not only on the choice of network architecture but also on the choice of loss function. When the segmentation process targets rare observations, a severe class imbalance is likely to occur between candidate labels, thus resulting in sub-optimal performance. In order to mitigate this issue, strategies such as the weighted cross-entropy function, the sensitivity function or the Dice loss function, have been proposed. In this work, we investigate the behavior of these loss functions and their sensitivity to learning rate tuning in the presence of different rates of label imbalance across 2D and 3D segmentation tasks. We also propose to use the class re-balancing properties of the Generalized Dice overlap, a known metric for segmentation assessment, as a robust and accurate deep-learning loss function for unbalanced tasks.
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.
Transfusion: Understanding Transfer Learning for Medical Imaging
M Raghu
C Zhang
J Kleinberg
S Bengio
M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio, "Transfusion: Understanding Transfer Learning for Medical Imaging," arXiv:1902.07208
[cs, stat], Oct. 2019.
Lookahead Optimizer: K steps forward, 1 step back
Jan 2019
9597
zhang
M. Zhang and J. Lucas, "Lookahead Optimizer: K steps forward, 1 step
back," in Advances in Neural Information Processing Systems 32 (NIPS
2019), H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
E. Fox, and R. Garnett, Eds. Vancouver, Canada: Curran Associates,
Inc., 2019, pp. 9597-9608.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Jan 2019
6105
tan
M. Tan and Q. V. Le, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks," in Proceedings of the 36th International
Conference on Machine Learning, ser. Proceedings of Machine Learning
Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long
Beach, CA, USA: PMLR, Jun. 2019, pp. 6105-6114.