Conference PaperPDF Available

Abstract and Figures

In this work we address the problem of blind deconvolution and denoising. We focus on restoration of text documents and we show that this type of highly structured data can be successfully restored by a convolutional neural network. The networks are trained to reconstruct high-quality images directly from blurry inputs without assuming any specific blur and noise models. We demonstrate the performance of the convolutional networks on a large set of text documents and on a combination of realistic de-focus and camera shake blur kernels. On this artificial data, the convolutional networks significantly outperform existing blind deconvolution methods, including those optimized for text, in terms of image quality and OCR accuracy. In fact, the networks outperform even state-of-the-art non-blind methods for anything but the lowest noise levels. The approach is validated on real photos taken by various devices.
Content may be subject to copyright.
HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING 1
Convolutional Neural Networks for Direct
Text Deblurring
Michal Hradiš1
ihradis@fit.vutbr.cz
Jan Kotera2
kotera@utia.cas.cz
Pavel Zemˇ
cík1
zemcik@fit.vutbr.cz
Filip Šroubek2
sroubekf@utia.cas.cz
1Faculty of Information Technology
Brno University of Technology
Brno, Czech Republic
2Institute of Information Theory and
Automation
Czech Academy of Sciences
Prague, Czech Republic
Abstract
In this work we address the problem of blind deconvolution and denoising. We fo-
cus on restoration of text documents and we show that this type of highly structured
data can be successfully restored by a convolutional neural network. The networks are
trained to reconstruct high-quality images directly from blurry inputs without assuming
any specific blur and noise models. We demonstrate the performance of the convolutional
networks on a large set of text documents and on a combination of realistic de-focus and
camera shake blur kernels. On this artificial data, the convolutional networks signifi-
cantly outperform existing blind deconvolution methods, including those optimized for
text, in terms of image quality and OCR accuracy. In fact, the networks outperform even
state-of-the-art non-blind methods for anything but the lowest noise levels. The approach
is validated on real photos taken by various devices.
1 Introduction
Taking pictures of text documents using hand-held cameras has become commonplace in
casual situations such as when digitizing receipts, hand-written notes, and public informa-
tion signboards. However, the resulting images are often degraded due to camera shake,
improper focus, image noise, image compression, low resolution, poor lighting, or reflec-
tions. We have selected the restoration of such images as a representative of tasks for which
current blind deconvolution methods are not particularly suited for. They are not equipped
to model some of the image degradations, and they do not take full advantage of the avail-
able knowledge of the image content. Moreover, text contains high amount of small details
which needs to be preserved to retain legibility. We propose a deblurring method based on a
convolutional network which learns to restore images from data.
In its idealized form, blind deconvolution is defined as the task of finding an original
image x, and possibly a convolution kernel k, from an observed image ywhich is created as
y=xk+n,(1)
c
2015. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
2HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING
Figure 1: Real photos deblured with a convolutional neural network.
where nis an independent additive noise. This is an ill-posed problem with infinite number
of solutions. The problem may remain ill-posed even in the absence of noise and when the
blur kernel is known (non-blind deconvolution). Fortunately, real images are not completely
random and the knowledge of their statistics can be used to constrain the solution.
A possible solution to the deconvolution problem is a minimization of a suitable cost
function, e.g.
ˆx=argmin
x
kyxkk2
2+r(x).(2)
Here the data term kyxkk2
2forces the solution ˆxto be a plausible source of the observed
image. The regularizing prior term r(x)expresses knowledge of the image statistics and
forces the solution to be a “nice” image. Unfortunately, the process of capturing images with
real cameras is much more complex than Eq. (1) suggests. Atmosphere scatters incoming
light (haze, fog). Instead of being space-invariant, real blur depends on the position within
the image, on object distance, on object motion, and it interacts with chromatic aberrations of
the lens. The linear convolution assumption does not hold either. The image can be saturated,
cameras apply color and gamma correction, and colors are interpolated from neighboring
pixels due to Bayer mask. Many cameras even apply denoising and sharpening filters and
compress the image with lossy JPEG compression.
Some of the aspects of the real imaging process can be incorporated into Eq. (2), for ex-
ample by modification of the data term, but doing so increases computational complexity and
is often too difficult (e.g. space-variant blur [32], saturation [9], or non-Gaussian noise [4]).
It is sometimes not practical to model the imaging process fully and the unmodeled phenom-
ena is left to be handled as a noise.
In our work we propose an alternative approach to deconvolution by directly modeling
the restoration process as a general function
ˆx=F(y,θ)(3)
with parameters θwhich can be learned from data. The function F(y,θ)implicitly incorpo-
rates both the data term and the prior term from Eq. (2). The advantage of this data-driven
approach is that it is relatively easy to model the forward imaging process and to generate
(x,y)pairs. In many applications, it is even possible to get the (x,y)pairs from a real imag-
ing process. Consequently, learning the restoration function F(y,θ)can be straightforward
even for a complex imaging system, as opposed to the extremely hard task of inverting the
imaging process by hand.
Convolutional neural networks (CNN) are the state-of-the art function approximators for
computer vision problems and are well suited for the deconvolution task – they can naturally
HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING 3
incorporate inverse filters and, we believe, they posses enough computational capacity to
model the blind deconvolution process, including strong data priors.
We investigate whether CNNs are able to learn end-to-end mapping for deblurring of text
documents. Text is highly structured and the strong prior should allow to deconvolve images
locally even in blind setting without the need of extracting information from a large image
neighborhood.
CNNs were previously used to learn denoising [16], structured noise removal [11], non-
blind deconvolution [26,35], and sub-tasks of blind deconvolution [27]. Our work is the
first to demonstrate that CNNs are able to perform state-of-the-art blind deconvolution (see
Figure 1). Locality, data-driven nature, and the ability to capture strong priors makes the
approach robust, efficient, easy to use, and easy to adapt to new data.
2 Related work
The key idea of general modern blind deblurring methods is to address the ill-posedness
of blind deconvolution by a suitable choice of prior (which then forms the regularizer rin
Eq. (2)), for example by using natural image statistics, or by otherwise modifying either the
minimized functional or the optimization process. This started with the work of Fergus et
al. [12], who applied variational Bayes to approximate the posterior. Other authors (e.g.
[1,2,18,20,28,33,34]) maximize the posterior by an alternating blur-image approach and,
in some cases, use rather ad hoc steps to obtain an acceptable solution. Levin et al. [22,23]
showed that marginalizing the posterior with respect to the latent image leads to the correct
solution of the PSF, while correct prior alone might not – therefore these ad hoc steps are in
fact often crucial for some deconvolution methods to work.
Space-variant blind deconvolution is even more challenging problem, as the PSF also
depends on the position and the problem thus has much more unknowns. In such case, the
space of possible blurs is often limited, for example to camera rotations. The blur operator
can be then expressed as a linear combination of a small number of base blurs, and the
blind problem is solved in the space spanned by such basis [15,19,32]. Successful image
deblurring is further complicated by e.g. saturated pixels [9], the problem of unknown image
boundary in convolution [1], non-linear post-processing by the camera, and many more. In
short, modeling the whole process is, while perhaps possible, simply too complicated and
even state-of-the-art deblurring methods do not attempt to include all degradation factors at
once.
Images of text are markedly different from images of natural scenes. One of the first
methods specialized for text-image deblurring was [25] where Panci et al. modeled text
image as a random field and used the Bussgang algorithm to recover the sharp image. Cho
et al. [7] segment the image into background and characters, for which they use different
prior based on properties characteristic for images of text. Even more recent and arguably
state-of-the-art approach is the method of Pan et al. [24] which uses sparse l0 prior on image
gradients and on pixel intensities. Otherwise, these methods follow the established pipeline
of first estimating the blur kernel and then deconvolving the blurred image using a non-blind
method, which includes all the complications mentioned before.
Neural networks and other learning methods have been used extensively in image restora-
tion. The most relevant to our work are methods witch use CNNs to directly predict high-
quality images (as in Eq. (3)). Xu et al. [35] learn CNNs for space-invariant non-blind
deconvolution. They initialize first two layers with a separable decomposition of an inverse
4HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING
filter and then they optimize the full network on artificial data. This network can handle
complex blurs and saturation, but it has to be completely re-trained for each blur kernel. Jain
and Seung [16] learned a small CNN (15,697 parameters) to remove additive white Gaus-
sian noise with unknown energy. Eigen et al. [11] detect and remove rain drops and dirt by
CNN learned on generated data. They report significant quality increase with CNN over a
patch-based network. Dong et al. [10] learn a small 3-layer CNN with Rectified Linear Units
for state-of-the-art single image super resolution. Interesting and related to our work is the
blind deconvolution method by Schuler et al. [27]. They propose to learn a "sharpenning"
CNN for blur kernel estimation. In contrast to our work, their network is rather small and
they reconstruct the final image with a standard non-blind method.
3 Convolutional networks for blind deconvoution
We directly predict clean and sharp images from corrupted observed images by a convolu-
tional network as in Eq. (3). The architecture of the networks is inspired by the recently very
successful networks that redefined state-of-the-art in many computer vision tasks including
object and scene classification [29,30,37], object detection [13], and facial recognition [31].
All these networks are derived from the ImageNet classification network by Krizhevsky et
al. [21]. These networks can be reliably trained even when they contain hundreds of millions
weights [29] and tens of layers [30].
The networks are composed of multiple layers which combine convolutions with element-
wise Rectified Liear Units (ReLU):
F0(y) = y
Fl(y) = max(0,WlFl1(y) + bl),l=1,...,L1
F(y) = WLFL1(y) + bL
(4)
The input and output are both 3-channel RGB images with their values mapped to interval
[0.5,0.5]. Each layer applies clconvolutions with filters spanning all channels cl1of the
previous layer. The last layer is linear (without ReLU).
As in previous works [3,11,26,35], we train the networks by minimizing mean squared
error on a dataset D= (xi,yi)of corresponding clean and corrupted image patches:
argmin
W,b
1
2|D|
(xi,yi)D
||F(yi)xi||2
2+0.0005||W||2
2(5)
The weight decay term 0.0005||W||2
2is not required for regularization, but previous work
showed that it improves convergence [21]. The optimization method we use is Stochastic
Gradient Descent with momentum. We set the size of clean patches xito 16×16 pixels,
which we believe provides good trade-off between computational efficiency and diversity of
mini-batches (96 patches in each mini-batch). Size of the blurred training patches yidepends
on the spatial support of a particular network.
We initialize weights from uniform distribution with variance equal to 1
nin , where nin is
the size of the respective convolutional filter (fan-in). This recipe was derived by authors of
Caffe framework [17] from recommendations by Xavier and Bengio [14].
HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING 5
Table 1: CNN architecture – filter size and number of channels for each layer.
Layer 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
L15 19×19 1×1 1×1 1×1 1×1 3×3 1×1 5×5 5×5 3×3 5×5 5×5 1×1 7×7 7×7
128 320 320 320 128 128 512 128 128 128 128 128 256 64 3
L10 23×23 1×1 1×1 1×1 1×1 3×3 1×1 5×5 3×3 5×5
S 128 320 320 320 128 128 512 48 96 3
M 196 400 400 400 156 156 512 56 128 3
L 220 512 512 512 196 196 512 64 196 3
4 Results
We tested the approach on the task of blind deconvolution with realistic de-focus and camera-
shake blur kernels on a large set of documents from the CiteSeerX1repository. We explored
different network architecture choices, and we compared results to state-of-the-art blind and
non-blind deconvolution methods in terms of image quality and OCR accuracy. We pur-
posely limited the image degradations to shift-invariant blur and additive noise to allow for
fair comparison with the baseline methods, which are not designed to handle other aspects
of the image acquisition process. To validate our approach, we qualitatively evaluated the
created networks on real photos of printed documents. The experimentes were conducted
using a modified version of Caffe deep learning framework [17].
Dataset. We selected scientific publications for the experiments as they contain an inter-
esting mix of different content types (text, equations, tables, images, graphs, and diagrams).
We downloaded over 600k documents from CiteSeerX repository from which we randomly
selected 50k files for training and 2k files for validation. We rendered at most 4 random
pages from each document using Ghostscript with resolution uniformly sampled form 240-
300 DPI. We down-sampled the images by factor of two to a final resolution 120-150 DPI
(A4 paper at 150 DPI requires 2.2 Mpx resolution). Patches for training (3M) and validation
(35k) were sampled from the rendered pages with a preference to regions with higher total
variation.
To make the extracted patches more realistic, we applied small geometric transformations
with bicubic interpolation corresponding to camera rotations in the document plane and de-
viations from the perpendicular document axis (normal distributions with standard deviation
of 1and 4, respectively). We believe that this amount of variation may correspond to that
of automatically rectified images.
We combined two types of blur – motion blur similar to camera shake and de-focus
blur. The de-focus blur is an uniform anti-aliased disc. The motion blur was generated
by a random walk. The radius of the de-focus blur was uniformly sampled from [0,4]and
the maximum size of the motion kernel was sampled from [5,21]. Each image patch was
blurred with a uniqe kernel. A histogram of kernel sizes is shown in Figure 2(bottom-right).
Gaussian noise with standard deviation uniformly sampled from [0,
7
255 ]was added to the
blurred patches which were finally quantized into 256 discrete levels.
CNN architecture. Bigger and deeper networks give better results in computer vision
tasks [30] provided enough training data and computing power is available. Figure 2(top-
1http://citeseerx.ist.psu.edu/
6HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING
2 4 6 8 10
10
12
14
layers
PSNR (dB)
first layer filter diameter (px)
PSNR (dB)
0 10 20 30 40
10
12
14 L03
L05
L10−S
channel number
PSNR (dB)
S M L
10
12
14
L05
L10
0 50 100 150 200
0
0.05
0.1
blur kernel area (px)
dataset ratio
Figure 2: Different CNN architectures. top-left – network depth; top-right – spatial support
size; bottom-left – channel number; bottom-right – distribution of blur-kernel sizes in dataset
left) shows Peak Signal to Noise Ration (PSNR) on the validation set for networks with
different number of layers (128 filters in each layer; size of first layer filters 23×23, other
layers 3×3). The deeper networks consistently outperform shallower networks.
The better results of deeper networks could be due to their higher computational capacity
or due to larger spatial support (2px per layer). To gain more insight, we trained multiple
networks with filter sizes in the first layer from 9×9 up to 35×35. These networks have
128 filters in the first layer, 3 filters 5×5 in the last layer (RGB), and 256 filters 1×1 in the
middle layers. These networks with three layers will be refered to as L03 and those with
five layers as L05. Figure 2(top-right) shows that the spatial support affects performance
insignificantly compared the depth of the network. The optimal size of the first layer filters
is relatively small: 13×13 for L03 and 19×19 for L05. The same observations hold for a 10
layer network (L10-S from Table 1).
Another way to enlarge a network is to increase the number of channels (filters). Such
change affects the amount of information which can pass through the network at the expense
of quadratic increase of the number of weights and of computational complexity. Figure 2
(bottom-left) shows the effect of changing number of channels for networks L10 (see Ta-
ble 1) and L05 (first layer channels 128 (S), 196 (M), 220 (L); other layers 256 (S), 320 (M),
512 (L)). The reconstruction quality increases only slightly with higher number of channels,
while the effect of network depth is much stronger.
The largest and deepest network we trained has 15 layers and 2.3M parameters (L15
from Table 1). This network provides the best results (validation PSNR 16.06 dB) still
without over-fitting. Compared to convolutional networks used in computer vision tasks, this
network is still small and computationally efficient – it requires 2.3M multiply-accumulate
operations per pixel. Assuming 50% efficiency, contemporary GPUs, which provide peak
single-precision speeds exceeding 4 Tflops, should be able to process a 1Mpx image in 2s.
The network was trained for more than a weak on a single GPU.
HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING 7
0 2 4 6 8 10
15
20
25
30
noise std. dev.
PSNR (dB)
noise std. dev.
0 5 10
character error (%)
100
101
Figure 3: Artificial dataset and results. top – filters from test set; middle - examples from test
set with L15 CNN deblurring resuts; bottom-left – deblurring image quality; bottom-right –
OCR error on deblurred data
Image quality We evaluated image restoration quality on one hundred 200×200 image
patches which were prepared the same way as the training and the validation set. These
patches were extracted from unseen documents and the preference for regions with higher
total variation was relaxed to only avoid empty regions. A random subset of the regions
together with all 100 blur filters is shown in Figure 3. PSNR was measured only on the
central 160×160 region.
We selected 4 baselines. Two blind deconvolution methods – the standard method of
Xu and Jia [33], and the L0 regularized method by Pan et al. [24] which is especially
designed for text. In addition, we selected two non-blind methods as an upper bound on the
possible quality of the blind methods – total variation regularized deconvolution [5], and L0
regularized deconvolution which is a part of the blind method of Pan et al. [24]. Optimal
parameters of all four baseline methods were selected by grid search for each noise level
separately on an independent set of 25 images.
8HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING
Figure 4: CNN deblurring of challenging real images. More in supplementary material.
Figure 3(bottom-left) shows results of all methods for different amounts of noise. L15
CNN clearly outperforms both reference blind methods for all noise levels. The non-blind
methods are better at very low noise levels, but their results rapidly degrade with stronger
noise – L15 CNN is better for noise with std. dev. 3 and stronger. Surprisingly, the network
maintains good restoration quality even for noise levels higher than for what it was trained
for.
Optical Character Recognition. One of the main reasons for text deblurring is to improve
legibility and OCR accuracy. We evaluated OCR accuracy on 100 manually cropped para-
graphs which ware blurred with the same kernels as the test set. The manual cropping was
necessary as OCR software uses language models and needs continuous text for realistic
performance. The 100 paragraphs were selected with no conscious bias except that width
of the selected paragraphs had to be less than 512px. The paragraphs contain in total 4730
words and 25870 chracters of various sizes and fonts2. We used ABBYY FineReader 11 to
recognize the text and we report mean Character Error Rate3after Levenshtein alignment.
Figure 3(bottom-right) shows results for different noise levels. The OCR results are very
similar to the deblurring image quality. Non-blind methods are the best for very low amount
of noise, but L15 CNN becomes better for noise level 3 and higher. The blind methods fail
2See supplementary material for examples.
3Character Error Rate can rise over 100% – we clipped errors at 100% when computing the mean.
HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING 9
(a) blurred image (b) Xu and Jia [33] (c) L0Deblur [34]
(d) Cho and Lee. [8] (e) Zhong et al. [36] (f) Chen et al [6]
(g) Cho et al. [7] (h) Pan et al. [24] (i) CNN deblurring
Figure 5: A real image from [7]. The reference results are from [24]. The input of L15 CNN
was downsampled by factor of 3 compared to the other methods.
on many images and the respective mean results are therefore quite low.
Real images To create networks capable of restoring real images, we incorporated addi-
tional transformations into the simulation of the imaging process when creating the blurred
training data. The additional transformations are: color balance, gamma and contrast trans-
formations, and JPEG compression. This way the network effectively learns to reconstruct
the documents with perfect black and white levels. Figure 4shows that the network was able
to restore even heavily distorted photos with high amount of noise. The presented examples
are at the limits of the CNN and reconstruction artifacts are visible. The network is not able
to handle blur larger than what it was trained for – in such cases the reconstruction fails
completely. On the other hand, it is able to handle much stronger noise and even effects like
saturation for which it was not trained. Figure 5shows comparison with other methods on
an image by Cho et al. [7].
Discussion The results are extremely promising especially considering that the presented
approach naturally handles space-variant blur and that it processes a 1Mpx image in 5s even
in our non-optimized implementation. L15 CNN is small (9MB) and it would easily fit into
a mobile device.
To some extent, the network performs general deconvolution – it sharpens and denoises
images in documents and it produces typical ringing artefacts in some situations. However,
much of the observed performance is due to its ability to model the text prior which has to
be strong enough to disambiguate very small image patches (spatial support of L15 CNN is
only 50×50px). We observed better reconstruction quality for common words (e.g. "the",
"and") and relatively worse results for rare characters (e.g. calligraphic fonts, greek alpha-
bet). However, these effects are evident only for severely blurred inputs.
For computational reasons, we considered only relatively small-resolution images with
10 HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING
correspondingly small characters and blur kernels. With the simple CNN structure we use,
it would be inefficient to process higher-resolution images – a more complex structure (e.g.
the "Inception module" [30]) would be needed.
5 Conclusions
We have demonstrated that convolutional neural networks are able to directly perform blind
deconvolution of documents, and that they achieve restoration quality surpassing state-of-
the-art methods. The proposed method is computationally efficient and it naturally handles
space-variant blur, JPEG compression, and other aspects of imaging process which are prob-
lematic for traditional deconvolution methods.
In the future, we intend to broaden the domain on which CNN deblurring can be ap-
plied, and we would like to increase the complexity of the network structure to make it more
efficient. A natural extension of the presented approach is to combine the deblurring with
character recognition into a single network. Character labels could help to learn reconstruc-
tion and vice versa. Alternatively, the deblurring task could be used to pre-train an OCR
network.
Acknowledgements
This research is supported by the ARTEMIS joint undertaking under grant agreement no 641439 (AL-
MARVI), GACR agency project GA13-29225S. Jan Kotera was also supported by GA UK project
938213/2013, Faculty of Mathematics and Physics, Charles University in Prague.
References
[1] Mariana S.C. Almeida and Mário A.T. Figueiredo. Blind image deblurring with un-
known boundaries using the alternating direction method of multipliers. In ICIP, pages
586–590. Citeseer, 2013.
[2] M.S.C. Almeida and L.B. Almeida. Blind and semi-blind deblurring of natural images.
IEEE Transactions on Image Processing, 19(1):36–52, January 2010. ISSN 1057-7149,
1941-0042. doi: 10.1109/TIP.2009.2031231.
[3] Harold C. Burger, Christian J. Schuler, and Stefan Harmeling. Image denoising: Can
plain neural networks compete with BM3D? In Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, pages 2392–2399,
2012. ISBN 9781467312264. doi: 10.1109/CVPR.2012.6247952.
[4] M. Carlavan and L. Blanc-Feraud. Sparse poisson noisy image deblurring. Image
Processing, IEEE Transactions on, 21(4):1834–1846, April 2012. ISSN 1057-7149.
doi: 10.1109/TIP.2011.2175934.
[5] Stanley H. Chan, Ramsin Khoshabeh, Kristofor B. Gibson, Philip E. Gill, and
Truong Q. Nguyen. An augmented Lagrangian method for total variation video
restoration. IEEE Transactions on Image Processing, 20(11):3097–3111, 2011. ISSN
10577149. doi: 10.1109/TIP.2011.2158229.
HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING 11
[6] Xiaogang Chen, Xiangjian He, Jie Yang, and Qiang Wu. An effective document
image deblurring algorithm. In Proceedings of the IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition, pages 369–376, 2011. ISBN
9781457703942. doi: 10.1109/CVPR.2011.5995568.
[7] Hojin Cho, Jue Wang, and Seungyong Lee. Text image deblurring using text-
specific properties. In ECCV, volume 7576 LNCS, pages 524–537, 2012. ISBN
9783642337147. doi: 10.1007/978-3-642-33715-4\_38.
[8] Sunghyun Cho and Seungyong Lee. Fast motion deblurring. ACM Transactions on
Graphics, 28(5):1, 2009. ISSN 07300301. doi: 10.1145/1618452.1618491.
[9] Sunghyun Cho, Jue Wang, and Seungyong Lee. Handling outliers in non-blind image
deconvolution. In Computer Vision (ICCV), 2011 IEEE International Conference on,
pages 495–502, Nov 2011. doi: 10.1109/ICCV.2011.6126280.
[10] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a Deep Con-
volutional Network for Image Super-Resolution. In ECCV, pages 184–199. Springer
International Publishing, 2014.
[11] David Eigen, DIlip Krishnan, and Rob Fergus. Restoring an Image Taken through
a Window Covered with Dirt or Rain. In Computer Vision (ICCV), 2013 IEEE In-
ternational Conference on, pages 633–640, 2013. ISBN 978-1-4799-2840-8. doi:
10.1109/ICCV.2013.84.
[12] Rob Fergus, Barun Singh, Aaron Hertzmann, Sam T. Roweis, and William T. Freeman.
Removing camera shake from a single photograph. ACM Transactions on Graphics
(TOG), 25(3):787–794, 2006.
[13] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar-
chies for accurate object detection and semantic segmentation. In Computer Vision and
Pattern Recognition, 2014.
[14] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feed-
forward neural networks. In In Proceedings of the International Conference on Artifi-
cial Intelligence and Statistics (AISTATSâ ˘
A´
Z10). Society for Artificial Intelligence and
Statistics, 2010.
[15] Ankit Gupta, Neel Joshi, C. Lawrence Zitnick, Michael Cohen, and Brian Curless.
Single image deblurring using motion density functions. In Computer Vision–ECCV
2010, pages 171–184. Springer, 2010.
[16] Viren Jain and Sebastian Seung. Natural Image Denoising with Convolutional Net-
works. In D Koller, D Schuurmans, Y Bengio, and L Bottou, editors, Advances in
Neural Information Processing Systems 21, pages 769–776. Curran Associates, Inc.,
2009.
[17] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross
Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture
for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
12 HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING
[18] Neel Joshi, Richard Szeliski, and David Kriegman. PSF estimation using sharp edge
prediction. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE
Conference on, pages 1–8. IEEE, 2008.
[19] Neel Joshi, Sing Bing Kang, C. Lawrence Zitnick, and Richard Szeliski. Image de-
blurring using inertial measurement sensors. In ACM SIGGRAPH 2010 Papers, SIG-
GRAPH ’10, pages 30:1–30:9, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-
0210-4. doi: 10.1145/1833349.1778767.
[20] Dilip Krishnan, Terence Tay, and Rob Fergus. Blind deconvolution using a normalized
sparsity measure. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on, pages 233–240. IEEE, 2011.
[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with
Deep Convolutional Neural Networks. Advances In Neural Information Processing
Systems, pages 1–9, 2012.
[22] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Understanding blind deconvolution
algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12):
2354–2367, December 2011. ISSN 0162-8828, 2160-9292. doi: 10.1109/TPAMI.
2011.148.
[23] Anat Levin, Yair Weiss, Fredo Durand, and William T. Freeman. Understanding and
evaluating blind deconvolution algorithms. In Computer Vision and Pattern Recogni-
tion, 2009. CVPR 2009. IEEE Conference on, pages 1964–1971. IEEE, 2009.
[24] Jinshan Pan, Zhe Hu, Zhixun Su, and Ming-Hsuan Yang. Deblurring Text Images via
L0-Regularized Intensity and Gradient Prior. In 2014 IEEE Conference on Computer
Vision and Pattern Recognition, pages 2901–2908. IEEE, June 2014. ISBN 978-1-
4799-5118-5. doi: 10.1109/CVPR.2014.371.
[25] G. Panci, P. Campisi, S. Colonnese, and G. Scarano. Multichannel blind image decon-
volution using the bussgang algorithm: Spatial and multiresolution approaches. IEEE
Transactions on Image Processing, 12(11):1324–1337, November 2003. ISSN 1057-
7149. doi: 10.1109/TIP.2003.818022.
[26] Christian J. Schuler, Harold Chrisopher Burger, Stefan Harmeling, and Bernhard
Scholkopf. A Machine Learning Approach for Non-blind Image Deconvolution. In
Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages
1067–1074, 2013. ISBN 1063-6919 VO -. doi: 10.1109/CVPR.2013.142.
[27] Christian J Schuler, Michael Hirsch, Stefan Harmeling, and Bernhard Schölkopf.
Learning to Deblur. CoRR, abs/1406.7, 2014.
[28] Qi Shan, Jiaya Jia, and Aseem Agarwala. High-quality motion deblurring from a single
image. In ACM Transactions on Graphics (TOG), volume 27, page 73. ACM, 2008.
[29] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for
Large-Scale Image Recognition. CoRR, abs/1409.1, 2014.
[30] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going
Deeper with Convolutions. CoRR, abs/1409.4, 2014.
HRADIŠ et al.: CNN FOR DIRECT TEXT DEBLURRING 13
[31] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. DeepFace: Clos-
ing the Gap to Human-Level Performance in Face Verification. In 2014 IEEE Con-
ference on Computer Vision and Pattern Recognition, pages 1701–1708. IEEE, June
2014. ISBN 978-1-4799-5118-5. doi: 10.1109/CVPR.2014.220.
[32] O. Whyte, J. Sivic, A. Zisserman, and J. Ponce. Non-uniform deblurring for shaken
images. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference
on, pages 491–498, June 2010. doi: 10.1109/CVPR.2010.5540175.
[33] Li Xu and Jiaya Jia. Two-phase kernel estimation for robust motion deblurring. In
ECCV, volume 6311 LNCS, pages 157–170, 2010. ISBN 3642155480. doi: 10.1007/
978-3-642-15549-9\_12.
[34] Li Xu, Shicheng Zheng, and Jiaya Jia. Unnatural L0 sparse representation for nat-
ural image deblurring. In Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pages 1107–1114, 2013. ISBN 978-0-7695-
4989-7. doi: 10.1109/CVPR.2013.147.
[35] Li Xu, Jimmy SJ. Ren, Ce Liu, and Jiaya Jia. Deep Convolutional Neural Network for
Image Deconvolution. In Advances in Neural Information Processing Systems (NIPS),
2014.
[36] Lin Zhong, Sunghyun Cho, Dimitris Metaxas, Sylvain Paris, and Jue Wang. Han-
dling noise in single image deblurring using directional filters. In Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
pages 612–619, 2013. ISBN 978-0-7695-4989-7. doi: 10.1109/CVPR.2013.85.
[37] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning Deep Features for
Scene Recognition using Places Database. NIPS, 2014.
... Text Deblurring Datasets Hradiš et al. (2015) collected text documents from the Internet. These are downsampled to 120 − 150 DPI resolution, and split into training and test sets of sizes 50K and 2K, respectively. ...
... deblur text images via an L 0 -regularized prior based on intensity and gradient. More recently, deep learning methods, e.g. the method by Hradiš et al. (2015) have shown to effectively remove out-of-focus and motion blur. adopt a GAN-based model to learn a category-specific prior for the task, designing a multi-class GAN model and a novel loss function for both face and text images. ...
... adopt a GAN-based model to learn a category-specific prior for the task, designing a multi-class GAN model and a novel loss function for both face and text images. Figure 18 shows the deblurring results from several non-deep methods (Chen et al. 2011;Cho et al. 2012;Cho and Lee 2009;Xu and Jia 2010;Xu et al. 2013;Zhong et al. 2013) and the deep learning based method in Hradiš et al. (2015), showing that Hradiš et al. (2015) achieves better deblurring results on text images. ...
Article
Full-text available
Image deblurring is a classic problem in low-level computer vision with the aim to recover a sharp image from a blurred input image. Advances in deep learning have led to significant progress in solving this problem, and a large number of deblurring networks have been proposed. This paper presents a comprehensive and timely survey of recently published deep-learning based image deblurring approaches, aiming to serve the community as a useful literature review. We start by discussing common causes of image blur, introduce benchmark datasets and performance metrics, and summarize different problem formulations. Next, we present a taxonomy of methods using convolutional neural networks (CNN) based on architecture, loss function, and application, offering a detailed review and comparison. In addition, we discuss some domain-specific deblurring applications including face images, text, and stereo image pairs. We conclude by discussing key challenges and future research directions.
... Three different reference datasets are used to create the training, validation, and verification datasets. The first dataset library contains 60,000 images from Hradis et al. [71]. It consists of cropped images from different scientific papers without any artifacts. ...
... The effect of newly introduced components such as blur and Gabor filters are examined based on the reference model of Hradis et al. [71]. Each of the new filter arrays are being added/removed (e.g., Gabor, blur) from the base model to show the performance based on the absence/presence of the respective component (ablation study). ...
... The total number of images used for testing each model for producing the results of Table 2 was 1500 images. Those images were produced based on 100 images from the image dataset of Hradis et al. [71]; the blur types and parameters used are defined in Table 1. Comparison with respect to blur enhancement between our model and different related models from the relevant literature. ...
Article
Full-text available
In this paper, we propose a new convolutional neural network (CNN) architecture for improving document-image quality through decreasing the impact of distortions (i.e., blur, shadows, contrast issues, and noise) contained therein. Indeed, for many document-image processing systems such as OCR (optical character recognition) and document-image classification, the real-world image distortions can significantly degrade the performance of such systems in a way such that they become merely unusable. Therefore, a robust document-image enhancement model is required to preprocess the involved document images. The preprocessor system developed in this paper places “deblurring” and “noise removal and contrast enhancement” in two separate and sequential submodules. In the architecture of those two submodules, three new parts are introduced: (a) the patch-based approach, (b) preprocessing layer involving Gabor and Blur filters, and (c) the approach using residual blocks. Using these last-listed innovations results in a very promising performance when compared to the related works. Indeed, it is demonstrated that even extremely strongly degraded document images that were not previously recognizable by an OCR system can now become well-recognized with a 91.51% character recognition accuracy after the image enhancement preprocessing through our new CNN model.
... assumptions. With the continuous development of deep learning, some deep convolutional neural networks (CNNs) [3,10,15,22,42,48] have been adopted as blur kernel estimator and showed satisfied deblurring performance. However, these methods always need two stages to accomplish the image deblurring task. ...
... • Unlike some other methods which simply concatenate the image features of the same scale [10,15,22], we propose a cross-scale feature fusion module (CSFFM) to fuse the features of encoder and decoder from different scales, so that the multi-scale information can be better used to facilitate the deblurring. ...
Preprint
Full-text available
Image deblurring aims to restore the detailed texture information or structures from the blurry images, which has become an indispensable step in many computer-vision tasks. Although various methods have been proposed to deal with the image deblurring problem, most of them treated the blurry image as a whole and neglected the characteristics of different image frequencies. In this paper, we present a new method called multi-scale frequency separation network (MSFS-Net) for image deblurring. MSFS-Net introduces the frequency separation module (FSM) into an encoder-decoder network architecture to capture the low and high-frequency information of image at multiple scales. Then, a simple cycle-consistency strategy and a sophisticated contrastive learning module (CLM) are respectively designed to retain the low-frequency information and recover the high-frequency information during deblurring. At last, the features of different scales are fused by a cross-scale feature fusion module (CSFFM). Extensive experiments on benchmark datasets show that the proposed network achieves state-of-the-art performance.
... With the improvement of computing power of computer hardware, more and more scholars combine deep learning with image restoration. Hradis proposed an image deblurring algorithm based on deep convolutional neural network [1] , and achieved good deblurring effect, but this method is only used to remove text blurred images. Chakrabarti proposed an algorithm to accurately estimate the blur kernel [2] . ...
... These methods, tailored to RGB document images, do not exploit the high correlation between spectral bands: they can be applied to each band separately, leading to a higher sensitivity to the presence of noise and a time-consuming restoration process. Methods based on neural networks are also exploited in the past for text deblurring [8]. These methods, although providing good results on noiseless RGB images, are sensitive to even a small amount of noise (commonly present in bands of HS images corresponding to blue and near-infrared spectral regions). ...
... Efficiency and precision are two main concerns using text recognition methods to obtain the working state of a CNC machine. Text blurring often happens when the mobile robot moves due to camera shake and de-focus, which degrades recognition accuracy [34]. In addition, efficiency is a shortcoming of deep learning-based methods, making it challenging to deploy those methods on mobile devices and lightweight systems [35]. ...
Article
Full-text available
Smart manufacturing uses robots and artificial intelligence techniques to minimize human interventions in manufacturing activities. Inspection of the machine’ working status is critical in manufacturing processes, ensuring that machines work correctly without any collisions and interruptions, e.g., in lights-out manufacturing. However, the current method heavily relies on workers onsite or remotely through the Internet. The existing approaches also include a hard-wired robot working with a computer numerical control (CNC) machine, and the instructions are followed through a pre-program path. Currently, there is no autonomous machine tending application that can detect and act upon the operational status of a CNC machine. This study proposes a deep learning-based method for the CNC machine detection and working status recognition through an independent robot system without human intervention. It is noted that there is often more than one machine working in a representative industrial environment. Therefore, the SiameseRPN method is developed to recognize and locate a specific machine from a group of machines. A deep learning-based text recognition method is designed to identify the working status from the human–machine interface (HMI) display.
Article
Full-text available
In this paper, we integrate image gradient priors into a generative adversarial networks (GANs) to deal with the dynamic scene deblurring task. Even though image deblurring has progressed significantly, the deep learning-based methods rarely take advantage of image gradients priors. Image gradient priors regularize the image recovery process and serve as a quantitative evaluation metric for evaluating the quality of deblurred images. In contrast to previous methods, the proposed model utilizes a data-driven way to learn image gradients. Under the guidance of image gradient priors, we permeate it throughout the design of network structures and target loss functions. For the network architecture, we develop a GradientNet to compute image gradients via horizontal and vertical directions in parallel rather than adopt traditional edge detection operators. For the loss functions, we propose target loss functions to constrain the network training. The proposed image deblurring strategy discards the tedious steps of solving optimization equations and taking further advantage of learning massive data features through deep learning. Extensive experiments on synthetic datasets and real-world images demonstrate that our model outperforms state-of-the-art (SOAT) methods.
Article
Multiple lenses are used in most modern imaging systems to reduce deviations from the perfect optical imaging, which also results in a significant increase in prices. Computational Imaging Technology (CIT), which combines the traditional optical design and image reconstruction algorithms, has provided many methods for simplifying optical systems in recent years. However, the Field-of-View (FOV) and the relatively simple image degradation model limit the CIT approach. In this work, we present a novel and low-cost CIT approach for large-FOV imaging. Our system consists of a wide-angle optical module with two spherical lenses and a deep learning network for image reconstruction. Aiming at improving image quality, we introduce the Weighted Patch Degradation Model(WPDM) for the simple optical module with a wide range of spatial variants for large FOV and then construct a dataset. In addition, we present a DMPH-SE Network for our reconstruction task. Experiments show that our large-FOV imager could obtain excellent imaging results with a simple optical structure.
Article
Blind image deblurring is an important but challenging problem in image processing. Traditional optimization-based methods typically formulate this task as a maximum-a-posteriori estimation or variational inference problem, whose performance highly relies on handcrafted priors for both the latent image and blur kernel. In contrast, recent deep learning methods generally learn from a large collection of training images. Deep neural networks (DNNs) directly map the blurry image to the clean image or to the blur kernel, paying less attention to the physical degradation process of the blurry image. In this study, we present a deep variational Bayesian framework for blind image deblurring. Under this framework, the posterior of the latent clean image and blur kernel can be jointly estimated in an amortized inference manner with DNNs, and the involved inference DNNs can be trained by fully considering the physical blur model, and the supervision of data driven priors for the clean image and blur kernel, which is naturally led to by the lower bound objective. Comprehensive experiments were conducted to substantiate the effectiveness of the proposed framework. The results show that it can achieve a promising performance with relatively simple networks and incorporate existing deblurring DNNs to enhance their performance.
Article
Full-text available
Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. Here, we introduce a new scene-centric database called Places with over 7 million labeled pictures of scenes. We propose new methods to compare the density and diversity of image datasets and show that Places is as dense as other scene datasets and has more diversity. Using CNN, we learn deep features for scene recognition tasks, and establish new state-of-the-art results on several scene-centric datasets. A visualization of the CNN layers' responses allows us to show differences in the internal representations of object-centric and scene-centric networks.
Conference Paper
Full-text available
The alternating direction method of multipliers (ADMM) is an efficient optimization tool that achieves state-of-the-art speed in several imaging inverse problems, by splitting the underlying problem into simpler, efficiently solvable sub-problems. In deconvolution, one of these sub-problems requires a matrix inversion, which has been shown to be efficiently computable (via the FFT), if the observation operator is circulant, i.e., under periodic boundary conditions. We extend ADMM-based image deconvolution to a more realistic scenario: unknown boundaries. The observation is modeled as the composition of a periodic convolution with a spatial mask that excludes the regions where the periodic convolution is invalid. We show that the resulting algorithms inherit the convergence guarantees of ADMM and illustrate its performance on non-periodic de-blurring under frame-based regularization.
Article
Full-text available
We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) that takes the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component separately, our method jointly optimizes all layers. Our deep CNN has a lightweight structure, yet demonstrates state-of-the-art restoration quality, and achieves fast speed for practical on-line usage. We explore different network structures and parameter settings to achieve trade-offs between performance and speed. Moreover, we extend our network to cope with three color channels simultaneously, and show better overall reconstruction quality.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) that takes the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component separately, our method jointly optimizes all layers.
Many fundamental image-related problems involve deconvolution operators. Real blur degradation seldom complies with an ideal linear convolution model due to camera noise, saturation, image compression, to name a few. Instead of perfectly modeling outliers, which is rather challenging from a generative model perspective, we develop a deep convolutional neural network to capture the characteristics of degradation. We note directly applying existing deep neural networks does not produce reasonable results. Our solution is to establish the connection between traditional optimization-based schemes and a neural network architecture where a novel, separable structure is introduced as a reliable support for robust deconvolution against artifacts. Our network contains two submodules, both trained in a supervised manner with proper initialization. They yield decent performance on non-blind image deconvolution compared to previous generative-model based methods.
Article
We propose a simple yet effective L0-regularized prior based on intensity and gradient for text image deblurring. The proposed image prior is motivated by observing distinct properties of text images. Based on this prior, we develop an efficient optimization method to generate reliable intermediate results for kernel estimation. The proposed method does not require any complex filtering strategies to select salient edges which are critical to the state-of-the-art deblurring algorithms. We discuss the relationship with other deblurring algorithms based on edge selection and provide insight on how to select salient edges in a more principled way. In the final latent image restoration step, we develop a simple method to remove artifacts and render better deblurred images. Experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art text image deblurring methods. In addition, we show that the proposed method can be effectively applied to deblur low-illumination images.
Article
Photographs taken in low-light conditions are often blurry as a result of camera shake, i.e. a motion of the camera while its shutter is open. Most existing deblurring methods model the observed blurry image as the convolution of a sharp image with a uniform blur kernel. However, we show that blur from camera shake is in general mostly due to the 3D rotation of the camera, resulting in a blur that can be significantly non-uniform across the image. We propose a new parametrized geometric model of the blurring process in terms of the rotational motion of the camera during exposure. This model is able to capture non-uniform blur in an image due to camera shake using a single global descriptor, and can be substituted into existing deblurring algorithms with only small modifications. To demonstrate its effectiveness, we apply this model to two deblurring problems; first, the case where a single blurry image is available, for which we examine both an approximate marginalization approach and a maximum a posteriori approach, and second, the case where a sharp but noisy image of the scene is available in addition to the blurry image. We show that our approach makes it possible to model and remove a wider class of blurs than previous approaches, including uniform blur as a special case, and demonstrate its effectiveness with experiments on synthetic and real images.
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.