ArticlePDF Available

Abstract and Figures

In this paper, we propose a method to detect and recognize a Sudoku puzzle on images taken from a mobile camera. The lines of the grid are detected with a Hough transform. The grid is then recomposed from the lines. The digits position are extracted from the grid and finally, each character is recognized using a Deep Belief Network (DBN). To test our implementation, we collected and made public a dataset of Sudoku images coming from cell phones. Our method proved successful on our dataset, achieving 87.5% of correct detection on the testing set. Only 0.37% of the cells were incorrectly guessed. The algorithm is capable of handling some alterations of the images, often present on phone-based images, such as distortion, perspective, shadows, illumination gradients or scaling. On average, our solution is able to produce a result from a Sudoku in less than 100ms.
Content may be subject to copyright.
Camera-based Sudoku recognition with
Deep Belief Network
Baptiste Wicht, Jean Hennebert
University of Fribourg
HES-SO, University of Applied Science
Fribourg, Switzerland
Abstract—In this paper, we propose a method to detect and rec-
ognize a Sudoku puzzle on images taken from a mobile camera.
The lines of the grid are detected with a Hough transform. The
grid is then recomposed from the lines. The digits position are
extracted from the grid and finally, each character is recognized
using a Deep Belief Network (DBN). To test our implementation,
we collected and made public a dataset of Sudoku images coming
from cell phones. Our method proved successful on our dataset,
achieving 87.5% of correct detection on the testing set. Only
0.37% of the cells were incorrectly guessed. The algorithm is
capable of handling some alterations of the images, often present
on phone-based images, such as distortion, perspective, shadows,
illumination gradients or scaling. On average, our solution is able
to produce a result from a Sudoku in less than 100ms.
Keywords-Camera-based OCR; Deep Belief Network; Text
Detection; Text Recognition;
Deep learning and more specifically deep belief networks
have been used successfully on some scanner based digit
recognition tasks [1]. An important advantage of such ap-
proaches is in the fact that few a priori knowledge is injected
in the system. Typically, raw inputs (pixels) can be injected
in the system which is able to learn features. The scientific
question that we address in this paper is about the capacity of
such deep learning procedures to handle more complex inputs,
for example acquire from camera-based images that present
more noise and distortion than scanner based images.
For this reason, we address in our work the specific problem
of recognizing Sudoku puzzles from newspaper pictures com-
ing from digital camera. With the ever-increasing number of
smartphone, people have instant access to a digital camera.
Thus, it is becoming more and more important to make
use of these smartphone-based pictures. Moreover, as the
phones are becoming very powerful, it becomes easier for the
applications analyzing these pictures to directly run on them.
Solving computer vision problems directly on the phone has
the advantage that the user has directly access to the result.
A solution to solve this problem is presented and thoroughly
tested. A Hough transform is used to detect the lines present in
the images. From this information, the outer grid of the Sudoku
is detected and then split in smaller cells. Once the characters
are isolated, the digits are recognized using a Deep Belief
Network (DBN). A dataset has also been collected during
this project to assess the quality of the system. Although
Fig. 1. Image of a Sudoku puzzle from our dataset
the proposed system has not been tested on a smartphone,
everything has been designed to be able to adapt it as a
smartphone application easily.
The rest of this paper is organized as follows. Section II
covers the background of the problem. Section III analyzes the
previous work achieved in different fields covered by this re-
search. Section IV presents the dataset of Sudoku images that
has been collected and that is used to evaluate the presented
solution. Section V details the techniques used to detect the
Sudoku grid and the digits inside it. Section VI describes how
digits are recognized from the image. Section VII discusses the
overall results of the system as well as its efficiency. Finally,
Section VIII concludes this paper and presents some ideas for
further improvements of the solution.
A. Sudoku
The Sudoku puzzle is a famous Japanese game. It is a logic,
number-based puzzle. This paper focuses on the standard
Sudoku, played on a 9x9 grid. Each cell can either be empty
or contain a digit from 1 to 9. The game begins with a partially
filled grid and the goal is to fill every row, column and sub
3x3 square with the numbers, so that each number is present
only once. Figure 1 shows a typical example from our dataset.
Fig. 2. A Restricted Boltzmann Machine
B. Camera-based OCR
Text detection and recognition in images acquired from
scanners have been studied for a long time with very efficient
solutions proposed [2]. On the other hand, camera-based
computer vision problems remains challenging for several
Camera are of various qualities and different camera may
produce different pictures for the same scene.
Focus is rarely perfect and zoom is often of poor quality.
Light conditions may highly vary.
The rotation of the image varies from one shot to another.
Pictures taken from newspaper have other caveats:
Newspaper pages are never completely flat, resulting in
curved images.
Each newspaper uses different font styles and sizes.
C. Deep Belief Network
Deep Belief Network (DBN) were introduced by G. Hinton
and R. Salakhutdinov in 2006 [3]. It is a novel way of
training deep neural network efficiently and effectively. A
DBN is a deep neural network composed of several layers
of hidden units. There are connections between the layers, but
not between units of the same layer. The hidden units typically
have binary values, but extensions to a system with different
types of units have already been experimented.
DBNs are typically implemented as a composition of simple
networks, such as a Restricted Boltzmann Machine (RBM).
Figure 2 shows an example of an RBM. Using RBMs for each
layer leads to a fast, unsupervised, layer by layer, training. For
that, contrastive divergence is applied to each layer in turn. To
turn the network into a classifier, fine-tuning strategy can be
applied to the whole network to finalize the training [1]. This
is comparable to the backtracking algorithms used to optimize
a standard neural network.
When trained in an unsupervised way, a DBN learns to re-
construct its inputs, making it an autoencoder. During learning,
the DBN automatically learns features from the raw input. The
advantage of this network is that they are generally directly
fed with low level data such as pixel colors. It leads to simpler
system with few pre-processing steps.
A. Camera-based OCR
In 2005, Liang et al. published a complete survey of
Camera-based analysis of text and documents[4]. The various
challenges of this problem are studied in detail. The various
steps of image processing (text localization, normalization,
enhancement, binarization) are analyzed and the different
solutions are compared. Although there are many different
solutions, they show that many problems still remain open.
In 2013, Jain et al. thoroughly explored the various chal-
lenges arisen by Mobile Based OCR[5]. The solutions adopted
by standard systems to overcome these challenges are analyzed
and compared. They focus on the processing steps allowing
later traditional feature extraction and recognition techniques
to work as usual. They have shown that even if solutions are
getting better and better, there is still room for improvement.
In 2014, Chen et al. compared several features for a Product
Identification task[6]. They especially found that resizing and
cropping the image may lead to significant accuracy and
performance gains. Moreover, global features based on color
are generally the most performing one on this kind of images.
B. Sudoku
In 2012, Adam Van Horn proposed a very simple technique
to recognize and solve Sudoku puzzles[7], also based on
Hough Transform. The four corners of the Sudoku are detected
based on the intersections of the detected lines. The digits
are then centered in their cells and passed to an Artificial
Neural Network (ANN). From each digit image, 100 features
are computed. Blank cells are also classified by the ANN and
not detected a priori. An improved backtracking algorithm is
used to solve the puzzle. For lack of a complete test set, this
method was only tested on few images.
Simha et al. presented a different technique in 2012[8].
An Adaptive thresholding is applied to the complete image.
Then, the components connected to the borders are removed
to reduce noise and improve the later character recognition
steps. By using another Connected Components algorithm, the
largest component area is identified as the grid. Characters
inside the grid are then located by labeling the connected
components. After that, a virtual grid is computed based on the
enclosing box of the grid and each detected digit is assigned to
a cell. Finally the digits are classified using a simple template
matching strategy and the Sudoku is solved with a recursive
backtracking strategy.
C. Deep Belief Networks
In 2006, Hinton et al. proposed to use a DBN to perform
Digit Recognition on the MNIST dataset[1]. The weights of
the network are first initialized using a unsupervised algorithm.
They are then fine-tuned using a contrastive form of the
wake-sleep algorithm[9]. Their network achieved state of the
art performance, outperforming the previous Support Vector
Machine (SVM) leader. Since this outbreak, interest for Deep
Learning architectures has been relaunched.
Fig. 3. Detected Lines
In 2007, Ranzato et al. designed proposed a novel way of
learning sparse feature for DBN[10]. Their algorithm, based
on the encoder-decoder principle, is called Sparse Encoding
Symmetric Machine (SESM). This new algorithm is particu-
larly efficient to train and works without requiring any input
processing. Trained networks achieves excellent error rates.
In 2009, Lee et al. presented a new type of DBN, a Convolu-
tional Deep Belief Network (CDBN)[11]. This solution allows
to scale the network to larger image sizes, a CDBN being
able to classify full-sized natural images. They demonstrated
excellent performance on visual recognition tasks. Moreover,
their network achieved state of the art on the MNIST dataset.
Since then, Deep Belief Networks and Deep Architectures in
general have been used in several domains (Face Recognition,
Reinforcement Learning, Handwritten Characters Recognition,
etc.). They have proved very successful, often achieving state
of the art results.
As no free dataset of Sudoku images existed when this
project started, it was decided to gather a new dataset to
thoroughly test the proposed approach. For this purpose, 160
images has been gathered, from various cell phones and from
different local newspapers. The dataset is separated into a
training set of 120 images and a test set of 40 images. The
actual dataset images are coming from seven different phone
models from three different manufacturers. At the time of this
writing, a second version of the dataset is developed with new
pictures from more recent phones.
Each image is associated with some metadata indicating the
content of the cells, 0 indicating an empty cell. The metadata
are stored in a simple text file. The pixel resolution and color
depth of each image is also kept in the metadata, as well as the
brand and model of the cell phone that has taken the picture.
The dataset is available on Github:
wichtounet/sudoku dataset
The system is developed specifically to work on a Sudoku
grid containing 9x9 cells. The system works in several steps.
Fig. 4. Fully detected grid with cell numbers
First, the lines of the Sudoku are detected. Then, the grid is
detected using the lines and the grid is split into 81 cells.
Finally, at most one digit is detected in each cell.
A. Line Detection
Before detecting the lines, the image is binarized. This first
binarization being here to help detecting the lines, it does
not try to preserve the details of the digits. The image is
converted to gray-scale. A median blur is used to remove
noise and an adaptive thresholding algorithm is performed on
the image to binarize it. A second median blur is applied to
the binary image to remove further noise. Finally, a dilatation
morphological operation is performed on the image to thicken
the lines.
Once the image is binarized, edges are detected using the
Canny algorithm[12]. Segments of lines are then detected
using the Progressive Probabilistic Hough Transform[13] on
the detected edges. The probabilistic version of the Hough
transform is preferred over the standard form, because it is able
to detect segments and not only complete lines. The Hough
transform is a fairly standard computer vision technique.
First conceived to detect lines in image, it has since been
extended to detect arbitrary shapes, such as circles or ellipses.
The Hough Transform is especially tuned to detect imperfect
shapes, through a voting system.
Connected Component Analysis is then performed to cluster
the segments. Only the biggest cluster of segments is kept. The
segments that are approximately on the same line are then
merged together to form bigger segments. All the segments
are then converted to lines. If too many lines are detected
(>20), they are filtered:
1) Lines for which there are no other lines with a relatively
equal angle are removed.
2) Lines which are too far from their neighbours compared
to the other lines are removed.
Figure 3 shows the result of this step on a Sudoku image.
B. Grid Detection
Once the lines are detected, all the intersections between
them are computed. Intersections very close to each other are
Fig. 5. Characters detected by the system
merged. If the lines are correctly detected, there will be exactly
100 detected intersections. If it is not the case, a Contour
Detection algorithm[14] is used to detect the greatest contour
of the image. The contour outer points are then used instead
of the intersection points for the next steps.
To keep only the external points, the four corners of the grid
are computed from the Convex Hull of the detected points.
The quadrilateral formed by these four points is considered
the grid. Each side is split into 9 segments of equal length. A
quadrilateral is computed for each cell, from these segments.
As the later steps only handles rectangles, the bounding
rectangle of each quadrilateral is taken as the final cell.
Figure 4 shows a detected grid on a Sudoku image.
C. Character Isolation
Once each cell is detected, it is necessary to find whether
there is a character inside the cell or not. If there is one,
its position needs to be precisely found. This step works on
the binary image. The lines detected in the previous step are
removed from the image.
Then, the algorithm works as follow, for each cell. The sub
image of the cell is extracted from the binary image. All the
contours inside the image are detected. Only the contours that
are of reasonable size are kept. The bounding rectangles of
the contours are used for the next steps. Once these rectangles
are found, overlapping contours are merged together (several
characters are often detected as two or more distinct parts). If
the image was of poor quality or if the grid was not perfectly
detected, there will still be several candidates at this time.
Heuristics based on the shapes of digits and their sizes are
used to filter the candidates. At the end, the rectangle with
the biggest area is taken as the best character candidate. The
background pixel density is used to determine if it is really a
character and not an empty cell.
Once the character is properly isolated inside a rectangle,
the digit image is centered inside a white squared image of side
equal to the maximum of both dimensions of the rectangle.
The squared image is resized to a 32x32 image and this image
is binarized using a less destructive binarization than the one
Fig. 6. The network used to classify the source images with digit labels.
Each layer of the network is a RBM.
that was applied to the whole image: an adaptive thresholding,
followed by a simple median blur with a small window size
to remove the last noise.
At any time during this step, if there are no more candidate
or if the last candidate is filtered by the heuristics, the cell is
considered empty and 0 is returned.
Figure 5 shows the detected characters on a Sudoku image.
Once each digit has been detected, the remaining task is
to label each image with a label from 1 to 9 (empty cells are
classified a posteriori by previous steps). Each digit is a binary
image of 32x32 dimension.
A DBN is used to recognize the digits. The network has four
layers (300 hidden units for the first layer, then 300 again for
the second, then 500 and finally 9 units for the final layer).
Each layer is a RBM. The first layers are using a logistic
sigmoid as activation function whereas the last layer uses a
simple base-e exponential. This network is made of 551700
parameters. Figure 6 shows the DBN used for this task.
The DBN is used as a feature extractor and a classifier at
the same time. No features are extracted from the image. The
input of the first layer is 1024 binary units (32x32) and is fed
the binary values of the cell image. The output of the last layer
is 9 label units indicating a digit from 1 to 9. The last layer
is not binary, but contains activation probabilities.
A. Training the network
The network shown in Figure 6 is trained on the complete
training set. After digit detection and empty cell removal, 3497
digits are present in the training set.
The weights of each RBM are initialized using a Gaussian
distribution of mean 0.0and with a standard deviation of 0.01.
The biases of the visible unit iof the first RBM are initialized
to log(pi/(1 pi)) where piis the proportion of training
vectors in which unit iis on. All the other biases are initialized
to 0.0
Each RBM is first trained in an unsupervised way, one by
one, using CD1contrastive divergence. Batches of 10 images
are used to train the RBMs. The learning rate has been selected
by experiment to be 0.1. The momentum for the first 6 epochs
has been fixed to 0.5and is then increased to 0.9for the
remaining epochs. Weight decay is applied to the all the
weights with a weight cost of 0.0002. 20 epochs of contrastive
divergence are performed on the network on each layer. The
unsupervised training completed in 280 seconds on a 3.3GHz
Intel Haswell processor. The last layer of the network is not
trained with contrastive divergence and its weights are left to
values given by the Gaussian initialization.
In the second phase of training, the complete network
is “fine-tuned” with the labels associated with each sample
image. Several methods can be used to optimize a deep net-
work, for instance: Stochastic Gradient Descent (SGD), Lim-
ited memory Broyden-Fletcher-Goldbard-Shanno (L-BFGS)
or Conjugate Gradient (CG). In general, CG and L-BFGS
methods are faster than SGD methods and may lead to better
models. They also have a better parallel potential (on GPUs)
and can be distributed on different locally-connected machines.
Moreover, CG may perform better than L-BFGS on large
neural networks[15].
For these reasons, a CG method has been chosen to “fine-
tune” the classifier. A nonlinear Conjugate Gradient optimiza-
tion method has been implemented([16], [17] and [18]). A
Polack-Ribiere flavor of the Conjugate Gradient method is
used to find the search directions. The step sizes are guessed
with a line search using cubic and quadratic polynomial
approximations and a Wolfe-Powell stopping criteria. CG can
be made more scalable by using minibatch training. In this
case, batches of 100 images were chosen. The network is
trained for 10 epochs. The complete fine-tuning step took
about 40 minutes to complete.
B. Testing the network
The network shown in Figure 6 is tested on the complete
test set. After digit detection and empty cell removal, 1156
images are present in the test set.
To test the network, the state of the visible units of the first
layer is set to the binary values of the pixels of the image to
classify. The values of the hidden units are sampled from the
inputs. The next layers are using the activation probabilities
of the previous RBMs as input. Finally, the topmost label unit
with the highest activation probability is taken as the output
of the network.
The best trained network achieved an error rate of 0.605%
on the 1156 images to classify. Most errors correspond to
similar images. Other errors are coming from an image of
very poor quality. In this case, the binarization really destroyed
information about the digits and the recognition step was made
harder. As the recognition step depends on the detection step,
if digits are not detected or wrongly detected (displaced or
only part of the digit), the training and testing of the digit
Step Min Max Mean Median
Image Loading 1568 70136 2576 1659
Line Detection 27181 52840 31421 30983
Grid Detection 61 16131 1403 70
Digit Detection 9412 12301 10256 10227
Digit Recognition 30847 61237 44869 44127
Total 77835 188232 90238 88333
recognized is impacted. On the complete set, the network
achieved 0.47% of error rate.
A. Overall results
On the test set, the complete system has an error rate of
12.5% (five Sudoku images were not perfectly recognized).
This error rate is computed at the Sudoku level. A puzzle
is considered wrong as soon as one cell of is not classified
correctly. If cells are considered directly, the error rate is as
low as 0.37% (12 errors on 3240 cells). 58% of the errors on
the cells are coming from the detector not identifying correctly
an empty cell or not detecting a digit. The remaining errors
are coming from wrong classification by the DBN.
Considering the poor quality of some images or the bad
conditions in which some other images were taken, these
results are rather satisfying. The recognition task is not very
complex since all digits are computer-printed. However, the
classification is highly depending on the quality of the detec-
tion steps and on the quality of the binarization that is done on
the image. As the dataset contains images that were taken in
variable conditions, no thresholding method is able to perfectly
binarize all the images. As such, it causes some digits to be
wrongly classified by the DBN.
B. Performance
Table I shows the computing time for each step of the
complete process. The experiment has been repeated 3 times
and only the experiment with the lowest total time has been
taken into account. All the experiments have been run on a
3.3GHz Intel Haswell processor.
The recognition itself is the most time-consuming task of
the process. The classification time depends on the size of the
network and the number of digits in the Sudoku. The DBN
implementation itself has already been tuned for efficiency.
On the other hand, it is also the task that is the easiest to
parallelize. Indeed, each cell can be independently classified.
A bit less time-consuming, the Line Detection task is also
quite expensive. The Hough Transform is slow and a lot
of processing, transformations and filtering is done on the
detected lines. The Digit Detection time is not negligible, but
is not as critical as the first other two steps. Detection of the
grid is generally almost free. The large difference between the
mean and the median here is due to the fact that when the lines
are not detected correctly, the system fall back to a contour
detection algorithm which is much more expensive than the
simple detection of the grid using the lines.
Given that the algorithms were not specially tuned for
performance and the large remaining areas of optimizations,
the results are quite satisfying. On average, less than 100ms
are necessary to completely detect and recognize a Sudoku
inside an image.
We designed and implemented a complete solution to detect
and recognize a Sudoku in an image taken from a phone-
camera. Our system proved to work well with images under
different conditions (light, blur, shadows, small rotations,
distortions, etc.), The system fully recognized 35 Sudokus out
of 40 from the test set, the others only having small errors on
some digits. The DBN have proved a very accurate classifier
for the digits, reaching an error rate of 0.609%. The proposed
implementation is able to complete in less than 100ms on
average for an image. We also collected and made available a
dataset of 160 images of Sudoku puzzles.
While our method performed well on our dataset, there is
still room for improvement in our system:
Instead of a complex algorithm to detect the grid and the
cells, it could be better to directly use a DBN to detect
the digits inside the detected grid or detect the grid itself.
In cases where Line Detection is not completely accurate
and contour detection is used instead, differentiating the
grid lines from the digits is not always perfect. The
complete algorithm should be unified and improved. Both
ways of detecting the grid should work together instead
of the second being only used as a fall back.
While being not slow, there is still room for improve-
ments to make sure the solution works as fast as possible.
For instance, digit recognition could be run in parallel on
several digits at the same time or the slow line detection
algorithm could be improved.
Several heuristics are too tightly tied to the dataset im-
ages. The dataset will need to be completed with images
of higher quality to represent more types of images. It
will be necessary to make the heuristics more adaptive to
handle these new types of images.
Since our system handles specifically smartphone pic-
tures, it would make sense to port the system to a
smartphone application. In that case, an integrated solver
would be a great addition to the system.
Recognition of Sudoku with both handwritten and printed
digits could be an interesting challenge.
We would like to thank all the people who contributed to the
dataset by sending us Sudoku images taken from their phones,
in particular Patrick Anagnostaras.
The C++ implementation of our recognizer is available on-
line: recognizer
The C++ DBN library, used by the recognizer, is also
available freely:
Both works are available under the terms of the MIT license.
A recent version of the Clang compiler is necessary to
compile these tools. While the project should be able to build
on Windows, it has not been tested under platforms other than
[1] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm
for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554,
Jul. 2006. [Online]. Available:
[2] S. Impedovo, L. Ottaviano, and S. Occhinegro, “Optical character
recognition—a survey,” International Journal of Pattern Recognition and
Artificial Intelligence, vol. 5, no. 01n02, pp. 1–24, 1991.
[3] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
Jul. 2006. [Online]. Available:
[4] J. Liang, D. Doermann, and H. Li, “Camera-based analysis of text and
documents: a survey,International Journal of Document Analysis and
Recognition (IJDAR), vol. 7, no. 2-3, pp. 84–104, 2005.
[5] A. Jain, A. Dubey, R. Gupta, and N. Jain, “Fundamental challenges to
mobile based ocr,” vol. 2, no. 5, May 2013, pp. 86–101.
[6] K. Chen and J. Hennebert, “Content-Based Image Retrieval with
LIRe and SURF on a Smartphone-Based Product Image Database,” in
6th Mexican Conference on Pattern Recognition (MCPR2014), 2014.
[Online]. Available:
[7] A. Van Horn, “Extraction of sudoku puzzles using the hough transform,”
[8] P. Simha, K. Suraj, and T. Ahobala, “Recognition of numbers and
position using image processing techniques for solving sudoku puzzles,”
in Advances in Engineering, Science and Management (ICAESM), 2012.
IEEE, 2012, pp. 1–5.
[9] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The wake-sleep
algorithm for unsupervised neural networks.” Science, vol. 268, p. 1158,
[10] Y. Marc’Aurelio Ranzato, L. Boureau, and Y. LeCun, “Sparse feature
learning for deep belief networks,” Advances in neural information
processing systems, vol. 20, pp. 1185–1192, 2007.
[11] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional
deep belief networks for scalable unsupervised learning of hierarchical
representations,” in Proceedings of the 26th Annual International
Conference on Machine Learning, ser. ICML ’09. New York,
NY, USA: ACM, 2009, pp. 609–616. [Online]. Available: http:
[12] J. Canny, “A computational approach to edge detection,IEEE Trans.
Pattern Anal. Mach. Intell., vol. 8, no. 6, pp. 679–698, Jun. 1986.
[Online]. Available:
[13] J. Matas, C. Galambos, and J. Kittler, “Robust detection of lines using
the progressive probabilistic hough transform,Comput. Vis. Image
Underst., vol. 78, no. 1, pp. 119–137, Apr. 2000. [Online]. Available:
[14] S. Suzuki and K. Abe, “Topological structural analysis of digitized
binary images by border following.Computer Vision, Graphics, and
Image Processing, vol. 30, no. 1, pp. 32–46, 1985. [Online]. Available:
[15] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng,
“On optimization methods for deep learning.” in ICML, L. Getoor and
T. Scheffer, Eds. Omnipress, 2011, pp. 265–272. [Online]. Available:
[16] J. R. Shewchuk, “An introduction to the conjugate gradient method
without the agonizing pain,” Pittsburgh, PA, USA, Tech. Rep., 1994.
[17] R. Fletcher and C. M. Reeves, “Function minimization by conjugate
gradients,” The Computer Journal, vol. 7, no. 2, pp. 149–154, Feb.
1964. [Online]. Available:
[18] C. E. Rasmussen, “Minimize a differentiable multivariate function,
implementation in matlab,” 2006. [Online]. Available: http://learning.
... • Chapter 7 Auto-encoders In this final evaluation, the features generated from RBM and CRBM are compared to the features generated by regular auto-encoders on the keyword spotting task. Several variants of models are investigated, from dense auto-encoders to convolutional auto-encoders, as well as deep and stacked 2. 1. Introduction to Machine Learning CHAPTER 2. FUNDAMENTALS 2.8.4 Spike and Slab RBM . . . . . . . . . . . . . . . . . . . . ...
... The first layer is called the input layer, the last layer is the output layer and the intermediate layers are called hidden layers. A example of neural network with three layers is shown in Figure 2. 1. Each neuron has an associated value and this value is either given by the data (for the input layer) or computed from the set of its inputs, given by the previous layer. ...
... Handcrafted features have several disadvantages. 1. They require expert knowledge of the data and of the problem. ...
Full-text available
In this thesis, we propose to use methodologies that automatically learn how to extract relevant features from images. We are especially interested in evaluating how these features compare against handcrafted features. More precisely, we are interested in the unsupervised training that is used for the Restricted Boltzmann Machine (RBM) and Convolutional RBM (CRBM) models. These models relaunched the Deep Learning interest of the last decade. During the time of this thesis, the auto-encoders approach, especially Convolutional Auto-Encoders (CAE) have been used more and more. Therefore, one objective of this thesis is also to compare the CRBM approach with the CAE approach. The scope of this work is defined by several machine learning tasks. The first one, handwritten digit recognition, is analysed to see how much the unsupervised pretraining technique introduced with the Deep Belief Network (DBN) model improves the training of neural networks. The second, detection and recognition of Sudoku in images, is evaluating the efficiency of DBN and Convolutional DBN (CDBN) models for classification of images of poor quality. Finally, features are learned fully unsupervised from images for a keyword spotting task and are compared against well-known handcrafted features. Moreover, the thesis was also oriented around a software engineering axis. Indeed, a complete machine learning framework was developed during this thesis to explore possible optimizations and possible algorithms in order to train the tested models as fast as possible.
... In a previous work, we addressed the problem of recognizing Sudoku puzzles from newspaper pictures taken with digital camera such as the ones embedded in our smartphones [3]. The Sudoku puzzle is a famous Japanese game. ...
... The digit detection step follows the approach of our previous work [3]. The detection procedure had to be tuned in order to handle handwritten digits with thinner strokes and the fact that there are no empty cells. ...
... Since the feature extractor expects equally-sized squares, the final rectangle is enlarged to a square and resized to 32×32. More information on these steps is available in [3]. ...
... The authors have proposed model to recognize the Sudoku images from a mobile capture and classify them on their difficulties.The methods used were CNN as primary and rest are OCR,CDBN and DBN.The model produced and accuracy of 98 percent [5]. ...
... A neural network is a progression of calculations that attempts to perceive underlying relationships in a bunch of information through an interaction that imitates the manner in which the human cerebrum works. As the dataset is limited to two column(quizzes,solution) there is no much scope for feature selection [5] . ...
Conference Paper
Computer vision has been lionized in the recent trends of information technology world.Because of its enormous application and features,it has been gaining limelight for solving real world complex problems which are beyond human intelli�gence.Extending this artificial intelligence marvel to conventional life makes it useful and convenient.Sudoku has become a part of amusement in many lives which is ubiquitous in many articles news papers,magazine and apps.So imbibing computer vision techniques to solve the Sudoku puzzle is upshot of this paper.We have employed some of the prominent tools like Open CV ,OCR ,Tensorflow.The paper explains the methodology with which we have used these tools to solve the Sudoku puzzle.The methodology shown in the paper involves several stages such as image preprocessing and image extraction using OpenCV ,OCRing the numerical data from the extracted image using Tesseract and finally feeding the numerical data extracted to the neural network(tensorflow) model to get the desired output.The main motivation of our research is to develop up a help device for Sudoku players, for example urging players to tackle the hard Sudoku puzzles or when looking for help.It is troublesome at various stages and simple to artful players, particularly for new players or individuals insufficient certainty or endurance.The input data, comes from the camera or as an image.
... Sudoku is a number game played on 9*9 grid. The objective is to fill the blank squares with digits 1 to 9. When Sudoku is completed, each row and column must contain all digits from 1 to 9 exactly once [1] [2][3][4] [5]. Numbers from 1 to 9 must be affixed within each grid of 3*3, no row and column can contain repeat of the instance [6][7] [28]. ...
... The scientific question that, when trying to address in this research is whether such deep learning systems are able to handle mixed contents, for example recognizing both text and printed digits inputs without separating them in two distinct problems. In a previous work, to address the problem of recognizing Sudoku puzzles from newspaper pictures taken with digital camera such as the ones embedded in our smart phones [3]. Analysis of documents and images with texts and digits/ numbers continues to be active research topics [4,5,6,7,8]. ...
... Restricted Boltzmann Machines (RBM) [16] have been extensively used to extract features from data sets [17]. Once stacked into Deep Belief Networks (DBN), they are able to extract multi-layer features from images [18], [19]. Convolutional RBMs have proven especially successful on images [18], [20]. ...
... While Restricted Boltzmann Machines (RBM) have originally been used to initialize the weights of a neural network in an unsupervised manner [8], they also have been extensively used to extract features from a dataset [9]. RBMs can also be stacked into Deep Belief Networks (DBN) to extract multi-layer features [12,24]. Convolutional RBMs have proved especially successful to extract features from images [12,25]. ...
Conference Paper
Full-text available
To spot keywords on handwritten documents, we present a hybrid keyword spotting system, based on features extracted with Convolutional Deep Belief Networks and using Dynamic Time Warping for word scoring. Features are learned from word images, in an unsupervised manner, using a sliding window to extract horizontal patches. For two single writer historical data sets, it is shown that the proposed learned feature extractor outperforms two standard sets of features.
Full-text available
The motivation behind the paper is to give a single shot solution of sudoku puzzle by using computer vision. This study’s purpose is twofold. First to recognise the puzzle by using deep belief network which is very useful to extract the high-level feature, and the second objective is to solve the puzzle by using parallel rule-based technique and efficient ant colony optimization method. Each of the two methods can solve this NP-complete puzzle. But singularly they lack effeciency, so we serialised these two techniques to resolve any puzzle efficiently with less time and number of iteration.
Conference Paper
In this paper, we propose the digital detection and decryption of a sudoku puzzle using vision based techniques and subsequent solving of the puzzle using three algorithms-Backtracking, Simulated Annealing and Genetic Algorithm. The proposed method can recognize any sudoku puzzle captured from a digital camera and after employing appropriate pre-processing algorithms which include adaptive thresholding, Hough Transform and geometric transformation, the digits are recognized using Optical Character Recognition (OCR), and based on their pixel locations in the image, they are stored in corresponding locations in the 9×9 matrix. The detected puzzles of varying complexity levels are then solved using the three algorithms and the results are compared and contrasted, indicating the relative efficiencies of the three techniques in accurately solving Sudoku puzzles. Simulated Annealing performed the best amongst the three algorithms, whereas, genetic algorithm performed the worst in the comparison.
Conference Paper
Full-text available
We present the evaluation of a product identification task using the LIRe system and SURF (Speeded-Up Robust Features) for content-based image retrieval (CBIR). The evaluation is performed on the Fribourg Product Image Database (FPID) that contains more than 3’000 pictures of consumer products taken using mobile phone cameras in realistic conditions. Using the evaluation protocol proposed with FPID, we explore the performance of different preprocessing and feature extraction. We observe that by using SURF, we can improve significantly the performance on this task. Image resizing and Lucene indexing are used in order to speed up CBIR task with SURF. We also show the benefit of using simple preprocessing of the images such as a proportional cropping of the images. The experiments demonstrate the effectiveness of the proposed method for the product identification task.
Full-text available
This paper describes a computational approach to edge detection. The success of the approach depends on the definition of a comprehensive set of goals for the computation of edge points. These goals must be precise enough to delimit the desired behavior of the detector while making minimal assumptions about the form of the solution. We define detection and localization criteria for a class of edges, and present mathematical forms for these criteria as functionals on the operator impulse response. A third criterion is then added to ensure that the detector has only one response to a single edge. We use the criteria in numerical optimization to derive detectors for several common image features, including step edges. On specializing the analysis to step edges, we find that there is a natural uncertainty principle between detection and localization performance, which are the two main goals. With this principle we derive a single operator shape which is optimal at any scale. The optimal detector has a simple approximate implementation in which edges are marked at maxima in gradient magnitude of a Gaussian-smoothed image. We extend this simple detector using operators of several widths to cope with different signal-to-noise ratios in the image. We present a general method, called feature synthesis, for the fine-to-coarse integration of information from operators at different scales. Finally we show that step edge detector performance improves considerably as the operator point spread function is extended along the edge.
Conference Paper
In this paper we propose a method of detecting and recognizing the elements of a Sudoku Puzzle and providing a digital copy of the solution for it using MATLAB. The method involves a vision-based sudoku solver. The solver is capable of solving a sudoku directly from an image captured from any digital camera. After applying appropriate pre-processing to the acquired image we use efficient area calculation techniques to recognize the enclosing box of the puzzle. A virtual grid is then created to identify the digit positions. Template matching is used as a method for digit recognition. The actual solution is computed using a backtracking algorithm. Experiments conducted on various types of sudoku questions demonstrate the efficiency and robustness of our proposed approaches in real-world scenarios. The algorithm is found to be capable of handling cases of translation, perspective, illumination gradient, scaling, and background clutter.
In order to highlight the interesting problems and actual results on the state of the art in optical character recognition (OCR), this paper describes and compares preprocessing, feature extraction and postprocessing techniques for commercial reading machines. Problems related to handwritten and printed character recognition are pointed out, and the functions and operations of the major components of an OCR system are described. Historical background on the development of character recognition is briefly given and the working of an optical scanner is explained. The specifications of several recognition systems that are commercially available are reported and compared.
A quadratically convergent gradient method for locating an unconstrained local minimum of a function of several variables is described. Particular advantages are its simplicity and its modest demands on storage, space for only three vectors being required. An ALGOL procedure is presented, and the paper includes a discussion of results obtained by its used on various test functions.
The increasing availability of high-performance, low-priced, portable digital imaging devices has created a tremendous opportunity for supplementing traditional scanning for document image acquisition. Digital cameras attached to cellular phones, PDAs, or wearable computers, and standalone image or video devices are highly mobile and easy to use; they can capture images of thick books, historical manuscripts too fragile to touch, and text in scenes, making them much more versatile than desktop scanners. Should robust solutions to the analysis of documents captured with such devices become available, there will clearly be a demand in many domains. Traditional scanner-based document analysis techniques provide us with a good reference and starting point, but they cannot be used directly on camera-captured images. Camera-captured images can suffer from low resolution, blur, and perspective distortion, as well as complex layout and interaction of the content and background. In this paper we present a survey of application domains, technical challenges, and solutions for the analysis of documents captured by digital cameras. We begin by describing typical imaging devices and the imaging process. We discuss document analysis from a single camera-captured image as well as multiple frames and highlight some sample applications under development and feasible ideas for future development.