ThesisPDF Available

A Hand-Drawn Barcode

  • Daniel Klöck

Abstract and Figures

By studying how characters from different alphabets are written, an adequate set of substructures that can be drawn swiftly and effortlessly is identified. Following, a way to compose a hand-drawn barcode is presented, optimizing information density to increase the amount of contained data while being easy and fast to draw. A recognition procedure will be defined and different models for barcode detection and substructure classification are presented and evaluated. Possible value encoding error sources are examined and the recognition procedure is reviewed and tested for their accuracy. Finally, the probability of a successful recognition will be studied and improved by choosing a suitable forward error correction method. Code can be found at
Content may be subject to copyright.
A Hand-Drawn Barcode
Daniel Klöck
Dipl.-Inf., BTU Cottbus (2009)
Submitted to the Department of Computer Science
in partial fulfillment of the requirements for the degree of
Master in Artificial Intelligence and Deep Learning
at the
October 2020
©Daniel Klöck, MMXX. All rights reserved.
The author hereby grants to UAH permission to reproduce and to
distribute publicly paper and electronic copies of this thesis document
in whole or in part in any medium now known or hereafter created.
Department of Computer Science
October 18, 2020
Accepted by.........................................................
José Ignacio Olmeda Martos
Chairman, Department Committee on Graduate Theses
I hereby confirm that this thesis was written independently by myself without the
use of any sources beyond those cited, and all passages and ideas taken from other
sources are cited accordingly.
A Hand-Drawn Barcode
Daniel Klöck
Submitted to the Department of Computer Science
on October 18, 2020, in partial fulfillment of the
requirements for the degree of
Master in Artificial Intelligence and Deep Learning
By studying how characters from different alphabets are written, an adequate set of
substructures that can be drawn swiftly and effortlessly is identified. Following, a
way to compose a hand-drawn barcode is presented, optimizing information density
to increase the amount of contained data while being easy and fast to draw. A
recognition procedure will be defined and different models for barcode detection and
substructure classification are presented and evaluated. Possible value encoding error
sources are examined and the recognition procedure is reviewed and tested for their
accuracy. Finally, the probability of a successful recognition will be studied and
improved by choosing a suitable forward error correction method.
I am extremely grateful to my beloved wife Aleksandra Kucharczuk-Klöck for her
care and support.
My sincere thanks also goes to all members of the UAH’s Master in Artificial
Intelligence and Deep Learning course of 2019-2020 for all the help and discussions.
Especially to Ming Lei, Fabrice Aubert, Gianni Santinelli, Genís Virgili Sánchez,
Micheline Pollock, Irene van den Broek, Jesús Chávez and Brian Naranja.
Finally, I thank everyone that helped generate the hand-drawn barcode dataset.
Especially, Sandie Klöck, Maja Kucharczuk and Jorge Gangoso Klöck.
1 Introduction 13
2 Identifying Symbols and Structure 15
2.1 Defining Drawing Complexity . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Simplicity by Similarity . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Simplicity by Speed . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 ExploringSymbols ............................ 16
2.3 ABarcodeProposal............................ 18
3 Detecting the Barcode and Extracting its Value 21
3.1 Detecting the Barcode and its Parts . . . . . . . . . . . . . . . . . . . 22
3.1.1 Faster R-CNN ResNet50 V1 640x640 . . . . . . . . . . . . . . 23
3.1.2 CenterNet HourGlass104 512x512 . . . . . . . . . . . . . . . . 24
3.1.3 EfficientDet D2 768x768 . . . . . . . . . . . . . . . . . . . . . 25
3.1.4 Evaluation............................. 26
3.2 Calculate the Rotation and Extract the Bars . . . . . . . . . . . . . . 27
3.3 ClassifyingtheBars............................ 28
4 Value Encoding and Decoding 33
4.1 BitOrder ................................. 33
4.2 ErrorSources ............................... 34
4.3 Error Detection and Correction . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Error Detection and Correction using Linear Codes . . . . . . 36
4.3.2 Evaluation of Error Correction by Exploiting Model Confidence 40
5 Future Lines of Research 41
A Additional Information 43
List of Figures
2-1 Omniglot characters with their Omniglot ids sorted by median time
spendwriting................................ 17
2-2 Median speed compared to median number of strokes of characters. . 18
2-3 All possible single substructure bars and their value while representing
a decimal number between 0 and 220 1................. 19
2-4 A bar that contains all possible substructures. . . . . . . . . . . . . . 20
3-1 Faster R-CNN structure. Image from [1]. . . . . . . . . . . . . . . . . 23
3-2 Example results of the barcode and parts detection model using “Faster
R-CNN ResNet50 V1 640x640” on validation images. . . . . . . . . . 24
3-3 Architecture of CenterNet. Image from [2]. . . . . . . . . . . . . . . . 25
3-4 Architecture of EfficientDet. Image from [3]. . . . . . . . . . . . . . . 26
3-5 Result examples of extracted and rotated bars. . . . . . . . . . . . . . 28
3-6 Bar with flipped counterparts for data augmentation. . . . . . . . . . 28
3-7 Image of a Bar with a corrected substructure. . . . . . . . . . . . . . 30
4-1 Order of the bits in a bar. . . . . . . . . . . . . . . . . . . . . . . . . 33
List of Tables
3.1 Evaluation of experimented barcode detection models. . . . . . . . . . 27
3.2 Reached accuracy with lowest validation loss on substructure classifi-
cationmodels................................ 29
3.3 Confusion values of classified substructures using EfficientNetB2. . . 31
4.1 Error rates of popular 1D-barcodes (with 95% confidence) [4]. . . . . 35
4.2 Linear code choice recommendations for different number of bars. . . 38
Chapter 1
Many researchers have worked on improving the recognition accuracy of mathematical
symbols [5], digits [6, 7], sketches [8], text [9], patterns [10] and other symbols [11,
12]. However, little research has been published to determine what impact different
patterns or set of symbols could have on the recognition difficulty. This makes the
task of sketching the right set of symbols for later detection very complicated, more
so, if the person does not have in-depth knowledge of the system that will be used
for recognition. Currently, if hand-drawn symbols need to be recognized, there is no
guide as of which symbols to use.
In Section 2.1 the identification of symbols with low drawing complexity will be
made possible by declaring what it means for a symbol to be easy to draw. Then,
in Section 2.2, the writing style used for characters of the Omniglot dataset [13] will
be examined. Subsequently, by using the definitions from Section 2.1, symbols that
are quick and easy to draw will be detected. In Section 2.3, those symbols will be
organized in a structure, creating a novel barcode that can be hand-drawn. That
structure will optimize information density and minimize hand-drawing complexity.
In Chapter 3, a recognition procedure for that kind of barcodes will be presented.
This procedure will need an object detection model that can detect the barcode and its
parts. That model will be chosen, trained and evaluated in Section 3.1. The procedure
will also require a multi-label bar classification model, which will be selected from a
list of modern classification models and evaluated in Section 3.3.
In Chapter 4, encoding, decoding, sources of error as well as expected accuracies
will be discussed. Section 4.3 will recommend adequate forward error correction as
well as error detection mechanisms to improve the accuracy by using redundancy bits.
Note that Appendix A contains information about how to acquire source code,
configuration files and images related to this thesis.
Chapter 2
Identifying Symbols and Structure
2.1 Defining Drawing Complexity
As a first step in the search for a hand-drawn barcode, drawing complexity was
explored to make sure that every part of the barcode can be easily and accurately
drawn by anyone. Unfortunately, to my knowledge, there is no published research
about finding or comparing the complexity of drawing symbols. To overcome this
lack of previous work, two axioms will be formulated that will make it possible to
discern symbols that are easy to draw from those that are more complex.
2.1.1 Simplicity by Similarity
Axiom 1 (Axiom of simplicity by similarity).A symbol is easier to draw, if it is
usually replicated with more accuracy.
Axiom 1 means that if the same symbol is drawn several times by one or multiple
persons, the similarity of the resulting images will be higher with an easy symbols than
with a complex symbol. To use this approach, an image similarity measure must be
selected. There are several possibilities, but the assumptions taken by the techniques
are critical and may lead to erroneous results when calculating drawing complexity.
For example, we cannot assume that we are analysing variations of the same image.
Some options of defining the image similarity would be the Image Euclidian Distance
(IMED) [14], the Structural Similarity Index (SSIM) [15] or the Modified Hausdorff
Distance [16] . Another option would be using the entropy of the image [17], which
may be based on histogram values [18]. However, in this case, pixel positions may
play an important role and should not be ignored.
2.1.2 Simplicity by Speed
The second method is based on following premise:
Axiom 2 (Axiom of simplicity by speed).A symbol is easier to draw, if it is usual ly
drawn in a shorter time.
Using this method would not only give more reliable results, since no other as-
sumptions or definitions are needed, but it would also yield symbols that reduce the
time needed to draw the barcode, which would also be a desirable property. Due to
its simplicity and relation to drawing speed, Axiom 2 was used for searching for a set
of symbols that would be easy enough for everyone.
2.2 Exploring Symbols
To explore the writing speed of symbols from different alphabets, the Omniglot
dataset [13] was used. Its Github page [19] describes its content as follows:
“It contains 1623 different handwritten characters from 50 different alphabets. Each
of the 1623 characters was drawn online via Amazon’s Mechanical Turk by 20 differ-
ent people. Each image is paired with stroke data, a sequences of [x,y,t] coordinates
with time (t) in milliseconds.
By sorting the characters of all alphabets by median writing speed, the symbols
that are easiest and fastest to draw could be found. Note that taking into considera-
tion Axiom 1, these will also be the simplest symbols to replicate accurately. Thus,
increasing the probability of a correct recognition.
In Figure 2-1, the 100 symbols that were drawn the fastest are shown. As expected,
the top symbols are geometric primitives i.e. dot, line (with different rotations),
different curves and circle. Followed by shapes that could be seen as combinations
of those primitives, such as ‘’ (two lines), ‘:’ (two dots), ‘!’ (line and dot) and ‘
(two lines).
Figure 2-1: Omniglot characters with their Omniglot ids sorted by median time spend
If we also take into account, that we tend to misperceive curvature, direction
and length due to the nature of our eye movements [20], curves and structures where
direction and length are important (such as would be the case when drawing a barcode
based on modules similar to code128 based symbology [21]) should be discarded as
By comparing the median speed to the median number of strokes of the characters
(see Figure 2-2), a clear tendency can be observed that shows that the more strokes
a symbol needs, the more time it will take to draw it.
Knowing these findings, it seems that a set of symbols that are easy and fast to
draw should consist of symbols based on circles, dots and lines with as few occurrences
of them as possible.
Further, a symbol should consist of substructures, since we can increase the num-
Figure 2-2: Median speed compared to median number of strokes of characters.
ber of objects it can represent exponentially by additional strokes instead of linearly
by finding a new symbol. For example, a symbol that may or may not contain 15
strokes can represent 215 = 32,768 different objects, finding that amount of symbols
that are easy and fast to draw may be much more complicated.
2.3 A Barcode Proposal
I propose a hand-drawn barcode, that has as primary goal, the optimization of draw-
ing speed, accuracy and information density. The barcode will be drawn on a straight
horizontal line starting and ending with upwards facing lines. This will be enough to
understand the direction of the barcode, since it will always be read left to right. It
will contain a number of vertical lines (that will be called “bars”) that serve as the
ground structure for sketching the symbols that represent the data. Each bar is made
of maximally 10 additional lines.
Each bar may or may not contain any subset of 20 different substructures, which
amounts to 20 bits per bar. This means that one single bar can represent 220 =
1,048,576 different objects when no error correction nor detection is included. In
Figure 2-3 you can see these 20 possible substructures with its values when the barcode
is used to represent a decimal number between 0 and 1,048,575.
Figure 2-3: All possible single substructure bars and their value while representing a
decimal number between 0 and 220 1.
These substructures are combined to create new values, for example Figure 2-4
shows the symbols when all substructures are present. Note how lines on opposite
sides of the bar can be drawn with a single stroke resulting in less needed geometric
primitives. Further, since substructure lines are either at one end, touching the
horizontal line or at the middle of the vertical lines, misperceived length should not
be an issue.
When more than the number of objects that a single bar can represent are needed,
more bars can be attached resulting in a barcode that can represent up to 2bars·20
different objects. Note that a dynamic barcode that can contain between 1 and 𝑛𝑏bars
can represent up to 𝑛𝑏
𝑥=1 2𝑥·20 different objects if you distinguish between numbers
with different number of leading zeroes, i.e. a barcode that contains a specific value
with one bar is different to a barcode that contains the same value with 2 bars.
Tests with grid structures (2 dimensional bar positioning) have also been made,
but discarded because the user had to draw the grid with the correct size before adding
the substructures, while with this structure, line sizes can be corrected a posteriori.
Figure 2-4: A bar that
contains all possible
Using dot substructures also was discarded due to user
feedback. It was perceived as difficult to draw, since the
proper position was not understood. As [22] shows, we read
connected structures faster than unconnected ones, which
may suggest, that for us humans, it is harder to recognize the
relation between unconnected structures, thus rendering it
more complicated to find the correct location for dots within
a symbol. Another disadvantage of dots, is that it is not pos-
sible to draw several of them with one stroke, which is a line
feature that this barcode exploits.
Chapter 3
Detecting the Barcode and
Extracting its Value
To be able to train an object detection model, a dataset of hand-drawn barcodes
with sufficient samples had to be generated first. To collect the data, a “Hand-drawn
Barcode User Study” webpage [23] was implemented, using Python [24], Flask [25],
Skeleton [26] and Cloudinary [27] as main technologies and it was hosted on Heroku
[28]. On this webpage, any user willing to help with the data generation would get a
random barcode displayed, which had to be copied with pen and paper.
As a result, between August 27𝑡ℎ and 31𝑠𝑡, 149 hand-drawn barcodes were uploaded
from 23 unique IPs. After removing images where the full barcode was not visible,
144 images remained, containing exactly one barcode each. The website also named
the uploaded file with the code that the depicted bars describe (e.g. a barcode that
contains the numbers 5023 and 101 in 2 bars would be named “5023_101”) to make
the classification of the bars easier. Finally, all images were manually annotated with
bounding boxes for contained barcode start symbol, barcode end symbol, bars and
complete barcodes using “RectLabel for object detection” [29].
Once a barcode has been generated and hand-drawn by a user, the next task
would be to recognize it’s encoded value. Even if a one-stage architecture could
probably yield good results if more samples were available, the usage of a multi-
stage architecture was chosen because it made it possible to use sample augmentation
without distortion and to give more homogeneous input images to the classifier of the
bars. The chosen architecture would need to implement following execution steps:
1. Predict the bounding box of the full barcode, the single bars, as well as the
starting and ending symbols.
2. Calculate the rotation of the barcode by analizing the relative position of the
starting and ending symbols.
3. Extract the bars sorted by distance to the starting symbols and rotate them to
have a vertical position where the starting symbol would be to the left.
4. Classify the bars with a multi-label system where each of the bar’s bits corre-
spond to one class.
5. Calculate the value represented by the barcode using the decoding function.
The sections of this chapter will describe and evaluate the models and algorithms
that execute steps 1 through 4, as well as the methods followed to find the right
candidates. The next chapter will propose an encoding and decoding technique.
3.1 Detecting the Barcode and its Parts
As seen in the introduction of this chapter, the first step to determine the value that
is encoded in a barcode is to detect the barcode and its parts. To make the prediction
as accurate as possible, 3 object detection models were trained and evaluated using
the annotated barcode images to predict the position of the full barcode, the start of
the barcode symbol, the end of the barcode symbol and the bars. The chosen models
were “Faster R-CNN ResNet50 V1 640x640”, “CenterNet HourGlass104 512x512”
and “EfficientDet D2 768x768” from the Tensorflow 2 Object Detection Model Zoo
[30]. All models were pre-trained on the COCO dataset [31].
The configurations of these models were changed to search for 4 label classes, the
TPU capability was deactivated and the data augmentation options were set to ‘ran-
dom_rgb_to_gray’, ‘random_adjust_brightness’, ‘random_adjust_contrast’, ‘ran-
dom_adjust_hue’, ‘random_adjust_saturation’ and ‘random_distort_color’. Finally,
the ‘fine_tune_checkpoint_type’ property was set to ‘detection’ (used when loading
models pre-trained on other detection tasks) for all models but for “CenterNet Hour-
Glass104 512x512”, since it needs a special setting of ‘fine_tune’ (used when loading
the entire CenterNet feature extractor pre-trained on other tasks).
All models were locally trained on a GeForce GTX 1070 (Compute Capability
6.1) using the Tensorflow 2 Object Detection API with CUDA 10.1 for Windows 10
(cuda_10.1.243_426.00_win10), Protoc 3.13.0-win64 and CuDNN 7.6.5. The evalu-
ation was executed in parallel using the CPU.
3.1.1 Faster R-CNN ResNet50 V1 640x640
The “Faster R-CNN ResNet50 V1 640x640” model is based on the popular Faster
R-CNN [1] architecture, which is an object detection system composed of a deep fully
convolutional network that proposes regions and a Fast R-CNN detector [32], which
Figure 3-1: Faster R-
CNN structure. Im-
age from [1].
in turn is based on R-CNN [33]. The Fast R-CNN architec-
ture produces a convolutional feature map by processing the
input image with convolutional and max-pooling layers. The
region proposal network uses the feature maps to generate
an output of rectangular object proposals with an objectness
score. For each region proposal, a feature vector is extracted
from each feature map by a region of interest pooling layer.
These feature vectors are then passed to fully connected lay-
ers to produce two outputs, the position of the bounding box
and the softmax probability that the region of interest con-
tains one of the target classes. This concrete implementation
of the model uses an input image with a resolution of 640 by
640 RGB pixels. As the backbone, a 50 layer deep residual network [34] has been used
as opposed to the original ZF-NET [35] and VGG-NET [36] (which used to be pre-
trained on ImageNet [37]) [38]. The advantage of ResNet over VGG is that it is bigger,
which means that it has more capacity to learn what is needed. Further, ResNet uses
residual connections and batch normalization, which was not invented when VGG
was first released. According to the Tensorflow documentation, this model reached a
COCO mAP1of 29.3 and a mean time of running inference of 53 milliseconds.
Figure 3-2: Example results of the barcode and parts detection model using “Faster
R-CNN ResNet50 V1 640x640” on validation images.
3.1.2 CenterNet HourGlass104 512x512
One property that distinguishes Faster R-CNN from CenterNet [2] is that the latter
is a one-stage object detector. Another difference is that it detects each object as a
1mean Average Precision, an accuracy metric for object detectors.
triplet, instead of a pair, using a keypoint estimator to find center points and regress to
all other object properties. The center point based approach, is said to be end-to-end
differentiable, simpler, faster, and more accurate than corresponding bounding box
based detectors [39]. The idea behind this model, is that if a predicted bounding box
would have a high IoU2with the ground truth, it would also be highly probable for
the center point to be in the center region of the bounding box, and vice versa, which
enables a more efficient way of searching for objects by using their center points.
As seen in Figure 3-3, the main part of this architecture consists of two modules
named cascade corner pooling and center pooling, which play the roles of enriching
information collected by the top-left and bottom-right corners and providing more
recognizable information at the central regions.
Figure 3-3: Architecture of CenterNet. Image from [2].
As the backbone, HourGlass-104 was used, which yielded the best keypoint es-
timation performance in the evaluation done in [39]. This specific implementation
uses an input image with a resolution of 512 by 512 RGB pixels. According to the
Tensorflow documentation, this model reached a COCO mAP of 41.9 and a mean
time of running inference of 70 milliseconds.
3.1.3 EfficientDet D2 768x768
EfficientDet D2 employs an EfficientNet [40] B2 network as the backbone, a Weighted
Bi-directional Feature Pyramid Network (also known as BiFPN) with 112 channels
2Intersection over Union, an evaluation metric used to measure the accuracy of an object detector,
calculated as the area of overlap divided by the area of union of the predicted and the ground truth
bounding boxes.
and 5 layers, 3 box/class layers and an expected input size of 768 by 768 RGB
pixels. EfficientDet’s BiFPN incorporates multi-level feature fusion allowing data to
flow in both directions, top-down and bottom-up, while using regular and efficient
connections. EfficientNet is a Convolutional Network, developed by Google’s Brain
team, that seeks to optimize downstream performance given free range over depth,
width and resolution while staying within the constraints of target memory and target
FLOPs [41].
According to the Tensorflow documentation, this model reached a COCO mAP
of 41.8 and a mean time of running inference of 67 milliseconds.
Figure 3-4: Architecture of EfficientDet. Image from [3].
3.1.4 Evaluation
After training the models until a clear overfitting pattern emerged, since it should
generalize better than any other, the step with the lowest total loss on the evaluation
dataset was taken. In Table 3.1 it can be seen that the Faster R-CNN model could
reach the best mean Average Precission for all explored IoU thresholds on that step.
For the first version of a barcode detector, accuracy seemed more important than
speed and thus, speed was ignored as long as it was within the limit of usable software,
which was the case for all of these models.
Model Total Loss mAP@.50:.95IoU mAP@.50IoU mAP@.75IoU
Faster R-CNN 0.5845 0.68 0.98 0.7619
CenterNet 1.714 0.573 0.9504 0.6086
EfficientDet 0.3475 0.631 0.9681 0.7066
Table 3.1: Evaluation of experimented barcode detection models.
Note that the total losses cannot be compared between models, since the loss
functions are different from each other and thus, have different meanings.
Some models that were part of the initial proposal of this thesis did not become
candidates for different reasons, e.g. SSD models use an aspect-ratio which would
probably not work very well with barcodes that have a dynamic width but a rather
static height and YOLO trades accuracy for speed.
3.2 Calculate the Rotation and Extract the Bars
After the model detects the barcode and its parts, the angle of the barcode is calcu-
lated as:
𝑎𝑛𝑔𝑙𝑒 =𝑎𝑡𝑎𝑛2(𝑦𝑦, 𝑥 𝑥)
𝑦= the y coordinate of the center of the ending symbol’s bounding box.
𝑦= the y coordinate of the center of the starting symbol’s bounding box.
𝑥= the x coordinate of the center of the ending symbol’s bounding box.
𝑥= the x coordinate of the center of the starting symbol’s bounding box.
Or in other words, the angle of the barcode is calculated as the angle between
the starting and ending symbol center points. Note that an angle of 0 corresponds
to a perfectly aligned image where no rotation is needed, which means that the start
symbol can be found to the left of the barcode and the ending symbol to the right at
the same height.
Then the bars are extracted, meaning that new images are created for bounding
Figure 3-5: Result
examples of extracted
and rotated bars.
box areas that corresponded to bars of the input image. Fi-
nally, the extracted images are rotated by the barcode’s angle
and ordered by distance to the starting symbol. This will en-
sure that the classification model receives homogeneous im-
ages, namely bars with the same rotation and only the part of
the image that is needed for the classification, which should
increase the accuracy. Additionally, the images can be flipped
vertically, horizontally and both to produce new bar images.
These new images would contain different bits than the original image and can be
used for training or evaluation. However, the original and the augmented images were
kept together either in the training or in the validation dataset to reduce correlation
between these datasets.
(a) Extracted
bar, representing
(b) Horizontally
flipped, represent-
ing 861200.
(c) Vertically
flipped, represent-
ing 16775.
(d) Horizontally
and vertically
flipped, represent-
ing 33355.
Figure 3-6: Bar with flipped counterparts for data augmentation.
3.3 Classifying the Bars
Once the bars are ordered and rotated, they can be classified to find which bits are
active. The basic idea is to give each substructure a class label and use a classifier to
predict which of those classes are represented in the bar image. This means that the
classifier model has to solve a multi-label classification problem.
Model Training Accuracy Validation Accuracy
VGG16 0.9758 0.8683
VGG16* 0.9718 0.9420
VGG19 0.9663 0.8555
VGG19* 0.9719 0.9322
EfficientNetB1 0.9264 0.8842
EfficientNetB1* 0.9759 0.9528
EfficientNetB1** 0.9637 0.9255
EfficientNetB10.9881 0.9615
EfficientNetB2 0.9281 0.8703
EfficientNetB2* 0.9771 0.9463
EfficientNetB20.9928 0.9723
EfficientNetB3 0.9000 0.8430
ResNet50 0.9224 0.8050
ResNet50* 0.9798 0.9345
ResNet101 0.8606 0.8045
ResNet101* 0.9041 0.8805
DenseNet121 0.7657 0.7115
* Removing the last block of layers
** Removing the last 2 blocks of layers
Training the last block of layers
Table 3.2: Reached accuracy with lowest validation loss on substructure classification
For experimentation, multiple classifier models based on the pre-trained models
from tensorflow’s keras section, specifically keras.applications were used. In all
cases, the input shape has been changed to 450 by 100 RGB to make it more similar
to the output shape of the detected bars. Addditionally, the top3has been dropped
since a new output format is needed. To make up for the removed top a flattening
3The top refers to the flattening and fully connected layers stacked on top of the models.
layer and a fully connected layer with a sigmoid activation function ending in 20
output nodes (one for each substructure in a bar) have been appended to the top.
Some of the experiments only trained the top dense layer of the model. Others,
additionally allowed training the last block of layers of the pre-trained model to train
the bigger features for barcodes. A third kind of experiments removed the last block
of layers altogether and appended the new top to the previous block of layers to make
sure that the dense network was not influenced by the bigger features of the image
dataset used to pre-train the network, since they probably would be very different.
All models were compiled using binary cross-entropy as loss function and Adam [42]
as optimizer.
From the initial dataset of 144 images, 495 bars were automatically extracted,
augmented to 1980 bars4and labelled with their corresponding value of active bits.
Those images were split into 1780 training images and 200 validation images. Ad-
ditionally, the training images have been augmented with an ImageDataGenerator
allowing to add a subset of following transformations: channel shift, brightness shift,
shear angle, rotation, zoom.
Figure 3-7: Image of
a Bar with a corrected
Table 3.2 shows the reached accuracy on the training and
validation images in the step with the lowest loss on the val-
idation data, since it should generalize best. As shown, the
pre-trained EfficientNetB2 model allowing to train the last
block of layers had the best accuracy on the validation data,
reaching an accuracy of 97.23%.
Table 3.3 shows the confusion values of the single sub-
structures as well as the summed results over the 20 sub-
structures for the 200 validation images. It can be seen that all bits have similar
confusion percentages and are behaving as expected.
Further, it could be observed that the model adapted to unforeseen user behaviour.
For example, the model learned how users fix incorrectly drawn lines, such as the one
a user corrected in the bar seen in Figure 3-7.
4Using the technique described in Section 3.2
True Positive False Positive True Negative False Negative
Bit 1 117 4 79 0
Bit 2 112 6 77 2
Bit 3 101 6 91 2
Bit 4 101 3 94 2
Bit 5 89 7 104 0
Bit 6 87 4 107 2
Bit 7 91 1 102 6
Bit 8 95 3 100 2
Bit 9 100 3 96 1
Bit 10 99 7 92 2
Bit 11 98 3 96 3
Bit 12 101 5 94 0
Bit 13 94 0 103 3
Bit 14 95 3 100 2
Bit 15 86 3 108 3
Bit 16 88 3 108 1
Bit 17 102 2 95 1
Bit 18 101 2 95 2
Bit 19 115 3 80 2
Bit 20 114 1 82 3
Summed 1986 69 1903 42
Table 3.3: Confusion values of classified substructures using EfficientNetB2.
Chapter 4
Value Encoding and Decoding
As mentioned in earlier chapters, each substructure of a bar represents a bit. If the
substructure is represented in the bar, the bit becomes 1, if not, it is a 0.
Due to the low data density compared to digital barcodes and the variable clas-
sification error, I suggest to develop customized encoding and decoding mechanisms
depending on the application and quality of the models.
In this chapter, some possible encoding, decoding and error correction techniques
will be described.
4.1 Bit Order
One definition for the bit order of the substructures can be seen in Figure 4-1. The
Figure 4-1: Order of the bits
in a bar.
number next to the substructure represents the index
of the bit (using little-endian format). Fore example, if
only substructure 19 is active, the bar will represent the
binary number 0b1000 0000 0000 0000 0000. If a second
bar would exist to the right, the first bar’s value would
be shifted left 20 places. For example, if a second bar
would be added to the right of the bar from the previous
example and it would have all of its bits active, the new
value would be 0b1000 0000 0000 0000 0000 1111 1111
1111 1111 1111. This mechanism allows us to convert
barcodes into noisy bit streams with blocks of 20 bits. It is noisy because of possible
erroneous substructure classifications.
4.2 Error Sources
Since the barcode can be transformed into a noisy bit stream with blocks of 20 bits,
typical error detection and correction techniques can be used. However, we do have
some additional knowledge about the channel, that may help us make better decisions
about where the error may have originated and how to correct or detect it. Not only
using adequate coding, but also by implementing an appropriate usability flow. We
can also calculate the needed error detection and correction capabilities by studying
the error sources.
There are 3 sources of errors: the first being a possible human error while drawing
the barcode, the second a wrong detection of the barcode parts with the object
detection model (for example by missing a bar) and the third a wrong classification
of a substructure.
To solve the problem of human error I would suggest adding a validation feature,
that is, a detection the user will try on it’s own to make sure the barcode was drawn
correctly whenever possible. However, take into account that in some situations, this
may not be possible (e.g. the user may not have a camera). Therefore, I would
recommend always adding error correction or detection capabilities based on the
application’s need for accuracy. Secton 4.3 will give more information on how to
construct such a mechanism.
The second possible source of error, the barcode detection model, could only be
trained with very few samples (144 barcodes) in this first version. This led to an error
rate of about 2%, even when choosing a threshold of 0.5IoU (as seen in Subsection
3.1.4). This error rate is much higher compared to typical digital barcodes, which
reach accuracies of 1 error in 394 thousand, even in the worst case scenario of the
simplest barcode types (See Table 4.1). However, by choosing the right angle and
position of the camera, the rate can probably be improved. Therefore, I would suggest,
at least until more training data is available, to use a fixed number of bars and only
detect the barcode when all needed pieces have been recognized.
Barcode Type Worst Case Accuracy Best Case Accuracy
Code 128 1 error in 2.8 million 1 error in 37 million
Code 39 1 error in 2.5 million 1 error in 34 million
UPC or more 1 error in 394 thousand 1 error in 800 thousand
Table 4.1: Error rates of popular 1D-barcodes (with 95% confidence) [4].
The last source of error is a wrong classification of a substructure. The substruc-
ture classification model yielded a probability of correct classification of 0.9723, which
means that the probability of error is 0.0277. This implies that the probability of at
least one error occurring in a bar is approximately 42.98%1. This number is way too
high to be in the range of useful barcodes. Therefore, using a coding mechanism to
correct and detect errors is crucial.
4.3 Error Detection and Correction
Improving angle and position of the camera and having a fixed number of bars can
help reducing the errors originated by the object detection model. The human error
and the wrong bar substructure classification can be overcome with selected minimal
accuracy by using Forward Error Detection techniques. Two of them that will be
explored are error correction through linear codes with minimal Hamming distance
and error correction by flipping the least confidence bits.
The basic idea of error correcting codes, is to reduce the number of accepted code
words, maximizing the distance between words. A standard distance between words
is the Hamming distance [43], which in the case of binary words is defined as the
number of bits that have to be flipped to get from one word to another. For example,
1The complimentary of the probability of no errors occurring in 20 bits = 1 0.972320 0.4298
you could get from word 001 to 100 by flipping 2 bits (the first and the last). Thus,
it has a Hamming distance of 2.
A property of a set of allowed words with a minimal Hamming Distance 𝑑𝐻(that
is, there aren’t 2 words in the set with a Hamming distance inferior to 𝑑𝐻) is that
𝑑𝐻1errors can be detected and 𝑑𝐻1
2errors corrected.
For example, if we only allow the recognition of the words 001 and 100, since the
minimal Hamming distance is 2, we will be able to detect if an error of 21=1bit
flip happened and we would be able to fix 21
2= 0 errors. Imagine that the word
101 would be recognized instead, we would then be able to detect an error since it isn’t
in the set of allowed words, but we could not be able to fix it, since the probability of
the intended word being 001 and 100 would be the same, since the Hamming distance
to them would be the same. However, if we change the allowed words to 000 and 111,
the Hamming distance would increase to 3 and we would be able to detect 31=2
errors and even fix 31
2= 1 error. If we now would detect the word 101, we would
know there was an error since the word is not part of our set of allowed words and
we could fix it by transforming it to the word with the lowest Hamming distance to
the received word, i.e 111.
4.3.1 Error Detection and Correction using Linear Codes
If such a set of allowed words is generated by multiplication of the data bits with
a generative matrix, we say that we are using a linear code. Mathematically, lin-
ear codes can be constructed as a subspace of a vector space with any number of
elements. When generating linear codes for a binary system, usually Galois Fields
of 2 elements are used. These fields, usually written as GF(2), have the properties
of closure, commutativity, associativity, identity, inverse and distributivity [44] and
define the sum as the logical XOR operation and the multiplication as the logical
AND operation, which makes the implementation on hardware efficient [45].
The allowed word set can be constructed by calculating:
𝑏𝑇=𝑥 𝑇·¯
where ¯
𝐺is the generative matrix, 𝑥 the data words and
𝑏the generated allowed
code words [45]. Note that finding an optimal ¯
𝐺for a given code length and data
length is not always trivial, but there are published collections of best known matrices
for given lengths (such as [46]). Also, [47] shows the maximum Hamming distance
that can be achieved for given code and data lengths.
Table 4.2 shows recommendations for different number of bars and expected ac-
curacy. The probability of wrong correction has been based on the probability mass
function and calculated as:
𝑥=0 𝑛
𝑥·𝑝𝑥·(1 𝑝)𝑛𝑥, 𝑝 = 0.0277 (4.2)
where 𝑑𝐻is the Hamming distance, 𝑛the amount of bits (20 ·number of bars)
and 𝑝the probability of incorrect classification of a substructure. The probability of
an undetected error has been calculated with the formula:
𝑥=0 𝑛
𝑥·𝑝𝑥·(1 𝑝)𝑛𝑥, 𝑝 = 0.0277 (4.3)
Note that a set of words may have different Hamming distances depending on
the region of the vector subspace where the decoded value was placed and thus have
different probability of correct error fixing or detection. Therefore, some of the items
of Table 4.2 show a range of Hamming distances and probabilities instead of single
Obviously, increases of accuracy of correct decoding such as from the initial ap-
proximate 57% to 99.8% for one bar or 3.43% to 99.95% for 6 bars comes at a price.
Instead of using all bits for data transfer, we now have to use some of them as re-
dundancy bits to increase the Hamming distance and be able to correct or detect
errors. The rightmost column of Table 4.2 shows how many bits we have left for
data transfer. For example, if you need 6 bars for your application and want to use
error correction to achieve a maximal decoding error of 1 out of 2000 you would have
57 bits for data transfer (you would use the row with the Hamming distance 21-28,
since the wrong correction probability is at least 5.753 ·104). This would reduce the
objects you can distinguish from 2120 to 257.
Even if an error rate of 1 out of 2000 cannot compare to the error rates digital
barcodes can achieve, it still surpasses human data entry operators accuracy, which is
about 1 error each 300 keystrokes [48] (note that the value of a barcode would usually
need several keystrokes).
Number Hamming Wrong Correction Undetected Error Data
of Bars Distance Probability Probability Bits
51.703 ·1021.784 ·10411
61.703 ·1021.252 ·10510
71.998 ·1037·1079
7 - 8 2.436 ·1021.044 ·1041.205 ·10524
82.436 ·1021.205 ·10523
9 - 10 4.786 ·1031.204 ·1061.051 ·10720
9 - 11 2.53 ·1026.297 ·1033.952 ·1057.202 ·10724
10 - 12 2.53 ·1026.297 ·1035.642 ·1068.276 ·10823
11 - 12 6.297 ·1037.201 ·1078.276 ·10820
11 - 15 2.38 ·1021.707 ·1031.324 ·1055.236 ·10947
13 - 18 6.821 ·1033.775 ·1043.134 ·1076.318 ·1012 42
16 - 19 1.71 ·1037.453 ·1056.004 ·1010 5.801 ·1013 40
17 - 21 1.707 ·1031.324 ·1056.386 ·1011 3.331 ·1016 36
517 - 22 1.892 ·1031.079 ·1042.471 ·1094.741 ·1014 51
19 - 24 4.753 ·1042.232 ·1053.921 ·1011 4.299 ·1016 49
619 - 25 1.939 ·1033.116 ·1051.018 ·1093.725 ·1015 64
21 - 28 5.348 ·1046.653 ·1061.951 ·1011 3.585 ·1018 57
721 - 30 1.894 ·1039.062 ·1063.862 ·1010 3.041 ·1018 72
23 - 32 5.613 ·1041.981 ·1068.543 ·1012 2.958 ·1020 69
824 - 36 1.792 ·1035.903 ·1072.239 ·1011 2.426 ·1022 80
25 - 38 5.623 ·1041.239 ·1073.436 ·1012 2.122 ·1024 75
Table 4.2: Linear code choice recommendations for different number of bars.
The standard form of the generative matrix is:
𝐺= [𝐼𝑘|¯
where 𝐼𝑘is the identity matrix of size 𝑘×𝑘and ¯
𝑃is of size 𝑘×(𝑛𝑘), with 𝑘
being the length of a data word and 𝑛the length of an encoded word. Equation 4.4
can be used to calculate a so called parity check matrix ¯
𝐻, which fulfils
𝐻𝑇= 0 (4.5)
for all
𝑏𝑇produced with Equation 4.1 and is different to 0 for all other words.
This also implies that ¯
𝐻𝑇= 0, which means that ¯
𝐻must be of the form
𝐻= ( ¯
Note that the formula actually should be ¯
𝐻= (¯
𝐼𝑛𝑘), but thanks to the
logical XOR properties, the minus sign can be ignored. Once we have ¯
𝐻, a recognized
𝑏*can be tested for errors with:
𝑠 𝑇=
) = [(𝑏*
1, 𝑏*
2, ..., 𝑏*
𝑘+1, 𝑏*
𝑘+2, ..., 𝑏*
The vector 𝑠 is called a syndrome and is zero if a correct word was recognized
(due to Equation 4.5). Vector
𝑏*is composed of the word that should be recognized
𝑏summing an error vector 𝑒 (which is
0if the recognition was successful):
𝑏+𝑒 (4.8)
Putting everything together, we get that
𝑠 𝑇= (
𝑏𝑇+𝑒 𝑇)·¯
𝐻𝑇+𝑒 𝑇·¯
which can be used to find the source of error and correct it [45, 49]. One simple
method would be to create a table with all possible errors and their syndromes and
check it when an erroneous word has been received. Using Equation 4.9 and knowing
the error that generates such a syndrom, all erroneous bits on the recognized word
can just be flipped to get the intended word.
4.3.2 Evaluation of Error Correction by Exploiting Model
In some cases, it might be suitable to use the model’s confidence to try to correct
errors. The idea would be to use an error detection technique, for example a linear
code to detect the amount of errors that have occurred and then flip that many
bits, selecting the ones where the substructure classification model has the lowest
This technique could drastically increase the amount of data bits that can be
used. Note that a Hamming distance of 𝑛+ 1 would be needed to fix 𝑛errors instead
of 2·𝑛+ 1. However, using the validation dataset, I have found that if one error
occurs, the probability of fixing it correctly is only 61%. If a second error were
to occur, the probability of fixing them would decrease even further, reaching an
approximate probability of successful error correction of 5.5%. Therefore, with the
current substructure classification model, this technique is not advised.
Chapter 5
Future Lines of Research
In this thesis a novel hand-drawn barcode has been presented. To develop the pro-
posed structure, the Omniglot dataset was explored using the simplicity by speed
axiom. The outcome of the exploration together with numerical and perceptual rea-
sons led to create a barcode subdivided into bars of 20 substructures each. Further
research or user studies about human drawing capabilities could help to create an
improved hand-drawn barcodes structure with higher data density, for example by
creating alternative substructures, thus changing the base of the represented num-
ber. Another possibility would be to build a model that predicts the complexity of a
barcode and use it to refine the current barcode structure.
In Chapter 3, a procedure to detect and classify the barcode’s substructures has
been demonstrated and evaluated. The training of the object detection and classifi-
cation models has been limited to a relatively small number of samples. Once more
training samples have been collected, an improvement of the precision of the models
is to be expected. This would lead to a reduction of the necessary redundant bits
which have been recommended in Chapter 4.
The presented hand-drawn barcode, reading mechanism and coding techniques
should not be seen as a final version, but as a first step with great improvement
Appendix A
Additional Information
A repository that contains images, source code and configuration files related to this
thesis can be found at:
[1] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks, 2015.
[2] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and
Qi Tian. Centernet: Keypoint triplets for object detection, 2019.
[3] Mingxing Tan, Ruoming Pang, and Quoc V. Le. Efficientdet: Scalable and
efficient object detection, 2019.
[4] Fritz J. and Dolores H. Russ. Executive summary: Code 16k and code
49 data integrity test.
OSU-Data-Integrity-Linear.pdf (Accessed 2020-09-26).
[5] Hongyu Wang and Guangcun Shan. Recognizing handwritten mathematical
expressions as latex sequences using a multiscale robust neural network, 2020.
[6] Adam Byerly, Tatiana Kalganova, and Ian Dear. A branching and merging
convolutional network with homogeneous filter capsules, 2020.
[7] Abdul Mueed Hafiz and Ghulam Mohiuddin Bhat. Reinforcement learning based
handwritten digit recognition with two-state q-learning, 2020.
[8] Peng Xu, Yongye Huang, Tongtong Yuan, Tao Xiang, Timothy M. Hospedales,
Yi-Zhe Song, and Liang Wang. On learning semantic representations for million-
scale free-hand sketches, 2020.
[9] Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. Bertweet: A pre-trained
language model for english tweets, 2020.
[10] Ikuro Sato, Hiroki Nishimura, and Kensuke Yokoi. Apac: Augmented pattern
classification with neural networks, 2015.
[11] Alireza Rezvanifar, Melissa Cote, and Alexandra Branzan Albu. Symbol spotting
on digital architectural floor plans using a deep learning-based framework, 2020.
[12] William Adorno III, Angela Yi, Marcel Durieux, and Donald Brown. Hand-drawn
symbol recognition of surgical flowsheet graphs with deep image segmentation,
[13] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-
level concept learning through probabilistic program induction. Science,
350(6266):1332–1338, 2015.
[14] Liwei Wang, Yan Zhang, and Jufu Feng. On the euclidean distance of images.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1334–
1339, 2005.
[15] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality
assessment: from error visibility to structural similarity. IEEE Transactions on
Image Processing, 13(4):600–612, 2004.
[16] M. . Dubuisson and A. K. Jain. A modified hausdorff distance for object match-
ing. In Proceedings of 12th International Conference on Pattern Recognition,
volume 1, pages 566–568 vol.1, 1994.
[17] C. E. Shannon. A mathematical theory of communication. The Bell System
Technical Journal, 27(3):379–423, 1948.
[18] Mohammed Aljanabi, Zahir Hussain, and Songfeng Lu. An entropy-histogram
approach for image similarity and face recognition. Mathematical Problems in
Engineering, 2018:1–18, 07 2018.
[19] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Omniglot
data set for one-shot learning.
(Accessed 2020-07-28).
[20] Veijo Virsu. Tendencies to eye movement, and misperception of curvature, di-
rection, and length. Perception and Psychophysics, 9:65–72, 01 1971.
[21] Barcode Island. Code 128 symbology.
code128.phtml (Accessed 2020-08-02).
[22] Deia Ganayim. Visual processing of connected and unconnected letters and words
in arabic. Cognitive Linguistic Studies, 2:205–238, 01 2015.
[23] Daniel Klöck. Hand-drawn barcode user study. https:// (Accessed 2020-09-18).
[24] (Accessed 2020-09-18).
[25] Flask: A lightweight wsgi web application framework. https:// (Accessed 2020-09-18).
[26] Skeleton: Responsive css boilerplate. (Accessed
[27] Cloudinary: Image and video upload, storage, optimization and cdn. https:
// (Accessed 2020-09-18).
[28] Heroku: Cloud application platform. (Accessed
[29] Rectlabel for object detection. (Accessed 2020-09-18).
[30] Tensorflow 2 detection model zoo.
(Accessed 2020-09-18).
[31] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B.
Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll’a r, and
C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR,
abs/1405.0312, 2014.
[32] Ross Girshick. Fast r-cnn, 2015.
[33] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation, 2013.
[34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-
ing for image recognition, 2015.
[35] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional
networks, 2013.
[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition, 2014.
[37] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet:
A large-scale hierarchical image database. In 2009 IEEE conference on computer
vision and pattern recognition, pages 248–255. Ieee, 2009.
[38] Faster r-cnn: Down the rabbit hole of modern ob-
ject detection.
(Accessed 2020-09-26).
[39] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points, 2019.
[40] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for con-
volutional neural networks, 2019.
[41] A thorough breakdown of efficientdet for ob-
ject detection.
(Accessed 2020-09-26).
[42] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion, 2014.
[43] R. W. Hamming. Error detecting and error correcting codes. The Bel l System
Technical Journal, 29(2):147–160, 1950.
[44] Suayb S. Arslan. Finite fields and linear codes.
classnotes2.pdf (Accessed 2020-10-12).
[45] Prof. Dr.-Ing. Gerald Oberschmidt. Grundlagen der Übertragungstechnik,
kapitel 5: Datensicherung und kodierung.
uebertragung.pdf (Accessed 2020-10-12).
[46] Markus Grassl. Searching for linear codes with large minimum distance. In Wieb
Bosma and John Cannon, editors, Discovering Mathematics with Magma — Re-
ducing the Abstract to the Concrete, volume 19 of Algorithms and Computation
in Mathematics, pages 287–313. Springer, Heidelberg, 2006.
[47] Markus Grassl. Bounds on the minimum distance of linear codes and quantum
codes. Online available at, 2007. Accessed on 2020-
[48] Barcode reading and accuracy.
reading_and_accuracy. (Accessed 2020-10-12).
[49] Richard E. Blahut. Linear Block Codes, page 49–66. Cambridge University Press,
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Image similarity and image recognition are modern and rapidly growing technologies because of their wide use in the field of digital image processing. It is possible to recognize the face image of a specific person by finding the similarity between the images of the same person face and this is what we will address in detail in this paper. In this paper, we designed two new measures for image similarity and image recognition simultaneously. The proposed measures are based mainly on a combination of information theory and joint histogram. Information theory has a high capability to predict the relationship between image intensity values. The joint histogram is based mainly on selecting a set of local pixel features to construct a multidimensional histogram. The proposed approach incorporates the concepts of entropy and a modified 1D version of the 2D joint histogram of the two images under test. Two entropy measures were considered, Shannon and Renyi, giving a rise to two joint histogram-based, information-theoretic similarity measures: SHS and RSM. The proposed methods have been tested against powerful Zernike-moments approach with Euclidean and Minkowski distance metrics for image recognition and well-known statistical approaches for image similarity such as structural similarity index measure (SSIM), feature similarity index measure (FSIM) and feature-based structural measure (FSM). A comparison with a recent information-theoretic measure (ISSIM) has also been considered. A measure of recognition confidence is introduced in this work based on similarity distance between the best match and the second-best match in the face database during the face recognition process. Simulation results using AT&T and FEI face databases show that the proposed approaches outperform existing image recognition methods in terms of recognition confidence. TID2008 and IVC image databases show that SHS and RSM outperform existing similarity methods in terms of similarity confidence.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Objective methods for assessing perceptual image quality have traditionally attempted to quantify the visibility of errors between a distorted image and a reference image using a variety of known properties of the human visual system. Under the assumption that human visual perception is highly adapted for extracting structural information from a scene, we introduce an alternative framework for quality assessment based on the degradation of structural information. As a specific example of this concept, we develop a Structural Similarity Index and demonstrate its promise through a set of intuitive examples, as well as comparison to both subjective ratings and state-of-the-art objective methods on a database of images compressed with JPEG and JPEG2000. A MatLab implementation of the proposed algorithm is available online at
A letter-reading task (Experiments 1) and a word-reading task of regular words (Experiments 2) and of visually distorted words (Experiments 3) were used to examine the reciprocal interaction between phonological encoding strategies and visual factors, such as the global word shape, local letters shape, and inter-letter spacing. Our participants comprised Arabic readers familiar with different letter and word forms (connected vs. unconnected: without inter-letter spaces vs. with inter-letter spaces). In addition, this study is the first instance of the word length effect being studied in an Arabic context using different word lengths (3 vs. 5 letters). The average reading times for Arabic words are affected by the word connectivity, since the average reading time is shorter for connected than unconnected words of all word lengths (3 and 5 letters) reflecting the activation of lexical route, which processes letters in letter strings in parallel. As well, the average reading times for Arabic words are affected by the word length, since the average reading time is shorter for 3-letter words than 5-letter words reflecting the activation of non-lexical route, which processes letters in letter strings sequentially. Length effect is the signature of the non-lexical route due to its seriality caused by assembled phonology.
People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms—for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world’s alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. On a challenging one-shot classification task, the model achieves human-level performance while outperforming recent deep learning approaches. We also present several “visual Turing tests” probing the model’s creative generalization abilities, which in many cases are indistinguishable from human behavior.
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.