Conference PaperPDF Available

Optical character recognition applied on receipts printed in macedonian language

Authors:

Abstract and Figures

The paper presents an approach to Optical Character Recognition (OCR) applied on receipts printed in Macedonian language. The OCR engine recognizes the characters of the receipt and extracts some useful information, such as: the name of the market, the names of the products purchased, the prices of the products, the total amount of money spent, and also the date and the time of the purchase. We used the publicly available OCR framework Tesseract, which was trained on pictures of receipts printed in Macedonian language. The results showed that it can recognize the characters with 93% accuracy. Additionally, we used another approach that uses the original Tesseract to extract the features out of the picture and the final classification was performed with k-nearest neighbors classifier using dynamic time warping as a distance metrics. Even though the accuracy achieved with the modified approach was for 6 percentage points lower than the original approach, it is a proof of concept and we plan to further research it in future publications. The additional analysis of the results showed that the accuracy is higher for the words which are prescribed for each receipt, such as the date and the time of the purchase and the total amount of money spent.
Content may be subject to copyright.
Optical character recognition applied on receipts
printed in Macedonian Language
Martin Gjoreski2, Gorjan Zajkovski2, Aleksandar
Bogatinov2, Gjorgji Madjarov1,2, Dejan Gjorgjevikj1,2
1Department of Computer Science and Engineering
2Faculty of Computer Science and Engineering
Skopje, Macedonia
Hristijan Gjoreski
Department of Intelligent Systems, Jožef Stefan Institute
Jožef Stefan International Postgraduate School
Ljubljana, Slovenia
Abstract The paper presents an approach to Optical
Character Recognition (OCR) applied on receipts printed in
Macedonian language. The OCR engine recognizes the
characters of the receipt and extracts some useful information,
such as: the name of the market, the names of the products
purchased, the prices of the products, the total amount of money
spent, and also the date and the time of the purchase. We used
the publicly available OCR framework Tesseract, which was
trained on pictures of receipts printed in Macedonian language.
The results showed that it can recognize the characters with 93%
accuracy. Additionally, we used another approach that uses the
original Tesseract to extract the features out of the picture and
the final classification was performed with k-nearest neighbors
classifier using dynamic time warping as a distance metrics. Even
though the accuracy achieved with the modified approach was
for 6 percentage points lower than the original approach, it is a
proof of concept and we plan to further research it in future
publications. The additional analysis of the results showed that
the accuracy is higher for the words which are prescribed for
each receipt, such as the date and the time of the purchase and
the total amount of money spent.
KeywordsOCR; Receipt digitalization; Tesseract; DTW;
I. INTRODUCTION AND RELATED WORK
Optical Character Recognition (OCR) is conversion of
photographed or scanned images, which contain printed or
typewritten text, into machine readable characters (text). The
basic idea origins since 1929 when the first OCR patent is
obtained by Tausheck [1]. It is based on template matching by
using optics and mechanics. After the first commercial
computer (UNIVAC I) is installed (1951), the era of converting
images of text into computer readable text has started. In 1956
the first approach to convert images of text into computer
readable text was presented [2]. At that time hardware and
software are strong limitations, so the OCR approaches are
based on template matching and simple algebraic operations.
Since then a lot research has been done on OCR and with the
advancement of the technology more complex OCR
approaches are developed. Today OCR is done in much more
intelligent way, but it also requires more computational power,
which can be a problem for smartphone implementations.
OCR can be used in common industries and applications
including date tracking on pharmaceutical or food packaging,
sorting mail at post offices and other document handling
applications, reading serial numbers in automotive or
electronics applications, passport processing, secure document
processing (checks, financial documents, bills), postal tracking,
publishing, consumer goods packaging (batch codes, lot codes,
expiration dates), and clinical applications. Also OCR readers
and software can be used, as well as smart cameras and vision
systems which have additional capabilities like barcode reading
and product inspection.
In recent years, numerous OCR-based smartphone
applications were also introduced. A successful example
application is the Google’s Goggles application [3], which has
more than 10 million downloads. Beside the OCR functionality
it has several others such as: image search, text translation, bar
code scanner. Their OCR engine can analyze text in several
languages, not including the Macedonian language.
Additionally, the implementation of the OCR engine is not on
the smartphone itself, but on a server and therefore it requires
internet connection in order to perform an OCR action.
Recently Google has allowed public and freely available API
for their OCR engine [4], which resulted in numerous
smartphone OCR-based applications. However they can only
be used with internet connection and furthermore, the API does
not provide support for the Macedonian language. Finally,
there are some examples of OCR-based applications that claim
to support the Macedonian language, e.g., Translang [5].
However, none of them supports an OCR for Cyrillic script,
which is the official script of the Macedonian language. In this
paper we present an application of OCR on receipts printed in
Macedonian language by using the open source OCR engine
called Tesseract [6].
The remainder of this paper is organized as follows. Second
section presents the methodology used for the process of OCR.
Then, in the Experimental Results section, the recognition
accuracy is presented. Finally, the conclusion and a brief
discussion about the approach and the results are given.
II. METHODOLOGY
Figure 1 shows the whole process of the OCR. First, the
user takes a photo of a receipt that he/she received from a
market. Then, the OCR engine recognizes the characters
printed on the receipt and therefore extracts some useful
information out of the receipt, e.g., the name of the market, the
names of the products purchased by the user, the prices of the
products, the total amount of money spent, and also the date
and the time of the purchase. For the process of the OCR, the
open source OCR engine called Tesseract [6] is used.
Figure 1. The OCR process applied on a receipt printed in Macedonian
language (Cyrillic script).
A. Tesseract
Creating an OCR engine is a challenging research task and
requires great knowledge in image processing, feature
extraction and machine learning. However, there are several
open source projects that provide OCR framework and are
widely used in the creation of OCR-related applications. In
order not to reinvent the wheel and also to save time for
development, in this study we decided to use an OCR
framework which is freely available. After studying several
frameworks, we decided to use the Tesseract. Tesseract is OCR
engine that is developed by HP between 1984 and 1994 to run
in a desktop scanner, but it is never used in an HP product [7].
Since then it has a lot of improvements. In 2005 it becomes
open source and is managed by Google since then. The last
stable version (V3.02) is released in 2012 and V3.03 is
expected to be released in 2014. Tesseract is written in C and
C++ but it also has Android and iOS wrappers which make it
useful for smartphone application.
1) Tesseract Architecture
The first approach that is tested in the process of character
recognition is the original Tesseract engine. Tesseract has
traditional step-by-step pipeline architecture (shown in Figure
2). First image preprocessing is done with adaptive
thresholding where a binary image is produced. Then
connected component analysis is done to provide character
outlines. Next techniques for character chopping and character
association are used to organize the outlines into words. In the
end two-pass word recognition is done by using methods of
clustering and classification. For the final decision about the
recognized word, Tesseract consults with both language
dictionary and user defined dictionary. The word with smallest
distance is provided as an output. This is just brief overview of
the Tesseract architecture, more details can be found in the
authors’ literature [7].
2) Training Tesseract
For the training phase, Tesseract needs a photograph (tiff or
pdf file) of a text written in the same language as the one that it
is trying to recognize. For each character from the learning text
Tesseract extracts 4 different feature vectors. Then it uses
clustering technique to construct a model for each character
and those models are later used in the classification phase for
decision of which character should be recognized.
Adaptive
thresholding
Connected
component
analysis
Find lines and
words
Word
classification
Word lists
Binary
image Character
outlines
Character
features
Final word
output Compare
words
Word
Figure 2. Tesseract OCR engine architecture.
For preparing the training text, several different approaches
were tested regarding the font of the training text, the size of
the characters in the training text and the content of the training
text. Tests were done for each of the three problems. In the first
approach the training text was written with a font that was
made of quality photographs of single characters. In the second
approach the training text was written with a font that is similar
to the font of the receipts. For the size of the characters in the
training text tests were done with different font sizes starting
from 16px to 48px. Regarding the content of the training text
two different approaches were tested. With the first approach
for each character that the model is trying to recognize there
are 10 to 25 different instances with respect to the frequency of
the character in the Macedonian language. For example the
count of the vowels was 20-25 and the count of the special
characters or very infrequent characters such as H or Z was 10-
15. In the second approach the training text was consisted of
1300-1500 random sampled words from different receipts.
The tests showed that the engine is most accurate if the size
of the letters in the training text is similar to the text on the
photographed receipts. In this case the size that is used is 40px.
Also it was concluded that better results can be achieved if the
training text is consisted of random sampled words from
different receipts. After all the testing done on Tesseract, the
training text that was used for further analysis consisted of
1300-1500 random sampled words from different receipts, it
was written with a font similar to the font of the receipts and
the size of the characters was 40px.
B. Tesseract-DTW
For the process of character recognition we also tried
another approach that uses Dynamic Time Warping (DTW) [8]
and K-Nearest Neighbors (KNN) classifier [9]. This approach,
Tesseract-DTW (shown in Figure 3), uses the original
Tesseract only for feature extraction; the final classification is
performed by the KNN classifier using the DTW as a distance
metrics. The DTW metric was chosen because the size of each
feature vector extracted by the Tesseract varies, and is not the
same for each character. Please note that applying a standard
classifier such as decision tree, SVM, etc., was not an option
because of the varying size of the feature vectors.
1) DTW
DTW also known as dynamic programming matching is a
well-known technique to find an optimal alignment between
two given sequences [8]. It finds an optimal match between
two sequences of feature vectors by allowing stretching and
compression of sections of the sequences. DTW first has been
used by Sakoe and Chiba [10] to compare different speech
patterns in automatic speech recognition. In fields such as data
mining and information retrieval, DTW has been successfully
applied to automatically cope with time deformations and
different speeds associated with time-dependent data. Also it
successfully has been used both for online [11] and offline
signature verification [12].
Adaptive
thresholding
Connected
component
analysis
Find lines
and words
KNN with
DTW
Binary
image Character
outlines
Character
features
Final
character Character
Tesseract
Figure 3. Tesseract-DTW architecture.
2) DTW distance
To calculate the distance between two vectors X1 = (x11,
x12, ..., x1i), and X2 = (x21, x22, ..., x2j), DTW needs a local
cost measure, sometimes also referred to as local distance
measure. In this study an Euclidean distance is used as cost
measure, see equation (1). By evaluating the local cost measure
for each pair of elements of the sequences X1 and X2, cost
matrix M is calculated, see equation (2). The goal is to find an
alignment between X1 and X2 having minimal overall cost. For
calculating the minimal overall cost three conditions must be
satisfied: boundary condition, monotonicity condition and step
size condition. The minimal overall cost is the output of the
DTW algorithm, shown in equation (3).
Cost (x1i, x2j) = Euclid (x1i, x2j) (1)
M[i][j] = Cost (x1i, x2j) (2)
DTWdist (X1, X2) = M[1][1] + Smin + M[i][j] (3)
Where, Smin = (min (M[k+1][t], M[k][t+1], M[k+1][t+1])), k
𝜖
{1, 2,
…, i-2} and t
𝜖
{1, 2, …, j-2}.
3) Evaluating Tesseract-DTW
For evaluating the Tesseract-DTW approach 6 photographs
of different receipts were used. 5 of them were used as training
samples and 1 as a test sample. This is repeated 6 times so each
of the receipts was used once as a test sample.
First each character of the training receipts is labeled. Then
feature extraction is done by using Tesseract. After the feature
extraction each character of the learning receipts is described
with 4 feature vectors (4). X and Z are with variable size (5)
and Y and W are with constant size (6).
C1 = (X1, Y1, Z1, W1) (4)
X1 = (x1, x2, ..., xm), Z1= (z1, z2, ..., zj) (5)
Y1 = (y1, y2, y3), W1 = (w1, w2, w3) (6)
In the classification phase KNN classifier was used. For
calculating the distance between two characters C1 and C2
combination of DTW and Euclidean distance measurement is
used. DTW is used for calculating the distance between the
vectors with varying size (7) and Euclidean distance is used for
calculating the distance between the vectors with the no
varying size (8). After DTW and Euclidean distance is
calculated between the corresponding vectors of the two
characters the final distance between the two characters is
calculated with Euclidean distance based on the four distances
(d1, d2, d3, d4) calculated in the previous step (9). The
character with the smallest distance to the test character is
chosen as the output of the classifier.
d1 = DTWdist (X1, X2), d3 = DTWdist (Z1, Z2) (7)
d2= Euclid (Y1, Y2), d4=Euclid (W1, W2) (8)
Distance (C1, C2) = Euclid (d1, d2, d3, d4) (9)
III. EXPERIMENTAL RESULTS
Figure 4 shows an accuracy comparison for the two
approaches used for character recognition, Tesseract and
Tesseract-DTW. The comparison is performed using the
number of correctly recognized characters from 6
photographed receipts. One can note that only for the third
photograph, the Tesseract-DTW is better than the original
Tesseract. In all other cases the original Tesseract approach is
better. On average, the Tesseract is better for 6 percentage
points. Also compared by time of execution Tesseract was
better than Tesseract-DTW, which was in a way expected
given the complexity of the DTW and the usage of the so
called ―lazy‖ (instance-based) classifier KNN.
0
10
20
30
40
50
60
70
80
90
100
p 1 p 2 p 3 p 4 p 5 p 6 average
Tesseract vs Tesseract-DTW Accuracy (%)
Teeseract Original Tesseract DTW
Figure 4: Accuracy for correctly recognized characters by using Tesseract
and Tesseract-DTW.
IV. DISCUSSION AND CONCUSION
The paper presented an approach of OCR for receipts
printed in Macedonian language. The main OCR engine that
was used is Tesseract. In the process of character recognition
two approaches were tested. In the first approach the original
Tesseract was tested. Tests showed that Tesseract is most
accurate when the training text consists of random sampled
words from different receipts and is written with similar font
and size as the characters that we are trying to recognize. In the
second approach modified version of Tesseract was used
(Tesseract-DTW). In this approach the feature extraction was
again performed by the Tesseract, however the final
classification was done with KNN classifier using DTW as
distance metrics. Tests showed that the first approach by using
original Tesseract engine outperformed the second approach by
6 percentage points. Further analysis showed that the accuracy
is higher for the numbers and words which are prescribed for
each receipt, such as the date and the time of the purchase and
the total amount of money spent. This was in a way expected
because the classifier has more examples to train on, i.e. they
are present in each receipt and the numbers are limited only to
10 characters. On the other hand, the names of the products are
more difficult to recognize mainly because there are names that
are not from Macedonian language. In general, the more data is
used for training, the better the model should be. In future we
plan to collect much more data samples by providing a free
smartphone application.
To the best of our knowledge, this is the first attempt to
apply OCR on receipts printed in Macedonian language using
the Cyrillic script, and moreover the first attempt to modify the
original Tesseract by applying KNN algorithm using DTW
distance metrics. Even though, the modified version of the
Tesseract achieved slightly worse results, it gives promising
results and we plan to further improve it in the future work. We
are also considering an approach that will combine the both
methods, e.g. by using meta-learning, and eventually improve
the recognition accuracy.
ACKNOWLEDGMENT
The authors would like to thank the developers of the
Tesseract OCR framework and making it freely available to
the research and developers community.
REFERENCES
[1] G. Tauschek, ―Reading machine‖ U.S. Patent 2026329, Dec. 1935
[2] S. Mori, C.Y. Suen, K. Yamamoto―Historical Review of OCR research
and development‖, Proceeding of the IEEE (Volume:80, Issue 7), Jul
1992
[3] Google’s Goggle application.
https://play.google.com/store/apps/details?id=com.google.android.apps.
unveil
[4] Google’ API for OCR.
https://developers.google.com/google-apps/documents-
list/#uploading_documents_using_optical_character_recognition_ocr
[5] Thanslang application.
https://play.google.com/store/apps/details?id=icactive.app.translang
[6] Tesseract-ocr. Mar-2012. URL: http://code.google.com/p/tesseract-ocr/.
[7] Ray Smith. An overview of the Tesseract OCR engine". In: Document
Analysis and Recognition, 2007. ICDAR (2007).
[8] M. Müller, Information Retrival for music and motion‖, 2007, XVI,
318 p. 136 illus. 39.
[9] D. Aha, D. Kibler (1991). Instance-based learning algorithms. Machine
Learning. 6:37-66D. Aha, D. Kibler (1991). Instance-based learning
algorithms. Machine Learning. 6:37-66.
[10] H. Sakoe and S. Chiba, ―Dynamic programming algorithm optimization
for spoken word recognition,‖ Acoustics, Speech and Signal Processing,
IEEE Transactions on, vol. 26, no. 1, pp. 4349, 1978.
[11] Y. Qiao, X. Wang, C. Xu, ―Learning Mahalanobis Distance for DTW
based Online SignatureVerification, Information and Automation
(ICIA), 2011 IEEE International Conference, June 2011
[12] A. Piyush Shanker, A.N. Rajagopalan, Off-line signature verification
using DTW‖, Journal Pattern Recognition Letters Volume 28 Issue 12,
September 2007
Thesis
Full-text available
Stress is a process triggered by a demanding physical and/or psychological event. It is not necessarily a negative process, but when present continuously, the stress process results in chronical stress. The chronical stress has negative health consequences such as raised blood pressure, bad sleep, increased vulnerability to infections, slower body recovery and decreased mental performance. In addition to the negative health consequences, there are also negative economic consequences due to increased number of accidents, absenteeism from work and decreased productivity. In 2002, the European Commission calculated the costs of work-related stress at €20 billion a year. Being able to detect stress as it occurs can importantly contribute to dealing with its negative health and economic consequences. This thesis builds upon the advanced approaches for stress detection by analyzing the problem of stress detection using machine learning and signal processing techniques first in laboratory conditions, and then applies the extracted laboratory knowledge on real-life data gathered in the wild. The laboratory dataset was collected using a standardized stress-inducing method. 21 subjects participated in the experiments and were monitored with a wrist-device equipped with bio-sensors. In addition, numerous features were extracted from the device’s bio-sensors and selected using a feature-selection algorithm. Finally, a machine-learning method was applied to learn a laboratory stress detector. The experiments showed that the laboratory stress detector detected stress with an accuracy of 85% for a two-class problem (“no stress” vs “stress”), and accuracy of 73% for a three-class problem (“no stress” vs “low stress” vs “high stress”). The real-life dataset consisting of 55 days of data was collected by monitoring 5 subjects who were wearing the wrist device 24/7 and were keeping track of their stressful events. For this experiment, the laboratory stress detector was augmented with another two modules in order to capture the user’s activities and context. Therefore, the method consist of three machine-learning components: the laboratory stress detector that detects short-term stress; an activity recognizer that continuously recognizes user’s activity; and a context-based stress detector that exploits the output of the laboratory stress detector and the user’s context in order to provide the final decision for 10-minute intervals. The real-life experiments show that the context-based method with a mean F-score of 0.9 significantly outperforms the no-context method with a mean F-score of 0.47. In addition, the experiments showed that the best-performing context-based model detects (recalls) 70% of the stress events with a precision of 98%. Finally, a smartphone application for continuous stress-monitoring is presented. It implements the context-based method, which detects stressful events using the bio-signals provided by a Bluetooth-connected commercial wrist device. The application is developed as a part of the Fit4Work project which aims to help workers to manage and improve their mental and physical health.
Article
Full-text available
Signature, a form of handwritten depiction, has been and is still widely used as a proof of the writer's identity/intent in human society. Online signatures represents the dynamic process of handwriting as a sequence of feature vectors along time. Dynamic time warping (DTW) has been popularly adopted to compare sequence data. A basic problem in using DTW for signature verification is how to estimate the difference between the feature vectors. Most previous researches made use of Euclidean distance (ED) for this problem. However, ED treats each feature equally and cannot take account of the correlations between features. To overcome this problem, this paper proposed Mahalanobis distance (MD) for signature verification. One key question is how to estimate covariance matrix in MD calculation. We formulate this problem in a learning framework and introduce two criterion for estimating the matrix. The first criteria aims at minimizing the signature difference for the same writer, while the second criteria try to maximize the signature difference between different writ- ers while minimize the within-writer signature difference. We carried out experiments on the MCYT biometric database. The experimental results exhibit that the proposed MD based method achieved better results than ED based method.
Book
Full-text available
The second part of this monograph deals with content-based analysis and retrieval of 3D motion capture data as used in computer graphics for animating virtual human characters. In this chapter, we provide the reader with some fundamental facts on motion representations. We start with a short introduction on motion capturing and introduce a mathematical model for the motion data as used throughout the subsequent chapters (Sect. 9.1).We continue with a detailed discussion of general similarity aspects that are crucial in view of motion comparison and retrieval (Sect. 9.2). Then, in Sect. 9.3, we formally introduce the concept of kinematic chains, which are generally used to model flexibly linked rigid bodies such as robot arms or human skeletons. Kinematic chains are parameterized by joint angles, which in turn can be represented in various ways. In Sect. 9.4, we describe and compare three important angle representations based on rotation matrices, Euler angles, and quaternions. Each of these representations has its strengths and weaknesses depending on the respective analysis or synthesis application.
Article
Full-text available
Storing and using specific instances improves the performance of several supervised learning algorithms. These include algorithms that learn decision trees, classification rules, and distributed networks. However, no investigation has analyzed algorithms that use only specific instances to solve incremental learning tasks. In this paper, we describe a framework and methodology, called instance-based learning, that generates classification predictions using only specific instances. Instance-based learning algorithms do not maintain a set of abstractions derived from specific instances. This approach extends the nearest neighbor algorithm, which has large storage requirements. We describe how storage requirements can be significantly reduced with, at most, minor sacrifices in learning rate and classification accuracy. While the storage-reducing algorithm performs well on several real-world databases, its performance degrades rapidly with the level of attribute noise in training instances. Therefore, we extended it with a significance test to distinguish noisy instances. This extended algorithm's performance degrades gracefully with increasing noise levels and compares favorably with a noise-tolerant decision tree algorithm.
Article
In this paper, we propose a signature verification system based on Dynamic Time Warping (DTW). The method works by extracting the vertical projection feature from signature images and by comparing reference and probe feature templates using elastic matching. Modifications are made to the basic DTW algorithm to account for the stability of the various components of a signature. The basic DTW and the modified DTW methods are tested on a signature database of 100 people. The modified DTW algorithm, which incorporates stability, has an equal-error-rate of only 2% in comparison to 29% for the basic DTW method.
Conference Paper
The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in particular the line finding, features/classification methods, and the adaptive classifier.
This paper reports on an optimum dynamic progxamming (DP) based time-normalization algorithm for spoken word recognition. First, a general principle of time-normalization is given using time-warping function. Then, two time-normalized distance definitions, called symmetric and asymmetric forms, are derived from the principle. These two forms are compared with each other through theoretical discussions and experimental studies. The symmetric form algorithm superiority is established. A new technique, called slope constraint, is successfully introduced, in which the warping function slope is restricted so as to improve discrimination between words in different categories. The effective slope constraint characteristic is qualitatively analyzed, and the optimum slope constraint condition is determined through experiments. The optimized algorithm is then extensively subjected to experimental comparison with various DP-algorithms, previously applied to spoken word recognition by different research groups. The experiment shows that the present algorithm gives no more than about two-thirds errors, even compared to the best conventional algorithm.
Article
Research and development of OCR systems are considered from a historical point of view. The historical development of commercial systems is included. Both template matching and structure analysis approaches to R&D are considered. It is noted that the two approaches are coming closer and tending to merge. Commercial products are divided into three generations, for each of which some representative OCR systems are chosen and described in some detail. Some comments are made on recent techniques applied to OCR, such as expert systems and neural networks, and some open problems are indicated. The authors' views and hopes regarding future trends are presented
Learning Mahalanobis Distance for DTW based Online SignatureVerification‖, Information and Automation (ICIA)
  • Y Qiao
  • X Wang
  • C Xu
Y. Qiao, X. Wang, C. Xu, -Learning Mahalanobis Distance for DTW based Online SignatureVerification‖, Information and Automation (ICIA), 2011 IEEE International Conference, June 2011
―An overview of the Tesseract OCR engine". In: Document Analysis and Recognition
  • Ray Smith
Ray Smith. ―An overview of the Tesseract OCR engine". In: Document Analysis and Recognition, 2007. ICDAR (2007).