ArticlePDF Available

OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym

Authors:

Abstract and Figures

Optical character recognition (OCR) as a classic machine learning challenge has been a longstanding topic in a variety of applications in healthcare, education, insurance, and legal industries to convert different types of electronic documents, such as scanned documents, digital images, and PDF files into fully editable and searchable text data. The rapid generation of digital images on a daily basis prioritizes OCR as an imperative and foundational tool for data analysis. With the help of OCR systems, we have been able to save a reasonable amount of effort in creating, processing, and saving electronic documents, adapting them to different purposes. A set of different OCR platforms are now available which, aside from lending theoretical contributions to other practical fields, have demonstrated successful applications in real-world problems. In this work, several qualitative and quantitative experimental evaluations have been performed using four well-know OCR services, including Google Docs OCR, Tesseract, ABBYY FineReader, and Transym. We analyze the accuracy and reliability of the OCR packages employing a dataset including 1227 images from 15 different categories. Furthermore, we review the state-of-the-art OCR applications in healtcare informatics. The present evaluation is expected to advance OCR research, providing new insights and consideration to the research area, and assist researchers to determine which service is ideal for optical character recognition in an accurate and efficient manner.
Content may be subject to copyright.
OCR as a Service: An Experimental Evaluation
of Google Docs OCR, Tesseract, ABBYY
FineReader, and Transym
Ahmad P. Tafti1(B
), Ahmadreza Baghaie2, Mehdi Assefi3, Hamid R. Arabnia3,
Zeyun Yu4, and Peggy Peissig1(B
)
1Biomedical Informatics Research Center, Marshfield Clinic Research Foundation,
Marshfield, WI 54449, USA
{pahlavantafti.ahmad,peissig.peggy}@mcrf.mfldclin.edu
2Department of Electrical Engineering, University of Wisconsin-Milwaukee,
Milwaukee, WI 53211, USA
3Department of Computer Science, University of Georgia, Athens, GA 30602, USA
4Department of Computer Science, University of Wisconsin-Milwaukee,
Milwaukee, WI 53211, USA
Abstract. Optical character recognition (OCR) as a classic machine
learning challenge has been a longstanding topic in a variety of applica-
tions in healthcare, education, insurance, and legal industries to convert
different types of electronic documents, such as scanned documents, dig-
ital images, and PDF files into fully editable and searchable text data.
The rapid generation of digital images on a daily basis prioritizes OCR
as an imperative and foundational tool for data analysis. With the help
of OCR systems, we have been able to save a reasonable amount of effort
in creating, processing, and saving electronic documents, adapting them
to different purposes. A set of different OCR platforms are now available
which, aside from lending theoretical contributions to other practical
fields, have demonstrated successful applications in real-world problems.
In this work, several qualitative and quantitative experimental evalua-
tions have been performed using four well-know OCR services, including
Google Docs OCR, Tesseract, ABBYY FineReader, and Transym. We
analyze the accuracy and reliability of the OCR packages employing a
dataset including 1227 images from 15 different categories. Furthermore,
we review the state-of-the-art OCR applications in healtcare informatics.
The present evaluation is expected to advance OCR research, providing
new insights and consideration to the research area, and assist researchers
to determine which service is ideal for optical character recognition in
an accurate and efficient manner.
1 Introduction
Optical character recognition (OCR) has been a very practical research area
in many scientific disciplines, including machine learning [13], computer vision
[46], natural language processing (NLP) [79], and biomedical informatics [10
12]. This computational technology has been utilized in converting scanned,
c
Springer International Publishing AG 2016
G. Bebis et al. (Eds.): ISVC 2016, Part I, LNCS 10072, pp. 735–746, 2016.
DOI: 10.1007/978-3-319-50835-1 66
736 A.P. Tafti et al.
hand-written, or PDF files into an editable text format (e.g., text file or MS
Word/Excel file) for further processing tasks [13,14]. OCR has contributed to
significant process improvement in many different real world applications in
healthcare, finance, insurance, and education. For example, in healthcare there
has been a need to deal with vast amounts of patient forms (e.g., insurance
forms). In order to analyze the information in such forms, it is critical to input
the patient data in a standarized format into a database so it can be accessed
later for analysis. Using OCR systems, we are able to automatically extract
information from the forms and enter it into databases, so that every patient’s
data is immediately recorded. OCR really simplifies the process by turning those
documents into easily editable and searchable text data. In the sense of software
engineering “Software as a Service” (SaaS), as an architectural model behind
the centralized computing, has emerged as a design pattern and also a deliv-
ery model in which a software could be accessed by both human-oriented and
application-oriented standards [1519]. Human users can get the SaaS system
through a web browser, and an application will utilize the service using APIs
(application programming interfaces).
To date, several attempts have been made to design and develop OCR ser-
vices and/or packages, such as Google Docs OCR [20], Tesseract [21,22], ABBYY
FineReader [23,24], Transym [25], Online OCR [26], and Free OCR [27]. Based
on core functionalities, including recognition accuracy, performance, multilin-
gual support, open-source implementation, delivery as a software development
kit (SDK), high availability, and rating in the OCR community [28,29], the
present contribution is mainly focused on the experimental evaluation of Google
Docs OCR, Tesseract, ABBYY FineReader, and Transym. The current work
is expected to provide better insights to the OCR study, and address several
capabilities for possible future enhancements.
The rest of the paper is arranged as follows. The Google Docs OCR, Tesser-
act, ABBYY FineReader, and Transym OCR systems will be introduced in
Sect. 2. In Sect. 3we review, from an application perspective, the state-of-the-art
OCR systems in healthcare informatics. Experimental validations including the
dataset, testbed, and the results will be reported in Sect. 4. Section 5provides
discussion and concludes the work.
2 OCR Toolsets
OCR toolsets and their underlying algorithms not only focus on text and char-
acter recognition in a reliable manner, but may also address: (1) Layout analysis
in which they can detect and understand different items in an image (e.g., text,
tables, barcodes), (2) Support of various alphabets, including English, Greek,
Persian, and etc., and (3) Support of different types of input images (e.g., TIFF,
JPEG, PNG, PDF) and capabilities to export text data in different output for-
mats. The basis of OCR methods dates back to 1914 when Goldberg designed
a machine that was able to read characters and turn them into standard tele-
graph code [13]. With the emergence of computerized systems, many artificial
OCR as a Service 737
intelligence researchers have tried to tackle the problem of OCR complexity to
build efficient OCR systems capable of working in accurate and real-time fashion
(e.g., [2,3033]). Although there are many OCR methods and toolsets available
now in the literature, here we limit the work to a comparative study of four
well-known OCR toolsets namely “Google Docs OCR”, “Tesseract”, “ABBYY
FineReader”, and “Transym”.
Google Docs OCR [20] is an easy-to-use and highly available OCR service
offered by Google within the Google Drive service [34]. We can convert different
types of image data into editable text data using Google Drive. Once we upload
an image or a PDF file to the Google Drive, we can start the OCR conversion
by right-clicking on the file to select “Open with Google Docs” item, then the
image is inside a Google Doc document and the extracted text is right below the
image.
Tesseract was originally developed by HP as an open-source OCR toolset
released under the Apache License [35], available for different operating system
platforms, such as Mac OS X, Linux, and Windows. Since 2006, Tesseract devel-
opments have been maintained by Google [36], and it is among one of the top
OCR systems used worldwide [29]. The Tesseract algorithm, at step one, uses
adaptive thresholding strategies [37] to convert the input image into a binary
one. It then utilizes connected component analysis to extract character layouts
in which such layouts are then turned into blobs, the regions in an image data
that differ in some part of the properties including color or intensity, compared
to surrounding pixels [36]. Blobs are then formed as text lines, and consequently
examined for an equivalent text size which is then divided into words using fuzzy
spaces [36]. Text recognition will then proceed in a two stage process. In the first
stage, the algorithm tries to discover each word from the text. Then, every sat-
isfactory word will be passed to an adaptive classifier to train the data in stage
one. In the second stage, the adaptive classifier assists to discover text data in a
more reliable way [36].
ABBYY FineReader as an advanced OCR software system has been
designed and developed by an international company, namely “ABBYY” [23]
to provide high level OCR services. It has been improving the main functionali-
ties of optical character recognition for many years, providing promising results
in text retrieval from digital images [28]. The underlying algorithms of ABBYY
FineReader have not yet been illustrated to the research community, proba-
bly because it is a commercial software product, and the package is not avail-
able as open-source code. Researchers and developers can access the ABBYY
FineReader OCR by two different ways: (1) The ABBYY FineReader SDK which
is available at https://www.abbyy.com/resp/promo/ocr-sdk/, and (2) Employ-
ing a web browser to try it over the Internet at https://finereaderonline.com/
en-us/Tasks/Create.
Transym is another OCR software package that assists research and devel-
opment communities in extracting accurate information from digital documents,
particularly scanned and digital images. The source code of the Transym and
its underlying algorithms are not available, but it has been delivered as a SDK
738 A.P. Tafti et al.
which provides a high level API, and it also has a software package with a light
GUI (graphical user interface) which can be easily installed and used efficiently.
Transym OCR package along with some sample codes are available at http://
www.transym.com/download.htm.
3 Applications in Healthcare Informatics
There have been limited studies surrounding the application of OCR within
healthcare. Generally, the studies are divided into two major approaches:
(1) Prospective data collection using forms that are specifically designed to cap-
ture hand printed data for OCR processing, and (2) Retrospective OCR data
extraction using scanned historical paper documents or image forms [38]. There
are several innovative examples of prospective OCR data capture at point-of-
care. Titlestad [39] created a special OCR form to register new cancer patients
into a large cancer registry. The OCR forms captured basic patient demographics
and cancer codes. More recently OCR was introduced to capture data on anti-
retroviral treatment, drug switches and tolerability for human immunodeficiency
virus (HIV-1) patients [40]. This application enabled clinical staff to better man-
age the care of the HIV patient because the data could be tracked from visit to
visit. Lee et al. [40] used OCR to minimize the transcription effort of radiologists
when creating radiology reports. The Region of Interest (ROI) values (includ-
ing area, mean, standard deviation, maximum and minimum) were limited to
view on the computed tomography (CT) console or image analysis workstation.
This image was then stored in a Picture Archiving and Communicating System
(PACS). Radiologists would review the PAC images on the screen and then type
the ROI measurements into a radiology report. OCR was used to automatically
capture the ROI and measurements to place it on the clip board so it could
be copied into the radiology report. Finally, Hawker et al. [41] used a set of
cameras to capture the patient name when processing lab samples. OCR was
used to interpret the patient name on incoming biological samples and then the
name was compared to the laboratory information system for validity. The OCR
mislabeling identification process outperformed the normal quality assurance
process.
The majority of retrospective OCR studies have focused on retrieving medical
data for research use. Peissig et al. [42] used OCR to extract cataract subtypes
and severity from handwritten ophthalmology forms to enrich existing electronic
health record data for a large genome-wide association study. This application
extracted data from existing clinical forms that were not designed for OCR use
with high accuracy rates. Fenz et al. [43] developed a pipeline that processed
paper-based medical records using the open-source OCR engine Tesseract to
extract synonyms and formal specifications of personal and medical data ele-
ments. The pipeline was applied on a large scale to health system documents
and the output then used to identify representative research samples. Finally,
OCR was applied to photographed printed medical records to detect diagnosis
codes, medical tests and medications enabling the creation of structured personal
OCR as a Service 739
health records. This study applied OCR to a real-world situation and addressed
image quality problems and complex content by pre-processing and using mul-
tiple OCR engine synthesis [44].
4 Experimental Validations
To validate the accuracy, reliability, and performance of the Google Docs OCR,
Tesseract, ABBYY FineReader, and Transym, several experiments on real, and
also synthetic data were performed. In Sect. 4.1 we discuss the experimental
setup, including the proposed dataset along with the testbed and its configura-
tions. In Sect. 4.2 the qualitative OCR visualization results achieved from the
OCR packages/services are reported. Subsequently, in Sect. 4.3 we examine the
accuracy and reliability of the OCR systems, and perform a quantitative com-
parative study. Section 4.3 also presents and compare a set of quality attributes
that the OCR systems offer to the research community.
4.1 Experimental Setup: The Dataset and Testbed
We have gathered 1227 images from 15 categories, including: (1) Digital Images,
(2) Machine-written characters, (3) Machine-written digits, (4) Hand-written
characters, (5) Hand-written digits, (6) Barcodes, (7) Black and white images,
(8) Multi-oriented text strings, (9) Skewed images, (10) License plate numbers,
(11) PDF files including electronic forms, (12) Digital receipts, (13) Noisy images,
(14) Blurred images, and (15) Multilingual text images. Figure 1shows an exam-
ple from every category listed here. Except the PDF files (dataset No. 11), all
images were taken in different resolutions using multiple formats, such as JPEG,
TIFF, PNG, etc. The dataset attributes are explained in Table 1. Every dataset
came up with the ground truth information including a list of the characters
existing in the images. For all experiments illustrated here, we used 64-bit MS
Windows 8 operating system on a personal computer with 3.00 GHz Intel Dual
core CPU, 4 MB cache and 6 GB RAM. To communicate with Google Docs OCR
[20], we employed Mozilla Firefox Version 48.0.1 at https://www.mozilla.org/.
4.2 Qualitative OCR Visualization
Using different images from the dataset illustrated in Sect. 4.1 we examined the
qualitative visualization of the OCR systems. Figure 2shows some sample results
in extracting text data from digital images.
4.3 Comparative Study
Here, we further analyzed and compared the accuracy and reliability of the
Google Docs OCR, Tesseract, ABBYY FineReader, and Transym using the
dataset reported in Table 1. A detailed comparative study is reported in Table 2.
740 A.P. Tafti et al.
Table 1. Dataset attributes. First column shows the image categories. Number and
type of the images is shown in the second column. CG, BW, and BWC stands for color
and gray-scale, black & white, and black & white and color images respectively.
Image category Images Formats
Digital images 131 CG TIFF, JPEG, GIF, PNG
Machine-written characters 47 CG TIFF, JPEG, GIF, PNG
Machine-written digits 28 CG TIFF, JPEG, GIF, PNG
Hand-written characters 49 BW TIFF, JPEG, GIF, PNG
Hand-written digits 28 BW TIFF, JPEG, GIF, PNG
Barcodes 224 BWC TIFF, JPEG, GIF, PNG
Black and white images 101 BW TIFF, JPEG, PNG
Multi-oriented text string 27 CG TIFF, JPEG, PNG
Skewed images 93 CG JPEG, PNG
License plate numbers 204 CG JPEG, PNG
PDF files 14 CG PDF
Digital receipts 108 CG JPEG, PNG
Noisy images 24 CG JPEG, PNG
Blurred images 31 CG JPEG, PNG
Multilingual text images 118 CG TIFF, JPEG, PNG
Fig. 1. Sample images from each category of the proposed dataset. The dataset includes
1227 digital images in 15 different categories.
OCR as a Service 741
Fig. 2. The qualitative visualization of the four OCR systems using some sample images
from the dataset.
A comparative examination of color as well as gray-scale images, with or with-
out applying low-level image processing tasks (e.g., contrast/brightness enhance-
ment) is shown in Fig. 3. To calculate the accuracy for every OCR systems dis-
cussed in the current work, we divided the number of characters which correctly
extracted from a dataset by the number of characters existing in the same dataset
using the Eq. (1), where ndenotes the number of images in the dataset. Then,
we calculated an average to obtain the total accuracy for each individual OCR
system.
Accuracy =n
k=1(numberof correctly extracted characters)
n
k=1(number of total characters in the dataset)×100 (1)
Table 2shows that the Google Docs OCR and ABBYY FineReader produced
more promising results on the stated dataset, and the population standard devi-
ation of accuracy obtained by those two are further consistent across the dataset.
In addition to the experiments illustrated in Table 2, we divided the dataset into
two parts including color and gray-scale images. Using color images, we obtained
74%, 64%, 71%, and 59% accuracy for the Google Docs OCR, Tesseract, ABBYY
FineReader, and Transym respectively. After performing low-level image process-
ing tasks including brightness and contrast enhancements, we obtained 75%,
742 A.P. Tafti et al.
Table 2. A Comparative study of the OCR systems. In this table we report analysis
results obtained from 15 different image categories, examining the ability of the OCR
systems to correctly extract characters from images. The percentage in the table means
accuracy (Eq. 1).
Extracted characters
Image cate-
gory
Existing
characters
Google Docs OCR Tesseract ABBY
FineReader
Tra n sy m
Digital
images
1834 1613 (87.95%) 1539 (83.91%) 1528 (83.31%) 1463 (79.77%)
Machine-
written
characters
703 569 (80.94%) 549 (78.09%) 574 (81.65%) 554 (78.81%)
Machine-
written
digits
211 191 (90.52%) 193 (91.47%) 193 (91.47%) 194 (91.94%)
Hand-
written
characters
2036 1254 (61.59%) 984 (48.33%) 1204 (59.14%) 960 (47.15%)
Hand-
written
digits
43 29 (67.44%) 11 (25.58%) 25 (58.14%) 10 (23.26%)
Barcodes 867 841 (97%) 844 (97.35%) 832 (95.96%) 845 (97.47%)
Black
and white
images
71 69 (97.19%) 69 (97.19%) 65 (91.55%) 61 (85.92%)
Multi-
oriented
text strings
106 68 (64.15%) 30 (28.3%) 75 (70.75%) 23 (21.7%)
Skewed
images
96 38 (39.58%) 31 (32.3%) 36 (37.5%) 27 (28.13%)
License
plate num-
bers
1953 1871 (95.8%) 1812 (92.78%) 1894 (96.98%) 1732 (88.68%)
PDF Files 15693 15409 (98.19%) 14121 (89.98%) 15376 (97.98%) 14133 (90%)
Digital
receipts
3672 3256 (88.67%) 3341 (90.99%) 3302 (89.92%) 3077 (83.8%)
Noisy
images
337 179 (53.12%) 161 (47.77%) 184 (54.6%) 169 (50.15%)
Blurred
images
461 259 (56.18%) 263(57.05%) 282 (61.17%) 277 (60.09%)
Multilingual
text images
3597 2831 (78.7%) 2474 (68.78%) 2799 (77.81%) 1740 (48.37%)
Standard
Deviation
σ=18.19 σ=25.56 σ=18.02 σ=25.79
64%, 75%, and 62% accuracy (Fig. 3). Using gray-scale images, we obtained
77%, 71%, 78%, and 68% accuracy for the Google Docs OCR, Tesseract, ABBYY
FineReader, and Transym respectively. After performing low-level image process-
ing tasks, such as brightness and contrast enhancement, we achieved 81%, 72%,
79%, and 70% accuracy (Fig. 3).
OCR as a Service 743
Fig. 3. A comparative study of the OCR systems using color and gray-scale images,
with or without applying low-level image processing tasks (e.g., contrast/brightness
enhancement). (Color figure online)
Table 3. A Comparative study of quality attributes of the OCR systems.
Quality attribute Google
Docs OCR
Tesseract ABBYY
FineReader
Transym
Open-source No Yes No No
Available online Ye s No Yes No
Available as a SDK No Yes Yes Ye s
Available as a Service Ye s Could be No No
Multilingual support Ye s Ye s Yes Yes
Free Ye s Ye s No No
Operating systems Any Linux, Mac OS
X, Windows
Linux, Mac OS
X, Windows
Windows
Table 3summarizes a comparative analysis of a set of quality attributes deliv-
ered by the OCR systems.
5 Discussion and Conclusion
We performed a qualitative and quantitative comparative study of four optical
character recognition services, including Google Docs OCR, Tesseract, ABBYY
FineReader, and Transym using a dataset containing 1227 images in 15 different
categories. In addition to experimentally evaluating the OCR systems, we also
reviewed OCR applications in the field of healthcare informatics. Based on our
experimental evaluations using stated dataset, and without employing advanced
image processing procedures (e.g., denoising, image registration), the Google
744 A.P. Tafti et al.
Docs OCR and ABBYY FineReader produced more promising results, and their
population standard deviation of accuracy remained consistent across different
types of images existing in the dataset. As we have seen in the experiments, the
quality of input images has a crucial impact on the OCR outputs. For example,
all of the examined OCR systems have faced a problem with skewed, blurred,
and noisy images. The remedy can be sought in taking advanced low-level and
medium-level image processing routines into account. We believe that the pro-
posed dataset came with a reasonable distribution concerning the image types,
but testing large-scale datasets employing hundred of thousand of digital images
is still needed. As a classic machine learning problem, OCR is not only about
character recognition itself, but also about learning how to be more accurate
from the data of interest. The OCR is a challenging research topic that broadly
lies in a variety of functionalities, such as layout analysis, support of different
alphabets and digits style, in addition to well-formed binarisation to separate
text data from an image background. As part of our future work, an attempt will
be made to evaluate further OCR services using large-scale datasets, incorporat-
ing more significant statistical analysis for the accuracy and reliability. We will
take advantage of advanced image processing algorithms and examine the bene-
fit of their use towards developing more accurate and efficient optical character
recognition systems.
Acknowledgement. The authors of the paper wish to thank Anne Nikolai at Marsh-
field Clinic Research Foundation for her valuable contributions in manuscript prepa-
ration. We also thank two anonymous reviewers for their useful comments on the
manuscript.
References
1. Lin, H.-Y., Hsu, C.-Y.: Optical character recognition with fast training neural
network. In: 2016 IEEE International Conference on Industrial Technology (ICIT),
pp. 1458–1461. IEEE (2016)
2. Patil, V.V., Sanap, R.V., Kharate, R.B.: Optical character recognition using arti-
ficial neural network. Int. J. Eng. Res. Gen. Sci. 3(1), 7 (2015)
3. Spitsyn, V.G., Bolotova, Y.A., Phan, N.H., Bui, T.T.T.: Using a haar wavelet
transform, principal component analysis and neural networks for OCR in the pres-
ence of impulse noise. Comput. Opt. 40(2), 249–257 (2016)
4. Bunke, H., Caelli, T.: Hidden Markov Models: Applications in Computer Vision,
vol. 45. World Scientific, River Edge (2001)
5. Gupta, M.R., Jacobson, N.P., Garcia, E.K.: OCR binarization and image pre-
processing for searching historical documents. Pattern Recogn. 40(2), 389–397
(2007)
6. Jadhav, P., Kelkar, P., Patil, K., Thorat, S.: Smart traffic control system using
image processing (2016)
7. Afli, H., Qiu, Z., Way, A., Sheridan, P.: Using SMT for OCR error correction
of historical texts. In: Proceedings of LREC-2016, Portoroˇz, Slovenia (2016, to
appear)
OCR as a Service 745
8. Kolak, O., Byrne, W., Resnik, P.: A generative probabilistic OCR model for
NLP applications. In: Proceedings of the 2003 Conference of the North American
Chapter of the Association for Computational Linguistics on Human Language
Technology, vol. 1, pp. 55–62. Association for Computational Linguistics (2003)
9. Kolak, O., Resnik, P.: OCR post-processing for low density languages. In: Proceed-
ings of the Conference on Human Language Technology and Empirical Methods
in Natural Language Processing, pp. 867–874. Association for Computational Lin-
guistics (2005)
10. Deselaers, T., M¨uller, H., Clough, P., Ney, H., Lehmann, T.M.: The CLEF 2005
automatic medical image annotation task. Int. J. Comput. Vis. 74(1), 51–58 (2007)
11. Kaggal, V.C., Elayavilli, R.K., Mehrabi, S., Joshua, J.P., Sohn, S., Wang, Y., Li,
D., Rastegar, M.M., Murphy, S.P., Ross, J.L., et al.: Toward a learning health-care
system-knowledge delivery at the point of care empowered by big data and NLP.
Biomed. Inf. Insights 8(Suppl1), 13 (2016)
12. Pomares-Quimbaya, A., Gonzalez, R.A., Quintero, S., Mu˜noz, O.M., Boh´orquez,
W.R., Garc´ıa, O.M., Londo˜no, D.: A review of existing applications and techniques
for narrative text analysis in electronic medical records (2016)
13. Herbert, H.F.: The History of OCR, Optical Character Recognition. Recognition
Technologies Users Association, Manchester Center (1982)
14. Tappert, C.C., Suen, C.Y., Wakahara, T.: The state of the art in online handwriting
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(8), 787–808 (1990)
15. Assefi, M., Liu, G., Wittie, M.P., Izurieta, C.: An experimental evaluation of apple
siri and google speech recognition. In: Proccedings of the 2015 ISCA SEDE (2015)
16. Assefi, M., Wittie, M., Knight, A.: Impact of network performance on cloud speech
recognition. In: 2015 24th International Conference on Computer Communication
and Networks (ICCCN), pp. 1–6. IEEE (2015)
17. Hatch, R.: SaaS Architecture, Adoption and Monetization of SaaS Projects using
Best Practice Service Strategy, Service Design, Service Transition, Service Oper-
ation and Continual Service Improvement Processes. Emereo Pty Ltd., London
(2008)
18. Tafti, A.P., Hassannia, H., Piziak, D., Yu, Z.: SeLibCV: a service library for com-
puter vision researchers. In: Bebis, G., et al. (eds.) ISVC 2015. LNCS, vol. 9475,
pp. 542–553. Springer, Heidelberg (2015). doi:10.1007/978-3- 319-27863-6 50
19. Xiaolan, X., Wenjun, W., Wang, Y., Yuchuan, W.: Software crowdsourcing for
developing software-as-a-service. Front. Comput. Sci. 9(4), 554–565 (2015)
20. Google docs (2012). http://docs.google.com
21. Tesseract OCR (2016). https://github.com/tesseract-ocr
22. Tesseract.js, a pure javascript version of the tesseract OCR engine (2016). http://
tesseract.projectnaptha.com/
23. Abbyy OCR (2016). https://www.abbyy.com/
24. Abbyy OCR online (2016). https://finereaderonline.com/en-us/Tasks/Create
25. Transym (2016). http://www.transym.com/
26. Online OCR (2016). http://www.onlineocr.net/
27. Free OCR (2016). http://www.free-ocr.com/
28. Mendelson, E.: Abbyy finereader 12 professional. Technical report, PC Magazine
(2014)
29. Rice, S.V., Jenkins, F.R., Nartker, T.A.: The fourth annual test of OCR accuracy.
Technical report, Technical Report 95 (1995)
30. Bautista, C.M., Dy, C.A., Ma˜nalac, M.I., Orbe, R.A., Cordel, M.: Convolutional
neural network for vehicle detection in low resolution traffic videos. In: 2016 IEEE
Region 10 Symposium (TENSYMP), pp. 277–281. IEEE (2016)
746 A.P. Tafti et al.
31. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444
(2015)
32. Shah, P., Karamchandani, S., Nadkar, T., Gulechha, N., Koli, K., Lad, K.: OCR-
based chassis-number recognition using artificial neural networks. In: 2009 IEEE
International Conference on Vehicular Electronics and Safety (ICVES), pp. 31–34.
IEEE (2009)
33. Ye, Q., Doermann, D.: Text detection and recognition in imagery: a survey. IEEE
Trans. Pattern Anal. Mach. Intell. 37(7), 1480–1500 (2015)
34. Google drive (2012). http://drive.google.com
35. Apache license, version 2.0 (2004). http://www.apache.org/licenses/LICENSE-2.0
36. Smith, R.: An overview of the tesseract OCR engine (2007)
37. Bradley, D., Roth, G.: Adaptive thresholding using the integral image. J. Graph.
GPU Game Tools 12(2), 13–21 (2007)
38. Rasmussen, L.V., Peissig, P.L., McCarty, C.A., Starren, J.: Development of an
optical character recognition pipeline for handwritten form fields from an electronic
health record. J. Am. Med. Inf. Assoc. 19(e1), e90–e95 (2012)
39. Titlestad, G.: Use of document image processing in cancer registration: how and
why? Medinfo. MEDINFO 8, 462 (1994)
40. Bussmann, H., Wester, C.W., Ndwapi, N., Vanderwarker, C., Gaolathe, T., Tirelo,
G., Avalos, A., Moffat, H., Marlink, R.G.: Hybrid data capture for monitoring
patients on highly active antiretroviral therapy (haart) in urban Botswana. Bull.
World Health Org. 84(2), 127–131 (2006)
41. Hawker, C.D., McCarthy, W., Cleveland, D., Messinger, B.L.: Invention and vali-
dation of an automated camera system that uses optical character recognition to
identify patient name mislabeled samples. Clin. Chem. 60(3), 463–470 (2014)
42. Peissig, P.L., Rasmussen, L.V., Berg, R.L., Linneman, J.G., McCarty, C.A.,
Waudby, C., Chen, L., Denny, J.C., Wilke, R.A., Pathak, J., et al.: Importance
of multi-modal approaches to effectively identify cataract cases from electronic
health records. J. Am. Med. Inform. Assoc. 19(2), 225–234 (2012)
43. Fenz, S., Heurix, J., Neubauer, T.: Recognition and privacy preservation of paper-
based health records. Stud. Health Technol. Inf. 180, 751–755 (2012)
44. Li, X., Hu, G., Teng, X., Xie, G.: Building structured personal health records from
photographs of printed medical records. In: AMIA Annual Symposium Proceed-
ings, vol. 2015, p. 833. American Medical Informatics Association (2015)
... Optical Character Recognition (OCR) is a research area that benefits from several computing fields, such as machine learning, computer vision, and natural language processing [11]. OCR essentially converts two types of documents into texts: handwritten and machine-typed [12]. ...
... Tesseract is one of the most widely used OCR engines that provides a high accuracy rate compared with other available engines [11]. Figure 1 shows a brief history of Tesseract. ...
... Tafti et al. [11] provided a comparative analysis on various OCR engines, their capability, working platform, and other attributes. Recently, a new survey with a broader coverage was also published [30]. ...
Article
Full-text available
Applications based on Long-Short-Term Memory (LSTM) require large amounts of data for their training. Tesseract LSTM is a popular Optical Character Recognition (OCR) engine that has been trained and used in various languages. However, its training becomes obstructed when the target language is not resourceful. This research suggests a remedy for the problem of scant data in training Tesseract LSTM for a new language by exploiting a training dataset for a language with a similar script. The target of the experiment is Kurdish. It is a multi-dialect language and is considered less-resourced. We choose Sorani, one of the Kurdish dialects, that is mostly written in Persian-Arabic script. We train Tesseract using an Arabic dataset, and then we use a considerably small amount of texts in Persian-Arabic to train the engine to recognize Sorani texts. Our dataset is based on a series of court case documents in the Kurdistan Region of Iraq. We also fine-tune the engine using 10 Unikurd fonts. We use Lstmeval and Ocreval to evaluate the outputs. The result indicates the achievement of 95.45% accuracy. We also test the engine using texts outside the context of court cases. The accuracy of the system remains close to what was found earlier indicating that the script similarity could be used to overcome the lack of large-scale data.
... The introduction of transformer models [44] has also promoted innovative forms of document analysis, leading to further improvements in OCR systems [12,17]. As a result, most off-the-shelf imaged document comparison software is based on OCR [42]. This software integrates pre-processing steps, structural analysis, text detection, and various postprocessing techniques to optimize OCR performance. ...
Preprint
Document comparison typically relies on optical character recognition (OCR) as its core technology. However, OCR requires the selection of appropriate language models for each document and the performance of multilingual or hybrid models remains limited. To overcome these challenges, we propose text change detection (TCD) using an image comparison model tailored for multilingual documents. Unlike OCR-based approaches, our method employs word-level text image-to-image comparison to detect changes. Our model generates bidirectional change segmentation maps between the source and target documents. To enhance performance without requiring explicit text alignment or scaling preprocessing, we employ correlations among multi-scale attention features. We also construct a benchmark dataset comprising actual printed and scanned word pairs in various languages to evaluate our model. We validate our approach using our benchmark dataset and public benchmarks Distorted Document Images and the LRDE Document Binarization Dataset. We compare our model against state-of-the-art semantic segmentation and change detection models, as well as to conventional OCR-based models.
... Google Docs OCR [18] is a service offered by Google integrated to Google Drive. Unfortunately Google does not provide any sufficient information of how the service is implemented and which technologies it uses. ...
Article
Full-text available
During the last 20 years the world has been meeting digitalization challenges in many areas of life and business. Companies convert their business processes to digital form as much as possible, what allows reducing costs and increasing profits. The paper considers a case of the company aimed at manufacturing and servicing spare parts for mining equipment. The company has collected a large amount of engineering drawings in a paper form that need to be organized in a digital form to simplify and support quick access to these documents. We propose an approach to engineering drawing organization based on modern artificial intelligence technologies for detection of main engineering drawing elements as well as extracting metadata from them. Based on the metadata we propose organizing engineering drawings into a structured digital collection with a possibility of quick access to them. All metadata in drawings are usually located in the title block. We show how we detect the title block as well as process it and recognize needed text-based information. The open-source Google Tesseract framework is used for text recognition in English, Finnish, Spanish, and Portuguese languages. For Cyrillic text recognition our own neural network-based model has been developed and trained on Russian GOST engineering drawings. We developed a drawing organization service and evaluated it based on the company engineering drawings database. The evaluation shows a good accuracy for documents that have a good quality (about 74% for title block detection as well as metadata identification).
... Improving image quality is crucial, primarily if this image is used for further processing, such as feature extraction, object recognition, and action recognition. A comparative study [38] on four known OCR systems, Google Docs, ABBY FineReader, Tesseract, and Transym, showed that basic image preprocessing operations such as converting image color to grayscale, brightness, and contrast adjustment improved the recognition accuracy of all systems up to 9%. Illumination adjustment includes brightness and contrast operations to increase the object's sharpness and show the contours clearly. ...
Article
Full-text available
This study aims to review the latest contributions in Arabic Optical Character Recognition (OCR) during the last decade, which helps interested researchers know the existing techniques and extend or adapt them accordingly. The study describes the characteristics of the Arabic language, different types of OCR systems, different stages of the Arabic OCR system, the researcher's contributions in each step, and the evaluation metrics for OCR. The study reviews the existing datasets for the Arabic OCR and their characteristics. Additionally, this study implemented some preprocessing and segmentation stages of Arabic OCR. The study compares the performance of the existing methods in terms of recognition accuracy. In addition to researchers' OCR methods, commercial and open-source systems are used in the comparison. The Arabic language is morphologically rich and written cursive with dots and diacritics above and under the characters. Most of the existing approaches in the literature were evaluated on isolated characters or isolated words under a controlled environment, and few approaches were tested on page-level scripts. Some comparative studies show that the accuracy of the existing Arabic OCR commercial systems is low, under 75% for printed text, and further improvement is needed. Moreover, most of the current approaches are offline OCR systems, and there is no remarkable contribution to online OCR systems.
... [43][44][45] The radiology informatics community has likewise studied ways to de-identify medical images. [46][47][48] However, it is not always clear what constitutes "adequate" de-identification. Part of the challenge is that personally identifying details can be found in many different places within highly heterogeneous medical records. ...
Article
Full-text available
A novel approach to image caption generation tailored specifically for visually impaired individuals. The proposed system employs advanced computer vision algorithms to analyze images and generate descriptive textual captions. Furthermore, it integrates seamless text-to-speech conversion functionality, allowing for the automatic transformation of these captions into spoken audio, thereby enabling access to visual content for individuals with visual impairments. The goal of this project is to generate descriptive captions for a given photograph or image. We achieve this by employing Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) models, both of which are advanced deep learning techniques. Using computer vision, the system identifies the content of the image and generates a relevant caption. This caption is then converted into audio using Natural Language Processing (NLP).
Article
Full-text available
Predicting student performance is crucial for understanding their progress and intervening effectively. This research aims to identify current student status and forecast future results to enable teachers to provide timely guidance and support. By analyzing dependencies for final examinations, we can recommend suitable courses for upcoming semesters, serving as advisor’s to students. Many students struggle due to a lack of proper guidance and monitoring, as teachers can’t monitor everyone simultaneously. An AI system can assist by identifying which students require specific types of support. Ultimately, the goal is to empower students to avoid predicted poor results through proactive intervention, potentially achieving accuracy rates as high as 94.88%.
Article
Full-text available
The concept of optimizing health care by understanding and generating knowledge from previous evidence, ie, the Learning Health-care System (LHS), has gained momentum and now has national prominence. Meanwhile, the rapid adoption of electronic health records (EHRs) enables the data collection required to form the basis for facilitating LHS. A prerequisite for using EHR data within the LHS is an infrastructure that enables access to EHR data longitudinally for health-care analytics and real time for knowledge delivery. Additionally, significant clinical information is embedded in the free text, making natural language processing (NLP) an essential component in implementing an LHS. Herein, we share our institutional implementation of a big data-empowered clinical NLP infrastructure, which not only enables health-care analytics but also has real-time NLP processing capability. The infrastructure has been utilized for multiple institutional projects including the MayoExpertAdvisor, an individualized care recommendation solution for clinical care. We compared the advantages of big data over two other environments. Big data infrastructure significantly outperformed other infrastructure in terms of computing speed, demonstrating its value in making the LHS a possibility in the near future.
Article
Full-text available
A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly 13% relative improvement compared to the initial baseline.
Conference Paper
Full-text available
Abstract—We perform an experimental evaluation of two popular cloud-based speech recognition systems. Cloud-based speech recognition systems enhances Web surfing, transportation, health care, etc. Using voice commands helps drivers stay connected to the Internet by avoiding traffic safety risks. The performance of these type of applications should be robust under difficult network conditions. User frustration with network traffic problems can affect the usability of these applications. We evaluate the performance of two popular cloud-based speech recognition applications, Apple Siri and Google Speech Recognition (GSR) under various network conditions. We evaluate transcription delay and accuracy of transcription of each application under different packet loss and jitter values. Results of our study show that performance of cloud-based speech recognition systems can be affected by jitter and packet loss; which are commonly occurring over WiFi and cellular network connections.
Article
Personal health records (PHRs) provide patient-centric healthcare by making health records accessible to patients. In China, it is very difficult for individuals to access electronic health records. Instead, individuals can easily obtain the printed copies of their own medical records, such as prescriptions and lab test reports, from hospitals. In this paper, we propose a practical approach to extract structured data from printed medical records photographed by mobile phones. An optical character recognition (OCR) pipeline is performed to recognize text in a document photo, which addresses the problems of low image quality and content complexity by image pre-processing and multiple OCR engine synthesis. A series of annotation algorithms that support flexible layouts are then used to identify the document type, entities of interest, and entity correlations, from which a structured PHR document is built. The proposed approach was applied to real world medical records to demonstrate the effectiveness and applicability.
Conference Paper
Optical character recognition has been extensively investigated in the past few years. Many existing techniques are able to provide high recognition rate, but at the cost of long training time. In this work, we present a neural network based approach to reduce the training time while maintain the high recognition rate. The main idea is to perform a preprocessing stage to partition the training data prior to the training stage. A multi-stage approach is then used to deal with various types of input source. Our experiments on real image datasets have demonstrated that the balance between the training time and recognition time can be achieved using the proposed method.
Chapter
The obstacles for using the extensive narrative data found within EMR in research projects, mainly due to their lack of structure and standardization, have motivated different types of works. This chapter presents projects that have demonstrated successful use of Natural Language Processing (NLP) and/or data mining techniques for the exploitation of EMR narrative data. These works can be classified into two broad groups: the first group uses NLP or data mining techniques in the context of a disease or a process; for instance, the analysis of a specific disease or within a pharmacovigilance process; this group is called NLP applications for medical analysis. The second group, called Generic NLP Methods and Tools, comprises works that propose methods or techniques to improve the analysis of texts regardless of the context, including, generating summaries, de-identifying narrative texts and solving redundancy aspects. This chapter surveys recent work in NLP and text mining over medical records. The period of the analysis ranges from 2008 to the beginning of 2014. Even though there are previous works on this subject, we decided to restrict the dates considering the recent advances on NLP and text mining the last years. Papers were identified using Web of Science database1, and specifically the results obtained from the following query: TS=(EHR or Electronic Health Record or Medical Health Record) and TS=(text mining or natural language processing or information retrieval) and TS= (text-free or free-text or free text or narrative text or text or medical notes or nursery notes)) textit AND LANGUAGE: (English). From the obtained list of paper we selected interesting publications by analyzing the titles and their abstracts.
Article
In this paper we propose a novel algorithm for optical character recognition in the presence of impulse noise by applying a wavelet transform, principal component analysis, and neural networks. In the proposed algorithm, the Haar wavelet transform is used for low frequency components allocation, noise elimination and feature extraction. The principal component analysis is used to reduce the dimension of the extracted features. We use a set of different multi-layer neural networks as classifiers for each character; the inputs are represented by a reduced set of features. One of the key features of the proposed approach is creating a separate neural network for each type of character. The experimental results show that the proposed algorithm can effectively recognize the characters in images in the presence of impulse noise; the results are comparable with ABBYY FineReader and Tesseract OCR. © 2016, Institution of Russian Academy of Sciences. All rights reserved.
Conference Paper
Image feature detectors and descriptors have made a big advance in several computer vision applications including object recognition, image registration, remote sensing, panorama stitching, and 3D surface reconstruction. Most of these fundamental algorithms are complicated in code, and their implementations are available for only a few platforms. This operational restriction causes various difficulties to utilize them, and even more, it makes different challenges to establish novel experiments and develop new research ideas. SeLibCV is a Software as a Service (SaaS) library for computer vision researchers worldwide that facilitates Rapid Application Development (RAD), and provides application-to-application interaction by tiny services accessible through the Internet. Its functionality covers a wide range of computer vision algorithms including image processing, features extraction, motion detection, visualization, and 3D surface reconstruction. The present paper focuses on the SeLibCV’s routines specializing in local features detection, extraction, and matching algorithms which offer reusable and platform independent components, leading to reproducible research for computer vision scientists. SeLibCV is freely available at http:// selibcv. org for any academic, educational, and research purposes.