Content uploaded by Erdem Emekligil
Author content
All content in this area was uploaded by Erdem Emekligil on Oct 15, 2019
Content may be subject to copyright.
A Bank Information Extraction System Based
on Named Entity Recognition with CRFs from
Noisy Customer Order Texts in Turkish
Erdem Emekligil, Secil Arslan, and Onur Agin
R&D and Special Projects Department,
Yapi Kredi Technology, Istanbul, Turkey
Email: {erdem.emekligil, secil.arslan, onur.agin}@ykteknoloji.com.tr
Abstract. Each day hundred thousands of customer transactions ar-
rive at banks operation center via fax channel. The information required
to complete each transaction (money transfer, salary payment, tax pay-
ment etc.) is extracted manually by operators from the image of customer
orders. Our information extraction system uses CRFs (Conditional Ran-
dom Fields) for obtaining the required named entities for each trans-
action type from noisy text of customer orders. The difficulty of the
problem arouses from the fact that every customer order has different
formats, image resolution of orders are so low that OCR-ed (Optical
Character Recognition) texts are highly noisy and Turkish is still chal-
lenging for the natural language processing techniques due to structure
of the language. This paper mentions the difficulties of our problem do-
main and provides details of the methodology developed for extracting
entities such as client name, organization name, bank account number,
IBAN number, amount, currency and explanation.
Keywords: Named Entity Recognition, Turkish, Conditional Random
Fields, Noisy Text, Banking Applications
1 Introduction
Natural Language Processing (NLP) systems have been employed for many dif-
ferent tasks on Information Retrieval (IR). Specific task of extracting predefined
types of data from documents is referred to as Named Entity Recognition (NER).
Each banking process such as money transfers, tax payments etc. is performed
in a maker-checker basis. Maker is responsible for entering the required data for
initiating the customer orders, whereas checker has the last approval respon-
sibility to prevent frauds at data entrance step. Customer orders are mostly
received via fax channel that means each order document is in image format.
Due to low resolution settings of the fax machines the OCR from customer order
images generates noisy text and challenges even state-of-the-art NER systems.
In this paper, we present a fast and robust information extraction system that
harnesses NER to automatically extract necessary entities for banking processes
from OCR-ed Turkish customer order documents; hence minimizes the effort
and time consumed for data entrance.
There have been many applications of Turkish NER in the literature. Tatar
and Cicekli [7] extracted different features from text by doing an automatic rule
learning. Their method resulted with 91.08% F-score on their TurkIE Turkish
news data1. Yeniterzi [5] exploits morphology by capturing syntactic and contex-
tual properties of tokens and she reports 88.94% F-score on Tur et al’s Turkish
general news data [11] by using CRFs. Seker and Eryigit [4] also used CRFs
along with morphological and lexical features. Moreover, they also made use
of large scale person and location gazetteers. Their state of the art framework
reports 91.94% F-score on Tur et al’s data.
Turkish newspaper domain is well studied and results are close to English.
However, in both English and Turkish, results of unstructured, noisy data like
Twitter are relevantly poor. Since what makes NER on customer orders difficult
is noisy unstructured texts, we can state that to some degree, our problem is
similar to NER on twitter data. Yamada et al. [9] report the best performing
method of ALC 2015 Twitter NER shared task2with 56.41% F-score on English
Twitter data by exploiting open knowledge bases such as DBpedia and Freebase.
Celikkaya et. al. [12] trained a CRF model with news data and achieved 19%
F-score on their Turkish Tweet test data3. Kucuk and Steinberger[8] reported a
diacritics-based expansion of their base NER system lexical resources that results
with 38.01% PLO (person, location, organization) overall F-score on Celikkaya
el al’s data and 48.13% PLO overall F-score on their Turkish Twitter data.
Eken and Tantug [10] exploited gazetteers along with some basic features. They
used Celikkaya el al’s Turkish Twitter data in training and their own annotated
Turkish Twitter data while testing which they reported 64.03% F-score.
The remaining of this paper is structured as follows: in Section 2, we explain
the problems that differentiate our system from other NER systems. We provide
details of our approach in Section 3 and share our experimental test results in
Section 4. Finally, we conclude by giving our remarks in Section 5.
2 Problem Definition
In this section, we give brief information on three main challenges:
–Turkish as a language with complex morphology and low resources
–Noisy text generated by OCR systems
–Lack of consensus on document types in banking systems
2.1 Turkish
Turkish is one of the highly agglutinative and morphologically rich languages.
In agglutinative languages each affix gives a different meaning to the word. For
1TurkIE dataset contains approximately 55K tokens from news articles on terrorism
from both online and print news sources
2ACL 2015 Workshop on Noisy User-generated Text (W-NUT)
3Celikkaya et al’s data contains approximately 5K tweets with about 50K tokens
instance, after adding plural suffix “-lar” to word “kitap” (book), it becomes
“kitaplar” (books). Furthermore, by appending possessive suffix “-ım”, the word
“kitaplarım” (my books) gains a possessive meaning. Common NLP techniques
that have been developed for English such as stemming and tokenization do
not perform that well for Turkish. For instance, since Turkish is a agglutina-
tive language, known stemming techniques cause us to lose important informa-
tion which are embedded in morphology. Word “hesabımıza” (to our account )
loses its possessive information after it is converted to “hesap” (account) by
stemming. On the contrary to English which has the constituent order SVO
(Subject-Verb-Object), Turkish has the order of SOV. Moreover, inverted sen-
tences that changes SOV order, are often used in Turkish, which makes NER
task on Turkish even harder. For example, sentences “Ali kitap oku” and “Kitap
oku Ali” have the same meaning “Ali read the book”.
2.2 Noisy Text
To convert fax documents to text, Optical Character Recognition (OCR) tool is
used. However, the customer order document images often have low resolution,
moreover the OCR is error-prone, thus may produce text deviated from its orig-
inal form. For instance, as shown in Figure 1, the word “nezdindeki” (in care of )
is recognized as “nezrimdeki”, which is not a valid Turkish word. Statistically
speaking, this word is often used before account numbers, therefore holds quality
information regarding whereabouts of account numbers. As we will talk about it
in forthcoming sections, “nezdindeki” is used as a feature with many other words,
and losing these words in OCR makes huge impact on system performance.
Fig. 1. Sample fax document and its OCR result. Words “nezdindeki” and its OCR
result are underlined. Private information of client is masked for security reasons
2.3 Document Types
In banking domain, the customer transaction order documents that come via
fax channel are created by customers. Since customers can create their own
documents and no template is provided by the bank, resulting documents can be
very diverse. Therefore, the documents might contain multiple transactions that
increase entity counts on the page significantly. Also, spelling and capitalization
errors that are made by the customers affect the system, since capitalization
information of words are used as a feature.
Fig. 2. Sample blurred transaction documents that are received by fax and their an-
notated entities. Each entity type is annotated by a different color. The documents
made unreadable by blurring to protect private customer information, but structures
of documents and relations between different entity types can be clearly seen. a) Doc-
ument that contains table like structure. b) Document that contains a tuple structure.
c) Unstructured document
Although the language used in documents is Turkish, it may not be well
formatted like newspapers. While some of the documents contain parts that
are similar to tables (Figure 2.a), others may contain header-entity tuple struc-
ture (Figure 2.b) such as “Alıcı adı: Mehmet” (Receiver name: Mehmet), “Alıcı
IBAN: TR01 2345 6789 ...” (Receiver IBAN: TR01 2345 6789 ...). As can be seen
in Figure 2, frequencies of entity types might differ from document to document
and no explicit relation can be found between these entity types.
3 Proposed Methodology
In this section we briefly explain Conditional Random Fields (CRFs), then de-
scribe the features we have used with CRF along with its output named entities.
3.1 Conditional Random Fields
Conditional Random Fields (CRFs) [1] are conditional undirected probabilistic
graphical models that are heavily used in Named Entity Recognition [14,15],
Part of Speech (POS) Tagging [16] and many other tasks. We are interested
in Linear Chain CRFs which solve the feature restriction problems in Hidden
Markov Model (HMM) and label bias problem in Maximum Entropy Markov
Model (MEMM) by combining good sides of both models.
Conditional distribution p(y|x) of classes y, given inputs x is defined in Linear
chain CRFs as:
p(y|x) = 1
Z(x)
T
Y
t=1
Ψt(yt, yt−1, xt) (1)
where normalization factor Z and local function Ψtare defined as:
Z(x) = X
y
T
Y
t=1
exp (K
X
k=1
θkfk(yt, yt−1, xt))(2)
Ψt(yt, yt−1, xt) = exp (K
X
k=1
θkfk(yt, yt−1, xt)).(3)
The parameter θis learned via optimization algorithms like Broyden Fletcher
Goldfarb Shanno (BFGS) or Stochastic Gradient Descent (SGD). The feature
function fis a non-zero function that represents features of state-observation
pairs and yt−1to ytstate transitions [2].
3.2 Selected Features
We have extracted features from most frequent 1-skip-2-grams, 2-skip-2-grams
and 3-skip-2-grams [13] along with most frequent 1-grams and 2-grams. While
N-grams denote most frequent phrases, skip-grams catch boundary words of
entities. In NER, some entities are usually wrapped between same/similar words
and to extract features from these patterns, we have used the most frequent skip-
grams. Known stemmers for Turkish4cannot handle noisy text and cause loss
of important information. Therefore, we have decided to do stemming by taking
the first five characters of words only while extracting N-gram and skip-gram
features.
Turkish person names are not only useful for CLIENT entity type, but also
ORG entities, since some of the Turkish organization names contain person
names. Therefore, a person name gazetteer5with 12.828 Turkish person first
names is used for extracting features.
In cases like in Figure 2.b, entities come after words (tuples) that end with a
colon character. Moreover, amounts may be written in parentheses and may con-
tain dot and comma characters. These are some of the many reasons why we have
decided to extract features from punctuation characters. For example, if the word
4Zemberek https://github.com/ahmetaa/zemberek-nlp
5The gazetteer is provided by the bank.
begins with a parentheses and ends with a colon, “BEGIN PARANTHESES”,
“END COLON” boolean features will be extracted.
The information whether a word begins/ends with a capital or contains num-
bers can be considered as valuable for named entity recognition. To obtain that
information, “d” digit, “p” punctuation, “s” lowercase letter, “S” uppercase
letter shape features are extracted for each character in each word. After this
pre-processing step, consecutive shape features are concatenated to form one
complete feature. For instance, “dpdpdpS” feature will be extracted from word
“1,300.00-EUR”.
As we have discussed so far, we have decided on a subset of features from
various studied features for NER in the literature [6]:
–Most frequent N-grams
–Most frequent skip-grams
–Person names
–Punctuation information
–Word shape information
POS tagging accuracy degrades significantly due to the unstructured doc-
uments as mentioned in Section 2.3. Furthermore, due to noisy text, first or
last n characters of the words are untrustworthy. For these reasons, fixed length
suffixes, prefixes and also POS tags are not used as features.
3.3 Named Entity Types
In our study, named entities are more customized than ENAMEX (person, or-
ganization, location) and NUMEX (money and percentage). To understand the
problem better, these entity types are explained in Table 1.
4 Experimental Results
We have created training data by annotating 365 real transaction order doc-
uments that have been received by the bank during a period of two months.
Annotation of training data is done by two different trained annotators, their
results are inspected and merged by an expert. Test data has been gathered on a
completely different time interval and annotated by another trained annotator,
but controlled by the same expert. Details of this manually annotated data is
given in Table 2.
We have used the open source Mallet6framework for our experiments on Hid-
den Markov Model (HMM) and CRF. As the first experiment, we have compared
HMM with CRFs without features and observed that even when all features are
employed, CRFs have provided better results in terms of accuracy. Therefore,
we have decided to use CRFs in further evaluations.
6available from http://mallet.cs.umass.edu/
Table 1. Descriptions of labels
Label Description
ACCOUNT NUMBER Ranges between 6-9 characters, but mostly given with
7-8 characters. In some cases it is concatenated to bank
branch code with a hyphen character or contain space
characters, which makes it harder to extract (“332-
12345678” or “123 45 678”).
AMOUNT Amount could be numeric, completely alphabetic or both
as in “1.500,00 TL (binbe¸sy¨uzt¨urklirası)” has its alpha-
betic parts written concatenated.
CLIENT Corresponds to person names. But, documents may also
contain irrelevant person names (bank employees name
etc.), therefore not all person names are considered as
CLIENT.
CURRENCY Currency types of different countries, could also be writ-
ten as abbreviations. “Euro” could be written as “Avro”
(in Turkish), “EUR” or “e”.
EXPLANATION Composed of multiple alphanumeric words. Since it does
not have any special form, this entity is the hardest one
to be extracted among others.
IBAN 15 to 34 alphanumeric characters depending on the coun-
try. Since it begins with two uppercase characters, it is
relatively easy to detect with a proper feature set.
ORGANIZATION Organization names, just as in ENAMEX. It may end
with some suffixes such as “a.¸s”, “ltd.”, “¸sti.”, which
makes it easier to predict.
Table 2. Counts of entities of our training and test data. “O” denotes words that are
not entities and not added to the total
Label Training Test
ACC 457 199
AMOUNT 736 391
CLIENT 227 66
CURR 542 264
EXPL 428 349
IBAN 1517 1061
ORG 3302 2028
O 27037 15574
Total 7209 4358
When previous and next tokens (also their features if available) are included
in [-3,3] window as features (+window) for each word, overall CRF performance
is increased by 21%. We take this resulting method as our baseline and gradually
added the features described in Section 3.2. As expected, punctuation specific
features (+punc) especially increased F1 score of AMOUNT entity. Moreover,
including word shape features (+wordshape) has tremendously improved IBAN
entity, since “Sd”, “d”, “d”, “d” features will be extracted respectively from
IBANs such as “TR01 2345 6789 1234”. Person name features (+names) have
been included to the feature list to increase CLIENT detection performance.
As a result, F-score of CLIENT entity is increased to 40.85%. Finally, by using
n-gram and skip-gram features (+gram) CRF gains the ability to predict EXPL
entities. However, since predicting EXPL entities are hard due to their free form,
only 21% of these entities are correctly extracted.
Table 3. F1 scores of CRF and HMM. + means that the feature is added to the model
above
Features ACC AMOUNT CLIENT CURR EXPL IBAN ORG AVG
HMM 0 21.88 0 27.41 0 8.21 28.55 19.76
+all 0 46.65 0 65.88 0 6.78 29.26 23.45
CRF 5.85 52.56 0 86.39 5.03 47.92 32.72 40.26
+window 71.06 58.4 0 83.95 4.42 75.14 56.86 61.3
+punc 72.88 68.17 0 84.68 3.68 73.91 59.16 62.91
+wordshape 87.74 70.92 0 83.67 5.9 91.73 60.78 69.51
+names 87.19 70.83 40.85 83.27 5.8 91.95 61.61 70.17
+gram 86.33 72.23 36.73 86.51 21.05 89.62 67.18 72.77
Upon this point, given F1 measures were in CoNLL (Conference on Com-
putational Natural Language Learning) metric. In CoNLL metric, a prediction
is considered true positive, if all words of the entity are predicted correctly [3].
However, we also measured F1 scores using word by word predictions (RAW)
and can conclude that segmentation is responsible from 7% of the drop in F1
measure as given in Table 4.
5 Conclusions and Future Work
In this work, we have described our methodology that tackles the NER problem
in noisy Turkish banking documents. In our method, we have trained CRFs
with n-gram, skip-gram, punctuation and shape features along with the ones
from a person name gazetteer. We compared F-scores (CoNLL metric) of each
entity type with the proposed features step by step in order to see the impact on
detection of specific entity types. Our final model has achieved 72.77% F-score
which is significantly lower than the models trained with the news data. We have
Table 4. F1 scores of our method with all features on 5-fold cross validation and test
set
Cross-validation Test Set
Label RAW CoNLL RAW CoNLL
ACC 91.04 90.63 86.86 86.33
AMOUNT 83.45 81.67 79.14 72.23
CLIENT 61.06 61.06 42.18 36.73
CURR 87.87 87.66 86.51 86.51
EXPL 65.99 56.49 28.23 21.05
IBAN 95.16 90.68 96.82 89.62
ORG 82.05 73.23 74.32 67.18
AVG 84.72 78.89 79.11 72.77
expected the result to be worse than news domain due to noisy text produced
by OCR systems and unstructured sentences.
This study is implemented as an information extraction system that is in-
tegrated to core banking. There are a thousand of maker (data entrance) and
checker (data validator) human operators currently working for completing a
100 thousand customer orders each day to the bank. Our system automatizes
the maker operators’ data entrance effort, minimizes the time required to enter
data and eliminates possible typing errors. After the system integration, 20% of
the manual workforce is saved. This is accomplished by digitalizing 53% of the
process workflow by replacing operators’ workforce with the system. As a result
overall cycle time of target processes is reduced significantly. For instance; book-
to-book money transfer cycle time is decreased from 45 minutes to 13 minutes
and EFT cycle time is decreased from 31 minutes to 19 minutes. On a yearly
basis 53% digitalization will mean 3.2 millon transactions to be completed with-
out human workforce, since there are roughly 6.5 million transactions arriving
each year.
In the future, text normalization methods might be applied to reduce the
negative effects of noise. Also, ensemble methods like bagging and boosting can
be tested to improve classification performance. In addition to the person name
gazetteer as discussed in Section 3.2, an organization name gazetteer might be
used to increase ORG entity prediction performance.
Some of the documents contain multiple orders in single sentence (compound
order) such as money transfer followed by a currency exchange. These kind of
documents require further processing since entity extraction is not enough and
relations between entities should also be extracted. As a future work, we aim to
extract these kind of relations between entities.
References
1. Lafferty, J., McCallum, A., Pereira, F. C.: Conditional random fields: Probabilistic
models for segmenting and labeling sequence data (2001).
2. Sutton, C., McCallum, A.: An introduction to conditional random fields. Machine
Learning, 4(4), 267-373 (2011).
3. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification.
Lingvisticae Investigationes, 30(1), 3-26 (2007).
4. Seker, G. A., Eryigit, G.: Initial Explorations on using CRFs for Turkish Named
Entity Recognition. COLING, 2459-2474 (2012).
5. Yeniterzi, R.: Exploiting morphology in Turkish named entity recognition system.
Proceedings of the ACL 2011 Student Session. Association for Computational Lin-
guistics, 105-110 (2011).
6. Tkachenko, M., Simanovsky, A.: Named entity recognition: Exploring features.
KONVENS, 118-127 (2012).
7. Tatar, S., Cicekli, I.: Automatic rule learning exploiting morphological features for
named entity recognition in Turkish. Journal of Information Science, 37(2), 137-151
(2011).
8. Kucuk, D., Steinberger, R.: Experiments to improve named entity recognition on
Turkish tweets. Proceedings of the 5th Workshop on Language Analysis for Social
Media, 71-78 (2014).
9. Yamada, I., Takeda, H., Takefuji, Y.: Enhancing Named Entity Recognition in Twit-
ter Messages Using Entity Linking. ACL-IJCNLP, 136 (2015).
10. Eken, B., Tantug, C.: Recognizing named entities in Turkish tweets. Proceedings
of the Fourth International Conference on Software Engineering and Applications,
Dubai (2015).
11. Tur, G., Hakkani-Tur, D., Oflazer, K.: A statistical information extraction system
for Turkish. Natural Language Engineering, 9(02), 181-210 (2003).
12. Celikkaya, G., Torunoglu, D., Eryigit, G.: Named entity recognition on real data: a
preliminary investigation for Turkish. 7th International Conference on Application
of Information and Communication Technologies (AICT), 1-5 (2013).
13. Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A closer look at skip-gram
modelling. Proceedings of the 5th international Conference on Language Resources
and Evaluation (LREC-2006), 1-4 (2006).
14. Settles, B.: Biomedical named entity recognition using conditional random fields
and rich feature sets. Proceedings of the International Joint Workshop on Natural
Language Processing in Biomedicine and its Applications, 104-107 (2004).
15. Klinger, R., Friedrich, C. M., Fluck, J., Hofmann-Apitius, M.: Named entity recog-
nition with combinations of conditional random fields. Proceedings of the second
biocreative challenge evaluation workshop (2007).
16. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. Proceedings
of the 2003 Conference of the North American Chapter of the Association for Com-
putational Linguistics on Human Language Technology-Volume 1, 134-141 (2003).