Conference PaperPDF Available

A Bank Information Extraction System Based on Named Entity Recognition with CRFs from Noisy Customer Order Texts in Turkish

Authors:
  • Unicredit Turkey - Yapi Kredi Bank
  • Yapı Kredi Bank

Abstract and Figures

Each day hundred thousands of customer transactions arrive at banks operation center via fax channel. The information required to complete each transaction (money transfer, salary payment, tax payment etc.) is extracted manually by operators from the image of customer orders. Our information extraction system uses CRFs (Conditional Random Fields) for obtaining the required named entities for each transaction type from noisy text of customer orders. The difficulty of the problem arouses from the fact that every customer order has different formats, image resolution of orders are so low that OCR-ed (Optical Character Recognition) texts are highly noisy and Turkish is still challenging for the natural language processing techniques due to structure of the language. This paper mentions the difficulties of our problem domain and provides details of the methodology developed for extracting entities such as client name, organization name, bank account number, IBAN number, amount, currency and explanation.
Content may be subject to copyright.
A Bank Information Extraction System Based
on Named Entity Recognition with CRFs from
Noisy Customer Order Texts in Turkish
Erdem Emekligil, Secil Arslan, and Onur Agin
R&D and Special Projects Department,
Yapi Kredi Technology, Istanbul, Turkey
Email: {erdem.emekligil, secil.arslan, onur.agin}@ykteknoloji.com.tr
Abstract. Each day hundred thousands of customer transactions ar-
rive at banks operation center via fax channel. The information required
to complete each transaction (money transfer, salary payment, tax pay-
ment etc.) is extracted manually by operators from the image of customer
orders. Our information extraction system uses CRFs (Conditional Ran-
dom Fields) for obtaining the required named entities for each trans-
action type from noisy text of customer orders. The difficulty of the
problem arouses from the fact that every customer order has different
formats, image resolution of orders are so low that OCR-ed (Optical
Character Recognition) texts are highly noisy and Turkish is still chal-
lenging for the natural language processing techniques due to structure
of the language. This paper mentions the difficulties of our problem do-
main and provides details of the methodology developed for extracting
entities such as client name, organization name, bank account number,
IBAN number, amount, currency and explanation.
Keywords: Named Entity Recognition, Turkish, Conditional Random
Fields, Noisy Text, Banking Applications
1 Introduction
Natural Language Processing (NLP) systems have been employed for many dif-
ferent tasks on Information Retrieval (IR). Specific task of extracting predefined
types of data from documents is referred to as Named Entity Recognition (NER).
Each banking process such as money transfers, tax payments etc. is performed
in a maker-checker basis. Maker is responsible for entering the required data for
initiating the customer orders, whereas checker has the last approval respon-
sibility to prevent frauds at data entrance step. Customer orders are mostly
received via fax channel that means each order document is in image format.
Due to low resolution settings of the fax machines the OCR from customer order
images generates noisy text and challenges even state-of-the-art NER systems.
In this paper, we present a fast and robust information extraction system that
harnesses NER to automatically extract necessary entities for banking processes
from OCR-ed Turkish customer order documents; hence minimizes the effort
and time consumed for data entrance.
There have been many applications of Turkish NER in the literature. Tatar
and Cicekli [7] extracted different features from text by doing an automatic rule
learning. Their method resulted with 91.08% F-score on their TurkIE Turkish
news data1. Yeniterzi [5] exploits morphology by capturing syntactic and contex-
tual properties of tokens and she reports 88.94% F-score on Tur et al’s Turkish
general news data [11] by using CRFs. Seker and Eryigit [4] also used CRFs
along with morphological and lexical features. Moreover, they also made use
of large scale person and location gazetteers. Their state of the art framework
reports 91.94% F-score on Tur et al’s data.
Turkish newspaper domain is well studied and results are close to English.
However, in both English and Turkish, results of unstructured, noisy data like
Twitter are relevantly poor. Since what makes NER on customer orders difficult
is noisy unstructured texts, we can state that to some degree, our problem is
similar to NER on twitter data. Yamada et al. [9] report the best performing
method of ALC 2015 Twitter NER shared task2with 56.41% F-score on English
Twitter data by exploiting open knowledge bases such as DBpedia and Freebase.
Celikkaya et. al. [12] trained a CRF model with news data and achieved 19%
F-score on their Turkish Tweet test data3. Kucuk and Steinberger[8] reported a
diacritics-based expansion of their base NER system lexical resources that results
with 38.01% PLO (person, location, organization) overall F-score on Celikkaya
el al’s data and 48.13% PLO overall F-score on their Turkish Twitter data.
Eken and Tantug [10] exploited gazetteers along with some basic features. They
used Celikkaya el al’s Turkish Twitter data in training and their own annotated
Turkish Twitter data while testing which they reported 64.03% F-score.
The remaining of this paper is structured as follows: in Section 2, we explain
the problems that differentiate our system from other NER systems. We provide
details of our approach in Section 3 and share our experimental test results in
Section 4. Finally, we conclude by giving our remarks in Section 5.
2 Problem Definition
In this section, we give brief information on three main challenges:
Turkish as a language with complex morphology and low resources
Noisy text generated by OCR systems
Lack of consensus on document types in banking systems
2.1 Turkish
Turkish is one of the highly agglutinative and morphologically rich languages.
In agglutinative languages each affix gives a different meaning to the word. For
1TurkIE dataset contains approximately 55K tokens from news articles on terrorism
from both online and print news sources
2ACL 2015 Workshop on Noisy User-generated Text (W-NUT)
3Celikkaya et al’s data contains approximately 5K tweets with about 50K tokens
instance, after adding plural suffix “-lar” to word “kitap” (book), it becomes
“kitaplar” (books). Furthermore, by appending possessive suffix “-ım”, the word
“kitaplarım” (my books) gains a possessive meaning. Common NLP techniques
that have been developed for English such as stemming and tokenization do
not perform that well for Turkish. For instance, since Turkish is a agglutina-
tive language, known stemming techniques cause us to lose important informa-
tion which are embedded in morphology. Word “hesabımıza” (to our account )
loses its possessive information after it is converted to “hesap” (account) by
stemming. On the contrary to English which has the constituent order SVO
(Subject-Verb-Object), Turkish has the order of SOV. Moreover, inverted sen-
tences that changes SOV order, are often used in Turkish, which makes NER
task on Turkish even harder. For example, sentences “Ali kitap oku” and “Kitap
oku Ali” have the same meaning “Ali read the book”.
2.2 Noisy Text
To convert fax documents to text, Optical Character Recognition (OCR) tool is
used. However, the customer order document images often have low resolution,
moreover the OCR is error-prone, thus may produce text deviated from its orig-
inal form. For instance, as shown in Figure 1, the word “nezdindeki” (in care of )
is recognized as “nezrimdeki”, which is not a valid Turkish word. Statistically
speaking, this word is often used before account numbers, therefore holds quality
information regarding whereabouts of account numbers. As we will talk about it
in forthcoming sections, “nezdindeki” is used as a feature with many other words,
and losing these words in OCR makes huge impact on system performance.
Fig. 1. Sample fax document and its OCR result. Words “nezdindeki” and its OCR
result are underlined. Private information of client is masked for security reasons
2.3 Document Types
In banking domain, the customer transaction order documents that come via
fax channel are created by customers. Since customers can create their own
documents and no template is provided by the bank, resulting documents can be
very diverse. Therefore, the documents might contain multiple transactions that
increase entity counts on the page significantly. Also, spelling and capitalization
errors that are made by the customers affect the system, since capitalization
information of words are used as a feature.
Fig. 2. Sample blurred transaction documents that are received by fax and their an-
notated entities. Each entity type is annotated by a different color. The documents
made unreadable by blurring to protect private customer information, but structures
of documents and relations between different entity types can be clearly seen. a) Doc-
ument that contains table like structure. b) Document that contains a tuple structure.
c) Unstructured document
Although the language used in documents is Turkish, it may not be well
formatted like newspapers. While some of the documents contain parts that
are similar to tables (Figure 2.a), others may contain header-entity tuple struc-
ture (Figure 2.b) such as “Alıcı adı: Mehmet” (Receiver name: Mehmet), “Alıcı
IBAN: TR01 2345 6789 ...” (Receiver IBAN: TR01 2345 6789 ...). As can be seen
in Figure 2, frequencies of entity types might differ from document to document
and no explicit relation can be found between these entity types.
3 Proposed Methodology
In this section we briefly explain Conditional Random Fields (CRFs), then de-
scribe the features we have used with CRF along with its output named entities.
3.1 Conditional Random Fields
Conditional Random Fields (CRFs) [1] are conditional undirected probabilistic
graphical models that are heavily used in Named Entity Recognition [14,15],
Part of Speech (POS) Tagging [16] and many other tasks. We are interested
in Linear Chain CRFs which solve the feature restriction problems in Hidden
Markov Model (HMM) and label bias problem in Maximum Entropy Markov
Model (MEMM) by combining good sides of both models.
Conditional distribution p(y|x) of classes y, given inputs x is defined in Linear
chain CRFs as:
p(y|x) = 1
Z(x)
T
Y
t=1
Ψt(yt, yt1, xt) (1)
where normalization factor Z and local function Ψtare defined as:
Z(x) = X
y
T
Y
t=1
exp (K
X
k=1
θkfk(yt, yt1, xt))(2)
Ψt(yt, yt1, xt) = exp (K
X
k=1
θkfk(yt, yt1, xt)).(3)
The parameter θis learned via optimization algorithms like Broyden Fletcher
Goldfarb Shanno (BFGS) or Stochastic Gradient Descent (SGD). The feature
function fis a non-zero function that represents features of state-observation
pairs and yt1to ytstate transitions [2].
3.2 Selected Features
We have extracted features from most frequent 1-skip-2-grams, 2-skip-2-grams
and 3-skip-2-grams [13] along with most frequent 1-grams and 2-grams. While
N-grams denote most frequent phrases, skip-grams catch boundary words of
entities. In NER, some entities are usually wrapped between same/similar words
and to extract features from these patterns, we have used the most frequent skip-
grams. Known stemmers for Turkish4cannot handle noisy text and cause loss
of important information. Therefore, we have decided to do stemming by taking
the first five characters of words only while extracting N-gram and skip-gram
features.
Turkish person names are not only useful for CLIENT entity type, but also
ORG entities, since some of the Turkish organization names contain person
names. Therefore, a person name gazetteer5with 12.828 Turkish person first
names is used for extracting features.
In cases like in Figure 2.b, entities come after words (tuples) that end with a
colon character. Moreover, amounts may be written in parentheses and may con-
tain dot and comma characters. These are some of the many reasons why we have
decided to extract features from punctuation characters. For example, if the word
4Zemberek https://github.com/ahmetaa/zemberek-nlp
5The gazetteer is provided by the bank.
begins with a parentheses and ends with a colon, “BEGIN PARANTHESES”,
“END COLON” boolean features will be extracted.
The information whether a word begins/ends with a capital or contains num-
bers can be considered as valuable for named entity recognition. To obtain that
information, “d” digit, “p” punctuation, “s” lowercase letter, “S” uppercase
letter shape features are extracted for each character in each word. After this
pre-processing step, consecutive shape features are concatenated to form one
complete feature. For instance, “dpdpdpS” feature will be extracted from word
“1,300.00-EUR”.
As we have discussed so far, we have decided on a subset of features from
various studied features for NER in the literature [6]:
Most frequent N-grams
Most frequent skip-grams
Person names
Punctuation information
Word shape information
POS tagging accuracy degrades significantly due to the unstructured doc-
uments as mentioned in Section 2.3. Furthermore, due to noisy text, first or
last n characters of the words are untrustworthy. For these reasons, fixed length
suffixes, prefixes and also POS tags are not used as features.
3.3 Named Entity Types
In our study, named entities are more customized than ENAMEX (person, or-
ganization, location) and NUMEX (money and percentage). To understand the
problem better, these entity types are explained in Table 1.
4 Experimental Results
We have created training data by annotating 365 real transaction order doc-
uments that have been received by the bank during a period of two months.
Annotation of training data is done by two different trained annotators, their
results are inspected and merged by an expert. Test data has been gathered on a
completely different time interval and annotated by another trained annotator,
but controlled by the same expert. Details of this manually annotated data is
given in Table 2.
We have used the open source Mallet6framework for our experiments on Hid-
den Markov Model (HMM) and CRF. As the first experiment, we have compared
HMM with CRFs without features and observed that even when all features are
employed, CRFs have provided better results in terms of accuracy. Therefore,
we have decided to use CRFs in further evaluations.
6available from http://mallet.cs.umass.edu/
Table 1. Descriptions of labels
Label Description
ACCOUNT NUMBER Ranges between 6-9 characters, but mostly given with
7-8 characters. In some cases it is concatenated to bank
branch code with a hyphen character or contain space
characters, which makes it harder to extract (“332-
12345678” or “123 45 678”).
AMOUNT Amount could be numeric, completely alphabetic or both
as in “1.500,00 TL (binbe¸sy¨uzt¨urklirası)” has its alpha-
betic parts written concatenated.
CLIENT Corresponds to person names. But, documents may also
contain irrelevant person names (bank employees name
etc.), therefore not all person names are considered as
CLIENT.
CURRENCY Currency types of different countries, could also be writ-
ten as abbreviations. “Euro” could be written as “Avro”
(in Turkish), “EUR” or “e”.
EXPLANATION Composed of multiple alphanumeric words. Since it does
not have any special form, this entity is the hardest one
to be extracted among others.
IBAN 15 to 34 alphanumeric characters depending on the coun-
try. Since it begins with two uppercase characters, it is
relatively easy to detect with a proper feature set.
ORGANIZATION Organization names, just as in ENAMEX. It may end
with some suffixes such as “a.¸s”, “ltd.”, “¸sti.”, which
makes it easier to predict.
Table 2. Counts of entities of our training and test data. “O” denotes words that are
not entities and not added to the total
Label Training Test
ACC 457 199
AMOUNT 736 391
CLIENT 227 66
CURR 542 264
EXPL 428 349
IBAN 1517 1061
ORG 3302 2028
O 27037 15574
Total 7209 4358
When previous and next tokens (also their features if available) are included
in [-3,3] window as features (+window) for each word, overall CRF performance
is increased by 21%. We take this resulting method as our baseline and gradually
added the features described in Section 3.2. As expected, punctuation specific
features (+punc) especially increased F1 score of AMOUNT entity. Moreover,
including word shape features (+wordshape) has tremendously improved IBAN
entity, since “Sd”, “d”, “d”, “d” features will be extracted respectively from
IBANs such as “TR01 2345 6789 1234”. Person name features (+names) have
been included to the feature list to increase CLIENT detection performance.
As a result, F-score of CLIENT entity is increased to 40.85%. Finally, by using
n-gram and skip-gram features (+gram) CRF gains the ability to predict EXPL
entities. However, since predicting EXPL entities are hard due to their free form,
only 21% of these entities are correctly extracted.
Table 3. F1 scores of CRF and HMM. + means that the feature is added to the model
above
Features ACC AMOUNT CLIENT CURR EXPL IBAN ORG AVG
HMM 0 21.88 0 27.41 0 8.21 28.55 19.76
+all 0 46.65 0 65.88 0 6.78 29.26 23.45
CRF 5.85 52.56 0 86.39 5.03 47.92 32.72 40.26
+window 71.06 58.4 0 83.95 4.42 75.14 56.86 61.3
+punc 72.88 68.17 0 84.68 3.68 73.91 59.16 62.91
+wordshape 87.74 70.92 0 83.67 5.9 91.73 60.78 69.51
+names 87.19 70.83 40.85 83.27 5.8 91.95 61.61 70.17
+gram 86.33 72.23 36.73 86.51 21.05 89.62 67.18 72.77
Upon this point, given F1 measures were in CoNLL (Conference on Com-
putational Natural Language Learning) metric. In CoNLL metric, a prediction
is considered true positive, if all words of the entity are predicted correctly [3].
However, we also measured F1 scores using word by word predictions (RAW)
and can conclude that segmentation is responsible from 7% of the drop in F1
measure as given in Table 4.
5 Conclusions and Future Work
In this work, we have described our methodology that tackles the NER problem
in noisy Turkish banking documents. In our method, we have trained CRFs
with n-gram, skip-gram, punctuation and shape features along with the ones
from a person name gazetteer. We compared F-scores (CoNLL metric) of each
entity type with the proposed features step by step in order to see the impact on
detection of specific entity types. Our final model has achieved 72.77% F-score
which is significantly lower than the models trained with the news data. We have
Table 4. F1 scores of our method with all features on 5-fold cross validation and test
set
Cross-validation Test Set
Label RAW CoNLL RAW CoNLL
ACC 91.04 90.63 86.86 86.33
AMOUNT 83.45 81.67 79.14 72.23
CLIENT 61.06 61.06 42.18 36.73
CURR 87.87 87.66 86.51 86.51
EXPL 65.99 56.49 28.23 21.05
IBAN 95.16 90.68 96.82 89.62
ORG 82.05 73.23 74.32 67.18
AVG 84.72 78.89 79.11 72.77
expected the result to be worse than news domain due to noisy text produced
by OCR systems and unstructured sentences.
This study is implemented as an information extraction system that is in-
tegrated to core banking. There are a thousand of maker (data entrance) and
checker (data validator) human operators currently working for completing a
100 thousand customer orders each day to the bank. Our system automatizes
the maker operators’ data entrance effort, minimizes the time required to enter
data and eliminates possible typing errors. After the system integration, 20% of
the manual workforce is saved. This is accomplished by digitalizing 53% of the
process workflow by replacing operators’ workforce with the system. As a result
overall cycle time of target processes is reduced significantly. For instance; book-
to-book money transfer cycle time is decreased from 45 minutes to 13 minutes
and EFT cycle time is decreased from 31 minutes to 19 minutes. On a yearly
basis 53% digitalization will mean 3.2 millon transactions to be completed with-
out human workforce, since there are roughly 6.5 million transactions arriving
each year.
In the future, text normalization methods might be applied to reduce the
negative effects of noise. Also, ensemble methods like bagging and boosting can
be tested to improve classification performance. In addition to the person name
gazetteer as discussed in Section 3.2, an organization name gazetteer might be
used to increase ORG entity prediction performance.
Some of the documents contain multiple orders in single sentence (compound
order) such as money transfer followed by a currency exchange. These kind of
documents require further processing since entity extraction is not enough and
relations between entities should also be extracted. As a future work, we aim to
extract these kind of relations between entities.
References
1. Lafferty, J., McCallum, A., Pereira, F. C.: Conditional random fields: Probabilistic
models for segmenting and labeling sequence data (2001).
2. Sutton, C., McCallum, A.: An introduction to conditional random fields. Machine
Learning, 4(4), 267-373 (2011).
3. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification.
Lingvisticae Investigationes, 30(1), 3-26 (2007).
4. Seker, G. A., Eryigit, G.: Initial Explorations on using CRFs for Turkish Named
Entity Recognition. COLING, 2459-2474 (2012).
5. Yeniterzi, R.: Exploiting morphology in Turkish named entity recognition system.
Proceedings of the ACL 2011 Student Session. Association for Computational Lin-
guistics, 105-110 (2011).
6. Tkachenko, M., Simanovsky, A.: Named entity recognition: Exploring features.
KONVENS, 118-127 (2012).
7. Tatar, S., Cicekli, I.: Automatic rule learning exploiting morphological features for
named entity recognition in Turkish. Journal of Information Science, 37(2), 137-151
(2011).
8. Kucuk, D., Steinberger, R.: Experiments to improve named entity recognition on
Turkish tweets. Proceedings of the 5th Workshop on Language Analysis for Social
Media, 71-78 (2014).
9. Yamada, I., Takeda, H., Takefuji, Y.: Enhancing Named Entity Recognition in Twit-
ter Messages Using Entity Linking. ACL-IJCNLP, 136 (2015).
10. Eken, B., Tantug, C.: Recognizing named entities in Turkish tweets. Proceedings
of the Fourth International Conference on Software Engineering and Applications,
Dubai (2015).
11. Tur, G., Hakkani-Tur, D., Oflazer, K.: A statistical information extraction system
for Turkish. Natural Language Engineering, 9(02), 181-210 (2003).
12. Celikkaya, G., Torunoglu, D., Eryigit, G.: Named entity recognition on real data: a
preliminary investigation for Turkish. 7th International Conference on Application
of Information and Communication Technologies (AICT), 1-5 (2013).
13. Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A closer look at skip-gram
modelling. Proceedings of the 5th international Conference on Language Resources
and Evaluation (LREC-2006), 1-4 (2006).
14. Settles, B.: Biomedical named entity recognition using conditional random fields
and rich feature sets. Proceedings of the International Joint Workshop on Natural
Language Processing in Biomedicine and its Applications, 104-107 (2004).
15. Klinger, R., Friedrich, C. M., Fluck, J., Hofmann-Apitius, M.: Named entity recog-
nition with combinations of conditional random fields. Proceedings of the second
biocreative challenge evaluation workshop (2007).
16. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. Proceedings
of the 2003 Conference of the North American Chapter of the Association for Com-
putational Linguistics on Human Language Technology-Volume 1, 134-141 (2003).
... Notably, this work utilized layout analysis which is a prominent feature in business letters. In recent works, Emekligil, Arslan, and Agin (2016) used a CRF based model to find named entities in scanned money transfer documents. Similarly, Ratti et al. (2017) used a CRF based approach to find named entities from bank wire texts. ...
... To see the impact of additional character-level representations, we test with different models namely; BiLSTM-CRF w/ FastText, BiLSTM-CRF w/ FastText+CharBiLSTM, BiLSTM-CRF w/ BERT, BiLSTM-CRF w/ Bert +CharBiLSTM, BilstmCRF w/ ELMo, BilstmCRF w/ ELMo+charBiLSTM. We compare our models with the original studies (Devlin et al., 2018;Peters et al., 2018) and a traditional pure CRF model using manually selected features and banking specific gazetteers (Emekligil et al., 2016;Şeker & Eryiǧit, 2017). ...
... As introduced in Section 4.2, we tested with different models namely; BiLSTM-CRF w/ FastText, BiLSTM-CRF w/ FastText +CharBiLSTM, BiLSTM-CRF w/ BERT, BiLSTM-CRF w/ Bert +CharBiLSTM, BiLSTMCRF w/ ELMo+CharBiLSTM. Table 2 provides the results of our experiments (in CoNLL entity matching metric) and their comparison with the baseline CRF (Emekligil et al., 2016), ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) NER models. As depicted within the table, the additional BiLSTM based character representations helped the ELMo model which uses CNNs as an initial layer to encode words in conformance with our hypothesis. ...
Article
Document types, where visual and textual information plays an important role in their analysis and understanding, pose a new and attractive area for information extraction research. Although cheques, invoices, and receipts have been studied in some previous multi-modal studies, banking documents present an unexplored area due to the naturalness of the text they possess in addition to their visual richness. This article presents the first study which uses visual and textual information for deep-learning based information extraction on text-intensive and visually rich scanned documents which are, in this instance, unstructured banking documents, or more precisely, money transfer orders. The impact of using different neural word representations (i.e., FastText, ELMo, and BERT) on IE subtasks (namely, named entity recognition and relation extraction stages), positional features of words on document images and auxiliary learning with some other tasks are investigated. The article proposes a new relation extraction algorithm based on graph factorization to solve the complex relation extraction problem where the relations within documents are n-ary, nested, document-level, and previously indeterminate in quantity. Our experiments revealed that the use of deep learning algorithms yielded around 10 percentage points improvement on the IE sub-tasks. The inclusion of word positional features yielded around 3 percentage points of improvement in some specific information fields. Similarly, our auxiliary learning experiments yielded around 2 percentage points of improvement on some information fields associated with the specific transaction type detected by our auxiliary task. The integration of the information extraction system into a real banking environment reduced cycle times substantially. When compared to the manual workflow, document processing pipeline shortened book-to-book money transfers to 10 minutes (from 29 min.) and electronic fund transfers (EFT) to 17 minutes (from 41 min.) respectively.
... Fiziksel kartvizit üzerindeki bilgilerin sayısallaştırılmasında kullanılan OCR yöntemi, basılı veya el yazması metinlere ait görüntülerin, görüntü işleme teknikleri kullanarak yazıya ve işlem yapılabilir veriye dönüştürme işidir. OCR günlük hayatımızda; plaka tanıma sistemleri [1], görme engelliler için destek uygulaması[2], bankacılık uygulaması[3], otonom sürüş uygulaması[4] gibi birçok alanda kullanılmaktadır. ...
Article
Full-text available
In this study, digital-business card holder software was developed that digitally stores physical business cards prepared in Turkish in a cloud-based database. In the proposed software, the information on the physical business card is converted into text by optical character recognition method (OCR) using business card photos, and then the texts obtained with the help of developed algorithms are separated and grouped. Finally, the digitally obtained business card data is stored in the cloud-based database for later use. Considering the Turkish business cards, it is known that there are a wide variety of complex business cards unique to the country as well as the characters specific to the Turkish language. In this context, first of all, a method that correctly recognizes Turkish characters has been determined in the study. Later, name, mobile phone, e-mail address, company title, position and similar meaningful information were separated from the data read. In order to make these decompositions, special methods have been developed for each field and more accurate and meaningful data has been obtained with field-based algorithms. Thanks to the developed cloud-based platform-independent interface, it is possible to access data from more than one device with a single user over the internet. The study also offers a layered service architecture and database infrastructure that can be used by multiple accounts and multiple users connected to it simultaneously from a single platform. In addition, in the analyzes performed with the developed software, it was determined that 15 business cards with different features were read with an accuracy rate of over 80%.
... We are also working on "loop" based use-cases. We are currently in the process of exploring NER (Named Entity Recognition) concepts [6,7,10] to extract entities from statements that fit "loop templates" that we are building. We are working towards an algorithm that can consume these entities to solved loop-based problem sets. ...
... In finance, named-entity recognition (NER) is used for extracting predefined types of data from a document. In banking, transaction order documents of customers may come via fax, which results in very diverse documents because of the lack of a fixed template and creates the need for proper feature extraction to obtain a structured document (Emekligil et al. 2016). ...
Article
Full-text available
Text-mining technologies have substantially affected financial industries. As the data in every sector of finance have grown immensely, text mining has emerged as an important field of research in the domain of finance. Therefore, reviewing the recent literature on text-mining applications in finance can be useful for identifying areas for further research. This paper focuses on the text-mining literature related to financial forecasting, banking, and corporate finance. It also analyses the existing literature on text mining in financial applications and provides a summary of some recent studies. Finally, the paper briefly discusses various text-mining methods being applied in the financial domain, the challenges faced in these applications, and the future scope of text mining in finance.
... Yakın geçmişte Türkçe diline özel olarak oluşturulan çalışmalar ivme kazanmıştır. Önceki çalışmalarda, bankacılık sistemine iletilen imge halindeki talimatlardan bilgi ve işlem çıkarımını otomatik olarak yapabilmek için makine öğrenmesi temelli yöntemler üzerinde çalışıldığı görülmektedir (Gözde Şahin ve arkadaşları [1], Erdem Emekligil ve arkadaşları [2]). Ancak bu çalışmalarda, karşılaşılan önemli problemlerden biri etiketli verilerin miktarındaki azlıktır. ...
Chapter
Digital evolution has made various services and products available at everyone’s fingertips and made human lives easier. It has become necessary for individuals with a passion to be a part of this digital evolution to learn how to write code, which is the basic literacy of the digital age. But writing code has become a privilege for students with prior knowledge of English. This project aims to remove this language barrier by teaching students to solve coding problems in their native language and to convert their logic to code. The paper presents a platform where students provide their logic to coding problems in their native language in plain text, which is then converted to python code using natural language processing techniques. The current platform can successfully identify and convert conditional statements in the Kannada language into python code. The next effort will be aimed at extending this to recognize loop statements and create a framework for a wide variety of languages.
Conference Paper
Named Entity Recognition (NER) is a subtask of the information extraction process and aims to discover named entities in unstructured texts. Previous studies on NER mostly use statistical machine learning models instead of using classifiers since solving this problem as a classification task requires to deal with quite high dimensional and sparse vector spaces. In this paper, we take NER as a classical text classification problem and extract nominal features from each token in the unstructured text sequence. We convert each token to a document transaction and then, we use frequent termset mining to extract termset features and apply termset weighting to classify named entities. Therefore we deal with lower dimensional feature spaces. Our experimental results obtained on a large Turkish dataset show that frequent termsets and their weighting scheme can be used in NER task.
Conference Paper
Full-text available
Named entity recognition (NER) is one of the well-s tudied sub-branch of natural language processing (NLP). State of the art NER systems give highly accurate results in domain of formal texts. With the expansion of microblog sites and so cial media, this informal text domain has become a new trend in NLP studies. Recent works has shown, social media texts are hard to process and the performance of the current systems substantially decrease when switched to this domain. We give our experience in improving named e ntity recognition on informal social media texts for the case of tweets.
Conference Paper
Full-text available
Named Entity Recognition (NER) is a well-studied area in natural language processing (NLP) and the reported results in the literature are generally very high (~>%95) for most of the languages. Today, the focus area of most practical natural language applications (i.e. web mining, sentiment analysis, machine translation) is real natural language data such as Web2.0 or speech data. Nevertheless, the NER task is rarely investigated on this type of data which differs severely from formal written text. In this paper, we present 3 new Turkish data sets from different domains (on this focused area; namely from Twitter, a Speech-to-Text Interface and a Hardware Forum) annotated specifically for NER and report our first results on them. We believe, the paper draws light to the difficulty of these new domains for NER and the possible future work.
Conference Paper
Full-text available
Social media texts are significant information sources for several application areas including trend analysis, event monitoring, and opinion mining. Unfortunately, existing solutions for tasks such as named entity recognition that perform well on formal texts usually perform poorly when applied to social media texts. In this paper, we report on experiments that have the purpose of improving named entity recognition on Turkish tweets, using two different annotated data sets. In these experiments, starting with a baseline named entity recognition system, we adapt its recognition rules and resources to better fit Twitter language by relaxing its capitalization constraint and by diacritics-based expansion of its lexical resources, and we employ a simplistic normalization scheme on tweets to observe the effects of these on the overall named entity recognition performance on Turkish tweets. The evaluation results of the system with these different settings are provided with discussions of these results.
Article
Full-text available
Data sparsity is a large problem in natural language processing that refers to the fact that language is a system of rare events, so varied and complex, that even using an extremely large corpus, we can never accurately model all possible strings of words. This paper examines the use of skip-grams (a technique where by n-grams are still stored to model language, but they allow for tokens to be skipped) to overcome the data sparsity problem. We analyze this by computing all possible skip-grams in a training corpus and measure how many adjacent (standard) n-grams these cover in test documents. We examine skip-gram modelling using one to four skips with various amount of training data and test against similar documents as well as documents generated from a machine translation system. In this paper we also determine the amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams.
Article
Full-text available
This paper presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. In languages like English, there is a very small number of possible word forms with a given root word. However, languages like Turkish have very productive agglutinative morphology. Thus, it is an issue to build statistical models for specific tasks using the surface forms of the words, mainly because of the data sparseness problem. In order to alleviate this problem, we used additional syntactic information, i.e. the morphological structure of the words. We have successfully applied statistical methods using both the lexical and morphological information to sentence segmentation, topic segmentation, and name tagging tasks. For sentence segmentation, we have modeled the final inflectional groups of the words and combined it with the lexical model, and decreased the error rate to 4.34%, which is 21% better than the result obtained using only the surface forms of the words. For topic segmentation, stems of the words (especially nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set according to the weighted TDT-2 segmentation cost metric. This is 32% better than the word-based baseline model. For name tagging, we used four different information sources to model names. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56%, according to the MUC evaluation criteria. Our results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.
Article
to assist in organizing, curating, and retrieving this information. To that end, named entity recognition (the task of identifying words and phrases in free text that belong to certain classes of interest) is an important first step for many of these larger information management goals. In recent years, much attention has been fo-cused on the problem of recognizing gene and protein mentions in biomedical abstracts. This paper presents a framework for simultaneously recognizing occurrences of PROTEIN, DNA, RNA, CELL-LINE, and CELL-TYPE entity classes us-ing Conditional Random Fields with a variety
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.