Conference PaperPDF Available

Enhancement of String Matching Queries on Albanian Names for Kosovo Civil Registry

Authors:

Abstract and Figures

Civil Registry Information System serves as an essential data source for all e-government services. Searching for a citizen's data in Civil Registry Database is usually done by providing unique keywords such as name. Due to the similar pronunciation of some Albanian language consonants, for example gj [ɟ] and xh [d ͡ ʒ], problems arise in finding citizens data, names of which are similarly pronounced, despite different spelling. This paper presents a novel approach for string matching algorithm based on Albanian names. For this paper Levenshtein distance, American Soundex and modified Soundex results are compared on a database of 271.000 citizens of Prishtina municipality. The modified Soundex algorithm accommodates basic rules of pronunciation in Albanian language and its accuracy and efficiency is better than Levenshtein distance and American Soundex.
Content may be subject to copyright.
Enhancement of String Matching Queries on Albanian Names for
Kosovo Civil Registry
BLERIM REXHA
VALON RAÇA
AGNI DIKA
Faculty of Electrical and Computer Engineering
University of Prishtina
Kodra e Diellit p.n., 10000 Prishtina
KOSOVO
blerim.rexha@uni-pr.edu, valon.raca@uni-pr.edu, agni.dika@uni-pr.edu
http://www.uni-pr.edu/
Abstract: - Civil Registry Information System serves as an essential data source for all e-government services.
Searching for a citizen’s data in Civil Registry Database is usually done by providing unique keywords such as
name. Due to the similar pronunciation of some Albanian language consonants, for example gj [ɟ] and xh [d
͡ʒ],
problems arise in finding citizens data, names of which are similarly pronounced, despite different spelling.
This paper presents a novel approach for string matching algorithm based on Albanian names. For this paper
Levenshtein distance, American Soundex and modified Soundex results are compared on a database of 271.000
citizens of Prishtina municipality. The modified Soundex algorithm accommodates basic rules of pronunciation
in Albanian language and its accuracy and efficiency is better than Levenshtein distance and American
Soundex.
Key-Words: - String matching, Levenshtein, Soundex, Civil Registry
1 Introduction
Until recently, Kosovo Civil Registry (KCR) was implemented as a distributed standalone system, comprising
numerous standalone databases; one per each official entitled to register civil status of citizens of the Republic
of Kosovo. The referenced solution was followed by many technical problems, while in use by civil registry
officials, especially in searching for a person by his/her name or surname.
Albanian language (Kosovo’s most widely spoken language, and its prime official language), includes many
letters, of its 36-letter alphabet, which may have similar pronunciation, making it difficult even for native
speakers to distinguish the exact spelling. This problem applies also to person names and surnames.
2 Levenshtein distance and Soundex
Edit distance commonly referred as Levenshtein distance and the Soundex algorithm present the most widely
used and most popular algorithms for string matching.
Levenshtein’s distance is a metric algorithm primarily concerned with manipulation of string characters
(addition, deletion or substitution) to compute the metric distance between two strings [1]. The distance
between strings is calculated based on operations on first string characters to get the second string.
Soundex algorithm is a phonetic algorithm. It tries to match strings that are more similar in pronunciation.
American Soundex, which is a modified version of the original Soundex, is based in encoding each string in a
4-character code, starting with the first letter of the string and followed by three numbers as they are coded,
based on rough pronunciation of English letters [2]. However, Soundex is based on English language phonetics
and does not serve well when used for string matching in other languages.
3 Civil Registry in Kosovo
Government of Kosovo has implemented a new version of Civil Registry Database (CRD) on 2008 [3], phasing
out all previous standalone databases (developed in Microsoft Access), which primarily served as certificate
producer databases, while not maintaining a comprehensive civil registry.
The new version comprises of a central database and decentralized databases in municipal level which replicate
with the central database every night [4]. The schema is organized in three tiers, having fundamental citizen
data on first tier, while civil facts are stored in second tier as determined in Law on Civil Registers [5]. The
third tier contains historical and system administration data. The actual CRD also comes with a web portal,
where citizens may apply online for certificates, and it is considered to be amongst most accomplished e-
Services that Kosovo government offers online [6].
While Kosovo has around 2.1 million inhabitants [7], the data migration process, from around 500 standalone
databases from all civil registration offices, produced a database of 3.5 million records of citizens, which after
an anti-redundancy process, followed by a subjective verification of records by a national committee shrunk to
around 2 million records. However, the data migration and removing duplicates process solved only partially
the problem of searching for persons in CRD. The remaining problem is searching for an individual’s data on
CRD, while the query returns no matching records.
3.1 Using Levenshtein distance in KCR for string matching
Levenshtein distance is the actual algorithm used in the context of KCR Information System for searching on
citizen names registered on KCR, personal data of whom cannot be found by using a simple text-search query,
due to misspelling of the names, either during the registration or search process.
Table 1 presents query results for variants of name Xhemajl [d
͡ʒemajl], a common name in Kosovo
transliterated from Arabic name Jamal.
Table 1. Variants of name Xhemajl, for various distances of Levenshtein algorithm
a
)
- D
(
1
),
b
)
-D
(
2
),
Xhema
j
l
Xhemaijl VS Çemajl S Xhelal D Xhemail VS Xhemajlie VS
Xhemail VS Gjemajl VS Xhema S Xhemaj VS Xhemal VS
Xhemaj VS Kemajl S Xhemahil VS Xhemajl O Xhemali VS
Xhemajl O Qemajl S Xhemaijl VS Xhemajli VS Xhemil S
Xhema
j
li VS

Xhemal VS
c
)
-D
(
3
),
Xhema
j
l
Çemajl S Kemal S Xhalal D Xhemajlie VS Xhemko D
Çemal S Kemall S Xhela D Xhemajlije VS Xhemush D
Dimajl D Qemail S Xhelal D Xhemal VS Xhenata D
Gjemail VS Qemajl S Xhelil D Xhemali VS Xhenej D
Gjemajl VS Qemal S Xhem D Xhemë S Xhesa D
Gjemajll VS Qemall S Xhema S Xhemi S Xheva D
Gjemal VS Shehal D Xhemahil VS Xhemil S Xhevad D
Hema D Shema D Xhemaijl VS Xhemila S Xhevaile S
Imajl D Shemsje D Xhemail VS Xhemile S Xhevair D
Ismajl D Shenaj D Xhemaj VS Xhemilja S Xhevat D
Kemail S Sheval D Xhemajl O Xhemille S Xhezair D
Kemajl S Smajl D Xhemajli VS Xhemka D Zhelal D
Table 1 is divided on three parts by presenting in part a) all the names on the database that can be transformed
from the name Xhemajl with addition, deletion or substitution of only one letter from the name with another
letter – one Levenshtein distance. Part b) and c) represent all the results returned by transformations requiring
two, respectively three operations.
Variants presented with bold font-style in Table 1, represent the very similar (VS) variants of the name being
searched (best results). Other variants are categorized as similar (S), being variants that are listed because of
their similarity to the given name (good results), but not the perfect matches. The category of names that is
depicted as different (D), includes the names which may be formed morphologically similar to the given name,
but which are bad matches of it (bad results).
The database used for this paper contains 271.000 citizens, with 15.839 unique names. Manual verification in
database for the variants of the name Xhemajl results on 13 entries, not including the given name. Search for
the given name, Xhemajl, with one Levenshtein distance returned 5 variants, for two Levenshtein distances
returned 9 variants, while for three Levenshtein distances returned 13 out of 13 possible relevant variants.
Levenshtein distance returns a set of good results for variants of a name in Albanian language, but it provides
also names, which are not expected to be listed as similar to the given name by manual processing, which is a
drawback, because it may produce a large amount of non-relevant variants. In the referenced case for the three
Levenshtein distances it produced 46 variants which are totally different names, or not very similar variants in
the respect of the given name. Due to these results, KCR applies searches with two Leveshtein distances,
leaving a valuable subset of relevant results out of final list.
In the Figure 1 (left) are depicted results represented by Levenshtein distance algorithm for each of relevant
distances presented in this section.
Fig. 1. Graphical representation of the distribution of the results for all compared algorithms
3.2 Using American Soundex in KCR for string matching
American Soundex is widely in use for string matching based on pronunciation of English words and it is built-
in function of most popular database management systems. American Soundex is not suitable solution for string
matching in Albanian, even it does present better results than Levenshtein distance in names which are
misspelled primarily in vowels, due to lack of encoding in Soundex algorithm for vowels. American Soundex is
considered in the context of this paper, also for its efficiency and very good performance of execution, even in
very large databases.
Table 2 presents the results for given names Xhemajl, Fëllënëza and Lulëzim. Xhemajl is one of the names,
constituting consonants of which are mostly misspelled, while Fëllënëza’s vowels are mostly misspelled. A
search for similarity by American Soundex, produced 4 out of 13 relevant variants for name Xhemajl in the
experimental database; 11 out of 11 variants for name Fëllënëza, but list includes 2 different names; and it
produced a mixed set, with roughly same number of relevant and non-relevant results for the name Lulëzim.
Table 2. Result sets for variants produced by American Soundex for Albanian names
a) Xhemajl
b) Fëllënëza
c) Lulëzim
Xhansel D Falani
k
D Fëllenza VS Lil
j
ana D Lulzim VS
Xhemai
j
l VS Fellanza VS Fëllënza VS Llokman D Lulzime VS
Xhema
j
l O Fëllanza VS Fëllënzë VS Llukman D Lulzon VS
Xhema
j
li VS Fëllanzë VS Fllanza VS Lule
g
zim S Lulzum S
Xhema
j
lie VS Fëllënëza OFllanzë VS Lulëzim O
Xhema
j
li
j
e VS Fëllënxa VS Fllënza VS Lul
j
zim VS
Fellenza VS Flon
j
aVS Lulxona S
Figure 1 (center) depicts the results represented by American Soundex for string matching in Albanian names,
generalized based on experimental results. As depicted in Figure 1 (center), roughly half of the presented
results are erroneous or bad matches, while a good portion of very similar variants of the given name is not
represented at all in the result set (small circles outside of the big shadowed circle).
4 Albanian Soundex
The Levenshtein algorithm fails to deliver good results on Albanian names such as Fëllënëza, which contains
vowel ë that is not stressed most of the time. The American Soundex fails to deliver good results on Albanian
names such as Xhemajl, which contains consonants that are mostly misspelled due to their similar
pronunciation.
In this paper is presented a novel approach that returns better results than Levenshtein distance and American
Soundex when used for string matching on Albanian names by adjusting the American Soundex coding rules
for supporting Albanian language phonetics. This modified Soundex algorithm is named ALSoundex.
4.1 ALSoundex coding rules
Albanian alphabet contains 36 letters, of which 7 are vowels and 29 consonants. Out of 29 consonants 9 of
them are two-character letters. Unlike English phonetics where same letters may be pronounced differently,
Albanian phonology has a determined sound for each letter, despite their position in the word or surrounding
characters. The problem with string search on names and surnames is derived from consonants that are
pronounced similarly, like gj [ɟ] and xh [d
͡ʒ] or q [3] and ç [t
͡ʃ], and the vowel ë [ə] (Indo-European schwa)
which is not stressed sometimes, therefore not spelled [8]. These problems are present much more with
speakers of Gheg dialect, widely spoken dialect of Albanian language in Kosovo. In Table 3 are presented
Albanian letters as categorized in ALSoundex versus English letters in American Soundex.
Table 3. American Soundex vs. ALSoundex coding
Code American Soundex
(
AS
)
ALSoundex
(
ALS
)
1 B, F, P, V B, F, P, V
2 C, G, J, K, Q, S, X, Z C, S, X, Z
3 D, T G, K
4 L D, T
5 M, N L
6 R M, N
7 - R
8 - Gj [ ɟ ], Xh [ d
͡
ʒ ], Sh [ ʃ ], Zh [ ʒ ]
9 - Q [ c ], Ç [ t
͡
ʃ ]
ALSoundex inherits some of the rules from American Soundex, including ignoring the vowels in coding,
despite they were coded on original Soundex [9]. Also the rules that apply on American Soundex for short
names (insufficient characters to code) and ignoring consecutive letters [10] are applied.
A rule that is applied on American Soundex, but is discarded on ALSoundex, is the one that ignores letters that
have the same code and are placed side-by-side [2].
Modified rules consist of coding word with five characters; coding first letter with numbers for names starting
with g (gj), x (xh), q or ç; ignoring consonants H and J due to their use with other characters to form two-
character letters; and ignoring Rr, Ll, Th, Dh and Nj.
4.2 Experimental results
Searching for similarities for the name Xhemajl (encoded as 86500) with ALSoundex, returns a result set of 24
variants, 13 of which are very similar variants of the name, respectively all of the variants of this name existing
in the database are included in the result set. 11 other variants are similar to the given name, while no variants
that are bad matches are included in the result set. American Soundex has grouped the very similar variants of
the given name in four disjunctive sets, with only four variants represented on the same set as the original
name, as presented in Table 4, under a).
Table 4. Result sets for variants produced by ALsoundex for Albanian names
ALSoundex(‘Xhemajl’) ALSoundex(‘Lulëzim’)
Variant Im
p
ALS AS Variant Im
p
ALS AS Variant Im
p
ALS AS
G
j
ema
j
l VS 86500 G252 Xhemali VS 86500 X540 Lulëzim O L5260 L425
G
j
email VS 86500 G254 Xhemil S86500 X540 Lul
j
zim VS L5260 L425
G
j
emal VS 86500 G254 Xhemila S86500 X540 Lulxona S L5260 L425
G
j
emali VS 86500 G254 Xhemile S86500 X540 Lulzim VS L5260 L425
G
j
emil S 86500 G254 Xhemille S86500 X540 Lulzime VS L5260 L425
G
j
emila S 86500 G254 Xhmile S86500 X540 Lulzon VS L5260 L425
G
j
emile S 86500 G254 Xhamil
j
a S 86500 X542 Lulzum S L5260 L425
G
j
emil
j
a S 86500 G254 Xhemil
j
a S 86500 X542 
Xhemaijl VS 86500 X524 ALSoundex(‘Fëllënëza’)
Xhema
j
l O 86500 X524 Fellanza VS F5620 F452 Fëllënza VS F5620 F452
Xhema
j
li VS 86500 X524 Fëllanza VS F5620 F452 Fëllënzë VS F5620 F452
Xhema
j
lie VS 86500 X524 Fëllanzë VS F5620 F452 Fllanza VS F5620 F452
Xhema
j
li
j
e VS 86500 X524 Fëllënëza OF5620 F452 Fllanzë VS F5620 F452
Xhemahil VS 86500 X540 Fëllënxa VS F5620 F452 Fllënza VSc F5620 F452
Xhemail VS 86500 X540 Fellenza VS F5620 F452

Xhemal
VS
86500
X540
Fëllenza
VS
F5620
F452
ALSoundex returns perfect result set, as presented in Table 4 under b), for the name Fëllënëza, giving 12 very
similar variants out of 12 verified variants of the name in the database. The ALSoundex result set for this name
is a subset of the American Soundex set, discarding bad matches returned by the American Soundex.
Regarding the given name Lulëzim, as presented in Table 4 under c), ALSoundex returned 4 very similar
variants out of 4 verified variants of the name Lulëzim in the database. In this case, 2 similar variants and no
bad matches were returned by ALsoundex.
Based on the experimental results, ALSoundex result sets contain in average 78% very similar results, as a
percentage of variants listed, while other 22% of the results are mainly similar variants, and in rare cases
different names or bad matches. Bad matches are most probable when searching for the names, which do not
contain more than 2 consonants.
In another view, sets containing best results, or the very similar variants, returned by the ALSoundex, contain
in average 94% of the manually verified very similar variants in the database. As presented in Figure 1 (right),
there are almost no relevant results that are outside the result set returned by ALSoundex.
5 Conclusion
The proposed solution for string matching on Albanian names, ALSoundex algorithm was tested against
Prishtina municipality database. The returned result sets are more complete, i.e. they contain all variants of
citizen names for a given name. A comparison between returned result sets for a given name Xhemajl for all
three algorithms: (i) Levenshtein D(2), (ii) American Soundex and (iii) ALSoundex is presented in Figure 2.
The small black circles, representing best matches, are all member of ALSoundex set.
Fig. 2. Distribution of the results for the name Xhemajl intersection of result sets
The ALSoundex algorithm is also more efficient than Levenshtein. In Prishtina municipality database that
contains 271.000 citizen records, returned result set for the given name Xhemajl is retrieved within one second,
whereby the Levenshtein returns result within 47 seconds.
The quality of result set returned by ALSoundex is more relevant for a given name than Levenshtein’s result set
and by far more qualitative than American Soundex result set.
References:
1. Black, P. E. 1999. Levenshtein distance. Dictionary of Algorithms and Data Structures. U.S. National
Institute of Standards and Technology. Available from: http://www.nist.gov/dads/HTML/Levenshtein.html
2. U.S. National Archives. The Soundex Indexing System. Available from:
http://www.archives.gov/research/census/soundex.html
3. Government of Kosovo. 2008. Strategy on e-Governance 2009-2015. Available from: http://map.ks-
gov.net/en/page.aspx?id=160
4. Rexha, B., Raça, V. and Arifaj, M. 2008. Kosovo Civil Registry System Specification.
5. Assembly of Republic of Kosovo. 2004. Law on Civil Registers. Available from: http://www.assembly-
kosova.org/common/docs/ligjet/2004_46_en.pdf
6. United Nations Development Program. 2010. eGovernance and ICT Usage Report for South East Europe -
2nd Edition. Available from: http://www.undp.ba/index.aspx?PID=36&RID=107
7. Statistical Office of Kosovo. 2011. Key Indicators of Population. Available from: http://esk.rks-
gov.net/eng/
8. Kostallari, A., Domi, M., Qabej, E. and Lafe, E. 1974. Drejtshkrimi i Gjuhes Shqipe (Albanian Language
Grammar).
9. Russell, R. C. 1918. U. S. Patent 1261167. Available from:
http://v3.espacenet.com/publicationDetails/biblio?CC=US&NR=1261167&KC=&FT=E
10. Mokotoff., G. 1997. Soundexing and Genealogy. Available from: http://www.avotaynu.com/soundex.htm
... Another problem encountered was different spelling of text containing fields (e.g. city names, subjects, etc.), which was solved using the algorithm proposed in [11]. Data that contain noise can be removed by using few principles: binningvalues close to each other take same value as their mean value; regression and/or clustering. ...
Article
Employing data mining algorithms on previous student's records may give important results in defining new ways of learning and during the process of development of a new curriculum for educational institutions. Creating group profiles by analyzing certain attributes of the students helps in defining more specifically the needs of the students. Achieving this requires manipulation of data in a structured database, and most important a complete data set, having that incomplete data sets may produce unreliable outputs. This paper presents the results of different data mining algorithms applied on previous student's records to produce predicted success results, and the comparison with the real data in database. Results show that, even if there is lack of attributes, one may still apply certain data mining algorithms over school data to gain knowledge on the mainstream flow. Besides prediction, one can cluster data in order to get main characteristics on the student's performance. The experiment presented in this paper will emphasize that dividing data in fewer classes will result in higher cluster sum of squared errors, which in fact show that there exist big difference between data.
Strategy on e-Governance
  • Kosovo Government Of
Government of Kosovo. 2008. Strategy on e-Governance 2009-2015. Available from: http://map.ksgov.net/en/page.aspx?id=160
Soundexing and Genealogy
  • G Mokotoff
Mokotoff., G. 1997. Soundexing and Genealogy. Available from: http://www.avotaynu.com/soundex.htm
Levenshtein distance. Dictionary of Algorithms and Data Structures. U.S. National Institute of Standards and Technology
  • P E Black
Black, P. E. 1999. Levenshtein distance. Dictionary of Algorithms and Data Structures. U.S. National Institute of Standards and Technology. Available from: http://www.nist.gov/dads/HTML/Levenshtein.html
Law on Civil Registers
  • Assembly Of Republic Of Kosovo
Assembly of Republic of Kosovo. 2004. Law on Civil Registers. Available from: http://www.assemblykosova.org/common/docs/ligjet/2004_46_en.pdf
U. S. Patent 1261167 Available from: http://v3.espacenet.com/publicationDetails/biblio?CC=US&NR=1261167&amp
  • R C Russell
Kosovo Civil Registry System Specification
  • B Rexha
  • V Raça
  • M Arifaj
Rexha, B., Raça, V. and Arifaj, M. 2008. Kosovo Civil Registry System Specification.