Conference PaperPDF Available

Word Segmentation for Akkadian Cuneiform

Authors:

Abstract and Figures

We present experiments on word segmentation for Akkadian cuneiform, an ancient writing system and a language used for about 3 millennia in the ancient Near East. To our best knowledge, this is the first study of this kind applied to either the Akkadian language or the cuneiform writing system. As a logosyllabic writing system, cuneiform structurally resembles Eastern Asian writing systems, so, we employ word segmentation algorithms originally developed for Chinese and Japanese. We describe results of rule-based algorithms, dictionary-based algorithms, statistical and machine learning approaches. Our results may indicate possible promising steps in cuneiform word segmentation that can create and improve natural language processing in this area.
Content may be subject to copyright.
Akkadian Word Segmentation
Timo Homburg M.Sc., Dr. Christian Chiarcos
Institute for Computer Science
Goethe University, Robert-Mayer-Str. 10, 60325 Frankfurt am Main, Germany
timo.homburg@gmx.de, chiarcos@em.uni-frankfurt.de
Abstract
We present experiments on word segmentation for Akkadian cuneiform, an ancient writing system and a language used for about 3
millennia in the ancient Near East. To our best knowledge, this is the first study of this kind applied to either the Akkadian language
or the cuneiform writing system. As a logosyllabic writing system, cuneiform structurally resembles Eastern Asian writing systems,
so, we employ word segmentation algorithms originally developed for Chinese and Japanese. We describe results of rule-based
algorithms, dictionary-based algorithms, statistical and machine learning approaches. Our results may indicate possible promising steps
in cuneiform word segmentation that can create and improve natural language processing in this area.
Keywords: Assyriology, Cuneiform, Akkadian, Chinese, Word Segmentation, Machine Learning
1. Introduction
Word segmentation is the most elementary task in natural
language processing of written language. In most alpha-
betical writing systems, this task is commonly referred to
as tokenization and can be easily solved through the inter-
pretation of orthographical markers for word and sentence
boundaries, e.g., white spaces. Where these are lacking,
however, word segmentation is a challenging task, a classi-
cal – and successfully addressed – problem in logographic
writing systems like Chinese and logosyllabic writing sys-
tems like Japanese.
Here, we describe experiments on cuneiform, a writing sys-
tem developed in the 4th m. BCE in Mesopotamia sub-
sequently applied to various Semitic, Indo-European and
isolate languages in the region. As a logosyllabic writing
system, it shares important structural characteristics with
Chinese and Japanese (Ikeda, 2007), so that we evaluate
word segmentation methods successfully applied to these
languages. However, these languages are unrelated to those
of the Ancient Near East, so that future research will focus
on developing aspects specific to languages with cuneiform
writing.
As a writing system, cuneiform poses a number of unique
challenges:
The same character, e.g., , can be read as a logo-
graph or as a syllable, as the logograph
GURU
‘young
man’ or with its phonological reading as a syllabic
sign.
As a syllabic sign, a single character can have multi-
ple different readings, e.g., grounded in the possible
Sumerian pronounciation(s) of the logograph, or the
pronounciation of their Akkadian translations, may
be read as
dan/tan
(from Akk.
dannu
‘strong, power-
ful’),
kal
(from Sum.
kal
‘rare, valuable’ and
kalag
‘strong’),
rib
(from Sum.
rib
‘outstanding, strong’),
etc. (Tinney and others, 2006; Lauffenburger, nd;
Borger, 2004).
CVC syllables (e.g.,
dan
) can be as a pair of CV-VC
characters (
da-an
) or with a single CVC character
(
dan
) (Gelb, 1957, p.8,
Da-an--ri
vs.
Dan-r-ri
).
We primarily consider the Akkadian language, the domi-
nant language of the Ancient Near East from the 3rd to the
1st millennium BCE. Originally spoken in Mesopotamia, it
became the lingua franca in the Near East during the 2nd
m. BCE, with an extensive body of material comparable
only to corpus languages such as Classical Latin or An-
cient Greek. With a considerable amount of cuneiform clay
tablets not yet deciphered, and new ones being continu-
ously excavated, the automated processing of the Akkadian
language is thus of tremendous importance. Previous re-
search on automated digitization focused on producing 3D
scans of tablets (Sect. 2.), with Optical Character Recog-
nition (OCR) being a logical next step in the development.
Successful cuneiform OCR, however, needs to be accom-
plished by knowledge-rich NLP methods for the contex-
tual disambiguation of characters: One of the key charac-
teristics of cuneiform is that a character can be read as an
logograph, as a determinative, or as a syllabic sign (with
different phonemic values). The contextual distribution of
characters is thus heavily dependent on its context. Word
segmentation approaches may thus be a key component to
any approach on cuneiform OCR.
Akkadian is the oldest attested Semitic language, and has
thus occasionally been considered in experiments on NLP
for Semitic languages, but mostly focusing on (rule-based)
morphological analysis. To our best knowledge, the present
paper describes the first study of word segmentation in
Akkadian cuneiform. It thus provides a primary point of
orientation for any subsequent experiments on cuneiform
word segmentation and will be of utmost importance to fu-
ture experiments on cuneiform OCR and Akkadian NLP.
4067
2. State Of the Art
We distinguish three types of word segmentation algo-
rithms:
rule-based segmentation rules derived from grammar
dictionary-based segmentation by lookup in a (statically
enhanced) dictionary
statistical/machine learning data-driven segmentation as
learnt from segmented corpora
As shown in several SIGHAN BakeOffs in the last decade
(Sproat and Emerson, 2003), in Chinese machine learn-
ing and dictionary-based approaches like MaxMatch (Chen
and Liu, 1992) produce reasonable results while rule-based
methods are commonly used as a Baseline (Palmer and
Burger, 1997). In Japanese, however, rule-based algorithms
like Tango (Ando and Lee, 2000) proved to be more suc-
cessful. This is partially due to the morphological richness
of Japanese as compared to Chinese.
As a point of orientation for subsequent studies on
cuneiform, we evaluate selected approaches from these
classes in their performance on Akkadian. Neither the
Akkadian language nor cuneiform as a writing system have
been addressed in this respect before.
Along with other cuneiform languages, Akkadian has a
considerable research history in NLP. For the greatest part,
existing approaches are concerned with rule-based mor-
phological analyzers, e.g., Kataja and Koskenniemi (1988),
Barthlemy (1998), Macks (2002), Barthlemy (2009), Khait
(accepted) for Akkadian, or Valentin Tablan Wim Peters
(2006) for Sumerian. As for data-driven morphological
tools, the state of the art in the field is represented by the
Lemmatizer of the Open Richly Annotated Cuneiform Cor-
pus (ORACC),1which supports manual morphological an-
notation for Akkadian, Sumerian and (to a limited degree)
Hittite with a lookup-functionality in the annotated corpus.
Such example-based approaches can be extended to auto-
matically transfer morphological rules through phonologi-
cal equivalences, as demonstrated by Snyder et al. (2010)
for the projection of Hebrew morphology and lexicon to
Ugaritic, another Semitic cuneiform language. As for
higher levels of linguistic analysis, we are not aware of any
tools for syntactic or semantic annotation for Akkadian,
however, the latter has been considered for administrative
texts from the Sumerian period, whose highly convention-
alized structure can be exploited for concept classification
(Jaworski, 2008).
Aside from linguistic analysis, another aspect of cuneiform
languages that recently aroused interest are approaches fo-
cusing on the material side of cuneiform writing, i.e., scan-
ning and digitizing clay tablets (Subodh et al., 2003; Co-
hen et al., 2004), reconstructing tablets and tales by au-
tomatically combining their fragments (Collins et al., ac-
cepted; Tyndall, 2012), and recently, initial steps towards
cuneiform OCR have been undertaken (Mara et al., 2010).
As this line of research is flourishing mostly in the field of
1http://oracc.museum.upenn.edu/doc/
builder/linganno/
computer graphics, the obvious gap between both lines of
research lies in the absence of any studies concerned with
the transition from the (identified) sign and its linguistic in-
terpretation, a challenging task, as mentioned before.
With our paper, we describe the first experiments in this
direction, with a specific focus on segmenting character
sequences into words as a core component for future ap-
proaches on transliteration.
3. Experimental Setup
3.1. Corpus Data
We use corpora from three different periods and dialects,
namely Old Babylonian, Middle Babylonian and Neo-
Assyrian, from the Cuneiform Digital Library Initiative
(CDLI)2, representing most of the available texts (clay
tablets) of the given periods of time. The corpora were ran-
domly split in a 80:20 ratio for training and testing purposes
(on a per-tablet, not a per-line basis). For the experiments,
we trained our segmentation algorithms on each of these
language stages, and performed evaluations on each lan-
guage stage respectively. For reasons of space, we only
report results for the Middle Babylonian training corpus
and evaluation against the Middle Babylonian test corpus
in detail. Further experiments showing robust performance
across different language stages will be represented in a
graphical way. Additionally we will present results of clas-
sifications using corpora data of one epoch applied on other
epochs of the same language to get an impression of the
performance of the algorithms on related data.
The CDLI ATF format contains metadata, a (word-
segmented) transliteration, and (optionally) a translation.
CDLI data always represents cuneiform in lines as found
on the clay tablets. To minimize ambiguity, Akkadian writ-
ers tried to avoid incomplete words at the end of a line,
so the tablets themselves provide initial data on word seg-
mentation. From the transliteration, we restored the orig-
inal UTF-8 characters on the basis of a sign list that we
compiled from various resources. Non-restorable charac-
ters were ignored and thus are not represented in the result-
ing texts. This data represents our gold standard. It should
be noted that the mapping to UTF-8 can not be trivially re-
versed because of the highly ambiguous phonological and
ideographic meaning of characters.
After conversion and the removal of whitespaces, segmen-
tation algorithms of three categories have been applied.
Figure 1 shows the segmentation process.
3.2. Baseline
As our baseline we adopted the Character-As-Word algo-
rithm (Palmer and Burger, 1997, p.1) common as a Baseline
in Chinese, in the form of an Average-Word-Length (AVG)
algorithm. The average word length Lwas determined em-
pirically by analyzing the training corpora as follows:
L=#(text)
i=1 len(wi)
#(text)
with text being the entire corpus, len(wi)being the length
of a word and #(text)representing the number of words
2http://cdli.ucla.edu/
4068
Figure 1: Classification process
in the entire corpus. Accordingly, a sequence of characters
will be split after every Lth character.
3.3. Rule-based Algorithms
A first simple bigram method inserts a word break between
two characters c1and c2, if a word break between c1and c2
is more frequent than a non-break in the training corpus.
Similarly, a unigram segmentation between two characters
can be achieved by classifying every individual character
according to the I(0)BES scheme – (preferred) intermedi-
ate (I), beginning (B), end (E) and single (S) characters – in
the training corpus. The prefix/suffix algorithm collects us-
ages for every characters in every word found in an already
segmented training corpus. For every character found in a
word classification in the I(0)BES scheme will be collected
and frequencies will be counted for I,B,E and S respec-
tively. For each pair of characters the algorithm will test
if the probability of being E or S exceeds the probability of
being I or B, therefore indicating a separation. Therefore
the prefix/suffix algorithm includes information on a word
basis to achieve the segmentation.
Furthermore, we apply the Tango algorithm (Ando and Lee,
2000) which uses a scoring system to determine possible
segmentations in a sliding window mechanism. The thresh-
old parameter and the window size have been empirically
chosen to fit the needs of the Akkadian language.
Finally, we compare these methods with a random segmen-
tation function which decides for every character whether a
segmentation should occur after it. The randomized func-
tion is initialized with a random seed of the size of the
corresponding charcount of each line respectively. Though
substantially worse than the average-length baseline, it out-
performs the bigram method.
3.4. Dictionary-based Algorithms
As dictionary-based approaches can despite their simplic-
ity gain considerable segmentation efforts in languages like
Chinese or Japanese, we apply the commonly used Mini-
mum WordCount Matching Algorithm, a modified version
of the LCU Matching algorithm (Pengyu et al., 2014), and
the MaxMatch algorithm (Chen and Liu, 1992), and a mod-
ified version of the MaxMatch Algorithm (Islam et al.,
2007).3Dictionaries used in those approaches have been
3We employ the basic version of (Chen and Liu, 1992). No
additional neighbor checking performed.
generated from the provided training data used as a basis for
the other approaches as well. Externa dictionary resources
have not been considered. It is important to notice that we
did not apply any new word detection which can be used to
extend the dictionary using data from the test corpus. This
may be a topic for further refinement and could be achieved
by applying one of many statistical approaches.
3.5. Statistics & ML
The commonly used Maximum Probability Matching al-
gorithm trivially maximizes the occurrence probability of
a word sequence (here, a line on a tablet) by matching
words against a frequency-tagged dictionary and returning
the most probable word segmentation s ∈ {(i|1<i<
n) N x|x < n}for a character sequence (i.e., line)
c1, .., cn:
SegM axP rob (c1, ..., cn) =
arg maxs
|s|−1
j=1
Pdict(cj, ..., cj+1 )
We normalize the dictionary probability Pdict to values
greater than 0 to account for out-of-vocabulary words.
More advanced approaches we studied here include clus-
tering algorithms (kNN,kMeans), decision trees (C4.5),
NaiveBayes and MaxEnt, sequence labelling models (Hid-
den Markov Models, HMM and Conditional Random
Fields, CRF), as well as Support Vector Machines (SVM)
and Multi-Layer Perceptrons (MLP). For most algorithms,
we relied on the implementation provided by WEKA 3.7 4,
for HMMs the HMMWeka extension5, for CRFs the MAL-
LET6CRF SimpleTagger, for SVMs libsvm7with polyno-
mial kernel.
Table 1 enumerates the feature sets we used in the exper-
iments. For the purpose of our experiments, these were
adopted without modification from the literature on Far
Eastern languages and writing systems (see references for
details) and directly applied to Akkadian cuneiform. The
motivation is to establish a sound basis for the future de-
velopment of cuneiform-specific algorithms, for which we
expect substantial refinements if language-specific features
for Akkadian are explicitly taken into account in statistical
approaches.
For all data-driven classifiers, we simplified the target clas-
sification of the I(O)BES scheme to a binary distinction of
Class 0 (i.e., IB: no segmentation after the current char-
acter) and Class 1 (i.e., ES: segmentation after the cur-
rent character). All classifiers were trained on data sets of
10.000 instances in order to assess their performance under
resource-poor circumstances. In addition, training against
the full training set of 100.000 instances was performed
successfully for most algorithms (and is reported below),
with the exception of HMM and CRF whose training timed
out.
4http://www.cs.waikato.ac.nz/ml/weka/
5http://doc.gold.ac.uk/˜mas02mg/software/
hmmweka/
6http://mallet.cs.umass.edu/
7http://www.csie.ntu.edu.tw/˜cjlin/
libsvm/
4069
feature set algorithms
BASE
base feature set (Low et al., 2005) C4.5, MaxEntropy, NaiveBayes, kNN,
kMeans, MLP, SVM
EXT
extended feature set (Low et al., 2005) C4.5, MaxEntropy, NaiveBayes, kNN,
kMeans, MLP, SVM
MAXENT
MaxEnt feature set (Raman, 2006) C4.5, MaxEntropy, NaiveBayes, kNN,
kMeans, MLP, SVM
PERC
perceptron feature set (Song and Sarkar, 2008) C4.5, MaxEntropy, NaiveBayes, kNN,
kMeans, MLP, SVM
RED-BIG
reduced bigram feature set (Papageorgiou, 1994) CRF, HMM
Table 1: Feature sets
3.6. Ensemble Combination
Finally, we implemented simple ensemble combination ar-
chitectures as a meta classifier, using
(a) a simple (unweighted) majority vote, tested for all pos-
sible combinations of individual segmentation algo-
rithms described above,
(b) C4.5 meta classification, resp.
(c) SVM meta classification
Remarkably, the more elaborate C4.5 and SVM versions
of the meta classifier did not outperform the best majority
configuration (i.e., all classifiers except SVM), whose re-
sults are reported in Tab. 2.
3.7. Evaluation Metrics
We primarily evaluate against the following conventional
metrics:
Boundary evaluation addresses character-based segmen-
tation per boundary (Palmer and Burger, 1997, p.176),
i.e., precision and recall of predicted and observed
boundaries following a given character:
recb=#correctly predicted boundary
#gold boundaries
precb=#correctly predicted boundary
#predicted boundaries
Word boundary evaluation evaluates completely seg-
mented words (Palmer and Burger, 1997, p.176), a
metric especially relevant for any future practical ap-
plication by Assyriologists or philologists:
recw=#correctly predicted words
#gold words
precw=#correctly predicted words
#predicted words
As these metrics are not differenciating between near and
far misses of boundaries at all, we also employ sliding-
window-based metrics:
WindowDiff aims to avoid penalizing near-matching
boundaries too restrictively, with window size k=
N
2#segments , reference segmentation R, a total number
of Ncontent units and Ccomputed boundaries the
correctness of a segmentation as follows:
W Dif f =1
Nk
Nk
i=0
(|Ri,i+kCi,i+k|>0)
WinPR is an established metric in the word segmentation
community, it calculates precision and recall building
on the basis of WindowDiff by defining precwp =
tpwp
tpwp+f pwp
and recwp =tpwp
tpwp+f nwp
with the follow-
ing definitions (Scaiano and Inkpen, 2012):
tpwp =
k
i=1k
min(Ri,i+k, Ci,i+k)
tnwp =k(k1) +
N
i=1k
(kmax(Ri,i+k, Ci,i+k))
fpwp =
N
i=1k
max(0, Ci,i+kRi,i+k)
fnwp =
N
i=1k
max(0, Ri,i+kCi,i+k)
PK metric is a standard metric in the field of text segmen-
tation (Pevzner and Hearst, 2002).
4070
4. Results
Table 2 and Fig. 3 provide results which are representative
for the experiments conducted. For reasons of space, only
the best-performing combinations of features and data-
driven (statistical/ML) methods are reported. Also, while
all combinations of cuneiform corpora have been tested,
we report only results obtained by training the tools on
the (training section of the) Middle Babylonian corpus and
tested on the (test section of the) Middle Babylonian cor-
pus. The general pattern, however, remains the same for all
cuneiform corpora considered, both within a language stage
(training and test corpus for the same language stage, Fig.
2), but also across language stages (e.g., Old Babylonian
tools tested on Middle Babylonian corpus).
For the example of the Middle Babylonian tools, the scores
of the best-performing configurations obtained on the Old
Babylonian and Neo-Assyrian test set are reported in Tab.
3. Unsurprisingly, the scores are generally worse than for
Middle Babylonian, yet, they still outperform the base-
line. Remarkably, Neo-Assyrian boundary F-scores actu-
ally seem to improve over Middle Babylonian. This may
be due to the fact that the Neo-Assyrian corpus is more ho-
mogenuous than the Middle Babylonian corpus, as the lat-
ter contains much material written by non-native speakers.
Method Bound Word PK WinPR
F-Score F-Score Score F-Score
baseline (AVG length [2]) 42.77 21.19 13.77 42.92
Bigram 14.90 7.22 47.72 20.84
Pref/Suff 34.59 10.17 23.08 34.86
Random 23.43 13.64 45.04 22.31
Tango 49.22 14.88 13.93 35.32
MaxMatch 65.02 65.05 9.03 57.47
MaxMatchCombined 73.91 58.48 8.69 60.80
LCUMatching 68.14 44.78 8.92 51.30
MinWCMatch 72.82 59.76 8.17 59.63
MaxProb 59.67 37.26 10.24 49.58
C4.5 (EXT 10K) 42.66 15.33 23.12 34.01
CRF (EXT 10K) 40.13 15.31 13.06 27.10
HMM (RED-BIG 10K) 23.10 12.98 21.64 35.55
kMeans (BASE 10K) 37.79 14.17 24.11 33.77
kNN (EXT 100K) 49.48 14.51 13.59 37.53
MaxEnt (EXT 10K) 46.91 14.97 15.82 36.59
NBayes (EXT 100K) 49.49 14.52 13.60 37.56
Percep (MAXENT 10K) 49.51 14.51 13.59 37.51
SVM (EXT 10K) 49.51 14.51 13.59 37.51
META (w/o SVM, 49.42 14.31 13.59 37.47
majority vote)
Table 2: Middle Babylonian tools on Middle Babylonian
test set
As for rule-based methods, only Tango (Ando and Lee,
2000) outperformed the baseline, whereas dictionary-based
algorithms performed clearly better. Dictionary-based ap-
proaches produced the best-performing classification, de-
pending on the metric between 60% and nearly 80%.
Machine Learning algorithms as applied here seem to face
problems in Akkadian cuneiform: The feature sets used for
segmenting Chinese seem to provide a significantly worse
result in Akkadian, and even the best-performing algo-
(2.a)
(2.b)
Figure 2: Within-language test error for (a) Old Babylo-
nian tools on Old Babylonian test set, and (b) Neo-Assyrian
tools on Neo-Assyrian test set (boundary evaluation)
rithms hardly outperform the baseline. This indicates that
a feature set specific to Akkadian needs to be developed in
future research.
It should be noted that this picture did not drastically
change when different training and test corpora for Akka-
dian language stages were employed.
A note on transliteration and morphosyntax As men-
tioned before, retransforming cuneiform Unicode charac-
ters into the correct transliteration is far from easy, as ev-
ery Unicode character may represent an ideograph, several
different syllables, etc. In the process of word segmenta-
tion, we also conducted initial experiments on translitera-
tion. By mapping every character to its according to the
corpora data most frequent transliteration, we were able to
establish a baseline that correctly transliterates up to 40%
of the characters per corpus. In future research, this needs
to be further improved by statistical, context-aware classi-
fiers and tighter integration of the word segmentation and
transliteration tasks. At the same time, word segmentation
for Akkadian can certainly benefit from integrating higher
levels of NLP, e.g., POS tagging, as this may be important
for lexical disambiguation. Taken together, this calls for
a uniform architecture capable to handle word segmenta-
tion, transliteration and morphosyntactic analysis in a sin-
gle task. For such an integrated system, our experients with
segmentation module may serve as a baseline.
4071
Method Bound Word PK WinPR language
F-Score F-Score Score F-Score (alg.)
baseline (AVG length [2]) 39.25 17.45 21.37 39.81 OBab
41.63 17.06 11.12 36.13 NAss
rule-based 33.80 9.11 28.23 31.53 OBab (Prefix/Suffix)
51.93 16.65 11.14 36.44 NAss (Tango)
dictionary-based 62.73 53.12 2.43 56.10 OBab (MinWCMatch)
52.95 24.51 10.97 38.46 NAss (MinWCMatch)
statistical/ML 39.56 9.56 21.26 31.91 OBab (NaiveBayes, EXT 10k)
52.31 16.64 10.95 36.47 NAss (NaiveBayes, PERC 10K)
Table 3: Across-language test error: Middle Babylonian tools on O(ld )Bab(ylonian) and N(eo-)Ass(yrian) test sets
Figure 3: Middle Babylonian tools (80%) on Middle Babylonian test set(20%) (boundary evaluation)
5. Summary and Outlook
Our paper is the first experiment on word segmentation on
Akkadian or cuneiform. It provides insights in what to ex-
pect by applying established word segmentation algorithms
on the Akkadian language, and we showed that for Akka-
dian corpora of the dimensions available, dictionary-based
approaches produce the best results in segmenting Akka-
dian texts.
In general, our results for Akkadian are substantially worse
than those for Chinese and Japanese, and currently do not
live up to the needs of philologists. However, given the
high degree of ambiguity in the writing system, this result
is unsurprising, and calls for intensified research efforts in
this regard. Our experiments stipulate directions for future
research and provide a point of orientation for any future
approach in this direction.
A key result is that further research on linguistic charac-
teristics of Akkadian and other cuneiform languages is
required, and – likely – only this will improve results to
production-ready quality.
Strategies to improve segmentation performance include
(1) extending dictionary-based algorithms with a new
word detection component – yet, this is methodolog-
ically problematic if the test corpus is used to improve
segmentation –,
(2) combining dictionary-based and rule-based approaches
and extending the latter with a morphological compo-
nent, and
(3) combining morphology-oriented rule-based systems
with statistical and machine learning algorithms to ben-
efit from both the context-awareness of data-driven
methods and the high precision of rule-based morpho-
logical analysis.
The most promising (and the most challenging) extension
4072
in this regard is the development of an integrated sys-
tem that provides uniform handling for word segmentation,
transliteration and morphosyntactic annotation. Akkadian
is a morphologically complex language with a highly am-
biguous writing system. Unlike Chinese, it shows heavy
interference between morphology and word segmentation,
and unlike Japanese, it does not have a 1:1 correspon-
dence between syllabic signs and phonemic values. In
this regard, any future word segmentation algorithm for
cuneiform will have to be an integrated approach, and thus
be very different from existing approaches for either Chi-
nese or Japanese.
Remark to reviewers
Upon acceptance of this paper, code and data will be pub-
lished under open licenses (Apache license and CC-BY).
In addition to the algorithms described here, this includes
an interactive GUI for visualizing and analyzing segmen-
tation results, a cuneiform Input Method Engine, and a
near-exhaustive list of UTF8 cuneiform characters and their
readings. Providing a link would reveal author identity.
6. Bibliographical References
Ando, R. K. and Lee, L. (2000). Mostly-unsupervised sta-
tistical segmentation of japanese: Applications to kanji.
In Proceedings of the 1st North American Chapter of the
Association for Computational Linguistics Conference,
NAACL 2000, pages 241–248, Stroudsburg, PA, USA.
Association for Computational Linguistics.
Barthlemy, F. (1998). A morphological analyzer for Akka-
dian verbal forms with a model of phonetic transfor-
mations. In Proceedings of the ACL-1998 Workshop on
Computational Approaches to Semitic Languages, Mon-
tral.
Barthlemy, F. (2009). The Karamel System and Semitic
languages: Structured multi-tiered morphology. In Pro-
ceedings of the EACL 2009 Workshop on Computational
Approaches to Semitic Languages, page 1018, Athens,
Greece.
Borger, R. (2004). Mesopotamisches Zeichenlexikon.
Ugarit-Verlag.
Chen, K.-J. and Liu, S.-H. (1992). Word identification for
mandarin chinese sentences. In Proceedings of the 14th
Conference on Computational Linguistics - Volume 1,
COLING ’92, pages 101–107, Stroudsburg, PA, USA.
Association for Computational Linguistics.
Cohen, J. D., Duncan, D., Snyder, D., Cooper, J., Ku-
mar, S., Hahn, D., Chen, Y., Purnomo, B., and Graet-
tinger, J. (2004). iClay: Digitizing Cuneiform. In Pro-
ceedings of the 5th International Conference on Virtual
Reality, Archaeology and Intelligent Cultural Heritage
(VAST-2004), pages 135–143, Aire-la-Ville, Switzer-
land, Switzerland. Eurographics Association.
Collins, T., Woolley, S., Gehlken, E., Lewis, A., Munoz,
L. H., and Ch’ng, E. (accepted). Automated reconstruc-
tion of virtual fragmented cuneiform tablets. Electronics
Letters (IET).
Gelb, I. (1957). Glossary of Old Akkadian. University of
Chicago Press, Chicago, Illinois.
Ikeda, J. (2007). Early Japanese and early Akkadian writ-
ing systems. a contrastive survey of Kunogenesis. In
Proceedings of Origins of Early Writing Systems, Peking
University, Beijing.
Islam, M. A., Inkpen, D., and Kiringa, I. (2007). A gen-
eralized approach to word segmentation using maximum
length descending frequency and entropy rate. In Com-
putational Linguistics and Intelligent Text Processing,
pages 175–185. Springer.
Jaworski, W. (2008). Contents modelling of neo-sumerian
ur III economic text corpus. In Proceedings of the 22nd
International Conference on Computational Linguistics
(Coling 2008), pages 369–376, Manchester, UK, August.
Coling 2008 Organizing Committee.
Kataja, L. and Koskenniemi, K. (1988). Finite-state de-
scription of Semitic morphology: A case study of An-
cient Akkadian. In Proceedings of COLING 1988.
Khait, I. (accepted). Cuneiform Labs: Annotating Akka-
dian corpora. In Rencontre Assyriologique Interna-
tionale (RAI-2015), Geneva and Bern, Switzerland, June
22-26, 2015.
Lauffenburger, O. (n.d.). Akkadian dictionary. www.
assyrianlanguages.org/akkadian.
Low, J. K., Ng, H. T., and Guo, W. (2005). A maximum
entropy approach to chinese word segmentation. In Pro-
ceedings of the Fourth SIGHAN Workshop on Chinese
Language Processing, volume 1612164.
Macks, A. (2002). Parsing Akkadian Verbs with Prolog.
In Proceedings of the ACL-02 Workshop on Computa-
tional Approaches to Semitic Languages, Philadelphia,
Pennsylvania.
Mara, H., Krmker, S., Jakob, S., and Breuckmann, B.
(2010). GigaMesh and Gilgamesh 3D Multiscale In-
tegral Invariant Cuneiform Character Extraction. In
Alessandro Artusi, et al., editors, VAST: International
Symposium on Virtual Reality, Archaeology and Intelli-
gent Cultural Heritage. The Eurographics Association.
Palmer, D. and Burger, J. (1997). Chinese word segmen-
tation and information retrieval. In AAAI Spring Sym-
posium on Cross-Language Text and Speech Retrieval,
pages 175–178.
Papageorgiou, C. P. (1994). Japanese word segmentation
by hidden markov model. In Proceedings of the work-
shop on Human Language Technology, pages 283–288.
Association for Computational Linguistics.
Pengyu, L., Jingchuan, P., Du Mingming, L. X., and Lijun,
J. (2014). A lexicon-corpus-based unsupervised chinese
word segmentation approach. International Journal On
Smart Sensing And Intelligent Systems, 7(1).
Pevzner, L. and Hearst, M. A. (2002). A critique and im-
provement of an evaluation metric for text segmentation.
Computational Linguistics, 28(1):19–36.
Raman, A. (2006). A dictionary-augmented maximum en-
tropy tagging approach to chinese word segmentation.
Scaiano, M. and Inkpen, D. (2012). Getting more from
segmentation evaluation. In Proceedings of the 2012
Conference of the North American Chapter of the Associ-
4073
ation for Computational Linguistics: Human Language
Technologies, pages 362–366. Association for Computa-
tional Linguistics.
Snyder, B., Barzilay, R., and Knight, K. (2010). A statis-
tical model for lost language decipherment. In Proceed-
ings of ACL-2010, Upsala, Sweden.
Song, D. and Sarkar, A. (2008). Training a perceptron with
global and local features for chinese word segmentation.
In IJCNLP, pages 143–146. Citeseer.
Sproat, R. and Emerson, T. (2003). The first interna-
tional chinese word segmentation bakeoff. In Proceed-
ings of the second SIGHAN workshop on Chinese lan-
guage processing-Volume 17, pages 133–143. Associa-
tion for Computational Linguistics.
Subodh, K., Snyder, D., Duncan, D., Cohen, J., and Cooper,
J. (2003). Digital preservation of ancient cuneiform
tablets using 3D-scanning. In Proceedings of the 4th
International Conference on 3-D Digital Imaging and
Modeling (3DIM-2003), pages 326–333. IEEE.
Tinney, S. et al. (2006). The Pennsylvania Sume-
rian dictionary. http://psd.museum.upenn.
edu/epsd1/.
Tyndall, S. (2012). Toward automatically assembling
Hittite-language cuneiform tablet fragments into larger
texts. In Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics, pages 243–
247. Association for Computational Linguistics.
Valentin Tablan Wim Peters, Diana Maynard, H. C. (2006).
Creating tools for morphological analysis of Sume-
rian. In Proceedings of the 5th International Conference
on Language Resources and Evaluation (LREC-2006),
pages 1762–1765, Genova, Italy.
4074
... There can be several reasons to use this technique. For instance, Homburg and Chiarcos [55] described it as the most elementary and essential task in natural language processing of written language. Character segment can be categorized as word segment as well because it is mostly applied in Chinese text processing, as character segment represents single unit same as word segment in following studies [45,46,58]. ...
... One of the main useful features of the proposed scheme is automatic and enhanced word identification. Beside the Chinese language, there are studies which applied word segmentation for Urdu [26], Russian [40], Arabic [13], Vietnamese [52], Akkadian [55], Indian [59] and South Indian [59]. However, other studies [22,32,44] applied sentence segmentation to analyze Arabic, Greek, and Malay languages accordingly. ...
Conference Paper
Full-text available
Extracting information from a large number of scientific documents prepared in portable document format (PDF) is a time-consuming process, if all this is to be done without the help of an automated system. However, the missing of structural information in PDF can create a lot of issues during the extraction process. Watermark is one of the objects that can have a negative effect on this. When PDF extraction tool is applied to PDF with watermark, the watermark can affect the order of the text and is often extracted as part of the text. If the text is to be used for analysis in the future, the watermark might affect the accuracy in the results, since they should not be taken into consideration. In this paper, an approach that can be used to overcome the issue above is proposed. The proposed approach makes use of direct text recognition from PDF and optical character recognition (OCR) to produce two version of digital text that can be combined for better accuracy. The results shown that the proposed approach is capable of extracting text from PDF with different watermark patterns. (PDF) Text Extraction and Categorization from Watermark Scientific Document in Bulk. Available from: https://www.researchgate.net/publication/333072857_Text_Extraction_and_Categorization_from_Watermark_Scientific_Document_in_Bulk [accessed Nov 12 2019].
... No algorithm can currently perform a sufficient coverage of cuneiform machine translation or automated interpretation of cuneiform languages. However, machine-learning approaches work on different problems such as cuneiform word segmentation (Homburg and Chiarcos, 2016) or cuneiform language identification (Jauhiainen et al., 2019) which could profit from evaluating paleographic features. ...
Article
This publication introduces PaleoCodage, a new machine-readable way to capture cuneiform paleographic shapes. Many different systems of listing and describing cuneiform signs exist, but none of the systems can be used to describe the shape, i.e. paleographic features of cuneiform signs. PaleoCodage aims to fill this missing link of description which can benefit the Semantic Web community of linked open data dictionaries, the assyrologist community in documenting signs and their variants and the cuneiform image recognition community in providing a gold standard encoding to match against. The publication evaluates the encoding on more than 200 Unicode cuneiform signs, describes already available application cases, a linked data vocabulary for paleography and concludes by describing future work on further validating PaleoCodage and applying the paleographic vocabulary to more languages.
... In many scripts whitespace between characters simplifies detection significantly, because it allows us to separate localization and classification in two consecutive steps. However, cuneiform signs are often inscribed without space between neighboring signs making their localization and classification interdependent [5]. Sign similarity of over 900 different signs increases the difficulty further as shown for two sign code classes in Fig 2. Since most cuneiform signs are constructed from very few types of wedges (in Neo-Assyrian: lying, standing and diagonal) that are combined in various spatial configurations, different sign code classes are often very similar (high inter-class similarity). ...
Article
Full-text available
The cuneiform script provides a glimpse into our ancient history. However, reading age-old clay tablets is time-consuming and requires years of training. To simplify this process, we propose a deep-learning based sign detector that locates and classifies cuneiform signs in images of clay tablets. Deep learning requires large amounts of training data in the form of bounding boxes around cuneiform signs, which are not readily available and costly to obtain in the case of cuneiform script. To tackle this problem, we make use of existing transliterations, a sign-by-sign representation of the tablet content in Latin script. Since these do not provide sign localization, we propose a weakly supervised approach: We align tablet images with their corresponding transliterations to localize the transliterated signs in the tablet image, before using these localized signs in place of annotations to re-train the sign detector. A better sign detector in turn boosts the quality of the alignments. We combine these steps in an iterative process that enables training a cuneiform sign detector from transliterations only. While our method works weakly supervised, a small number of annotations further boost the performance of the cuneiform sign detector which we evaluate on a large collection of clay tablets from the Neo-Assyrian period. To enable experts to directly apply the sign detector in their study of cuneiform texts, we additionally provide a web application for the analysis of clay tablets with a trained cuneiform sign detector.
... Having transliterated the cuneiform corpus, they utilized the pre-knowledge and applied limited tags to preannotate the corpus. As another study, (Homburg and Chiarcos, 2016) conducted the first research on word segmentation on Akkadian cuneiform. They used three types of word segmentations algorithms including rule-based algorithms (such as bigram and prefix/suffix), dictionary-based algorithms (like MaxMatch, MaxMatchCombined, LCUMatching, MinWCMatch), and statistical and/or machine learning algorithms (such as C4.5, CRF, HMM, k-means, k Nearest Neighbors, Max-Ent, naive Bayes, multi-layer perceptron, and Support Vector Machines (SVM)) which work based on transliterations of cuneiform characters. ...
Preprint
Full-text available
Identification of the languages written using cuneiform symbols is a difficult task due to the lack of resources and the problem of tokenization. The Cuneiform Language Identification task in VarDial 2019 addresses the problem of identifying seven languages and dialects written in cuneiform; Sumerian and six dialects of Akkadian language: Old Babylonian, Middle Babylonian Peripheral, Standard Babylonian, Neo-Babylonian, Late Babylonian, and Neo-Assyrian. This paper describes the approaches taken by SharifCL team to this problem in VarDial 2019. The best result belongs to an ensemble of Support Vector Machines and a naive Bayes classifier, both working on character-level features, with macro-averaged F1-score of 72.10%.
... Liu et al. (2016) continue the NER research on Sumerian using a variety of supervised classification methods to detect named entities. Homburg and Chiarcos (2016) researched automated word segmentation of Akkadian cuneiform script. They used a sign list to restore CDLI 3 transliterations back to cuneiform (represented as UTF-8 characters). ...
... plaintext symbol. Homburg and Chiarcos (2016) report preliminary results on automatic word segmentation for Akkadian cuneiform using rule-based, dictionary based, and data-driven statistical techniques. Pagé-Perron et al. (2017) furnish an analysis of Sumerian text including morphology, parts-ofspeech (POS) tagging, syntactic parsing, and machine translation using a parallel corpus. ...
... Liu et al. (2016) continue the NER research on Sumerian using a variety of supervised classification methods to detect named entities. Homburg and Chiarcos (2016) researched automated word segmentation of Akkadian cuneiform script. They used a sign list to restore CDLI 3 transliterations back to cuneiform (represented as UTF-8 characters). ...
Preprint
Full-text available
This article introduces a corpus of cuneiform texts from which the dataset for the use of the Cuneiform Language Identification (CLI) 2019 shared task was derived as well as some preliminary language identification experiments conducted using that corpus. We also describe the CLI dataset and how it was derived from the corpus. In addition, we provide some baseline language identification results using the CLI dataset. To the best of our knowledge, the experiments detailed here are the first time automatic language identification methods have been used on cuneiform data.
Thesis
Full-text available
This thesis explores the use of Natural Language Processing (NLP) on the Akkadian language documented from 2400 BCE to 100 CE. The methods and tools proposed in this thesis aim to fill the gaps left in previous research in Computational Assyriology, contributing to the transformation of transliterated cuneiform tablets into richly annotated text corpora, as well as to the quantitative lexicographic analysis of cuneiform texts. Three contributions of this thesis address the task of transforming Akkadian from its basic Latinized representation, transliteration, into linguistically annotated text corpora. These include (I) neural network-based automatic phonological transcription of transliterated cuneiform text, which is essential for normalizing the diverse spelling variations encountered in the Akkadian writing system; (II) finite-state-based automatic morphological analysis of Akkadian that allows deconstructing word forms into morphological labels, lemmata and part-of-speech tags to improve the useability of Akkadian corpora for quantitative analysis; and (III) creation of a morphological gold standard, and a standardized Universal Dependencies approved morphological label set for Akkadian morphology as the byproduct of an Akkadian treebank. Three contributions address the previously unexplored quantitative analysis of Akkadian lexical semantics using word association measures and word embeddings in order to better understand the language in its own terms. One of these contributions is (IV) an algorithmic method for reducing the distortion caused by fully or partially duplicated sequences in Akkadian texts. This algorithm solves over-representation issues encountered in pointwise mutual information (PMI)-based collocation analysis, and according to preliminary results, also in PMI-based word embeddings. Two contributions (V and VI) are quantitative case studies that demonstrate the use of PMI and word embeddings in Akkadian lexicography, and compare the results with previous qualitative philological research. The last contribution (VII) is a hybrid approach, where PMI is applied to social network analysis of the Neo-Assyrian pantheon in order to reinforce the statistical relevance between the actors. These "semantic" social networks are used to study the position of the Assyrian main god, Aššur, within the pantheon. In addition to the contributions, this thesis presents the first survey of Computational Assyriology, which covers six decades of research on automatic artifact reconstruction, optical character recognition, linguistic annotation, and quantitative analysis of cuneiform texts.
Article
Full-text available
Cuneiform tablets appertain to the oldest textual artifacts used for more than three illennia and are comparable in amount and relevance to texts written in Latin or ancient Greek. hese tablets are typically found in the Middle East and were written by imprinting wedge-shaped impressions into wet clay. There is an increasing demand in the Digital Humanities domain for handwriting recognition, i.e., machine reading of handwritten script, focusing on historic documents. Current practice in text analysis of cuneiform script relies heavily on transliteration and translation, which are incomplete and influenced by the knowledge and experience of the expert that created them. The development of computational tools for cuneiform analysis presents many opportunities. An efficient and accurate sign spotting enables cross-referencing and statistical analyzes that are infeasible to perform manually. Furthermore, a wedge constellation spotting tool, provides experts with a significantly broader base of references to create more accurate and less time consuming transliterations and translations. Yet, cuneiform script has since resisted efforts to computational processing on basis of its basic constituents, its 3D wedge-shaped impressions and their free-form arrangements into signs. In this work, we review the literature on computational processing and recognition in the domain of cuneiform script. We introduce the different heterogeneous sources of cuneiform script, namely, manual ink-on-paper drawings, digital vector graphics drawings, photographs, and 3D scans of tablets. We describe the development of methods, beginning with the first computational classification using a hybrid manual encoding and computational comparison, to the latest methods making use of Generative Adversarial Neural Networks to recognize characters automatically. Finally, we give an overview of applications of these methods that enable quantitative mining in the small, e.g., patterns of wedge constellations, and in the large, e.g., networks of economic activity.
Article
Significance The documentary sources for the political, economic, and social history of ancient Mesopotamia constitute hundreds of thousands of clay tablets inscribed in the cuneiform script. Most tablets are damaged, leaving gaps in the texts written on them, and the missing portions must be restored by experts. This paper uses available digitized texts for training advanced machine-learning algorithms to restore daily economic and administrative documents from the Persian empire (sixth to fourth centuries BCE). As the amount of digitized texts grows, the model can be trained to restore damaged texts belonging to other genres, such as scientific or literary texts. Therefore, this is a first step for a large-scale reconstruction of a lost ancient heritage.
Article
Full-text available
This paper presents a Lexicon-Corpus-based Unsupervised (LCU) Chinese word segmentation approach to improve the Chinese word segmentation result. Specifically, it combines advantages of lexicon-based approach and Corpus-based approach to identify out-of-vocabulary (OOV) words and guarantee segmentation consistency of the actual words in texts as well. In addition, a Forward Maximum Fixed-count Segmentation (FMFS) algorithm is developed to identify phrases in texts at first. Detailed rules and experiment results of LCU are presented, too. Compared with lexicon-based approach or corpus-based approach, LCU approach makes a great improvement in Chinese word segmentation, especially for identifying n-char words. And also, two evaluation indexes are proposed to describe the effectiveness in extracting phrases, one is segmentation rate (S), and the other is segmentation consistency degree (D).
Article
Full-text available
Cuneiform script is the earliest known system of writing consisting of wedge-shaped strokes forming signs impressed on clay tablets. Excavated cuneiform tablets are typically fragmented and their reconstruction is, at best, tedious but more often intractable. Whilst physical reconstruction of tablets may be impractical, the possibility of virtually reconstructing 3D scanned fragments offers a potential solution. In this letter, a novel algorithm for the automated reconstruction of fragmented cuneiform tablets is introduced and results for the first fragments to be virtually joined are presented.
Conference Paper
Full-text available
We introduce a new segmentation evaluation measure, WinPR, which resolves some of the limitations of WindowDiff. WinPR distinguishes between false positive and false negative errors; produces more intuitive measures, such as precision, recall, and F-measure; is insensitive to window size, which allows us to customize near miss sensitivity; and is based on counting errors not windows, but still provides partial reward for near misses.
Article
Full-text available
Sumerian is a long-extinct language documented throughout the ancient Middle East, arguably the first language for which we have written evidence, and is a language isolate (i.e. no related languages have so far been identified). The Electronic Text Corpus of Sumerian Literature (ETCSL), based at the University of Oxford, aims to make accessible on the web over 350 literary works composed during the late third and early second millennia BCE. The transliterations and translations can be searched, browsed and read online using the tools of the website. In this paper we describe the creation of linguistic analysis and corpus search tools for Sumerian, as part of the development of the ETCSL. This is designed to enable Sumerian scholars, students and interested laymen to analyse the texts online and electronically, and to further knowledge about the language.
Article
Full-text available
This paper proposes the use of global fea-tures for Chinese word segmentation. These global features are combined with local fea-tures using the averaged perceptron algo-rithm over N-best candidate word segmenta-tions. The N-best candidates are produced using a conditional random field (CRF) character-based tagger for word segmenta-tion. Our experiments show that by adding global features, performance is significantly improved compared to the character-based CRF tagger. Performance is also improved compared to using only local features. Our system obtains an F-score of 0.9355 on the CityU corpus, 0.9263 on the CKIP corpus, 0.9512 on the SXU corpus, 0.9296 on the NCC corpus and 0.9501 on the CTB cor-pus. All results are for the closed track in the fourth SIGHAN Chinese Word Segmen-tation Bakeoff.
Conference Paper
Full-text available
In this paper, we formulate a generalized method of automatic word segmentation. The method uses corpus type frequency information to choose the type with maximum length and frequency from “desegmented” text. It also uses a modified forward-backward matching technique using maximum length frequency and entropy rate if any non-matching portions of the text exist. The method is also extendible to a dictionary-based or hybrid method with some additions to the algorithms. Evaluation results show that our method outperforms several competing methods.
Conference Paper
Full-text available
This paper discusses the problems of description and computational implementation of phonology and morphology in Semitic languages, using Ancient Akkadian as an example. Phonological and morphophonological variations are described using standard finite-state two-level morphological rules. Interdigitation, prefixation and suffixation are described by using an intersection of two lexicons which effectively defines lexical representations of words.
Conference Paper
This paper presents the problem within Hittite and Ancient Near Eastern studies of fragmented and damaged cuneiform texts, and proposes to use well-known text classification metrics, in combination with some facts about the structure of Hittite-language cuneiform texts, to help classify a number of fragments of clay cuneiform-script tablets into more complete texts. In particular, I propose using Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the performance of Naive Bayes and Maximum Entropy classifiers. The performance in some cases is improved, and in some cases very much not, suggesting that the variable frequency of occurrence of these ideograms in individual fragments makes considerable difference in the ideal choice for a classification method. Further, complexities of the writing system and the digital availability of Hittite texts complicate the problem.
Article
We participated in the Second Inter-national Chinese Word Segmentation Bakeoff. Specifically, we evaluated our Chinese word segmenter in the open track, on all four corpora, namely Academia Sinica (AS), City University of Hong Kong (CITYU), Microsoft Re-search (MSR), and Peking University (PKU). Based on a maximum entropy approach, our word segmenter achieved the highest F measure for AS, CITYU, and PKU, and the second highest for MSR. We found that the use of an ex-ternal dictionary and additional training corpora of different segmentation stan-dards helped to further improve seg-mentation accuracy.