Content uploaded by Timo Homburg
Author content
All content in this area was uploaded by Timo Homburg on Feb 23, 2019
Content may be subject to copyright.
Akkadian Word Segmentation
Timo Homburg M.Sc., Dr. Christian Chiarcos
Institute for Computer Science
Goethe University, Robert-Mayer-Str. 10, 60325 Frankfurt am Main, Germany
timo.homburg@gmx.de, chiarcos@em.uni-frankfurt.de
Abstract
We present experiments on word segmentation for Akkadian cuneiform, an ancient writing system and a language used for about 3
millennia in the ancient Near East. To our best knowledge, this is the first study of this kind applied to either the Akkadian language
or the cuneiform writing system. As a logosyllabic writing system, cuneiform structurally resembles Eastern Asian writing systems,
so, we employ word segmentation algorithms originally developed for Chinese and Japanese. We describe results of rule-based
algorithms, dictionary-based algorithms, statistical and machine learning approaches. Our results may indicate possible promising steps
in cuneiform word segmentation that can create and improve natural language processing in this area.
Keywords: Assyriology, Cuneiform, Akkadian, Chinese, Word Segmentation, Machine Learning
1. Introduction
Word segmentation is the most elementary task in natural
language processing of written language. In most alpha-
betical writing systems, this task is commonly referred to
as tokenization and can be easily solved through the inter-
pretation of orthographical markers for word and sentence
boundaries, e.g., white spaces. Where these are lacking,
however, word segmentation is a challenging task, a classi-
cal – and successfully addressed – problem in logographic
writing systems like Chinese and logosyllabic writing sys-
tems like Japanese.
Here, we describe experiments on cuneiform, a writing sys-
tem developed in the 4th m. BCE in Mesopotamia sub-
sequently applied to various Semitic, Indo-European and
isolate languages in the region. As a logosyllabic writing
system, it shares important structural characteristics with
Chinese and Japanese (Ikeda, 2007), so that we evaluate
word segmentation methods successfully applied to these
languages. However, these languages are unrelated to those
of the Ancient Near East, so that future research will focus
on developing aspects specific to languages with cuneiform
writing.
As a writing system, cuneiform poses a number of unique
challenges:
•The same character, e.g., , can be read as a logo-
graph or as a syllable, as the logograph
GURU
‘young
man’ or with its phonological reading as a syllabic
sign.
•As a syllabic sign, a single character can have multi-
ple different readings, e.g., grounded in the possible
Sumerian pronounciation(s) of the logograph, or the
pronounciation of their Akkadian translations, may
be read as
dan/tan
(from Akk.
dannu
‘strong, power-
ful’),
kal
(from Sum.
kal
‘rare, valuable’ and
kalag
‘strong’),
rib
(from Sum.
rib
‘outstanding, strong’),
etc. (Tinney and others, 2006; Lauffenburger, nd;
Borger, 2004).
•CVC syllables (e.g.,
dan
) can be as a pair of CV-VC
characters (
da-an
) or with a single CVC character
(
dan
) (Gelb, 1957, p.8,
Da-an--ri
vs.
Dan-r-ri
).
We primarily consider the Akkadian language, the domi-
nant language of the Ancient Near East from the 3rd to the
1st millennium BCE. Originally spoken in Mesopotamia, it
became the lingua franca in the Near East during the 2nd
m. BCE, with an extensive body of material comparable
only to corpus languages such as Classical Latin or An-
cient Greek. With a considerable amount of cuneiform clay
tablets not yet deciphered, and new ones being continu-
ously excavated, the automated processing of the Akkadian
language is thus of tremendous importance. Previous re-
search on automated digitization focused on producing 3D
scans of tablets (Sect. 2.), with Optical Character Recog-
nition (OCR) being a logical next step in the development.
Successful cuneiform OCR, however, needs to be accom-
plished by knowledge-rich NLP methods for the contex-
tual disambiguation of characters: One of the key charac-
teristics of cuneiform is that a character can be read as an
logograph, as a determinative, or as a syllabic sign (with
different phonemic values). The contextual distribution of
characters is thus heavily dependent on its context. Word
segmentation approaches may thus be a key component to
any approach on cuneiform OCR.
Akkadian is the oldest attested Semitic language, and has
thus occasionally been considered in experiments on NLP
for Semitic languages, but mostly focusing on (rule-based)
morphological analysis. To our best knowledge, the present
paper describes the first study of word segmentation in
Akkadian cuneiform. It thus provides a primary point of
orientation for any subsequent experiments on cuneiform
word segmentation and will be of utmost importance to fu-
ture experiments on cuneiform OCR and Akkadian NLP.
4067
2. State Of the Art
We distinguish three types of word segmentation algo-
rithms:
rule-based segmentation rules derived from grammar
dictionary-based segmentation by lookup in a (statically
enhanced) dictionary
statistical/machine learning data-driven segmentation as
learnt from segmented corpora
As shown in several SIGHAN BakeOffs in the last decade
(Sproat and Emerson, 2003), in Chinese machine learn-
ing and dictionary-based approaches like MaxMatch (Chen
and Liu, 1992) produce reasonable results while rule-based
methods are commonly used as a Baseline (Palmer and
Burger, 1997). In Japanese, however, rule-based algorithms
like Tango (Ando and Lee, 2000) proved to be more suc-
cessful. This is partially due to the morphological richness
of Japanese as compared to Chinese.
As a point of orientation for subsequent studies on
cuneiform, we evaluate selected approaches from these
classes in their performance on Akkadian. Neither the
Akkadian language nor cuneiform as a writing system have
been addressed in this respect before.
Along with other cuneiform languages, Akkadian has a
considerable research history in NLP. For the greatest part,
existing approaches are concerned with rule-based mor-
phological analyzers, e.g., Kataja and Koskenniemi (1988),
Barthlemy (1998), Macks (2002), Barthlemy (2009), Khait
(accepted) for Akkadian, or Valentin Tablan Wim Peters
(2006) for Sumerian. As for data-driven morphological
tools, the state of the art in the field is represented by the
Lemmatizer of the Open Richly Annotated Cuneiform Cor-
pus (ORACC),1which supports manual morphological an-
notation for Akkadian, Sumerian and (to a limited degree)
Hittite with a lookup-functionality in the annotated corpus.
Such example-based approaches can be extended to auto-
matically transfer morphological rules through phonologi-
cal equivalences, as demonstrated by Snyder et al. (2010)
for the projection of Hebrew morphology and lexicon to
Ugaritic, another Semitic cuneiform language. As for
higher levels of linguistic analysis, we are not aware of any
tools for syntactic or semantic annotation for Akkadian,
however, the latter has been considered for administrative
texts from the Sumerian period, whose highly convention-
alized structure can be exploited for concept classification
(Jaworski, 2008).
Aside from linguistic analysis, another aspect of cuneiform
languages that recently aroused interest are approaches fo-
cusing on the material side of cuneiform writing, i.e., scan-
ning and digitizing clay tablets (Subodh et al., 2003; Co-
hen et al., 2004), reconstructing tablets and tales by au-
tomatically combining their fragments (Collins et al., ac-
cepted; Tyndall, 2012), and recently, initial steps towards
cuneiform OCR have been undertaken (Mara et al., 2010).
As this line of research is flourishing mostly in the field of
1http://oracc.museum.upenn.edu/doc/
builder/linganno/
computer graphics, the obvious gap between both lines of
research lies in the absence of any studies concerned with
the transition from the (identified) sign and its linguistic in-
terpretation, a challenging task, as mentioned before.
With our paper, we describe the first experiments in this
direction, with a specific focus on segmenting character
sequences into words as a core component for future ap-
proaches on transliteration.
3. Experimental Setup
3.1. Corpus Data
We use corpora from three different periods and dialects,
namely Old Babylonian, Middle Babylonian and Neo-
Assyrian, from the Cuneiform Digital Library Initiative
(CDLI)2, representing most of the available texts (clay
tablets) of the given periods of time. The corpora were ran-
domly split in a 80:20 ratio for training and testing purposes
(on a per-tablet, not a per-line basis). For the experiments,
we trained our segmentation algorithms on each of these
language stages, and performed evaluations on each lan-
guage stage respectively. For reasons of space, we only
report results for the Middle Babylonian training corpus
and evaluation against the Middle Babylonian test corpus
in detail. Further experiments showing robust performance
across different language stages will be represented in a
graphical way. Additionally we will present results of clas-
sifications using corpora data of one epoch applied on other
epochs of the same language to get an impression of the
performance of the algorithms on related data.
The CDLI ATF format contains metadata, a (word-
segmented) transliteration, and (optionally) a translation.
CDLI data always represents cuneiform in lines as found
on the clay tablets. To minimize ambiguity, Akkadian writ-
ers tried to avoid incomplete words at the end of a line,
so the tablets themselves provide initial data on word seg-
mentation. From the transliteration, we restored the orig-
inal UTF-8 characters on the basis of a sign list that we
compiled from various resources. Non-restorable charac-
ters were ignored and thus are not represented in the result-
ing texts. This data represents our gold standard. It should
be noted that the mapping to UTF-8 can not be trivially re-
versed because of the highly ambiguous phonological and
ideographic meaning of characters.
After conversion and the removal of whitespaces, segmen-
tation algorithms of three categories have been applied.
Figure 1 shows the segmentation process.
3.2. Baseline
As our baseline we adopted the Character-As-Word algo-
rithm (Palmer and Burger, 1997, p.1) common as a Baseline
in Chinese, in the form of an Average-Word-Length (AVG)
algorithm. The average word length Lwas determined em-
pirically by analyzing the training corpora as follows:
L=#(text)
i=1 len(wi)
#(text)
with text being the entire corpus, len(wi)being the length
of a word and #(text)representing the number of words
2http://cdli.ucla.edu/
4068
Figure 1: Classification process
in the entire corpus. Accordingly, a sequence of characters
will be split after every Lth character.
3.3. Rule-based Algorithms
A first simple bigram method inserts a word break between
two characters c1and c2, if a word break between c1and c2
is more frequent than a non-break in the training corpus.
Similarly, a unigram segmentation between two characters
can be achieved by classifying every individual character
according to the I(0)BES scheme – (preferred) intermedi-
ate (I), beginning (B), end (E) and single (S) characters – in
the training corpus. The prefix/suffix algorithm collects us-
ages for every characters in every word found in an already
segmented training corpus. For every character found in a
word classification in the I(0)BES scheme will be collected
and frequencies will be counted for I,B,E and S respec-
tively. For each pair of characters the algorithm will test
if the probability of being E or S exceeds the probability of
being I or B, therefore indicating a separation. Therefore
the prefix/suffix algorithm includes information on a word
basis to achieve the segmentation.
Furthermore, we apply the Tango algorithm (Ando and Lee,
2000) which uses a scoring system to determine possible
segmentations in a sliding window mechanism. The thresh-
old parameter and the window size have been empirically
chosen to fit the needs of the Akkadian language.
Finally, we compare these methods with a random segmen-
tation function which decides for every character whether a
segmentation should occur after it. The randomized func-
tion is initialized with a random seed of the size of the
corresponding charcount of each line respectively. Though
substantially worse than the average-length baseline, it out-
performs the bigram method.
3.4. Dictionary-based Algorithms
As dictionary-based approaches can despite their simplic-
ity gain considerable segmentation efforts in languages like
Chinese or Japanese, we apply the commonly used Mini-
mum WordCount Matching Algorithm, a modified version
of the LCU Matching algorithm (Pengyu et al., 2014), and
the MaxMatch algorithm (Chen and Liu, 1992), and a mod-
ified version of the MaxMatch Algorithm (Islam et al.,
2007).3Dictionaries used in those approaches have been
3We employ the basic version of (Chen and Liu, 1992). No
additional neighbor checking performed.
generated from the provided training data used as a basis for
the other approaches as well. Externa dictionary resources
have not been considered. It is important to notice that we
did not apply any new word detection which can be used to
extend the dictionary using data from the test corpus. This
may be a topic for further refinement and could be achieved
by applying one of many statistical approaches.
3.5. Statistics & ML
The commonly used Maximum Probability Matching al-
gorithm trivially maximizes the occurrence probability of
a word sequence (here, a line on a tablet) by matching
words against a frequency-tagged dictionary and returning
the most probable word segmentation ⃗s ∈ {(i|1<i<
n)∈ N x|x < n}for a character sequence (i.e., line)
c1, .., cn:
SegM axP rob (c1, ..., cn) =
arg max⃗s
|⃗s|−1
j=1
Pdict(cj, ..., cj+1 )
We normalize the dictionary probability Pdict to values
greater than 0 to account for out-of-vocabulary words.
More advanced approaches we studied here include clus-
tering algorithms (kNN,kMeans), decision trees (C4.5),
NaiveBayes and MaxEnt, sequence labelling models (Hid-
den Markov Models, HMM and Conditional Random
Fields, CRF), as well as Support Vector Machines (SVM)
and Multi-Layer Perceptrons (MLP). For most algorithms,
we relied on the implementation provided by WEKA 3.7 4,
for HMMs the HMMWeka extension5, for CRFs the MAL-
LET6CRF SimpleTagger, for SVMs libsvm7with polyno-
mial kernel.
Table 1 enumerates the feature sets we used in the exper-
iments. For the purpose of our experiments, these were
adopted without modification from the literature on Far
Eastern languages and writing systems (see references for
details) and directly applied to Akkadian cuneiform. The
motivation is to establish a sound basis for the future de-
velopment of cuneiform-specific algorithms, for which we
expect substantial refinements if language-specific features
for Akkadian are explicitly taken into account in statistical
approaches.
For all data-driven classifiers, we simplified the target clas-
sification of the I(O)BES scheme to a binary distinction of
Class 0 (i.e., IB: no segmentation after the current char-
acter) and Class 1 (i.e., ES: segmentation after the cur-
rent character). All classifiers were trained on data sets of
10.000 instances in order to assess their performance under
resource-poor circumstances. In addition, training against
the full training set of 100.000 instances was performed
successfully for most algorithms (and is reported below),
with the exception of HMM and CRF whose training timed
out.
4http://www.cs.waikato.ac.nz/ml/weka/
5http://doc.gold.ac.uk/˜mas02mg/software/
hmmweka/
6http://mallet.cs.umass.edu/
7http://www.csie.ntu.edu.tw/˜cjlin/
libsvm/
4069
feature set algorithms
BASE
base feature set (Low et al., 2005) C4.5, MaxEntropy, NaiveBayes, kNN,
kMeans, MLP, SVM
EXT
extended feature set (Low et al., 2005) C4.5, MaxEntropy, NaiveBayes, kNN,
kMeans, MLP, SVM
MAXENT
MaxEnt feature set (Raman, 2006) C4.5, MaxEntropy, NaiveBayes, kNN,
kMeans, MLP, SVM
PERC
perceptron feature set (Song and Sarkar, 2008) C4.5, MaxEntropy, NaiveBayes, kNN,
kMeans, MLP, SVM
RED-BIG
reduced bigram feature set (Papageorgiou, 1994) CRF, HMM
Table 1: Feature sets
3.6. Ensemble Combination
Finally, we implemented simple ensemble combination ar-
chitectures as a meta classifier, using
(a) a simple (unweighted) majority vote, tested for all pos-
sible combinations of individual segmentation algo-
rithms described above,
(b) C4.5 meta classification, resp.
(c) SVM meta classification
Remarkably, the more elaborate C4.5 and SVM versions
of the meta classifier did not outperform the best majority
configuration (i.e., all classifiers except SVM), whose re-
sults are reported in Tab. 2.
3.7. Evaluation Metrics
We primarily evaluate against the following conventional
metrics:
Boundary evaluation addresses character-based segmen-
tation per boundary (Palmer and Burger, 1997, p.176),
i.e., precision and recall of predicted and observed
boundaries following a given character:
recb=#correctly predicted boundary
#gold boundaries
precb=#correctly predicted boundary
#predicted boundaries
Word boundary evaluation evaluates completely seg-
mented words (Palmer and Burger, 1997, p.176), a
metric especially relevant for any future practical ap-
plication by Assyriologists or philologists:
recw=#correctly predicted words
#gold words
precw=#correctly predicted words
#predicted words
As these metrics are not differenciating between near and
far misses of boundaries at all, we also employ sliding-
window-based metrics:
WindowDiff aims to avoid penalizing near-matching
boundaries too restrictively, with window size k=
N
2∗#segments , reference segmentation R, a total number
of Ncontent units and Ccomputed boundaries the
correctness of a segmentation as follows:
W Dif f =1
N−k
N−k
i=0
(|Ri,i+k−Ci,i+k|>0)
WinPR is an established metric in the word segmentation
community, it calculates precision and recall building
on the basis of WindowDiff by defining precwp =
tpwp
tpwp+f pwp
and recwp =tpwp
tpwp+f nwp
with the follow-
ing definitions (Scaiano and Inkpen, 2012):
tpwp =
k
i=1−k
min(Ri,i+k, Ci,i+k)
tnwp =−k(k−1) +
N
i=1−k
(k−max(Ri,i+k, Ci,i+k))
fpwp =
N
i=1−k
max(0, Ci,i+k−Ri,i+k)
fnwp =
N
i=1−k
max(0, Ri,i+k−Ci,i+k)
PK metric is a standard metric in the field of text segmen-
tation (Pevzner and Hearst, 2002).
4070
4. Results
Table 2 and Fig. 3 provide results which are representative
for the experiments conducted. For reasons of space, only
the best-performing combinations of features and data-
driven (statistical/ML) methods are reported. Also, while
all combinations of cuneiform corpora have been tested,
we report only results obtained by training the tools on
the (training section of the) Middle Babylonian corpus and
tested on the (test section of the) Middle Babylonian cor-
pus. The general pattern, however, remains the same for all
cuneiform corpora considered, both within a language stage
(training and test corpus for the same language stage, Fig.
2), but also across language stages (e.g., Old Babylonian
tools tested on Middle Babylonian corpus).
For the example of the Middle Babylonian tools, the scores
of the best-performing configurations obtained on the Old
Babylonian and Neo-Assyrian test set are reported in Tab.
3. Unsurprisingly, the scores are generally worse than for
Middle Babylonian, yet, they still outperform the base-
line. Remarkably, Neo-Assyrian boundary F-scores actu-
ally seem to improve over Middle Babylonian. This may
be due to the fact that the Neo-Assyrian corpus is more ho-
mogenuous than the Middle Babylonian corpus, as the lat-
ter contains much material written by non-native speakers.
Method Bound Word PK WinPR
F-Score F-Score Score F-Score
baseline (AVG length [2]) 42.77 21.19 13.77 42.92
Bigram 14.90 7.22 47.72 20.84
Pref/Suff 34.59 10.17 23.08 34.86
Random 23.43 13.64 45.04 22.31
Tango 49.22 14.88 13.93 35.32
MaxMatch 65.02 65.05 9.03 57.47
MaxMatchCombined 73.91 58.48 8.69 60.80
LCUMatching 68.14 44.78 8.92 51.30
MinWCMatch 72.82 59.76 8.17 59.63
MaxProb 59.67 37.26 10.24 49.58
C4.5 (EXT 10K) 42.66 15.33 23.12 34.01
CRF (EXT 10K) 40.13 15.31 13.06 27.10
HMM (RED-BIG 10K) 23.10 12.98 21.64 35.55
kMeans (BASE 10K) 37.79 14.17 24.11 33.77
kNN (EXT 100K) 49.48 14.51 13.59 37.53
MaxEnt (EXT 10K) 46.91 14.97 15.82 36.59
NBayes (EXT 100K) 49.49 14.52 13.60 37.56
Percep (MAXENT 10K) 49.51 14.51 13.59 37.51
SVM (EXT 10K) 49.51 14.51 13.59 37.51
META (w/o SVM, 49.42 14.31 13.59 37.47
majority vote)
Table 2: Middle Babylonian tools on Middle Babylonian
test set
As for rule-based methods, only Tango (Ando and Lee,
2000) outperformed the baseline, whereas dictionary-based
algorithms performed clearly better. Dictionary-based ap-
proaches produced the best-performing classification, de-
pending on the metric between 60% and nearly 80%.
Machine Learning algorithms as applied here seem to face
problems in Akkadian cuneiform: The feature sets used for
segmenting Chinese seem to provide a significantly worse
result in Akkadian, and even the best-performing algo-
(2.a)
(2.b)
Figure 2: Within-language test error for (a) Old Babylo-
nian tools on Old Babylonian test set, and (b) Neo-Assyrian
tools on Neo-Assyrian test set (boundary evaluation)
rithms hardly outperform the baseline. This indicates that
a feature set specific to Akkadian needs to be developed in
future research.
It should be noted that this picture did not drastically
change when different training and test corpora for Akka-
dian language stages were employed.
A note on transliteration and morphosyntax As men-
tioned before, retransforming cuneiform Unicode charac-
ters into the correct transliteration is far from easy, as ev-
ery Unicode character may represent an ideograph, several
different syllables, etc. In the process of word segmenta-
tion, we also conducted initial experiments on translitera-
tion. By mapping every character to its according to the
corpora data most frequent transliteration, we were able to
establish a baseline that correctly transliterates up to 40%
of the characters per corpus. In future research, this needs
to be further improved by statistical, context-aware classi-
fiers and tighter integration of the word segmentation and
transliteration tasks. At the same time, word segmentation
for Akkadian can certainly benefit from integrating higher
levels of NLP, e.g., POS tagging, as this may be important
for lexical disambiguation. Taken together, this calls for
a uniform architecture capable to handle word segmenta-
tion, transliteration and morphosyntactic analysis in a sin-
gle task. For such an integrated system, our experients with
segmentation module may serve as a baseline.
4071
Method Bound Word PK WinPR language
F-Score F-Score Score F-Score (alg.)
baseline (AVG length [2]) 39.25 17.45 21.37 39.81 OBab
41.63 17.06 11.12 36.13 NAss
rule-based 33.80 9.11 28.23 31.53 OBab (Prefix/Suffix)
51.93 16.65 11.14 36.44 NAss (Tango)
dictionary-based 62.73 53.12 2.43 56.10 OBab (MinWCMatch)
52.95 24.51 10.97 38.46 NAss (MinWCMatch)
statistical/ML 39.56 9.56 21.26 31.91 OBab (NaiveBayes, EXT 10k)
52.31 16.64 10.95 36.47 NAss (NaiveBayes, PERC 10K)
Table 3: Across-language test error: Middle Babylonian tools on O(ld )Bab(ylonian) and N(eo-)Ass(yrian) test sets
Figure 3: Middle Babylonian tools (80%) on Middle Babylonian test set(20%) (boundary evaluation)
5. Summary and Outlook
Our paper is the first experiment on word segmentation on
Akkadian or cuneiform. It provides insights in what to ex-
pect by applying established word segmentation algorithms
on the Akkadian language, and we showed that for Akka-
dian corpora of the dimensions available, dictionary-based
approaches produce the best results in segmenting Akka-
dian texts.
In general, our results for Akkadian are substantially worse
than those for Chinese and Japanese, and currently do not
live up to the needs of philologists. However, given the
high degree of ambiguity in the writing system, this result
is unsurprising, and calls for intensified research efforts in
this regard. Our experiments stipulate directions for future
research and provide a point of orientation for any future
approach in this direction.
A key result is that further research on linguistic charac-
teristics of Akkadian and other cuneiform languages is
required, and – likely – only this will improve results to
production-ready quality.
Strategies to improve segmentation performance include
(1) extending dictionary-based algorithms with a new
word detection component – yet, this is methodolog-
ically problematic if the test corpus is used to improve
segmentation –,
(2) combining dictionary-based and rule-based approaches
and extending the latter with a morphological compo-
nent, and
(3) combining morphology-oriented rule-based systems
with statistical and machine learning algorithms to ben-
efit from both the context-awareness of data-driven
methods and the high precision of rule-based morpho-
logical analysis.
The most promising (and the most challenging) extension
4072
in this regard is the development of an integrated sys-
tem that provides uniform handling for word segmentation,
transliteration and morphosyntactic annotation. Akkadian
is a morphologically complex language with a highly am-
biguous writing system. Unlike Chinese, it shows heavy
interference between morphology and word segmentation,
and unlike Japanese, it does not have a 1:1 correspon-
dence between syllabic signs and phonemic values. In
this regard, any future word segmentation algorithm for
cuneiform will have to be an integrated approach, and thus
be very different from existing approaches for either Chi-
nese or Japanese.
Remark to reviewers
Upon acceptance of this paper, code and data will be pub-
lished under open licenses (Apache license and CC-BY).
In addition to the algorithms described here, this includes
an interactive GUI for visualizing and analyzing segmen-
tation results, a cuneiform Input Method Engine, and a
near-exhaustive list of UTF8 cuneiform characters and their
readings. Providing a link would reveal author identity.
6. Bibliographical References
Ando, R. K. and Lee, L. (2000). Mostly-unsupervised sta-
tistical segmentation of japanese: Applications to kanji.
In Proceedings of the 1st North American Chapter of the
Association for Computational Linguistics Conference,
NAACL 2000, pages 241–248, Stroudsburg, PA, USA.
Association for Computational Linguistics.
Barthlemy, F. (1998). A morphological analyzer for Akka-
dian verbal forms with a model of phonetic transfor-
mations. In Proceedings of the ACL-1998 Workshop on
Computational Approaches to Semitic Languages, Mon-
tral.
Barthlemy, F. (2009). The Karamel System and Semitic
languages: Structured multi-tiered morphology. In Pro-
ceedings of the EACL 2009 Workshop on Computational
Approaches to Semitic Languages, page 1018, Athens,
Greece.
Borger, R. (2004). Mesopotamisches Zeichenlexikon.
Ugarit-Verlag.
Chen, K.-J. and Liu, S.-H. (1992). Word identification for
mandarin chinese sentences. In Proceedings of the 14th
Conference on Computational Linguistics - Volume 1,
COLING ’92, pages 101–107, Stroudsburg, PA, USA.
Association for Computational Linguistics.
Cohen, J. D., Duncan, D., Snyder, D., Cooper, J., Ku-
mar, S., Hahn, D., Chen, Y., Purnomo, B., and Graet-
tinger, J. (2004). iClay: Digitizing Cuneiform. In Pro-
ceedings of the 5th International Conference on Virtual
Reality, Archaeology and Intelligent Cultural Heritage
(VAST-2004), pages 135–143, Aire-la-Ville, Switzer-
land, Switzerland. Eurographics Association.
Collins, T., Woolley, S., Gehlken, E., Lewis, A., Munoz,
L. H., and Ch’ng, E. (accepted). Automated reconstruc-
tion of virtual fragmented cuneiform tablets. Electronics
Letters (IET).
Gelb, I. (1957). Glossary of Old Akkadian. University of
Chicago Press, Chicago, Illinois.
Ikeda, J. (2007). Early Japanese and early Akkadian writ-
ing systems. a contrastive survey of Kunogenesis. In
Proceedings of Origins of Early Writing Systems, Peking
University, Beijing.
Islam, M. A., Inkpen, D., and Kiringa, I. (2007). A gen-
eralized approach to word segmentation using maximum
length descending frequency and entropy rate. In Com-
putational Linguistics and Intelligent Text Processing,
pages 175–185. Springer.
Jaworski, W. (2008). Contents modelling of neo-sumerian
ur III economic text corpus. In Proceedings of the 22nd
International Conference on Computational Linguistics
(Coling 2008), pages 369–376, Manchester, UK, August.
Coling 2008 Organizing Committee.
Kataja, L. and Koskenniemi, K. (1988). Finite-state de-
scription of Semitic morphology: A case study of An-
cient Akkadian. In Proceedings of COLING 1988.
Khait, I. (accepted). Cuneiform Labs: Annotating Akka-
dian corpora. In Rencontre Assyriologique Interna-
tionale (RAI-2015), Geneva and Bern, Switzerland, June
22-26, 2015.
Lauffenburger, O. (n.d.). Akkadian dictionary. www.
assyrianlanguages.org/akkadian.
Low, J. K., Ng, H. T., and Guo, W. (2005). A maximum
entropy approach to chinese word segmentation. In Pro-
ceedings of the Fourth SIGHAN Workshop on Chinese
Language Processing, volume 1612164.
Macks, A. (2002). Parsing Akkadian Verbs with Prolog.
In Proceedings of the ACL-02 Workshop on Computa-
tional Approaches to Semitic Languages, Philadelphia,
Pennsylvania.
Mara, H., Krmker, S., Jakob, S., and Breuckmann, B.
(2010). GigaMesh and Gilgamesh 3D Multiscale In-
tegral Invariant Cuneiform Character Extraction. In
Alessandro Artusi, et al., editors, VAST: International
Symposium on Virtual Reality, Archaeology and Intelli-
gent Cultural Heritage. The Eurographics Association.
Palmer, D. and Burger, J. (1997). Chinese word segmen-
tation and information retrieval. In AAAI Spring Sym-
posium on Cross-Language Text and Speech Retrieval,
pages 175–178.
Papageorgiou, C. P. (1994). Japanese word segmentation
by hidden markov model. In Proceedings of the work-
shop on Human Language Technology, pages 283–288.
Association for Computational Linguistics.
Pengyu, L., Jingchuan, P., Du Mingming, L. X., and Lijun,
J. (2014). A lexicon-corpus-based unsupervised chinese
word segmentation approach. International Journal On
Smart Sensing And Intelligent Systems, 7(1).
Pevzner, L. and Hearst, M. A. (2002). A critique and im-
provement of an evaluation metric for text segmentation.
Computational Linguistics, 28(1):19–36.
Raman, A. (2006). A dictionary-augmented maximum en-
tropy tagging approach to chinese word segmentation.
Scaiano, M. and Inkpen, D. (2012). Getting more from
segmentation evaluation. In Proceedings of the 2012
Conference of the North American Chapter of the Associ-
4073
ation for Computational Linguistics: Human Language
Technologies, pages 362–366. Association for Computa-
tional Linguistics.
Snyder, B., Barzilay, R., and Knight, K. (2010). A statis-
tical model for lost language decipherment. In Proceed-
ings of ACL-2010, Upsala, Sweden.
Song, D. and Sarkar, A. (2008). Training a perceptron with
global and local features for chinese word segmentation.
In IJCNLP, pages 143–146. Citeseer.
Sproat, R. and Emerson, T. (2003). The first interna-
tional chinese word segmentation bakeoff. In Proceed-
ings of the second SIGHAN workshop on Chinese lan-
guage processing-Volume 17, pages 133–143. Associa-
tion for Computational Linguistics.
Subodh, K., Snyder, D., Duncan, D., Cohen, J., and Cooper,
J. (2003). Digital preservation of ancient cuneiform
tablets using 3D-scanning. In Proceedings of the 4th
International Conference on 3-D Digital Imaging and
Modeling (3DIM-2003), pages 326–333. IEEE.
Tinney, S. et al. (2006). The Pennsylvania Sume-
rian dictionary. http://psd.museum.upenn.
edu/epsd1/.
Tyndall, S. (2012). Toward automatically assembling
Hittite-language cuneiform tablet fragments into larger
texts. In Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics, pages 243–
247. Association for Computational Linguistics.
Valentin Tablan Wim Peters, Diana Maynard, H. C. (2006).
Creating tools for morphological analysis of Sume-
rian. In Proceedings of the 5th International Conference
on Language Resources and Evaluation (LREC-2006),
pages 1762–1765, Genova, Italy.
4074