[Show abstract][Hide abstract] ABSTRACT: This paper presents a supervised machine learning approach to incrementally learn and segment affixes using generic background knowledge. We used Prolog script to split affixes from the Amharic word for further morphological analysis. Amharic, a Semitic language, has very complex inflectional and derivational verb morphology, with many possible prefixes and suffixes which are used to show various grammatical features. Further segmentation of the affixes into valid morphemes is a challenge addressed in this paper. The paper demonstrates how incremental and easy-to-complex examples can be used to learn such language constructs. The experiment revealed that affixes could be further segmented into valid prefixes and suffixes using a generic and robust string manipulation script by the help of an intelligent teacher who presents examples in incremental order of complexity allowing the system to gradually build its knowledge. The system is able to do the segmentation with 0.94 Precision and 0.97 Recall rates.
24th International Conference on Computational Linguistics (COLING-2012), Mumbai, India; 12/2012
[Show abstract][Hide abstract] ABSTRACT: This paper presents a supervised machine learning approach to morphological analysis of Amharic verbs. We use Inductive Logic Programming (ILP), implemented in CLOG. CLOG learns rules as a first order predicate decision list. Amharic, an under-resourced African language, has very complex inflectional and derivational verb morphology, with four and five possible prefixes and suffixes respectively. While the affixes are used to show various grammatical features, this paper addresses only subject prefixes and suffixes. The training data used to learn the morphological rules are manually prepared according to the structure of the background predicates used for the learning process. The training resulted in 108 stem extraction and 19 root template extraction rules from the examples provided. After combining the various rules generated, the program has been tested using a test set containing 1,784 Amharic verbs. An accuracy of 86.99% has been achieved, encouraging further application of the method for complex Amharic verbs and other parts of speech.
Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012), Istanbul, Turkey; 05/2012
[Show abstract][Hide abstract] ABSTRACT: We propose an unsupervised training method to guide the learning of Malay derivational morphology from a set of morphological segmentations produced by a naıve morphological analyzer. Using a morphology-based language model, we first estimate the probability of a given segmentation. We train the model with EM to find the segmentation that maximizes the probability of each morpheme. We extract the set of affix patterns produced by our algorithm and evaluate them
against two references: a list of affix patterns extracted from our hand-segmented derivational wordlist and a derivational history produced by a stemmer.
Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), IJCNLP 2011, Chiang Mai, Thailand,; 11/2011
[Show abstract][Hide abstract] ABSTRACT: Despite its linguistic complexity, the Horn of Africa region includes several major languages with more than 5 million speak-ers, some crossing the borders of mul-tiple countries. All of these languages have official status in regions or nations and are crucial for development; yet com-putational resources for the languages re-main limited or non-existent. Since these languages are complex morphologically, software for morphological analysis and generation is a necessary first step toward nearly all other applications. This pa-per describes a resource for morphologi-cal analysis and generation for three of the most important languages in the Horn of Africa, Amharic, Tigrinya, and Oromo.
[Show abstract][Hide abstract] ABSTRACT: Developing correct Grapheme-to-Phoneme (GTP) conversion method is a central problem in text-to-speech synthesis. Particularly, deriving phonologi-cal features which are not shown in orthography is challenging. In the Amharic language, geminates and epenthetic vowels are very crucial for proper pronunciation but neither is shown in orthography. This paper describes an architecture, a preprocess-ing morphological analyzer integrated into an Am-haric Text to Speech (AmhTTS) System, to con-vert Amharic Unicode text into phonemic specifi-cation of pronunciation. The study mainly focused on disambiguating gemination and vowel epenthe-sis which are the significant problems in develop-ing Amharic TTS system. The evaluation test on 666 words shows that the analyzer assigns gemi-nates correctly (100%). Our approach is suitable for languages like Amharic with rich morphology and can be customized to other languages.
[Show abstract][Hide abstract] ABSTRACT: Extensible Dependency Grammar (XDG; Debusmann, 2007) is a flexible, modular dependency grammar framework in which sentence analyses consist of multigraphs and processing takes the form of constraint satisfaction. This paper shows how XDG lends itself to grammar-driven machine translation and introduces the machinery necessary for synchronous XDG. Since the approach relies on a shared semantics, it resembles interlingua MT. It differs in that there are no separate analysis and generation phases. Rather, translation consists of the simultaneous analysis and generation of a single source-target sentence. Extensible Dependency Grammar (XDG; Debusmann, 2007) es un marco gramático de dependencias flexible y modular en el que los análisis de frases consisten en multigrafos y el procesamiento toma la forma de satisfacción de restricciones. En este artículo, mostramos las ventajas que tiene el XDG para la traducción automática basada en la gramática y presentamos la maquinaria necesaria para una versión sincrónica de XDG. Puesto que hay una semántica simple compartida, el enfoque que describimos es parecido a la traducción automática mediante una lengua intermedia. Sin embargo, se diferencia de ella por el hecho de que no hay fases de análisis y generación separadas. La traducción más bien consiste en el análisis simultáneo y la generación de una sola frase de entrada-salida. Extensible Dependency Grammar (XDG; Debusmann, 2007) és un marc gramàtic de dependències modular i flexible en què les anàlisis de frases consisteixen en multigrafs i el processament pren la forma de satisfacció de restriccions. En aquest article, mostrem els avantatges que l'XDG té per a la traducció automàtica basada en la gramàtica i presentem la maquinària necessària per a una versió sincrònica de l'XDG. Atès que hi ha una semàntica compartida simple, l'enfocament que descrivim s'assembla a la traducció automàtica mitjançant una llengua intermèdia. Se'n diferencia, però, pel fet que no hi ha fases d'anàlisi i generació separades. En canvi, la traducció consisteix en l'anàlisi i la generació simultànies d'una única frase d'entrada-sortida.
[Show abstract][Hide abstract] ABSTRACT: Computer-assisted language learning is by now so common around the world as to be something of a default, and the teaching of the indigenous languages of the Americas is already benefiting from the new technology. Intelligent computer-assisted language learning relies on software that has relatively sophisticated models of the target language and/or the learner. An example is the use of a program that has an explicit model of some aspect of the grammar of the target language and can analyze or generate words or sentences. Many indigenous languages of the Americas are characterized by complex morphology, and morphology must play a significant role in the instruction of these languages. This paper describes how morphological analyzers and generators can handle the complex morphology of languages such as K'iche' and Quechua and discusses a potential application of this technology to the teaching of such languages. Computer-Assisted Language Learning In recent years, computers have become so important in language teaching that it hard to imagine a class without them. Students use computers to do exercises practicing what they have learned in the class, they access documents from the Internet, they interact with other learners or with native speakers of the target language on the Internet, and they write papers with word processing software that may be especially adapted to second language learning. The field of computer-assisted language learning (CALL) has its own conferences and its own journals, CALICO Journal, Computer Assisted Language Learning, and Language Learning and Technology. Computers are even a part of the language curriculum in relatively impoverished parts of the world, including regions where indigenous languages are taught as
[Show abstract][Hide abstract] ABSTRACT: Resource-poor languages may suffer from a lack of any of the basic resources that are fundamental to computational linguistics, including an adequate digital lexicon. Given the relatively small corpus of texts that exists for such languages, extending the lexicon presents a challenge. Languages with complex morphology present a special case, however, because individual words in these languages provide a great deal of information about the grammatical properties of the roots that they are based on. Given a morphological analyzer, it is even possible to extract novel roots from words. In this paper, we look at the case of Tigrinya, a Semitic language with limited lexical resources for which a morphological analyzer is available. It is shown that this analyzer applied to the list of more than 200,000 Tigrinya words that is extracted by a web crawler can extend the lexicon in two ways, by adding new roots and by inferring some of the derivational constraints that apply to known roots.
Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta; 01/2010
[Show abstract][Hide abstract] ABSTRACT: The purpose of language is to encode information, so that it can be communicated. Both the producer and the comprehender of a communication want the encoding to be simple. However, they have competing concerns as well. The producer desires conciseness and the comprehen-der desires fidelity. This paper argues that the Minimum Description Length Principle (MDL) captures these two pressures on language. A genetic algorithm is used to evolve languages, that take the form of finite-state transducers, using MDL as a fitness metric. The languages that emerge are shown to have the ability to generalize beyond their initial training scope, suggesting that when selecting to satisfy MDL one is implicitly selecting for compositional languages.
[Show abstract][Hide abstract] ABSTRACT: There has been little work on computational grammars for Amharic or other Ethio-Semitic languages and their use for parsing and generation. This paper introduces a grammar for a fragment of Amharic within the Extensible Dependency Grammar (XDG) framework of Debusmann. A language such as Amharic presents special challenges for the design of a dependency grammar because of the complex morphology and agreement constraints. The paper describes how a morphological analyzer for the language can be integrated into the grammar, introduces empty nodes as a solution to the problem of null subjects and objects, and extends the agreement principle of XDG in several ways to handle verb agreement with objects as well as subjects and the constraints governing relative clause verbs. It is shown that XDG's multiple dimensions lend themselves to a new approach to relative clauses in the language. The introduced extensions to XDG are also applicable to other Ethio-Semitic languages.
[Show abstract][Hide abstract] ABSTRACT: The ontological distinction between discrete individuated objects and continuous substances, and the way this distinction is expressed in different languages has been a fertile area for examining the relation between language and thought. In this paper we combine simulations and a cross-linguistic word learning task as a way to gain insight into the nature of the learning mechanisms involved in word learning. First, we look at the effect of the different correlational structures on novel generalizations with two kinds of learning tasks implemented in neural networks-prediction and correlation. Second, we look at English- and Spanish-speaking 2-3-year-olds' novel noun generalizations, and find that count/mass syntax has a stronger effect on Spanish- than on English-speaking children's novel noun generalizations, consistent with the predicting networks. The results suggest that it is not just the correlational structure of different linguistic cues that will determine how they are learned, but the specific learning mechanism and task in which they are involved.
Language and Cognition 10/2009; 1(2):197-217. DOI:10.1515/LANGCOG.2009.010
[Show abstract][Hide abstract] ABSTRACT: This paper presents an application of finite state transducers weighted with feature structure descriptions, following Amtrup (2003), to the morphology of the Semitic language Tigrinya. It is shown that feature-structure weights provide an efficient way of handling the templatic morphology that characterizes Semitic verb stems as well as the long-distance dependencies characterizing the complex Tigrinya verb morphotactics. A relatively complete computational implementation of Tigrinya verb morphology is described.
the 12th Conference of the European Chapter of the Association for Computational Linguistics; 01/2009
[Show abstract][Hide abstract] ABSTRACT: This paper presents an application of finite state transducers weighted with feature structure descriptions, following Amtrup (2003), to the morphology of the Semitic language Tigrinya. It is shown that feature-structure weights provide an effi- cient way of handling the templatic mor- phology that characterizes Semitic verb stems as well as the long-distance de- pendencies characterizing the complex Tigrinya verb morphotactics. A relatively complete computational implementation of Tigrinya verb morphology is described.
EACL 2009, 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, March 30 - April3, 2009, Athens, Greece; 01/2009
[Show abstract][Hide abstract] ABSTRACT: The embodiment hypothesis is the idea that intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity. We offer six lessons for developing embodied intelligent agents suggested by research in developmental psychology. We argue that starting as a baby grounded in a physical, social, and linguistic world is crucial to the development of the flexible and inventive intelligence that characterizes humankind.
Artificial Life 12/2005; 11(1-2):13-29. DOI:10.1162/1064546053278973 · 1.39 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Human language exhibits mainly arbitrary relationships between the forms and meanings of words. Why would this be so? In this paper I argue that arbitrariness be- comes necessary as the number of words increases. I also discuss the effectiveness of competitive learning for acquiring lexicons that are arbitrary in this sense. Fi- nally, I consider some implications of this perspective for arbitrariness and iconicity in language acquisition.
[Show abstract][Hide abstract] ABSTRACT: Introduction We are interested in modeling an aspect of human rhythm perception and production called beat induction. Roughly, beat induction consists of nding the downbeats in a metrical signal. The most common example of beat induction in human performance is foot tapping to music, and one way to state our goal is that we want to build feet which can tap to the radio as well as people do. However, we are interested in more than the end-state of adult rhythmical behavior; we also focus on the path that people take in perfecting this skill. We build embodied models, ones with actual robotic components that interact with and constrain computational components. This commitment to a physical model of beat induction might seem wrong-headed. After all, beat induction is a perceptual phenomenon in adults, not necessarily involving motor control at all. But we believe that the interactions between body and brain in developing infants and toddlers cannot
[Show abstract][Hide abstract] ABSTRACT: Children generalize nouns in ways that are consistent with the referent s ontological and/or grammatical kind. In other words, children generalize a new noun based on both the perceptual properties of the referent and the linguistic properties of the noun (Soja, Carey, & Spelke, 1991; Jones & Smith, 1998, Soja, 1992; Smith, 1995). Cross-linguistic studies have shown that systematic differences in the structures of different languages are reflected in children s novel noun generalizations (Imai & Gentner, 1997; Gathercole & Min, 1997, Yoshida & Smith 1999). One of the differences that has been studied is the ontological object/substance distinction as it relates to the syntactic count/mass distinction. In this paper we look at the effect of mass/count syntax and perceptual cues concerning solidity on English- and Spanish-speaking children s generalization of new nouns. The task used to study this is the Novel Noun Extension Task. In this task, the child is shown an exemplar and the exemplar is labeled. The child is then asked what other things, matching the exemplar on different dimensions, can be called by the same name. Previous research has shown that children extend the name of a solid object to other objects of the same shape and the name of a nonsolid substance to other shapes made out of the same material. (Soja et al
[Show abstract][Hide abstract] ABSTRACT: Most theories of language processing and acquisition make the assumption that perception and comprehension are related to production, but few haveanything say about how. This paper describes a performance-oriented connectionist model of the acquisition of morphology in which production builds on representations which develop during the learning of word recognition. Using arti#cial language stimuli embodying simple su#xation, pre#xation, and template rules, I demonstrate that the model generalizes to novel combinations of roots and in#ections for both word recognition and production. I argue that the capacity of connectionist networks to develop intermediate distributed representations which not only enable the solving of the task at hand but also facilitate another task o#ers a plausible accountofhow comprehension and production come to share phonological knowledge as words are learned. Introduction Language learners must acquire both the ability to comprehend language and...
[Show abstract][Hide abstract] ABSTRACT: This paper describes a modular connectionist model of the acquisition of receptive inflectional morphology. The model takes inputs in the form of phones one at a time and outputs the associated roots and inflections. In its simplest version, the network consists of separate simple recurrent subnetworks for root and inflection identification; both networks take the phone sequence as inputs. It is shown that the performance of the two separate modular networks is superior to a single network responsible for both root and inflection identification. In a more elaborate version of the model, the network learns to use separate hidden-layer modules to solve the separate tasks of root and inflection identification.