Book

Formalizing Natural Languages: The NooJ Approach

Authors:

Abstract

This book is at the very heart of linguistics. It provides the theoretical and methodological framework needed to create a successful linguistic project. The author provides linguists with tools to help them formalize natural languages and aid in the building of software able to automatically process texts written in natural language (Natural Language Processing, or NLP). Computers are a vital tool for this, as characterizing a phenomenon using mathematical rules leads to its formalization. NooJ – a linguistic development environment software developed by the author – is described and practically applied to examples of NLP.
... Ahora bien, la elipsis nominal no cuenta con un minucioso estudio como sí sucede con la elipsis verbal. Para comprobar la representación computacional de este fenómeno sintáctico se crearon en NooJ (Silberztein, 2005(Silberztein, , 2016 dos herramientas: un diccionario electrónico y una gramática computacional. Ambas fueron evaluadas en un corpus de 423000 palabras pertenecientes al dominio médico. ...
... However, nominal ellipsis has not been as meticulously studied as verbal ellipsis has. To verify the computational representation of this syntactic phenomenon, two tools were created in NooJ (Silberztein, 2005(Silberztein, , 2016: an electronic dictionary and a computational grammar. Both were evaluated in a 423,000-word corpus from the medical domain. ...
... Si bien se muestra un componente computacional, el foco de atención es la estructura gramatical y cómo esta es un instrumento idóneo para resolver algunos de los problemas de la identificación automática de estas construcciones y otras similares. Para esto, se empleó NooJ (Silberztein, 2005(Silberztein, , 2016 un software computacional cuyo objetivo es describir exhaustivamente todas las oraciones de una lengua por medio del reconocimiento automático de textos escritos. Esta formalización se aplica en un corpus de textos escritos en lenguaje natural pertenecientes al dominio médico. ...
Article
Full-text available
En este trabajo se estudia la elipsis nominal del español desde los postulados generales de la Gramática Generativa (Chomsky, 1965, 1995) y la Teoría de la Identidad parcial (Saab, 2008, 2019) y, a partir de su explicación, se representa su alcance en una gramática computacional capaz no solo de identificar estas formaciones, sino también de reponer el correspondiente elemento nominal elidido. Si bien se muestra un componente computacional, el foco de atención es la estructura gramatical y cómo es un instrumento idóneo para resolver algunos de los problemas de la identificación automática de estas construcciones y otras similares. Ahora bien, la elipsis nominal no cuenta con un minucioso estudio como sí sucede con la elipsis verbal. Para comprobar la representación computacional de este fenómeno sintáctico se crearon en NooJ (Silberztein, 2005, 2016) dos herramientas: un diccionario electrónico y una gramática computacional. Ambas fueron evaluadas en un corpus de 423000 palabras pertenecientes al dominio médico. Es importante mencionar que esta gramática computacional no fue entrenada ni modificada en ningún corpus previo. En un barrido manual se identificaron 355 elipsis, de las cuales se logró detectar un total de 335 elipsis. Los resultados obtenidos fueron los siguientes: un 99,11 % de precisión, un 94,36 % de cobertura y un 96,68 % de medida f.
... One key feature about NooJ is that all linguistic descriptions are reversible, i.e., both a parser (to recognize sentences) and a generator (to produce sentences) can use them. In this manner, we can show and build, by combining a parser and a generator and applying them to syntactic grammar, a system that takes a sentence as input and produces all the sentences that share the same lexical material as the original sentence (Silberztein, 2016). We take de sentence: La misericordia vive in te (Mercy lives in you). ...
... Silberztein (2016), p. 1 ...
Article
Full-text available
The diversity of languages of the digital real has led to the creation of new communication models which incorporate formal, synthetic, and approved-by-areas codes for smart communication. The emblem of our society is the Mood virtual communicator: this replaces words and sentences with algorithms and acronyms. Our course of direction is heading inexorably towards digital codes that transmit pre-established semantics generated by a fixed sentence. Thus, it makes sense to discuss communication styles. And what does education consist of? Although we can take sides with fixed and standardized synthetic communication, we also call for respect for traditional and cultural values in individual societies. Still, we call into question the methodologies of study and reference models, which are currently used for virtual communication. The disciplines of theoretical linguistics, computational linguistics, mathematics, quantum physics, sociology, psychology, pedagogical theories, and other related subjects are all included in Social Sciences and Digital Media. This work aims to present an unwavering logical progression that transitions from normative grammar to formal grammar to then examine the features of a poetic text, such as Dante Alighieri's Divine Comedy, from several perspectives.
... To this end, we will first collect and study these expressions. We will then create their lexicon-grammar tables, which we will implement within a linguistic platform such as NooJ by transforming them into dictionaries [22] and building their corresponding syntactic grammars [23], [24]. This framework will enable the identification and analysis of these expressions in texts and corpora and can be used in natural language processing applications, including automatic translation. ...
... Initially, each table must undergo conversion into a NooJ dictionary. Subsequently, a syntactic grammar for each table needs to be constructed, utilizing the linguistic knowledge embedded in the table to identify sentences [23], [24]. It is important to note that both the dictionary and the syntactic grammar should share the same name and be located in the "Lexical Analysis" folder within the NooJ platform. ...
Article
Full-text available
Frozen expressions hold significant importance in the field of natural language processing, attracting considerable attention from researchers across various languages in recent years. The Arabic language, in particular, boasts a wealth of frozen expressions inherited from the pre-Islamic and early Islamic periods, with persistent usage to the present day. This linguistic richness has motivated researchers to systematically collect, classify, and elucidate these expressions. Various classifications have emerged, addressing aspects such as continuity, discontinuity, allowance for variations, and restriction from variations. Our aim is to produce lexicon-grammar tables of discontinuous Arabic frozen expressions and implement them. Our approach involves the meticulous collection and study of these expressions, followed by the transformation of their lexicon-grammar tables into dictionaries and syntactic grammars within the NooJ platform. This methodology allows us to recognize and annotate these expressions in texts and corpora, even when they exhibit discontinuity. Such recognition has the potential to address several challenges in automatic natural language processing, including the area of automatic translation.
... In a parallel effort for the French language, Mathieu [7]− [10] constructed lexicongrammar tables specifically for psychological verbs to establish a feeling system that would facilitate the recognition of these verbs in texts. Subsequently, Silberztein [11], [12] Gross's work, he created additional tables to complete the French lexicon and also integrated the theory of lexicon-grammar tables into the NooJ platform, allowing it to take advantage of this innovative approach. In a related context, Tolone [13]− [16] made an effort to collect existing works on lexicon-grammar tables for the French language. ...
... NooJ is a linguistic development environment that allows the modeling of linguistic knowledge through electronic dictionaries and grammars to make it exploitable by the machine. To integrate the lexicongrammar tables in this platform, Silberztein [12] proposed to transform them into dictionaries and syntactic grammars, as illustrated in Figure 2. His idea is to build for each table a dictionary and syntactic grammar, both having the same name, preferably the name of the table to which they refer. This dictionary and this syntactic grammar must be created in the same folder on the NooJ platform, which is "Lexical Analysis". ...
Article
Full-text available
span>The lexicon-grammar approach is a very important linguistic approach in automatic natural language processing (NLP). It allows for the description of the lexicon of the language through readable and intuitive tables for human manual editing. However, the automatic use of the lexicon-grammar tables in the automatic NLP platforms remains difficult, given the incompatibility between the codes used to represent the properties in the lexicon-grammar tables and those used to represent the properties in the automatic NLP platforms. In this work, we present our method of standardizing the lexicon-grammar tables for the French language, since they constitute very rich lexical, syntactic, and semantic linguistic resources. First, we standardize their properties so that they can be compatible with those used in the NLP platforms. Then, to implement the standardized tables, we used a linguistic platform such as NooJ. For that, we describe the process of integrating these tables into this platform through the automatic generation of the dictionaries from these tables. Finally, to test the efficiency of the generated dictionaries, we create for some of them syntactic grammars that take into account all the grammatical, syntactic, and semantic information contained in the dictionaries.</span
... [cs.CL] 1 Jun 2023 generation [4,8,30,31]. Various approaches towards automated inflection have been used to deal with particular aspects of inflection [7,33] in predefined languages [11,12,24,18,23] or in an unspecified inflected language [10,27]. ...
... Despite substantial recent progress in the field [9,3,18,27,28,32], automatic inflection and automatic text generation still represent a problem of formidable computational complexity for many natural languages in the world. Most stateof-the-art approaches make use of extensive manually annotated corpora that currently exist for all major languages [26]. ...
Preprint
Full-text available
We present a set of deterministic algorithms for Russian inflection and automated text synthesis. These algorithms are implemented in a publicly available web-service www.passare.ru. This service provides functions for inflection of single words, word matching and synthesis of grammatically correct Russian text. Selected code and datasets are available at https://github.com/passare-ru/PassareFunctions/ Performance of the inflectional functions has been tested against the annotated corpus of Russian language OpenCorpora, compared with that of other solutions, and used for estimating the morphological variability and complexity of different parts of speech in Russian.
... These resources are grouped into four modules: the Annotator, the Guesser, the Improver, and the Disambiguator (see Prihantoro 2021: forthcoming). SANTI-morf is implemented using NooJ 4 (Silberztein 2003;Silberztein 2016), a finite-state based text analyser program. ...
... These resources are grouped into four modules: the Annotator, the Guesser, the Improver, and the Disambiguator (see Prihantoro 2021: forthcoming). SANTI-morf is implemented using NooJ 4 (Silberztein 2003;Silberztein 2016), a finite-state based text analyser program. ...
Conference Paper
The pronunciation of foreign terms in our acoustic data for various dialects in Indonesia, such as Javanese, Sundanese, Batak, and Minangnese, have their own unique patterns when they are notated in a pronunciation lexicon. Based on the 2010 population census data by BPS (Badan Pusat Statistik; Statistics Indonesia), the Javanese people comprise 40 percent of the total population (Statistik, B. P., 2011). In this paper, we discuss the development of a speech corpus to examine the pronunciation patterns of foreign terms by Indonesians. It turned out that the number of Javanese speakers also made up a similar proportion in our speakers’ data. We propose a lexicon development method for an ASR (automatic speech recognition) modeling for medical dictation by mapping the pronunciation patterns of foreign terms. We mapped the pronunciation patterns of medical technical terms based on the recorded data of 122 speakers with various dialects. We identified speakers with Javanese dialects and made a custom lexicon file consisting of pronunciation data for the standard Indonesian and Javanese dialects. The experiment results show that the ASR model built with a combined standard Indonesian dialect and Javanese dialect lexicon has better accuracy than the ASR model made with a common Indonesian dialect lexicon. We hope that the proposed method can be used to build a lexicon for an ASR model intended for a multi dialects community.
... In order to determine how we express that something is within the domain of someone's responsibility and/or that someone has failed to do something or has done something wrong, we use the NooJ platform (Silberztein, 2016) to create rules designed to detect linguistic constructions used for this purpose. More precisely, we use NooJ for the construction of a set of rules that primarily aim to detect the usage of the Croatian lexemes odgovornost [responsibility] and krivnja [guilt] in this corpus. ...
... (a) mora nečiju krivnju dokazatiliterally: must someone's guilt prove (11) (b) govori o ukidanju krivnjiliterally: talks about suppressing guilt (12) (c) krivaca za to nemaliterally: culprits for that aren't (13) 5 ...
Chapter
This paper deals with the analysis of political discourse in Croatia, more precisely, it aims to determine how dissatisfaction is expressed with the attitudes represented by political rivals. We focus on the detection of linguistic means used to show disagreement with decisions or actions taken by parties or individuals considered political and/or ideological opponents. We are particularly interested in the means used by speakers to indicate that someone has failed to do something that is under his/her responsibility and is, therefore, guilty of this omission. In other words, we want to determine how the concept of responsibility is lexicalized, how it is signaled that there is a failure in someone’s responsibility, and, finally, that someone is therefore to be blamed for that omission or even transgression. For this purpose, we use a large corpus of texts, with over 127 million tokens, consisting of transcripts of plenary debates from the Croatian Parliament since 2003. We use NooJ for the construction of a set of rules that aim to detect the usage of the Croatian lexemes odgovornost [responsibility] and krivnja [guilt] in this corpus. Since Croatian is rich in terms of word formation, a set of rules is designed to capture the usage of derived words morphologically related to these nouns. In data analysis, we take into account the political orientation of MPs, i.e. their affiliation with left, right, or centrist parties, the usage of various linguistic constructions/frames related to responsibility and guilt as well as periods in which they were used.
... The CETEHIPL focuses on the pedagogical application of the NooJ tool created by Silberztein [4]. The central ideas of this proposal are developed in our book Aprendo con NooJ [3]. ...
... This research is an extension of our previous work, primarily manifested in the augmentation of our grammatical analyses [1]. We achieved this by creating 40 syntactic grammars within the NooJ platform [35]. This extension follows the principles of the expansive simple Arabic sentence parsing methodology as introduced by Bourahma et al. [18]. ...
Article
Full-text available
Complex Arabic sentences, especially those containing Arabic psychological verbs, follow a common underlying structure characterized by two essential components: the predicate and the subject. In addition, there are two optional elements: the head and the complement. These sentences, rooted in basic noun phrases (NPs), can be expanded within the predicate, subject, or complement, resulting in compound structures. This study aims to develop a syntactic analyzer for parsing complex sentences containing Arabic psychological verbs. To achieve this, we will use the dictionary generated from the lexicon-grammar table of Arabic psychological verbs, which contains all lexical, syntactic, semantic, and transformational information related to these verbs. Then, we will extend an existing analyzer to recognize and label all grammatical structures within complex sentences containing Arabic psychological verbs. Finally, we will evaluate the efficiency of this analyzer through tests on different texts and corpora.
... NooJ is a linguistic development platform that is used for the purpose of formalizing natural languages [19]. The software offers a range of resources for constructing, evaluating, and managing highly structured representations of natural languages. ...
Article
Full-text available
This paper outlines the implementation of a spell checker for the Arabic language, leveraging the capabilities of NooJ and its functionality, specifically noojapply. In this paper, we shall proceed to provide clear definitions and comprehensive descriptions of several categories of spelling errors. Next, we will provide a comprehensive introduction to the NooJ platform and its command-line utility, noojapply. In the subsequent section, we shall outline the four main phases of our spell checker prototype. We intend to develop a local grammar in NooJ for the purpose of error detection. Afterwards, a morphological grammar and a local grammar will be created in NooJ with the aim of providing an exhaustive list of possible corrections. Following that, a revised algorithm will be employed to arrange these candidates in descending order of ranking. Subsequently, a web user interface will be developed to visually represent our research efforts. Finally, we will proceed to showcase a series of tests and evaluations conducted on our prototype, Al Mudaqiq.
... In [16], the authors introduced a QA system called QASAL implanted with the Nooj linguistic engine [17] to extract the main target of a question using its lexical patterns. Their system includes a morphological analyzer, an automatic annotator a linguistic research tool and an electronic dictionary. ...
Chapter
Recently, deep learning-based contextualized word representations have made substantial advancements in enhancing the efficiency of various natural language processing (NLP) applications. However, only limited efforts have been dedicated to employing these representations for the development of Arabic open-domain question-answering (QA) systems, which are an indispensable component of conversational agents such as ChatGPT. In this study, we address this gap by delving into the Bert architecture to create a pre-trained Arabic Bert model. Furthermore, we assess the performance of this model in constructing a QA system by comparing its performance with that of a multilingual Bert model. The experimental results show that our AraQA_Bert_SL model, fine-tuned on the weights of a single-language pre-trained model, outperforms existing systems, boasting an F1 score of 90.6% and a pRR score of 93.7%. This achievement surpasses the performance of the AraQA_Bert_ML model, which relies on a multilingual pre-trained model. Notably, our approach significantly reduces the computational costs associated with the process of Bert fine-tuning.
... An important feature of NooJ is that all linguistic descriptions are reversible, i.e. both a parser (to recognize sentences) and a generator (to produce sentences) can use them. Like this manner, in line with (Silberztein 2012(Silberztein , 2016, we can show and build, by combining a parser and a generator and applying them to a syntactic grammar, a system that takes a sentence as input, and which produces all the sentences that share the sentence the same lexical material with the original expression. The second one can be implemented in Nooj via the following grammar: This chart uses three variables $ NO, $ V, and $ N1. ...
Article
Full-text available
Smart mobility has a positive impact on the lives of our cities through new technological solutions. An economy driven by an immense amount of data that allows us to engage in new business. A new cultural perspective and a new education approach where knowledge merges and generates new knowledge osmotically. We explore teaching strategies and excellent skills, which can meet the needs of higher education, such as the latest generation of technologies used in business, social and personal life, as well as in institutions and colleges of higher education. Technical fields like scientific and distributed informatics, and computational linguistics have new connotations as knowledge is reformulated and disciplines acquire new meanings. The research team aims to study the characteristics of the LSP language communication systems, describe and scientifically validate the characteristics, through conjectures and refutations of assumptions proposed by linguists, computer specialists and technologists. The phases of the research are divided into several chapters: research of textual production analysis - descriptions - comparisons with a focus on new text production techniques. - Second phase of data processing in NLG. Environments and processing techniques; third phase of validation and emotional techniques in the advertising text. Keywords: Scientific computer science, computational sciences, high performance computing, text production techniques, data processing in NLG.
... Le formalisme des ATNs peut être utilisé pour décrire des dépendances syntaxiques assez compliquées et profondes surtout pour un système récursif de façon relativement intuitive et facile à implémenter. L'utilisation de la technologie à états finis à été le sujet d'études dans plusieurs travaux de recherche telle la plateforme Nooj 54 (Silberztein, M., 2016), (Hadad,A.;Benghezala,H.;Ghenima,M., 2007), (Bataineh, B.M ;Bataineh, E.A, 2009) On distingue un état initial et un ensemble d'états finaux. À la différence des automates d'états finis, 1'étiquette d'un arc peut être soit un symbole de 1'alphabet terminal (classes lexicales : article, nom, verbe, etc.. .), ...
... On the contrary, English is the most common language across the globe as a medium of sharing information among citizens not only for administrative works but also for exchanging sentiments, emotions, ideas, and actions over global media (social platform). In the vast sector of Information Technology, English is preferred over other natural languages since the information provided in the English language by the standard ASCII symbols is relatively simple for computers to process [59]. However, those who are less conversant with English find it difficult to cope with and often need proper translation/interpretation for clarity at every step. ...
Article
Full-text available
Machine Translation System (MTS) serves as an effective tool for communication by translating text or speech from one language to another language. Recently, neural machine translation (NMT) has emerged to be popular for its performance and cost-effectiveness. However, NMT systems are restricted in translating low-resource languages as it requires huge quantity of data set to learn useful mappings across languages. The need of an efficient translation system becomes obvious in a large multilingual environment like India. Indian Language (ILs) are still treated as low-resource languages due to unavailability of corpora. In order to address such asymmetric nature, multilingual neural machine translation (MNMT) system evolves as an ideal approach in this direction. The MNMT converts many languages using a single model which are extremely useful in terms of training process and lowering online maintenance costs. It is also helpful for improving low-resource translation. In this paper, we propose a MNMT system to address the issues related to low-resource language translation. Our model comprises of two MNMT systems i.e. for English-Indic (one-to-many) and the other for Indic-English (many-to-one) with a shared encoder-decoder containing 15 language pairs (30 translation directions). Since most of IL pairs have scanty amount of parallel corpora, not sufficient for training any machine translation model, we explore various augmentation strategies to improve overall translation quality through the proposed model. A state-of-the-art transformer architecture is used to realize the proposed model. In addition, the paper addresses the use of language relationships (in terms of dialect, script, etc.), particularly about the role of high-resource languages of the same family in boosting the performance of low-resource languages. Moreover, the experimental results also show the advantage of backtranslation and domain adaptation for ILs to enhance the translation quality of both source and target languages. Using all these key approaches, our proposed model emerges to be more efficient than the baseline model in terms of evaluation metrics i.e BLEU (BiLingual Evaluation Understudy) score for a set of ILs.
... The Chomsky-Schützenberger hierarchy (Chomsky, 1956(Chomsky, , 1963Chomsky & Schützenberger, 1959) includes Regular Grammars (RGs), Context-free Grammars (CFGs), Context-Sensitive Grammars (CSGs), and Unrestricted Grammars (UGs). NooJ is a lexicon-driven, rule-based system that can handle all of these generative grammars (Silberztein, 2003(Silberztein, , 2016. Two empirical trials have been used to gauge the system's performance. ...
Article
Full-text available
This project looks at Arabic word generation from a computational angle. It focuses on the computational production and analysis of morphological Arabic nouns. The work begins with a stem-based descriptive analysis of Arabic noun morphology that fulfills both the computational formalization and the linguistic description. There includes a thorough discussion of both inflectional and derivational systems. The spelling of Arabic nouns is also covered, as well as morphotactics and morphophonemics. The work then offers a computer implementation of Arabic nouns built on a rule-based computational morphological methodology. The overall system is constructed using the NooJ toolkit, which supports both pushdown automata and finite-state automata (FSA) (PDA). Three elements make up the morphological generation and analysis system: a lexicon, morphotactics, and rules. The lexicon component catalogs lexical elements (indivisible words and affixes), the morphotactics component specifies ordering restrictions for morphemes, and the rules component converts lexical representations into surface representations and vice versa. Other rules, such as orthographic, morphophonemic, and morphological rules, are also stored as two-level rules. The core editable lexicon of lemmas used as input by the system is drawn from three sources: the Buckwalter Arabic morphological analyzer lexicon, the Arramooz machine-readable dictionary, and the Alghani Azzahir dictionary. A complete annotated vocabulary of inflected noun forms (combined into a single type of finite-state transducers (FSTs)) is the system's output. The lexicon that was developed is then put to use in morphological analysis. The study then offers the system's evaluation. Accuracy, precision, and recall are three widely used metrics to assess the system's performance. Two empirical experiments will be conducted as part of the evaluation task. The system analyzing Arabic words that have been discredited morphologically is evaluated in the first experiment. Accuracy, precision, and recall for the system when employing discredited Arabic words are (90.4%), (98.3%), and (88.9%), respectively. The technique is tested in a second experiment using undiacritical words. The achieved outcomes of this experiment were (94.7%) accuracy, (96.7%) precision, and (91.6% ) recall, respectively. Additionally, the measurement average for the two tests has been determined. The average performance values are respectively (92.55%), (97.5%), and (90.25) percent in terms of recall, precision, and accuracy. Overall, the results are encouraging and demonstrate the system's propensity for dealing with both diacritically and undiacritically written Arabic texts. This system can analyze Arabic text corpora in-depth and tag nouns according to their morphological characteristics. It breaks the word under analysis into three pieces (the stem, proclitics/prefixes, and suffixes/enclitics) and assigns each one a specific morphological feature tag or possibly many tags if the portion in question has numerous clitics or affixes. Many applications of natural language processing, including parsing, lemmatization, stemming, part-of-speech (POS) tagging, corpus annotations, word sense disambiguation, machine translation, information retrieval, text generation, spelling checkers, etc., depend on computational morphology. It is made up of morphological generation and analysis paradigms. According to a set of features, morphological generation attempts to construct every feasible derived and inflected form of a given lemma. On the other hand, morphological analysis is the process of dissecting a word into its component morphemes and giving each morpheme linguistic tags or qualities.
... • Nooj [159] is a linguistic development environment that allows the mathematical description of different linguistic phenomena at the orthographical, lexical, morphological, syntactic and semantic levels, for any natural language. NooJ's linguistic engine supports the four types of grammars of the Chomsky hierarchy to facilitate the creation of fine-grained text annotators for domain-specific information extraction. ...
Thesis
Full-text available
Agriculture is entering the digital age through data (which opens up precision agriculture) or knowledge (which opens up new decision support tools). Modern technologies and IoT devices have been applied to improve agricultural processes. One application scenario is plant monitoring using sensors and data analysis techniques. However, most existing solutions based on specific devices and imaging technologies require a financial investment, which is inaccessible to small farmers. Furthermore, the lack of farmer input into data collection and decision-making in these solutions raises trust issues between farmers and smart farming technologies. On the other hand, textual data in agriculture, e.g. exchanges among farmers on social networks, can be a source of knowledge. This knowledge has great value when it is formalised, contextualised and integrated with other data. Crowdsensing is a sensing paradigm that allows ordinary people to contribute with data that their mobile devices equipped with sensors collect or generate. Farmers' observations reflect their knowledge and experience in plant health monitoring.Driven by the increasing connectivity of farmers and the emergence of online farming communities, this thesis proposes:(1) to use Twitter as an open crowdsensing platform to acquire people's perceptions of crop health so that we can include farmer participation in agricultural knowledge reconstruction.(2) to use pre-trained language models as an implicit and domain-specific knowledge base that integrates heterogeneous texts and supports information extraction from text. https://www.theses.fr/2022REIMS025/document
Article
In this work, we propose a Distributional Semantic resource enriched with linguistic and lexical information extracted from electronic dictionaries. This resource is designed to bridge the gap between the continuous semantic values represented by distributional vectors and the discrete descriptions provided by general semantics theory. Recently, many researchers have focused on the connection between embeddings and a comprehensive theory of semantics and meaning. This often involves translating the representation of word meanings in Distributional Models into a set of discrete, manually constructed properties, such as semantic primitives or features, using neural decoding techniques. Our approach introduces an alternative strategy based on linguistic data. We have developed a collection of domain-specific co-occurrence matrices derived from two sources: a list of Italian nouns classified into four semantic traits and 20 concrete noun sub-categories and Italian verbs classified by their semantic classes. In these matrices, the co-occurrence values for each word are calculated exclusively with a defined set of words relevant to a particular lexical domain. The resource includes 21 domain-specific matrices, one comprehensive matrix, and a Graphical User Interface. Our model facilitates the generation of reasoned semantic descriptions of concepts by selecting matrices directly associated with concrete conceptual knowledge, such as a matrix based on location nouns and the concept of animal habitats. We assessed the utility of the resource through two experiments, achieving promising outcomes in both the automatic classification of animal nouns and the extraction of animal features.
Article
Bakground The linguistic pursuit of describing natural languages stands as a commendable scientific endeavor, regardless of immediate software application prospects. It transcends mere documentation of possible sentences to establish connections between sentences derived from transformations. Methods Amid the dominance of Large Language Models (LLMs) in research and technology, which offer intriguing advancements in text generation, the approaches presented in this article confront challenges like opacity, limited human intervention, and adaptation difficulties inherent in LLMs. The alternative or complementary approaches highlighted here focus on the theoretical and methodological challenges of describing linguistic transformations and are firmly rooted in the field of linguistics, the science of language. We propose two solutions to address the problem of language transformations: (i) the procedural approach, which involves representing each transformation with a transducer, and (ii) the declarative method, which entails capturing all potential transformations in a single neutral grammar. Results These approaches simplify the generation of complex sentences from elementary ones and vice versa. Conclusion This work has benefited from research exchanges within the Multi3Generation COST Action (CA18231), and the resources produced can contribute to enhancing any language generation system.
Article
We present a set of deterministic algorithms for Russian inflection and automated text synthesis.These algo-rithms are implemented in a publicly available web-service www.passare.ru. This service provides functions for inflection of single words, word matching and synthesis of grammatically correct Russian text. Selected code and datasets are available at https://github.com/passare-ru/PassareFunctions/ Performance of the inflectional functions has been tested against the annotated corpus of Russian language OpenCorpora, compared with that of other solutions, and used for estimating the morphological variability and complexity of different parts of speech in Russian.
Article
The article emphasizes the critical importance of language generation today, particularly focusing on three key aspects: Multitasking, Multilinguality, and Multimodality, which are pivotal for the Natural Language Generation community. It delves into the activities conducted within the Multi3Generation COST Action (CA18231) and discusses current trends and future perspectives in language generation.
Conference Paper
The article describes the resources for the automatic extraction of phraseological units in Belarusian within the research of syntagmatic delimitation of Belarusian prosody using NooJ. It comprises the dictionary of Belarusian phrasemes in NooJ format and 12 syntactic grammars for automatically searching different types of frozen expressions (phrasemes, nominal, adverbial, verbal, adjectival and mixed frozen expressions). Their implementation is essential for the computerized search of different types of syntagms for automatic speech delimitation to improve applications with voice accompaniment in the Belarusian language.
Conference Paper
This study presents a corpus approach to verbs of perception for the Croatian language, considering the fact that there has not been a single corpus-based study of the Croatian language, which considers a large number of verb lemmas that express perception verbs. A total of 86 verbs were selected from the Croatian Morphological Lexicon and divided into five semantic subgroups: sight, hearing, taste, smell and touch. These verbs are processed using NooJ by adding the semantic tag +prcp to mark the semantic category of a perception verb and its semantic subgroup +viz [sight], +sluh [hearing], +okus [taste], +miris [smell] and +dodir [touch]. The verbs were next explored within three different domains (a corpus of medical texts, a corpus of parliamentary texts and a corpus of children’s literature) to learn more about their syntactic and semantic features. The research corpus consisted of 214,387 forms of perception verbs in context. Corpus entries were manually validated, and linguistic information (syntactic complements and meanings of perception verbs) was assigned manually. In semantic annotation, information was added to each form of a perception verb concerning whether it expresses a prototypical, physical meaning or metaphorical meaning. In the syntactic processing, the types of predicate complements of perception verbs were annotated. The analysis showed eight different categories of predicate complements. This study includes many more perception verb lemmas than previous research, which is why it brings some new results, especially in semantic analysis.
Conference Paper
NooJ represents a powerful tool for discourse analysis, and promises great developments in the interpretation of articulate concepts across texts. In spite of the software’s ability to integrate syntactic operations in text parsing, social scientists seldom use NooJ for large discourse analysis procedures, leaving NooJ-powered discourse analysis protocols largely underdeveloped. This study chiefly aims to establish an early discourse analysis canon for NooJ analysts to use in social science research, in particular, sociolinguistics. We identified migrant self-entrepreneurship as a subject of study, and investigated its conceptual development across seven separate interviews with aspiring migrant entrepreneurs. Our data collection primarily consisted of semi-structured interviews, which we conducted in anticipation of NooJ-powered co-occurrence analysis. This article offers insights into research best practices we recommend researchers to employ when conducting co-occurrence analysis, with special regard to congruence patterns, overlapping expressions, divergence patterns and low occurrences. In conclusion, this study comes as a demonstration that NooJ is a perfect fit for a form of scientific text interpretation that goes beyond well-established thematic analysis techniques in terms of representational insightfulness and analytical depth. Importantly, NooJ-powered discourse analysis makes it possible to aid qualitative methods with quantitative assessments, ushering promising developments in computational sociolinguistics. We see this publication as a possible first step for NooJ-powered analysis to new sociolinguistic pursuits and for its affiliated scientific community to study social phenomena with novel depth.
Chapter
We have constructed a Ukrainian set of linguistic resources that have allowed us to construct various NLP applications on Ukrainian, including information retrieval and extraction, morphological, syntactic semantic and statistical analysis, spell checking, and machine translation. Our goal was to develop a reliable tool that would allow students and teachers of the Ukrainian language to explore simple texts, as well to allow researchers in the social sciences to analyze their own corpora of Ukrainian texts. We will first review the various existing NLP software applications that can process Ukrainian texts, their functionalities, and their performance. We then describe the linguistic resources we have developed, and finally compare the results produced by both approaches.
Chapter
To describe the infinite set of sentences expressed in a Natural language, one needs to define the finite set of its atomic units, i.e., its vocabulary, and the rules that combine these atomic units to construct sentences, i.e., its grammar. However, separating the vocabulary from the grammar is not straightforward; one crucial problem is defining multiword units. Here, I present three reproducible criteria to characterize them and thus separate the vocabulary from the grammar in an operational way.
Chapter
This presentation shows how a lexicon grammar dictionary of English phrasal verbs (PV) can be transformed into an electronic dictionary, in order to accurately identify PV in large corpora within the linguistic development environment, NooJ. The NooJ program is an alternative to statistical methods commonly used in NLP: all PV are listed in a dictionary and then located by means of a PV grammar in both continuous and discontinuous format. Results are then refined with a series of dictionaries, disambiguating grammars, filters, and other linguistics resources. The main advantage of such a program is that all PV can be identified, not just collocations of higher-than-normal frequency.
Chapter
I present a method to formalize the morphology of Quechua nouns, verbs, and other Part Of Speech (POS) categories to develop Natural Language Processing (NLP) applications. First, I constructed an electronic corpus comprising several digitalized texts and electronic dictionaries. After a detailed inventory of all Quechua suffixes, I classified them into specific sets corresponding to their POS category. Next, I formalized their grammatical behavior separately, using elementary matrices. The resulting tables describe valid combinations of two, three, and four suffixes. Finally, I formalized the inflection and derivation of each POS category.
Chapter
Nowadays, most Natural Language Processing software applications use empirical “black box” methods associated with training corpora to analyze texts written in natural languages. To analyze a sequence of text, they look for similar sequences in a corpus, select among them the most similar one according to some statistical measurement or some neural-network-based optimization state, and then bring forth its analysis as the new sequence analysis. Here, I first show that the limited size of the corpora used and their questionable quality explain why most NLP applications produce unreliable results. Next, I examine the principles which are at the basis of corpus-based methods and uncover their linguistic naiveté. I finally dispute the scientific validity of empirical approaches. I propose solutions to various problems that are based on the use of carefully handcrafted linguistic methods and resources.
Chapter
It is necessary to review the performance of Machine Translation software when they process low-resource languages. Our case study will be the language used in Tango songs. Tango is a challenging subject: in its lyrics, appear customary beliefs, social forms, and idiosyncratic expressions typical of the Argentinian culture from the late nineteenth to the mid-twentieth century, using Lunfardo as the predominant sublanguage. We address the problem of translating a Tango song lyrics to English. We look at the translations produced by Google Translator and DeepL and we compare them with translations produced by accessing handcrafted linguistic resources specifically developed for Rioplatense Spanish.
Chapter
Tagging systems developed using a data-driven approach are often considered superior to those produced using a linguistic approach [Brill (A Simple Rule-Based Part of Speech Tagger. Applied Natural Language Processing Conference, 1992, p.152)]. The creation of dictionaries and grammars (resources typically used in a linguistic approach) is considered costly compared to the creation of a training corpus (a resource typically used in a data-driven approach) [Silberztein (Formalizing Natural Languages: The NooJ Approach, 2016, p.22)]. In this contribution, I argue that such a view needs to be reconsidered. Focusing on MWE, I will show that some data-driven systems which rely on training corpora may produce inaccurate results, leading to incorrect automatic POS tagging, syntactic parsing and machine translation. I also show that such errors can be prevented using dictionaries and grammars for systems developed using a linguistic approach, which is principally in line with Silberztein’s (Formalizing Natural Languages: The NooJ Approach, 2016) view.
Conference Paper
In the context of healthcare and pharmaceutical industry, social media users currently generate amounts of subjective data. These data take the form of patient reviews and social media discussions about health and pharmaceutical services, specific drugs, or treatments. Because of its big size, these textual data represent a challenge to process and analyze without the use of NLP technology. Sentiment analysis as an NLP technique helps pharmaceutical companies and healthcare providers gain new insights, make timely informed decisions, and improve drug development and treatment strategies out of the data. Thus, achieve improved outcomes and good quality of life for people. Our current research addresses mining Facebook customer reviews on pharmaceutical services. It adopts an NLP approach to evaluate the sentiment in customer reviews regarding the service given by a particular pharmacy. The research utilizes a linguistic-based approach using Nooj platform for sentiment classification of pharmaceutical data according to the polarity of the reviews. Nooj is a linguistic processing platform supporting multiple languages, designed to analyze and process natural language texts for various linguistic tasks and provides a range of linguistic tools for analyzing and studying various linguistic phenomena. The study begins by collecting a corpus of social media posts related to pharmaceutical services of an enterprise. The collected data are then preprocessed, followed by building a Nooj electronic lexicon. Next, a local grammar is built to make the platform identify the linguistic structures embedding sentiments. Next, a linguistic analysis is applied to the corpus and the local grammar is applied to identify positive and negative sentiments. Our linguistic approach yields encouraging and promising results and this can help in understanding public sentiment towards pharmaceutical products and services and making informed decisions for the better of the customer and the pharmacy. Identifying areas for improvement and enhancing the quality of services are the ultimate goal for healthcare providers, which can be served via analyzing social media posts.
Article
Full-text available
In this paper, and in relation to the construction of electronic dictionaries for NooJ, we will deal with the tagging of Italian compound words, and with how it differs from that of simple words as for methods, functions, and purposes. We will especially focus our attention on the tagging of technical-scientific compound words, demonstrating how this operation, in NooJ, represents a crucial tool for both information extraction and knowledge automatic management and representation. Furthermore, with the intention to producing a complete analysis, we will provide the definitions of simple word and compound word, from both a formal and a linguistic point of view. As for the linguistic examination, we will adopt two different approaches. For the first one, we will use the analytic methods of Zellig S. Harris, who first set out, in structuralist terms and in relation to English, the study of the composition of different morphemes in more complex linguistic units, hence also of word groups or phrases. For the second one, we will make extensive reference to the methodological framework of language formalization described by Maurice Gross' Lexicon-Grammar, as to its subsequent adaptations to the Italian language. Finally, as we will see, it will be of fundamental importance for us to differentiate the definitions that we will give here of compound words from the more generic and less precise one of multiword expressions (MWE). To justify this differentiation, we will provide not only formal indications, but also lexical, morphosyntactic and semantic ones.
Article
The order of clitics (CLs) in Romance languages has been studied in depth by many scholars since Perlmutter’s first approach. This author points out that person and case features impose a filter that results in the following pattern in Spanish: Se- II- I- III dat - III acc . This pattern has been revisited because it is not restrictive enough: it gives rise to impossible sequences. In this context, the goal of this work is to define a pattern of morphosyntactic features for River Plate Spanish clitics, and an ordering pattern, with the purpose of developing a computational model for an automatic analysis of this phenomenon. This kind of analysis is justified because the algorithm output shows the degree of success of the descriptive proposal. To this end, we use NooJ, a linguistic development environment that has diverse tools. The computational modeling involves two stages: (i) the creation of an electronic dictionary and (ii) the creation of a computational grammar. The algorithm developed can generate all correct sentences conformed by Nominative Pronoun + CL + CL + Verb, Non-Finite Verb-CL-CL and Imperative Verb-CL-CL, and it can recognize these kinds of expressions in written text assigning the correct semantic labels. With these results, we conclude that our descriptive proposal is adequate for the analysis of sequences of clitics in River Plate Spanish.
Book
Full-text available
Book of abstracts about the XVI International NooJ Conference 2022 URI: http://hdl.handle.net/2133/24913
Chapter
Paronomasia is defined as the substitution of one lexical item for another, based on partial homophony. While this phenomenon has been the subject of several studies focused on stylistic aspects [1, 2] - and, to a lesser extent, those of grammar [3] - computational linguistics approaches are practically nonexistent. Previous barriers include difficulties in automatically assigning correct labels (both semantic and morphological) due to ambiguity in these expressions. To this end, this paper analyzes a group of Colombian Spanish nouns and their corresponding paronomastic variants from the Generative Lexicon Theory [4, 5] (GLT) with a view toward computational modelling and automatic recognition by means of NooJ [6]. The methodology consists of the following steps: (i) selection of a list of Colombian Spanish nouns included in Varela’s work [7]; (ii) creation of semantic structures (SS) according to GLT; and (iii) elaboration of productive morphological and constrained syntactic grammars [8]. The resources are tested on a set of sentences, yielding promising results to further investigate the automatic analysis of paronomasia from this perspective.
Chapter
The article describes the syntactic grammar for automatic text segmentation into syntagms in Belarusian by means of NooJ. It is based on the principle of defining sequences of linguistic elements associated with certain semantic relationships and aimed at searching structural and semantic components of utterances and delimiting them into accentual units. Its implementation is essential for improving the synthetic speech generated by Belarusian text-to-speech systems using prepared syntactic grammars in NooJ.
Chapter
In 1954, with his article entitled Transfer Grammar (published in “The International Journal of American Linguistics”, Vol. 20, No. 4, pp. 259–270, University of Chicago Press), Zellig S. Harris was the first linguist to approach the nascent Automatic Translation (AT) from the point of view of structuralist and formal linguistics. This article, written in the pivotal period for the first AT attempts in the US, outlines a translation method that wants to: Formally measure the difference between languages, in terms of grammatical structures; Define the point of minimum difference (or maximum similarity) between any type of language pair; Define the difference between the languages as the number and content of grammatical instructions needed to generate the utterances of one language from the utterances of the other. At the time, the purposes of Harris’s article were therefore extremely innovative, since they considered translation as a process in which meaning transfers could only be achieved based on morphosyntactic analyses and evaluations. Moreover, it is worth stressing that at the time the first AT experiments performed word-for-word translations, without taking into account (not even statistically) the contexts in which the words co-occurred. As is known, this method proved to be unsuccessful, as regards the quality, time and costs of the translations made automatically. In 1966, this led ALPAC [1] to end AT research in the US, and cut off the flow of funding to it.KeywordsNooJNooJ automatic translationLexicon-GrammarTransfer GrammarStructural linguisticsDistributional linguistics
Chapter
The abundant presence of prefabricated sequences attracts the attention of linguists. Phraseological units are often problematic for foreign learners both receptively and productively. In addition, they think that the number of phraseological units is greater in FOS (French for specific purposes) than in general French. In the field of electrical energy, the abundant presence of phrasemes [verb + nominal group or not + preposition + nominal group] pose many problems for non-native engineer-learners. The modeling is done based on the observation of our corpus and dictionaries in electricity. By using three grammars (rational grammar, algebraic grammar and contextual grammar.), NOOJ allows us to extract 1389 phrase sequences for teaching. Also, a disambiguation can be set up to reject useless phrasemes by using the + EXCLUDE operator. At the end, for the selection of phrases for teaching, three criteria (frequency, fixation and pragmatic criterion) seem relevant to us for the identification or classification of phrases from the professional field for foreign learners.
Chapter
SANTI-morf is a new morphological annotation system for Indonesian, implemented using Nooj [1, 2]. SANTI-morf is designed using multi-module pipeline architecture. The modules are the Annotator, the Improver, the Disambiguator, and the Guesser. The Guesser, as its name suggests, provides best guesses for words the Annotator fails to analyze. Due to the complexities of Indonesian morphology, multiple layers of rules are created to guess the morphological structures of unknown polymorphemic and monomorphemic words. These rules are incorporated into five morphological grammars, which are applied in a pipeline based on their priorities. In each grammar, there are two layers of rules. The first layer rules are prioritized, thus ending with a +UNAMB operator. The second layer rules only apply when the first layer rules fail to find any match. Thus, the rules are constructed without a +UNAMB operator. Reflecting on the complexity of this experiment, I therefore suggest an alternative to set priorities, whose method I simulate in this paper. I argue that using the proposed alternative, NooJ users can organize rules with multiple priorities in just one grammar file.
Chapter
The CETEHIPL (Centro de Estudios de Tecnología Educativa y Herramientas Informáticas de Procesamiento del Lenguaje) has been working on the pedagogical application of computer tools to language teaching [8]. Today we took a small turn towards discourse analysis and chose to analyze a recurring topic in post-pandemic Argentina: insecurity. Here we intended to record what impact insecurity had and still has on the linguistic domain. We built a corpus of journalistic texts published in December 2021 in the main newspapers in Rosario, Santa Fe, Argentina. We drew our attention to expressions referring to the victim, to the role of the State and to the perpetrator. We created tags to account for terms referring to the discourse of insecurity and included some lexical items provided by lunfardo, a Rioplatense slang originally created by immigrants, but which later became a colloquial and informal language variety still in use in our country. We tackled this issue of insecurity with the Rioplatense Spanish resources developed by the IES_UNR team with NooJ. To complete our analysis, we developed grammars to show how the impact of insecurity is made visible from a syntactic viewpoint.
Chapter
In the case of some languages, such as English, when a complex sentence consists of a main clause and a subordinate clause, these two clauses are joined together by either a subordinate ‘completive’ conjunction (that, so that), a circumstantial conjunction (when, before that, while) or a relative pronoun (who, that), e.g. “Because Mom said so, I talked to María”. The subordination in Quechua is induced by a morpho-syntactic marker, applied to the dependent verb of the sentence (e.g. chayta mamai niptin, Mariata rimarqani/ it’s because my mother said that, I talked to María), where the suffix -ptin marks the causative circumstance ‘because’. The case of participial relative clauses was partially studied by W.F.H. Adelaar, In this paper, I complement his work by proposing new methods to formalize the clause-subordination based on the verbal suffixes {-pti, -spa, -stin}. I have constructed specific grammars to obtain paraphrases of the sentences containing an adverbial subordinate clause. A few instances of transformations are also presented to illustrate how Quechua sentences containing a dependent clause can be translated into French.
Article
Our project is to describe French, using a finite vocabulary and a finite set of grammar rules. As far as noun phrases are concerned, we have to decide which noun phrases are to be treated by grammar rules, and which ones should be put in a lexicon. We present four criteria that have been used to build the electronic dictionary for compound nouns, (DELAC), which contains over 100,000 entries. The first two criteria (not compositional meaning, and institutionalized terms) yield purely lexicalized entries; the last two criteria (distributional restrictions and exceptional transformational analysis) concern certain noun phrases which could probably be treated by grammar rules; we prefer to put them in a lexicon for methodological reasons.
Article
NooJ associates each text with a Text Annotation Structure, in which each recognized linguistic unit is represented by an annotation. Annotations store the position of the text units to be represented, their length, and linguistic information. NooJ can represent and process complex annotations, such as those that represent units inside word forms, as well as those that are discontinuous. We demonstrate how to use NooJ‟s morphological, lexical, and syntactic tools to formalize and process these complex annotations.