ArticlePDF Available

Morpho-Syntactic Properties of Bulgarian Verbal Idiomatic Expressions

Authors:

Abstract

The paper is focused on the Bulgarian verbal idioms and the problems related to the representation of their structure. The presented work is aimed at the identification, recognition and translation of Bulgarian idiomatic expressions for the purposes of WSD and machine translation. The main goal is to provide a sufficient framework combining internal idiomatic structure, inflectional and semantic information for the description of the wide variety of idiomatic expressions.
A preview of the PDF is not available
... The phrase even lends itself particularly well to idiomatic usage given that "tongue" is another word for "language" in many world languages. Table 1 shows a sample of world languages with idiomatic usage of the literal phrase, "to swallow one's tongue" [4][5][6][7][8]. Intriguingly, many of these idiomatic usages of "swallowing one's tongue" are things that may happen to a patient during a seizure: choking, becoming mute, or even dying. ...
Article
Objective There is a harmful myth that persists in modern culture that one should place objects into a seizing person’s mouth to prevent “swallowing the tongue.” Despite expert guidelines against this, the idea remains alive in popular media and public belief. We aimed to investigate the myth’s origins and discredit it. Methods A medical and popular literature review was conducted for the allusions to “swallowing one’s tongue” and practice recommendations for and against placing objects into a seizing person’s mouth. Current prevalence of these beliefs and relevant anatomy and physiology were summarised. Results The first English language allusions to placing objects in a patient’s mouth occurred in the mid-19th century, and the first allusions to swallowing one’s tongue during a seizure occurred in the late 19th century. By the mid-20th century, it was clear that some were recommending against the practice of placing objects in a patient’s mouth to prevent harm. Relatively recent popular literature and film continue to portray incorrect seizure first aid through at least 2013. There is ample modern literature confirming the anatomical impossibility of swallowing one’s tongue and confirming the potential harm of putting objects in a patient’s mouth. Conclusion One cannot swallow their tongue during a seizure. Foreign objects should not be placed into a seizing person’s mouth. We must continue to disseminate these ideas to our patients and colleagues. As neurologists, we have an obligation to champion safe practices for our patients, especially when popular media and culture continue to propagate dangerous ones.
... Recognition of idioms, especially discontinuous types, is very important for the translation projects. NooJ, as an NLP tool of our choice, has already proved very efficient in dealing with different types of MWUs (Bekavac & Tadić, 2008;Todorova, 2008;Machonis, 2010Machonis, , 2012Gavriilidou et al., 2012;Vietri, 2012) and the results justify our selection. ...
Chapter
The correct interpretation of Multiword Units (MWUs) is crucial to many applications in Natural Language Processing but is a challenging and complex task. In recent years, the computational treatment of MWUs has received considerable attention but there is much more to be done before we can claim that NLP and Machine Translation (MT) systems process MWUs successfully. This volume provides a general overview of the field with particular reference to Machine Translation and Translation Technology and focuses on languages such as English, Basque, French, Romanian, German, Dutch and Croatian, among others. The chapters of the volume illustrate a variety of topics that address this challenge, such as the use of rule-based approaches, compound splitting techniques, MWU identification methodologies in multilingual applications, and MWU alignment issues.
Conference Paper
This paper tackles the computational problems of Croatian verbal idioms. Croatian language has very rich phraseme structure, as described in Matešić (1982), Menac (2007) and Menac-Mihalić (2007), as well as many others. This work is one of the few attempts of computational analyis of idioms in Croatian language as multi-word units. We used rule-based approach and NooJ syntactic grammars in order to recognize any verb based idiom (of the ~1500 analyzed) in any syntactic position. The Croatian Dictionary of Idioms (Menac et al. 2003) was used for the initial list, which was implemented with new additions during training phase. Grammars were tested within the corpora constructed specifically for this work, and used to calculate statistical measures of recall, precision and f-measure for our grammars. With the final results of recall < 98 %, precision < 96 % and f-measure < 97 %, we consider this a successful attempt in the recognition of verb based idioms in Croatian language.
Article
Full-text available
A large proportion of text is made up of a variety of multi-word units (MWUs). One type of MWU is 'idioms'. While previously linguists have established criteria to define an idiom, the criteria have often been general so as to apply to the wide-ranging MWUs found in this category, and have been a description of them rather than a definition. We present a more restrictive definition of idiom in the form of a test which divides MWUs into 'core idioms', 'figuratives', and 'ONCEs'. The result of applying the test is that the majority of idioms would be put into the 'figuratives' category. While 'figuratives' also present problems for the EFL/ESL learners, the more narrowly defined 'core idioms' are the most difficult set of MWUs for learners to come to terms with and are therefore the motivation for redefining idioms.
Article
Full-text available
THE DANISH IDIOM DICTIONARY has been criticized by Ken Farø to use a less than optimal theory in the description of what a phraseme really is. A closer examination of that theory, however, shows that this is not the case due to the hitherto neglected role of the common, non-academic user who has little need for an in-depth classification. The current article describes how THE DANISH IDIOM DICTIONARY puts the simplified theory into use and how a new and improved version currently under way improves on the theory by focusing on the user's needs, thereby reducing the necessity for complicated classifications. THE DANISH IDIOM DICTIONARY including its search capabilities is described in full.
Article
Full-text available
The Bulgarian Part-of–Speech (POS) and Word-Sense (WS) Tagged Corpora are derived from the "Brown" Corpus of Bulgarian, automatically annotated respectively with POS and WS tags and manually disambiguated with the annotation application Chooser. The adopted methodology for constructing and preprocessing the source corpora is briefly described. The paper also presents the annotation criteria underlying respectively the POS and WS selection process. At the present stage 217 210 tokens (single words, punctuation marks and numbers) are POS annotated and 50 368 words (single words and multi-word expressions) are WS annotated. The chief intended application of the Bulgarian Tagged Corpora is to serve as a test and / or training dataset for POS and WS disambiguation with the further aim of developing a Bulgarian-English bi-directional machine translation system.
Article
We propose a general lexicalization model which accounts for how lexical units are selected and introduced in linguistic utterances during language generation. This model aims at “naturalness” by being based on actual lexical knowledge used in speech; consequently, it should be compatible with standard patterns of behavior shown by humans when they speak (flexibility in computing both content and form of linguistic utterances, prototypical types of mistakes and backtracking, etc.). The main advantage of our model, once implemented in automatic language generation, is that it takes into account fundamental differences that exist between lexical units, with regard to why and how they are used in texts. This is achieved by means of a stratificational approach to lexicalization, where each type of lexical unit is introduced at a proper level of representation, according to the role it plays in the enunciation. Section 1 offers a general characterization of the approach and makes explicit its main assumptions. Sections 2 to 4 successively examine the three levels of transition implied by the stratificational structuring of the model. Section 5 concludes with an examination of its relevance to the design of text generation systems. Keywords: language/text generation, lexicalization, lexical choice, Meaning-Text theory.
Article
The Bulgarian Sense Tagged Corpus is derived from the "Brown" Corpus of Bulgarian and annotated with word senses from the Bulgarian WordNet. The paper gives a brief account of the already available and currently developed language resources and tools which enabled the compilation and annotation of the Bulgarian Sense Tagged Corpus. We briefly describe the adopted methodology for constructing and preprocessing the source corpus of 63 440 words: all words were lemmatised, PoS-tagged and linked to the corresponding sets of senses in the Bulgarian WordNet. The paper also presents the annotation criteria underlying the sense selection process and outlines the general directions of expansion and modification of the Bulgarian WordNet. At the present stage 45 562 words (single words and multi-word expressions) are semantically annotated. The chief intended application of the Bulgarian Sense Tagged Corpus is to serve as a test and / or training dataset for word sense disambiguation with the further aim of developing a Bulgarian -English bi-directional machine translation system.
Article
This paper discusses the approach to multiword expressions being adopted in the LinGO English Resource Grammar (http://lingo.stanford.edu), a broad-scale bidirectional grammar of English in the HPSG framework. We discuss how the lexicon of multiword expressions is encoded in a database and describe the implications for building a reusable lexical resource.
Article
We develop a new approach to learning phrase translations from parallel corpora, and show that it performs with very high coverage and accuracy in choosing French translations of English named-entity phrases in a test corpus of software manuals. Analysis of a subset of our results suggests that the method should also perform well on more general phrase translation tasks.