ArticlePDF Available

Abstract

Statistical measures of word frequency are used in psycholinguistic research to characterize the psychological organization of the mental lexicon, and the processes of retrieving, understanding, and learning words. More recently, researchers have calculated statistics from corpora to gain insights into processing of morphology, based on previous work on Serbian by A. Kostic ́ and colleagues. One such statistical measure - the inflectional entropy - has been shown to explain processing costs in word recognition experiments. The inflectional entropy of a word form is the amount of information carried by that inflected form, relative to the statistical distribution of its inflectional paradigm. In this work, we investigate whether it is possible to calculate measures like inflectional entropy for Slovak using the Slovak National Corpus (SNK). This would allow us to compare Slovak with other Slavic languages such as Serbian. The results will be useful for a wide variety of psycholinguistic investigations of comprehension or production of Slovak.
A preview of the PDF is not available
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The Enabling Grids for E-SciencE (EGEE) is the world largest operating grid infrastructure in place serving thousands of world wide multi-science users with robust, reliable and secure grid services. In order to integrate a large number of institutions, often with different setups and configurations, the EGEE middleware, gLite, must cope with a broad set of local requirements, such as the site Local Resource Management Sys-tem (LRMS). This paper presents the work developed to integrate Sun Grid Engine (SGE), a Sun open source resource management software, with the EGEE middleware, allowing to increase the gLite offer, and easing the integration of new clusters in EGEE infrastructure.
Article
Full-text available
The article discusses some observations from the joint work of Polish and Bul-garian research groups on the digital Bulgarian-Polish and Polish-Ukrainian dictionaries, as well as the projected multilingual (initially: Bulgarian-Polish-Ukrainian) dictionary. The researchers are currently working on a parallel corpus containing texts in Bulgarian and Polish, distributed over the Inter-net, whereby the translation correspondence is one-to-one. They are devel-oping a comparable corpus that includes texts in Bulgarian and Polish (ex-cerpts from newspapers, literary works, Internet textual documents) with the text sizes being comparable across the two languages. The two corpora, parallel and comparable, form the first Bulgarian-Polish corpus, that will be prepared in CES format, manually or using ad-hoc tools, and will be anno-tated on "paragraph" and "sentence" levels, according to the text annotation international standards. This bilingual corpus will provide a sample of the vocabulary to be included in an initial experimental version of the Bulgarian-Polish digital dictionary. The bi-and multilingual digital dictionaries have more limitations and require even more so that the description of language specifications of the headword in each entry of the dictionary be simple and simultaneously more comprehensive. The fact that the lexical form in every language may have several meanings that do not overlap across the respec-tive compared languages also has to be addressed. Great difficulties have to be addressed in order for a dictionary to satisfy the needs of a translator, a language researcher or an everyday user. 2 Ludmila Dimitrova, Violetta Koseska-Toszewa Introduction This article discusses some observations from the joint work of Polish and Bulgarian research groups on the digital (in electronic form) Bulgarian-Polish and Polish-Ukrainian dictionaries, as well as the projected multilingual (ini-tially: Bulgarian-Polish-Ukrainian) dictionary. The Bulgarian and Polish re-searchers are currently working on a parallel corpus (containing literary works and texts of documents in Bulgarian and Polish in digital form, with a one-to-one translation correspondence), and in addition developing a com-parable corpus that includes texts in Bulgarian and Polish (excerpts from newspapers, literary works, Internet textual documents, with the text sizes being comparable across the two languages). These two corpora, parallel and comparable, form the first Bulgarian-Polish corpus, that will be annotated according to the digital language resource annotation standards. The bilin-gual Bulgarian-Polish corpus will provide a sample of the vocabulary, which is to be included in an initial experimental version of the Bulgarian-Polish dictionary. Languages selection The three languages, Bulgarian, Polish and Ukrainian, have been chosen for the following reasons: 1) there are no digital dictionaries for these languages, 2) there are no parallel corpora for these languages, and 3) each language represents one of the three Slavic language families, Bulgarian belongs to the South-Slavic, Polish to the West-Slavic, and Ukrainian to the East-Slavic language family. The differences between the three families, such as some phonetic systemic features, could be presented via algorithms.
Article
Full-text available
The accurate recognition of modal information is vital for the correct interpretation of statements. In this paper, we report on the collection a list of words and phrases that express modal information in biomedical texts, and propose a categorisation scheme according to the type of information conveyed. We have performed a small pilot study through the annotation of 202 MEDLINE abstracts according to our proposed scheme. Our initial results suggest that modality in biomedical statements can be predicted fairly reliably though the presence of particular lexical items, together with a small amount of contextual information.
Article
Full-text available
The notion of loanword assimilation is operationalized in a number of d~flerent ways, focusing on both linguistic and social aspects. The indices of integration thus constructed are applied to a set of lexical data elicited from Puerto Ricatz children and adults from East Harlem, New York. The results of this survey are analyzed statistically using the method of principal components. We interpret the output in terms of the social and linguistic trajectory of words during the borrowing and integration process. Of particular importance are the relatively close relationship between increase in usage frequencies and the processes of phonological ititegration, the transient nature of inconsistencies in gender assignment, and the fates of competing lexical items for a single referent. The lexical stock of languages may contain a considerable proportion of words borrowed from one or more other languages. The historical record, together with methods of historical and comparative linguistics, can help us infer which words were borrowed, from what language, and approximately when. On the synchronic level, however, making such inferences can be more difficult, particularly because there is no unequivocal way of deciding when a lexical item from one language, used during discourse in another language, whether by a single speaker, or repeatedly in a community, should be considered a loanword. It may constitute all or part of a code-switch, which is a phenomenon quite distinct from borrowing. It may be a manifestation of incomplete acquisition of one of a bilingual's two languages. It might be a momentary lapse of the type often classified as 'interference'. Or we might want to characterize it as still another kind of result of language contact. It has been claimed that 'from synchronic examination [i.e. without comparative or etymological evidence] no loans are discoverable or describable' (Fries and Pike 1949; see also Haugen 1950a; Weinreich
Article
Full-text available
BalkaNet aims at building a multilingual lexical database consisting of WordNets in several Central and Eastern European languages. Even though it will be built in a similar way with EuroWordNet, new features will be implemented ranging from structuring the Inter-Lingual- Index to ensure linking of conceptual equivalencies across WordNets to the development of an inter-networked WordNet Management so that each partner retains full responsibility and independence of his local WordNet whereas at the same time they will be able to view other WordNets and check their compatibility.
Conference Paper
We present a new English→Czech machine translation system combining linguistically motivated layers of language description (as defined in the Prague Dependency Treebank annotation scenario) with statistical NLP approaches.
Article
Over the last few years, language technology has moved rapidly from 'applied research' to 'en- gineering', and from small-scale to large-scale engineering. Applications such as advanced text mining systems are feasible, but very resource-intensive, while research seeking to address the un- derlying language processing questions faces very real practical and methodological limitations. The e-Science vision, and the creation of the e-Science Grid, promises the level of integrated large- scale technological support required to sustain this important and successful new technology area. In this paper, we discuss the foundations for the deployment of text mining and other language technology on the Grid — the protocols and tools required to build distributed large-scale language technology systems, meeting the needs of users, application builders and researchers.