
Sokratis Sofianopoulos- PhD
- Developer at Institute for Language and Speech Processing
Sokratis Sofianopoulos
- PhD
- Developer at Institute for Language and Speech Processing
About
32
Publications
1,817
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
137
Citations
Introduction
Current institution
Additional affiliations
March 2005 - present
Publications
Publications (32)
We describe the development and capabilities of Meltemi 7B, the first open Large Language Model for the Greek language. Meltemi 7B has 7 billion parameters and is trained on a 40 billion token Greek corpus. For the development of Meltemi 7B, we adapt Mistral, by continuous pretraining on the Greek Corpus. Meltemi 7B contains up-to-date information...
Visual tracking and attribute estimation related to age or gender information of multiple person entities in a scene are mature research topics with the advent of deep learning techniques. However, when it comes to indoor images such as video sequences of retail consumers, data are not always adequate or accurate enough to essentially train effecti...
This paper presents a collection of parallel corpora generated by exploiting the COVID-19 related dataset of metadata created with the Europe Media Monitor (EMM) / Medical Information System (MediSys) processing chain of news articles. We describe how we constructed comparable monolingual corpora of news articles related to the current pandemic and...
Following the detailed description of the PRESEMT Machine Translation system and the report on its performance, the current chapter focuses on the system’s portability. Portability is a term intended to signify the process of integrating a new language pair into the system. This involves reviewing all the necessary system modules and resources and...
The topic of the current chapter is the evaluation of the performance of PRESEMT both per se as well as in comparison with other MT systems, the performance relating to the translation quality being achieved. While it is possible to employ humans for this task (subjective evaluation), who assess an MT system in terms of fluency (i.e. grammaticality...
This chapter performs a review of the research work discussed in the previous chapters of the present volume. This review represents a summary of the outcomes of the research within the PRESEMT project. As a logical outcome, a set of key directions is identified for future work in order to further improve the MT methodology. A brief report of the m...
This chapter presents in detail the main translation process of PRESEMT, delving deeper in the core of the system and its inner workings.
This chapter introduces the general design characteristics of PRESEMT and provides a detailed description of all resources required as well as all pre-processing steps needed, such as corpora processing and model creation.
This chapter describes a number of improvements performed on the basic PRESEMT system. These improvements are aimed at specific modules of the system in an effort to achieve gains in the translation accuracy, for which alternative implementations have been suggested. These extensions concern different modules of the PRESEMT architecture. The first...
This chapter contains a general introduction to the topic of the present book. It presents the current challenges of Machine Translation (MT), in particular for languages where only a limited amount of specialised resources is readily available. To that end, a comprehensive review of the state-of-the-art in MT is performed. Focus is placed on relat...
This book provides a unified view on a new methodology for Machine Translation (MT). This methodology extracts information from widely available resources (extensive monolingual corpora) while only assuming the existence of a very limited parallel corpus, thus having a unique starting point to Statistical Machine Translation (SMT). In this book, a...
The present chapter reviews the development of a hybrid Machine Translation (MT) methodology, which is readily portable to new language pairs. This MT methodology (which has been developed within the PRESEMT project) is based on sampling mainly monolingual corpora, with very limited use of parallel corpora, thus supporting portability to new langua...
This paper reports on a first prototype implementation for combining and extending a data infrastructure with linguistic processing services, bringing language datasets and basic language processing services together in a unified platform thus boosting the organic growth of data and facilitating language technology research and development. The MET...
The present article investigates the fusion of different language models to improve translation accuracy. A hybrid MT system, recentlydeveloped in the European Commissionfunded PRESEMT project that combines example-based MT and Statistical MT principles is used as a starting point. In this article, the syntactically-defined phrasal language models...
The current paper evaluates the performance of the PRESEMT methodology, which facilitates the creation of machine translation (MT) systems for different language pairs. This methodology aims to develop a hybrid MT system that extracts translation information from large, predominantly monolingual corpora, using pattern recognition techniques. PRESEM...
The current paper presents a language-independent methodology, which facilitates the creation of machine translation (MT) systems for various language pairs. This methodology is implemented in the PRESEMT hybrid MT system. PRESEMT has the lowest possible requirements on specialised resources and tools, given that for many languages (especially less...
This document contains a brief presentation of the PRESEMT project that aims in the development of a novel language-independent methodology for the creation of a flexible and adaptable MT system.
In this article, aspects regarding the optimisation of mach ine translation systems via evolutionary computation algorithms are examined. The article focuses on pattern- recognition based machine translation systems that use large monolingual corpora in the target language from which statistical information is extracted. The research reported here...
The present article introduces a phrasealignment approach that involves the processing of a small bilingual corpus in order to extract suitable structural information. This is used in the PRESEMT project, whose aim is the quick development of phrase-based Machine Translation (MT) systems for new language pairs. A main bottleneck of such systems is...
In this paper, an automated method is proposed for optimising the real-valued parameters of a hybrid Machine Translation (MT) system that employs pattern recognition techniques together with extensive monolingual corpora in the target language from which statistical information is extracted. The absence of a parallel corpus prohibits the use of the...
The main aim of this article is to present the prototype hybrid Machine Translation (MT) system METIS. METIS is interesting in two ways. As regards MT for low and middle density languages, METIS relies on relatively cheap resources: monolingual corpora of the target language (TL), flat bilingual lexica and basic NLP tools (taggers, lemmatizers, chu...
METIS-II was an EU-FET MT project running from October 2004 to September 2007, which aimed at translating free text input
without resorting to parallel corpora. The idea was to use “basic” linguistic tools and representations and to link them with
patterns and statistics from the monolingual target-language corpus. The METIS-II project has four par...
In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required and in which no full parser or extensive rule sets are needed. We describe the evaluation on a developme...
In this paper, we explain why we have adopted pattern matching for MT pur- poses and why we have embedded it into a hybrid approach. "Patterns" here are understood as independent meaningful sub-sentential segments received in a sys- tematic way. We describe the nature and size of the patterns used as well as the comparison algorithm developed. We d...
The innovative feature of the system presented in this paper is the use of pattern-matching techniques to retrieve translations resulting in a flexible, language-independent approach, which employs a limited amount of explicit a priori linguistic knowledge. Furthermore, while all state-of-the-art corpus-based approaches to Machine Translation (MT)...
METIS-II, the MT system presented in this paper, does not view translation as a transfer process between a source lan-guage (SL) and a target one (TL), but rather as a matching procedure of patterns within a language pair. More specifically, translation is considered to be an assign-ment problem, i.e. a problem of discover-ing each time the best ma...
In this paper an innovative approach is presented for MT, which is based on pat- tern matching techniques, relies on extensive target language monolingual corpora and em- ploys a series of similarity weights between the source and the target language. Our system is based on the notion of 'patterns', which are viewed as 'models' of target language s...
In the present article, a hybrid approach is pro- posed for implementing a machine translation system using a large monolingual corpus cou- pled with a bilingual lexicon and basic NLP tools. In the first phase of the METIS system, a source language (SL) sentence, after being tagged, lemmatised and translated by a flat lemma-to-lemma lexicon, was ma...