ChapterPDF Available

National Report on Language Technology in Greece

Authors:
  • Institute for Language and Speech Processing, Athena R.C.
Maria Gavrilidou / Penny Labropoulou / Stelios Piperidis
National Report on Language Technology in Greece
1. Introduction
The present document reports on Language Technology related activities in Greece.
After a general introduction to LT and its benets, it presents the evolution of the eld in
Greece and the current state of affairs, with an extensive reporting on Language Re-
sources and Technologies developed for Greek. The report concludes with the presen-
tation of ongoing infrastructural initiatives operating at the European level with the
participation of Greek institutes.
2. Language Technology in use
Language Technology (LT, also referred to as Human Language Technology, HLT) cov-
ers a wide range of software components, data, tools and technologies, techniques and
applications aimed at processing human natural language. Typical examples of such
tools are tokenizers and sentence splitters, morphological analyzers, part of speech tag-
gers and lemmatizers, syntactic analyzers, etc. The term Language Resources (LRs) de-
notes language data in digital format, usually of considerable size, for use by any type
of research and development targeting linguistic study and language technology appli-
cations, as well as by all elds where language plays a critical role. Typical examples of
LRs are spoken, written or multimodal/multimedia corpora, lexica, grammars, termino-
logical thesauri or glossaries, ontologies etc. Nowadays, the term is extended to cover
basic language processing tools used for the collection, preparation, annotation, man-
agement and deployment of LRs.
The change of perspective from the native speaker's intuition to original data and the
analysis of language in actual use was a landmark in linguistic research. Language data
collection had started as a tendency in the '50s, but was led to success by the dramatic
improvements in hardware technology and the advent of the web. Today, in addition to
the constant need for general language and domain specic data, the focus has shifted
from the data as such to the technologies that help quickly and efciently analyze huge
bulks of data.
LT is a valuable aid in many elds where research is based on language material, whether
language is the object of research or the means that carries information for the research;
even simple procedures such as the compilation of the list of words of a given text and
its comparison with the word-list of a general language corpus might lead to insightful
observations which would be missed by traditional methodologies. LT reduces the amount
of time needed for the initial processing of the research material, leaving, thus, more time
to researchers for the qualitative and interpretative processing of the data. Additionally,
the use of LT facilitates access to secondary material, such as literature on the research
subject (e.g. intelligent full-text search aiming to locate specic sections of interest).
Maria Gavrilidou / Penny Labropoulou / Stelios Piperidis
2
Most importantly, LT can play a crucial role serving the needs of laymen in all aspects
of their everyday life, as it enables communication across languages, and increases ac-
cess to information and knowledge for users of any language. For instance, the use of LT
in the access of language resources offers many advantages to the public: natural lan-
guage queries are friendlier to the lay user than specialized interfaces; machine transla-
tion systems integrated in search engines produce a rough translation (“gisting trans-
lation”) allowing the users to have an idea about the content of a foreign language text,
although they are usually unable to convey a complete understanding of it. It is also clear,
however, that not all LT tools and applications are mature enough to provide high-level
services and in a user-friendly way.
3. Historical overview of LT in Greece
Greece has a thirty year tradition in LT research and development, starting with the
EUROTRA project in the mid-80's. EUROTRA was a very ambitious EU-funded project
aiming to create a fully automatic high quality translation system for all of the originally
seven and, later, nine European ofcial languages. Although the project did not succeed
in fullling the set goal, its main legacy (apart from the lexica and grammars produced)
lies in the creation and training of groups of LT experts in all the involved countries.
At approximately the same time, EU-funded projects have inaugurated speech process-
ing research in Greece, focusing on speech synthesis at rst.
The decade 1990-2000 saw a critical increase in the amount of public funds invested in
LT in the country, besides the EU funds. Several national programmes resulted in the
creation of resources, tools and infrastructure as well as small and medium-scale appli-
cations in the eld of language and speech processing. The results included text and
speech databases, speech processing tools, Natural Language Processing tools, Machine
Translation tools and systems, but also multimedia, LT-aware educational material for
the teaching of Greek as mother tongue and foreign language. During the same years,
infrastructural programmes catered for the introduction of this educational material in
primary and secondary schools. Programmes with dedicated funding for resource crea-
tion resulted in the production of lexicographic material, such as computational lexica
for HLT, mono-/bi-/multi-lingual multimedia dictionaries for human users, pedagogical
dictionaries for Greek, terminological resources for various domains etc. A few medium-
and large-scale EU infrastructural projects have also contributed to the development of
monolingual resources (corpora and computational lexica) with common specications
for all EU ofcial languages.
Through national funding, the development of the Hellenic National Corpus (HNC,
http:// hnc.ilsp.gr) was made possible. HNC, accessible through the web via an interface
designed for non-expert users, boosted research on linguistics, lexicology and lexicog-
raphy as well as education.
These pioneer endeavours of the 90's have inspired the construction of new textual,
speech, but also multimodal/multimedia resources, for general language as well as for
specialized domains. Regarding general language text corpora, for instance, two en-
National Report on Language Technology in Greece 3
deavours, which saw the light in subsequent years (the rst by the Centre for the Greek
Language and the second by the University of Athens) made available more Greek lan-
guage resources. Relevant initiatives and results are presented in the following section.
During the next decade, 2000-2010, the national programmes mainly addressed the wider
sector of Information and Telecommunications Technologies, although specic activities
targeted to LT have also been launched. Their objective was the development and en-
hancement of the LT infrastructure (e.g. creation of digital corpora and computational
lexica) as well as applications in the general framework of human-machine interaction
(e.g. voice-enabled dialogue systems for information extraction, intelligent human-ma-
chine interfaces, authoring aids, optical character recognition for manuscripts, automatic
subtitling of multimedia content, multimedia search etc.). Obviously, the monolingual
dimension was prominent in the nationally funded projects; a few bi-/multilingual re-
sources and applications have also been produced, with English as the second language
primarily. Additionally, bilateral cooperation programmes for the Balkan region have
resulted in the creation of a set of resources and applications incorporating also lan-
guages such as Bulgarian, Serbian, Romanian, Albanian, etc.
Recently, Greece has faced new challenges, as the volume of digital (textual and multi-
media/multimodal) content has increased rapidly. Many digitization projects nationally
funded, aiming at the preservation and the promotion of Greek cultural heritage, have
created new requirements on the LT use and, thus, new impetus on related R&D. The
results of these projects are (or will be) available over the Internet in the form of digital
libraries. Researchers now have access to all types of data through their computers, but
the amount of information available is so huge and so dispersed, that, without the appro-
priate tools, it becomes unmanageable. Furthermore, the use of language resources and
tools is not extensive, not because of their quality, but because they are difcult to locate
and sometimes even more difcult to use. In order to perform a specic task (e.g. to use a
summarization tool, a morphological or syntactic analyzer, a speech synthesis tool etc.)
the users have to know the exact tool needed and the organization or the person they
need to contact to get the appropriate license, to review the terms of use, to download
the tool or the data, to convert the format of the data to render it interoperable with the
tool, to learn how to use it and so on. This situation can discourage the most dedicated
researcher, let alone the ones that are not digitally literate and/or the general public that
wishes to have access to digital cultural collections. Cultural informatics is a domain
currently attracting LT interest.
As far as EU-funding on LT is concerned, the majority of projects currently on-going in
Greece cater for application-oriented research in machine translation, information ex-
traction, data mining, semantic web-based technologies, cognitive systems etc. Greek is
not necessarily the focus of these projects, but one of the languages used as test-bed for
the applications. As for data resources, the focus is on multilingual lexica and, more re-
cently, on conceptual resources (e.g. ontologies, semantic lexica) as well as corpora;
these resources are mainly domain-specic, given that most of the projects target small
and medium-scale applications.
Maria Gavrilidou / Penny Labropoulou / Stelios Piperidis
4
4. Current LT activities in Greece
Constant funding through national and EU sources as outlined in the previous section
has resulted in a steady, increasing and dynamic evolution of LT research and develop-
ment in Greece. Thus, human resources employed in the LT area have increased over
the last few years, with the advent of new research groups and units in universities and
research organizations dedicated to LT research. The private sector has also invested on
LT; a small (but important for the dimensions of the country) number of private compa-
nies are active in the eld, some of which are spin-off companies of research centres that
engage in the areas of speech recognition and synthesis, machine translation, media moni-
toring, ePublishing, eLearning and intelligent content analysis.
A key parameter for the progress of LT has been its introduction in higher education:
over the years it has been introduced in the form of modules in the curricula of under-
and postgraduate studies in universities, in linguistics and technological departments
alike (obviously taught from different perspectives); in addition, a post-graduate two-year
interdepartmental course, summer schools and seminars dedicated to LT methodologies
and applications constitute an important asset in the dissemination of LT know-how.
LT for a less-widely spoken language like Greek poses additional challenges. Notably,
whereas some of the research and development work carried out in Greece is based on
English data-sets and/or uses language-independent algorithms, the majority of the re-
search endeavours has focused on Greek, attempting to model linguistic phenomena, to
create Greek training data and to develop language-sensitive applications. This is reected
in the high number of research groups who are active in the country as well as abroad,
trying to tackle language processing problems from the morphographemic and phonetic
level to technological solutions for access to information and content.
In fact, LT research and development concerning the Greek language has spread over the
years in a multitude of areas. Taking a closer look at the way it has evolved in Greece,
we can discern the main driving forces: the LT domain per se (engaged in Natural Lan-
guage Processing and speech related research), research in theoretical linguistics (mainly
focusing on the analysis of written and oral language), the use of LRs and LTs in language
learning and, more recently, applications for the cultural domain. It is under this prism
that we can explain the range and variety of research activities in which Greek LT re-
searchers are involved.
As evidenced from the following summary, the community has moved on from the more
“traditional” word/sentence-based research to new challenges (web content, various mo-
dalities, emotional language etc.). The following should be seen as points of interest
rather than a full synopsis of all research activities of the LT community in Greece.
Important progress has been made in the LR building domain. The processes of manual
collection, typing and/or OCRing, conversions from typeset material for the construction
of corpora, manual selection and encoding for the construction of general language and
domain specic lexica etc. are complemented and increasingly replaced by new meth-
ods and techniques. The development of (semi-)automatic tools catering for knowledge
acquisition from various sources (texts, images, video etc.) are exploited for LR con-
struction where possible: for instance, lemma and term extraction from mono and bi-/
National Report on Language Technology in Greece 5
multilingual text corpora, ontology building from textual content, web crawling meth-
ods used for spotting candidate texts for the construction of monolingual and bi-/multi-
lingual corpora (both parallel and comparable), new OCRing methods for manuscripts.
As a consequence, manual effort is more efciently spent on the more challenging tasks
(e.g. annotation with semantic and pragmatic information). Moreover, most of these
techniques and methods are integrated in LT applications and systems serving end-user
needs (e.g. keyword extractors for the automatic construction of indexes and thesauri to
be used in accessing cultural collections).
As far as speech is concerned, both speech recognition and analysis are the objects of ex-
tensive research. Current interests of the community include voice interactive systems,
speech-only user interfaces, speech synthesis from documents and web content, emo-
tional speech synthesis, implementation of prosodic features etc., going even beyond
speech to research on sound and music.
In the wider areas of text mining, information extraction and knowledge acquisition, the
focus is on cross-lingual information retrieval, sentiment analysis, textual entailment and
processing, automatic text categorization, text genre detection (including web genres),
authorship attribution, spam ltering, multimedia information processing (image/video
and/or audio processing for information extraction, automatic metadata extraction and
fusion from various modalities), exploitation of cognitive modeling techniques, etc.
Natural language generation activities currently include research in document summa-
rization, image-based summarization, user-adaptive management and presentation of in-
formation, monolingual and multilingual subtitling, question answering systems, spoken
dialogue interaction etc.
Machine translation research addresses both aids for human translation (e.g. translation
memories) and fully automatic machine translation (e.g. corpus-based machine transla-
tion approaches exploiting mono- and bi/multilingual corpora).
Developing assistive technologies for disabled persons (with visual and/or hearing im-
pairments but also with learning difculties) is the objective of several research groups
in the country.
Finally, research into the use of LT for the benet of the specialized public but also of the
broad public is ongoing: for instance, in educational software and applications, authoring
aids (e.g. spelling and style checkers, controlled language applications), eGovernment
applications etc.
5. LRTs for the Greek language
As a result of the research efforts described above, there is a signicant number of LRs
for Modern Greek; most of these are available for educational and research purposes.
More specically:1
1 This section presents a synopsis of results from various surveys on LRT for Greek, the most recent of
which has been conducted in the framework of the preparatory stage for the Greek counterpart of the
CLARIN project (cf. section 6). The results of this survey can be found at www.clarin.gr/clarinmaps (site
in Greek, accessed 29/3/2011).
Maria Gavrilidou / Penny Labropoulou / Stelios Piperidis
6
as far as textual data are concerned:
there are three general language corpora of considerable size, namely: (a) the
Hellenic National Corpus (HNC, http://hnc.ilsp.gr/), which was compiled in the
early 90's but continues to be enriched; it currently includes 47 million words
solely of written texts from various sources and it can be accessed via a web inter-
face; (b) the Corpus of Greek Texts (CGT, http://sek.edu.gr/index.php?en) compris-
ing around 30 million texts, including transcribed oral texts; the corpus is availa-
ble for downloading; and (c) the newspaper corpora of the Centre for the Greek
Language, of a total of 10 million words, made available through the Portal for
the Greek Language (http://www.greek-language.gr/greekLang/modern_greek/tools/
corpora/index.html);
domain specic corpora of small and medium size, an important proportion of
which are bi-/multilingual (with English as the most frequent other language), are
also available via the internet and/or distributed by the creators, covering a wide
range of domains (e.g. biomedicine, health, tourism, press, literature, academic
speech etc.);
dialectal material that has been collected and transcribed in the framework of lin-
guistic research activities can be regarded as a special form of specialized corpus;
an important number of cultural text collections has become available following a
digitization programme funded by the Greek state over the last few years. Although
most of these texts have been digitized as images and necessitate OCR processing
in order to be fully processable by LT tools, the accompanying metadata descrip-
tions can benet from LT.
linguistically annotated resources include aligned bi-/multilingual text corpora, aligned
transcriptions of audio data and text data annotated with various types of linguistic
information; the latter include morpho-syntactically tagged corpora, some of which
are manually disambiguated and validated, a treebank and various corpora annotated
with semantic information (e.g. ontological class, named entities, event type etc.);
obviously, the deeper level annotations are manually performed while morpho-syn-
tactic tagging is usually automatic;
most recently, a small but increasingly signicant number of multimedia/multimodal
resources has been produced; most of these resources, mainly video with accompany-
ing audio and/or text equivalents, have been annotated with various types of modality-
dependent information (e.g. speaker turn, gesture annotation etc.), while their textual
counterparts are also linguistically processed (e.g. morpho-syntactically, semantically
tagged);
as far as lexical/conceptual resources are concerned, there are a few bi-/trilingual lexica
of small and medium size intended both for computer and human use, three large
monolingual morphological computational lexica, various small-size computational
lexica endowed with syntactic and semantic information, usually developed for spe-
cic applications (e.g. ontologies, lists of acronyms and named entities, lexica with
event types, semantic classes etc.) and a number of terminological/domain-specic
lexical resources (e.g. for biomedicine, science etc.);
National Report on Language Technology in Greece 7
available LTs can be classied in two broad categories:
tools and software components that can be used to manage and process resources
(e.g. grammar/lexicon authoring tools, annotators etc.): here, we include morpho-
syntactic taggers, chunkers, dependency parsers, lemmatizers and stemmers, man-
ual annotation aids for text and multimodal/multimedia resources, named entity
recognizers, text aligners for bilingual texts etc.; most of these are available for ac-
ademic research and can be accessed via the internet and/or by permission of the
creators; some of these tools address the Greek language, either employing a lexi-
cal/corpus resource of Greek or having been developed by the use of statistical
techniques on Greek training data; the use of these tools is primarily intended for
LT research and applications but it can also be extended to serve needs of end users
with appropriate tuning/customization (e.g. lemmatizers deployed to facilitate lem-
ma-based search, named entity recognizers to mark person and place names etc.);
LT applications/technologies/systems that can be used for the benet of the end
user : here we include authoring aids (e.g. spelling and syntactic checkers), speech
recognizers, speech synthesizers, statistical information extraction tools, term ex-
tractors, speech transcribers, language detectors, summarizers, machine translation
tools, etc.
A signicant set of LRTs catering for Greek Sign Language (multimedia lexica, corpora,
terminological resources etc.) has been compiled during the last decade.
Finally, important digital text resources but also tools and systems (OCRing tools, mor-
phosyntactic taggers etc.) for older variants of Greek (ancient, medieval, early modern
Greek etc.) are at the heart of research projects in Greece as well as abroad (cf. Perseus,
http://www.perseus.tufts.edu/hopper/ and TITUS, http://titus.fkidg1.uni-frankfurt.de/framee.
htm?/search/query.htm#Etabelle, two large repositories including ancient Greek resources).
6. Current initiatives for the promotion of LT
In the previous sections, we have given an overview of the LT eld in Greece and the
LRs that exist for the Greek language. However, although it is obvious that the eld has
progressed a great deal in the last years, the impact and the signicance of LT for research
but also for everyday life has not actually reached crucial audiences, that is, researchers at
large, the broad public and, last but not least, the policy makers. The main drawbacks are:
fragmented scenery as regards the availability of LRs:
although most of these are supposedly available for research and/or educational
purposes, they are mainly distributed through the creators themselves and quite of-
ten they are badly “advertised” (i.e. dissemination of their existence is at best lim-
ited to specialized conferences); interested users have to search the internet in vari-
ous web sites and/or communicate with all LT institutes to nd the resources they
need;
moreover, access and usage rights are not always clear, so, even when they nd
them, users are not sure if they can indeed use these LRs;
Maria Gavrilidou / Penny Labropoulou / Stelios Piperidis
8
nally, technical issues also need to be tackled before they are used: some resources
can only be accessed through specic tools that users do not have; in other cases, the
operation of the tools is scarcely documented and/or too difcult to be understood
by LT illiterate people; or, even in cases where resources and relevant processing
tools are both available, they are not compatible and require some customization.
The infrastructure that puts resources together and sustains them is still largely miss-
ing; interoperability of resources, tools and frameworks at the organizational, legal
and technical levels has recently come to be recognized as perhaps the most pressing
current need for language processing research.
lack and/or improvements of specic tools and datasets: although most of the basic
processing tools and data resources have been developed, there is still need for exten-
sions, enrichment and/or improvements thereof and development of new ones, espe-
cially for higher level processing (e.g. semantic annotation, discourse processing, senti-
ment analysis, etc.); recording of existing tools and resources in surveys like the one
presented here is the rst step towards the solution of this problem; however, iden-
tication of the gaps and prioritization thereof in accordance to user needs must be
made in a well organized way, as well as attracting the funds that will support their
development.
Bridging the gap between the LRT community and the research community at large is
the task of certain initiatives that have been launched lately at the European and at na-
tional levels. The European projects META-NET and CLARIN have the aim to prepare
the ground and to provide the necessary infrastructure that will offer services based on
LT to the research community and to the public. FLaReNet, on the other hand, has a differ-
ent scope than the other two: it addresses the policy makers, its results being mainly rec-
ommendations based on extensive analysis of the eld according to several parameters.
More specically, META-NET (A Network of Excellence forging the Multilingual Eu-
rope Technology Alliance, www.meta-net.eu) is a Network of Excellence that brings to-
gether researchers, commercial technology providers, private and corporate LT users,
language professionals and other information society stakeholders. It constitutes a con-
certed, substantial, continent-wide effort in LT research and engineering which aims to
create an open distributed facility for the sharing and exchange of resources and to build
bridges to relevant neighbouring technology elds, as well as to prepare the strategic
research agenda of the eld for the years to come.
META-NET is supporting these goals by pursuing three lines of action:
fostering a dynamic and inuential community around a shared vision and strategic
research agenda (META-VISION),
creating an open distributed facility for the sharing and exchange of resources
(META-SHARE),
building bridges to relevant neighbouring technology elds (META-RESEARCH).
META-SHARE is a sustainable network of repositories of language data, tools and related
web services documented with high-quality metadata, aggregated in central inventories
allowing for uniform search and access to resources. It targets existing but also new and
National Report on Language Technology in Greece 9
emerging language data, tools and systems required for building and evaluating new
technologies, products and services. In this respect, reuse, combination, repurposing and
re-engineering of language data and tools play a crucial role. META-SHARE will eventu-
ally be an important component of a LT marketplace for HLT researchers and develop-
ers, language professionals (translators, interpreters, content and software localization
experts, etc.), as well as for industrial players, especially SMEs, catering for the full de-
velopment cycle of HLT, from research through to innovative products and services.
META-SHARE will start by integrating nodes and centres represented by the partners of
the META-NET consortium. It will gradually be extended to encompass additional nodes/
centres and provide more functionality with the goal of turning into an as largely distrib-
uted infrastructure as possible.
Similar to META-NET but catering for the Social Sciences and Humanities research-
ers, the European project CLARIN (Common Language Resources and Technology
Infrastructure, www.clarin.eu) is structured as a network of organizations that offer LRT
for all European languages. It is a research infrastructure that aims to make LRs and
LTs available though web services to researchers with little or no technical experience;
services include all aspects of resource creation and use (technical, legal, administra-
tive etc.).
When nalized, the infrastructure will constitute a platform on which
LRT providers will be able to upload their resources and their technologies, to de-
scribe them according to a common metadata schema, to get help on legal issues such
as licensing or property rights;
LRT consumers (Social Sciences and Humanities researchers, users, developers, etc.)
will prot from unied access to data and tools which physically might exist in diffe-
rent distributed repositories and will be able to: harvest metadata in the process of
LRTs identication; browse samples or whole resources' sign usage licenses' save the
resources on their computers; to run a tool and save the results of the process etc.
The participation of Greece in this network will cater for the integration in the infra-
structure of LRs and tools developed for the Greek language. Given that CLARIN will
serve as a dynamic, constantly updated atlas of LRTs, it will constitute a valuable tool
that will register the gaps that need to be tackled in what concerns the Greek language
and that will evaluate the performance of the data and technologies offered for Greek
in the domain of Social Sciences and Humanities.
In the framework of the national project, CLARIN-EL, a charting of the eld has been
initialized, which has recorded user needs and current practices, information on existing
resources, tools, LRT organizations and research teams;2 the national network has also
been drafted. The vision of CLARIN-EL is to gather the resources and technologies that
have been developed for Greek in one virtual repository and to transform them into web
services which are characterized by interoperability, stability, accessibility and extensi-
bility and which will be available to the users.
2 The results of the CLARIN-EL survey have fed the current report.
Maria Gavrilidou / Penny Labropoulou / Stelios Piperidis
10
The mission of the third initiative that aims at the unication of the LRT scenery, FLaReNet
(Fostering Language Resources Network, www.arenet.eu) is to identify priorities as well
as long-term strategic objectives and provide consensual recommendations in the form
of a plan of action for EC, national organizations and industry. Its outcomes are essen-
tially of directive nature, aimed at policy makers at all levels. FLaReNet analyses the sector
along various dimensions: technical, scientic but also organizational, economic, politi-
cal and legal. It aims to bring together major experts from different areas, reach consen-
sus, make the community aware of the results and disseminate them in a ne-grained,
pervasive way. Work in FLaReNet is inherently collaborative.
These concerting actions have as a goal to help the egression of LRT from the bounda-
ries of the scientic domain and its percolation through other domains, including every-
day life. They aspire to introduce the benets of LRT use to the researcher but also to the
lay-man, whose work, whether scientic or not, may be facilitated and accelerated and
its quality enhanced. The active participation of Greek LT research institutes in these ini-
tiatives is of paramount importance to the progress of the eld in the country.
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.