
Olivier Kraif- Professor
- Professor at Grenoble Alpes University
Olivier Kraif
- Professor
- Professor at Grenoble Alpes University
Professor at Université Grenoble Alpes - Lidilem laboratory
Département I3L et Sciences du langage
About
105
Publications
30,736
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
792
Citations
Introduction
NLP tools for corpus linguistics
Current institution
Additional affiliations
September 2002 - March 2017
September 2002 - present
Publications
Publications (105)
L’objectif de cet article est de proposer une réflexion conceptuelle et méthodologique sur les périmètres de réalisation des motifs phraséologiques. Pour ce faire, nous nous appuyons sur une approche guidée par les données (corpus driven) extraites grâce à la méthode des Arbres Lexico-syntaxiques Récurrents (ALR) qui a permis de montrer que le moti...
Actuellement, l’élaboration et le recensement des terminologies visant à créer et à alimenter les bases de données terminologiques se basent essentiellement sur l’exploration outillée des corpus permettant l’observation, la description et l’identification des dimensions morphosyntaxique, sémantique et phraséologique des termes et ce grâce à des tec...
La présente contribution propose de nouvelles avancées dans le but de relever l’un des défis majeurs posé par la classe des prépositions complexes à la communauté des chercheurs en linguistique : la possibilité d’en dresser une liste. En vue du développement d’une méthode entièrement automatisée pour extraire des candidats appartenant à cette class...
Dans cette étude exploratoire, nous nous intéressons aux Phrases Préfabriquées des Interactions (p. ex. c’est clair ; je te jure ; on dirait ). Après avoir défini ce type de phrase, nous évaluons dans quelle mesure le corpus arboré Orféopeut être exploité pour extraire et caractériser ces éléments. Les résultats de l’analyse qualitative montrent qu...
Cet ouvrage collectif rassemble vingt-quatre contributions scientifiques sélectionnées parmi les présentations des onzièmes journées du réseau Lexicologie, Terminologie, Traduction. Ces journées intitulées « Lexique(s) et genre(s) textuel(s) : approches sur corpus » eurent lieu à Grenoble, France du 25 au 28 septembre 2018. Les études lexicales ont...
In this paper we propose a multi-task sequence prediction system, based on recurrent neural networks and used to annotate on multiple levels an Arabizi Tunisian corpus. The annotation performed are text classification, tokenization, PoS tagging and encoding of Tunisian Arabizi into CODA* Arabic orthography. The system is learned to predict all the...
In this paper, we propose a corpus-driven study of “fiction words”, a concept introduced by Angenot (Poétique 33: 74–89, 1978) which denotes a type of lexical coinage specific to the science fiction genre and that furnishes an interesting key to interpreting the peculiarities of the science fiction imaginary. More precisely, the article presents a...
Exploring lexicosyntactic distributions of word and expressions using the LEXICOSCOPE
In this article, we show how to take advantage of corpora annotated in syntactic dependencies: we aim at extracting collocations that summerize the lexico-syntactic contexts of words, as well as working at a more general level on expressions, or even constructions...
L’étude propose d’analyser des motifs textuels spécifiques du roman policier et du roman de la littérature dite « blanche » de langue française. Ces motifs sont choisis selon des critères morphosyntaxiques, sémantiques, de spécificité et de dispersion : on se penchera précisément sur les motifs autour du nom de la porte pour le roman policier, et s...
Recent works in spoken language translation (SLT) have attempted to build end-to-end speech-to-text translation without using source language transcription during learning or decoding. However, while large quantities of parallel texts (such as Europarl, OpenSubtitles) are available for training machine translation systems, there are no large (100h)...
In this paper, we aim to test if the classifications of the phraseological units based on recurring trees and ngram methods are functional in order to separate novel genres one from another. Our results confirm that these two methods are relevant for the expressions relative to space and time into our corpora.
The present study investigates "core collocations," i.e. frequent or available (i.e. essential for accomplishing basic communicative tasks) word combinations consisting of two lexemes which yield a significant (collocational) relation and which represent the most basic co-occurrences of a word. We extracted from the web-crawled French-language corp...
À partir d’un vaste corpus de romans policiers, l’étude propose de cerner les propriétés linguistiques et les fonctions textuelles de l’expression scène de crime ; la méthode adoptée associe les outils de la linguistique de corpus et les travaux en stylistique. Il s’agit de montrer comment les structures où apparaît cette expression scène de crime...
The Phraseotext Project focuses on the comparison of phraseological phenomena through different literary subgenres. In this context, this paper presents the results of a preliminary experiment conducted on a Latin corpus. First, we built a dependency-parsed corpus and develop a methodology to study combinatorial profiles using textometric tools. Th...
This paper presents the concepts of “core vocabulary” and “core collocations” and discusses implications for the treatment of collocations in monolingual learner phraseological dictionaries. In the first section, we give an account of what the above concepts refer to by drawing on previous research. In the second part, we present the findings from...
Cet article expose les résultats d’une étude préliminaire menée dans le cadre du projet PHRASEOTEXT qui vise à articuler les niveaux phraséologiques, stylistiques et discursifs à travers l’analyse comparée de corpus représentants différents sous-genres littéraires. Son objectif est à la fois méthodologique, avec une approche originale s’appuyant su...
Scientific writing is characterized by a sociolect with specific features, and in particular phraseological features. This study is dedicated to semantico-phraseological routines, called in French patrons (viz. "patterns"), tournures or clichés, and by which writers become part of a discourse community. We first define the linguistic properties of...
The Phraseotext Project focuses on the comparison of phraseological phenomena through different literary subgenres. In this context, this paper presents the results of a preliminary experiment conducted on a Latin corpus. First, we built a dependency-parsed corpus and develop a methodology to study combinatorial profiles using textometric tools. Th...
Les écrits scientifiques se caractérisent par un sociolecte présentant des propriétés linguistiques spécifiques, notamment sur le plan phraséologique. Nous nous intéressons ici aux routines sémantico-rhétoriques, parfois appelées patrons, tournures, ou motifs, par lesquelles les scripteurs s’inscrivent dans une « communauté de discours ». Après avo...
Collocations have long been recorded by lexicographers but the theorization of thiskind of phraseological unit is quite recent. In line with the lexicological “continental”tradition of collocation, we propose a set of semantic and syntactic criteria to delimit clearly this phenomenon. We then show that, in the semantic field of emotions, collocatio...
This article discusses the features of the Lexicoscope, an architecture dedicated to treebank exploration. After reviewing some similar tools, we show how the use of complex expressions (corresponding to syntactic trees) in concordancing as well as extracting word sketches, can be very useful for the study of collocations and phraseology. We also p...
In this paper, we present the preliminary results of a research conducted as part of the PHRASEOTEXT project. The objectives of this project are to address the interrelations of phraseological, stylistic and discursive levels through the comparative analysis of corpora that represent various literary subgenres. Our goal is twofold, both methodologi...
Dossier d'Habilitation à diriger des recherches - Synthèse
Following the pioneering work of Gougenheim and his team in the 1950s, pedagogical frequency lists have received much attention in France and elsewhere. However, research has mainly focused on single lexical items, whereas the role played by high-frequency phraseological units, i.e. units functioning as independent lexico-grammatical chunks, has be...
Following the pioneering work of Gougenheim and his team in the 1950s, pedagogical frequency lists have received much attention in France and elsewhere. However, research has mainly focused on single lexical items, whereas the role played by high-frequency phraseological units, i.e. units functioning as independent lexico-grammatical chunks, has be...
This article presents a methodology developed in the EMOLEX project (www.emolex.eu) to analyse the lexis of emotion in five European languages. For this purpose, multilingual corpora were set up with several query interfaces, analytical tools and specifically designed applications that exploit the results of the linguistic analyses carried out duri...
While the field of technology-enhanced language learning (TELL) is AQ1 undeniably thriving, most technology-enhanced language tools are still relatively crude. One reason for this is that the field is disconnected from research in natural language processing (NLP) and corpus linguistics (CL), two fields which could greatly improve the effectiveness...
Cet article porte sur le développement d'une nouvelle approche pour l'exploration de la combinatoire lexico-syntaxique, en vue de la caractérisation des valeurs sémantiques des unités étudiées. Cette approche a été mise en œuvre à travers le développement d'un outil nommé EmoConc, permettant de d'étudier la combinatoire des pivots (ou mots pôles) v...
Dans cet article, nous présentons une approche visant à caractériser et catégoriser les collocatifs verbaux d'une certaine classe de noms (ici des noms d'affect) pris au travers d'une relation syntaxique préalablement fixée (ici la relation verbe - complément d'objet). Nous faisons l'hypothèse que les propriétés sémantiques des unités sont reflétée...
Use of vocabulary by language learners has been extensively investigated within the broader area of lexical acquisition, second language teaching, and language assessment. In recent decades there has been a shift from a view of vocabulary knowledge as the ability to produce single words to a view of vocabulary knowledge as the ability to use words...
In the first part of this article, we explore the background of computer-assisted learning from its beginnings in the early XIX th century and the first teaching machines, founded on theories of learning, at the start of the XX th century. With the arrival of the computer, it became possible to offer language learners different types of language ac...
This paper focuses on the development of specific tools designed for the observation and comparison of combinatoric profiles of lexicon for the semantic field of ‘emotions’, through various languages (German, French, English, Spanish and Russian). We first present our theoretical background, related to the Firth’s view of a syntagmatic structuratio...
Dans le cadre du projet franco-allemand Emolex, dédié à l'étude contrastive de la combinatoire du lexique des émotions en 5 langues, nous avons développé des outils et des méthodes permettant l'extraction, la visualisation et la comparaison de profils combinatoires pour des expressions simples et complexes. Nous présentons ici l'architecture d'ense...
The NLP community has developed many corpora with rich annotations but these
resources are not easily accessible to researchers with little computer expertise. If the NLP
community is eager to make available annotated corpora to a wider audience of nonspecialists,
it is imperative to design and develop user-friendly interfaces, which is not a
trivi...
Cet article se propose de faire le point sur la notion de concordance, en tant qu’objet privilégié pour l’exploration et l’observation des corpus textuels. Des premières concordances médiévales aux outils d’aujourd’hui, nous verrons qu’il existe une remarquable stabilité, dans le principe même de l’objet : éclairer réciproquement des occurrences et...
This paper describes the processing of a bilingual corpus for EFL learners. We focus on the identification of semi-frozen expressions specific to academic writing, using techniques such as POS tagging and finite states transducers : we show how, by coupling this identification with bilingual alignment, it is possible to develop interesting material...
The NLP community has developed many corpora with rich annotations but these resources are not easily accessible to researchers with little computer expertise. If the NLP community is eager to make available annotated corpora to a wider audience of non-specialists, it is imperative to design and develop user-friendly interfaces, which is not a triv...
In this paper, we study how single-word term extraction and bilingual lexical alignment can be used and combined to assist terminologists when they compile bilingual specialized dictionaries. Two specific tools — namely a term extractor called
TermoStat
and a sentence and lexical aligner called
Alinea
— are tested in a specific project the aim of w...
This paper presents a "didactic triangulation" strategy to cope with the problem of reli-ability of NLP applications for computer-assisted language learning (CALL) systems. It is based on the implementation of basic but well mastered NLP techniques and puts the emphasis on an adapted gearing between computable linguistic clues and didactic fea-ture...
Nous dédions cet article à la mémoire de notre jeune collègue Robert Barr, disparu brutalement et prématurément en avril 2010. Résumé Cet article présente le projet Scientext, qui a permis de constituer un corpus d'écrits scientifiques variés et des outils logiciels permettant d'effectuer une étude linguistique du positionnement et du raisonnement...
While the field of technology-enhanced language learning (TELL) is undeniably thriving, most technology-enhanced language
tools are still relatively crude. One reason for this is that the field is disconnected from research in natural language
processing (NLP) and corpus linguistics (CL), two fields which could greatly improve the effectiveness of...
La linguistique de corpus offre un renouvellement complet de la méthodologie de la recherche en linguistique. Le caractère massif des corpus utilisés actuellement permet d'avoir une nouvelle approche du concept de fréquence d'occurrence et révèle des régularités que les méthodes traditionnelles ne permettaient pas d'atteindre. Le corpus donne aussi...
Researches in the field of Named Entity recognition and alignment are of strong interest for various applications of natural language processing, such as Cross Lingual Information Retrieval, document management, question-answering systems, data mining etc. But in the processing of Arabic language, the task is particularly difficult and few resource...
This paper presents the design of a multilingual concordancer, ConcQuest, which attempts to give simple access to complex technologies of NLP. Through this presentation, we try to delineate some possible solutions, in order to cope with the inherent difficulties of formal linguistic representations. These solutions involve different aspects: simpli...
Researches in the field of Named Entity recognition and alignment are of strong interest for various applications of natural language processing, such as Cross Lingual Information Retrieval, document management, question-answering systems, data mining etc. But in the processing of Arabic language, the task is particularly difficult and few resource...
Learner corpora, electronic collections of spoken or written data from foreign language learners, offer unparalleled access to many hitherto uncovered aspects of learner language, particularly in their error-tagged format. This article aims to demonstrate the role that the learner corpus can play in CALL, particularly when used in conjunction with...
This report constitutes Deliverable 39.5.1. of JEIRP “Digital Language Learning: an
Integrated Perspective” devoted to designing a model of Digital Language Learning (DLL).
The objective of this report is to present three case studies which demonstrate that it is both
possible and desirable to integrate Natural language Processing (NLP) and compute...
Avec le développement du Web, les corpus multilingues parallèles en domaine spécialisé sont de plus en plus accessibles : un grand nombre de textes sont disponibles en ligne, qu'il s'agisse de documentations techniques (projets Open Source, corpus OPUS), de documents juridiques ou institutionnels (ONU, Acquis communautaire, Hansard, etc.), de rappo...
This paper describes the ARCADE II project, concerned with the evaluation of parallel text alignment systems. The ARCADE II project aims at exploring the techniques of multilingual text alignment through a fine evaluation of the existing techniques and the development of new alignment methods. The evaluation campaign consists of two tracks devoted...
This paper describes the ARCADE II project, concerned with the evaluation of parallel text alignment systems. The ARCADE II project aims at exploring the techniques of multilingual text alignment through a fine evaluation of the existing techniques and the development of new alignment methods. The evaluation campaign consists of two tracks devoted...
Le présent article se focalise sur le développement d'outils de traitement automatique des langues (TAL) pour l'apprentissage des langues assisté par ordinateur (ALAO). Après avoir identifié les limitations inhérentes aux outils d'ALAO dépourvus de composantes TAL, nous décrivons le cadre général du projet MIRTO, une plateforme de création d'activi...
The MIRTO project aims at designing a pedagogical plateform using Natural Language Processing (NLP) technologies, meant to be used by language teachers. More than an element of quality, the NLP is prerequisite, as our own, for the language learning softwares to be able to teach language as such. MIRTO tries to work out this approach, while offering...
This article focuses on the development of Natural Language Processing (NLP) tools for Computer Assisted Language Learning (CALL). After identifying the inherent limitations of NLP-free tools, we describe the general framework of Mirto, an NLP-based authoring platform under construction in our laboratory, and organized into four distinct layers: fu...
Various informations can be used to align parallel texts at word level: co-occurrence frequencies, position difference, part-of-speech, graphic resemblance, etc. This paper proposes a simple method to combine these clues in an efficient way. The association score is computed from the probabilities of pairing two units under Null hypothesis, assumin...
Textual aligning consists in pairing segments (e.g. sentences or phrases) that are translational equivalents across corpora of translations. An interesting application of textual aligning is the automatic extraction of bilingual lexicons. As it has been pointed out during previous evaluation campaigns, such as Arcade, lexical aligning remains probl...
Résumé – Abstract Le développement des corpus multilingues alignés a rendu possible la formalisation, et la systématisation, d'une forme originale d'observation, à savoir le repérage de traduction. Dans un premier temps, nous montrons comment ce type de repérage fournit des critères intéressants pour l'identification d'unités polylexicales, pour l'...
Alignement lexical dans les corpus multilingues
sous la dir. de Jean Véronis