Thomas François’s research while affiliated with Catholic University of Louvain and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (21)


Figure 1: Examples of relevant 'gapfill' (1) and 'multiple-choice' (2) questions.
Figure 2: Collection procedure for the examined corpora
Generating Contexts for ESP Vocabulary Exercises with LLMs
  • Conference Paper
  • Full-text available

October 2024

·

15 Reads

·

·

Amandine Dumont

·

[...]

·

Thomas François

The current paper addresses the need for language students and teachers to have access to a large number of pedagogically sound contexts for vocabulary acquisition and testing. We investigate the automatic derivation of contexts for a vocabulary list of English for Specific Purposes (ESP). The contexts are generated by contemporary Large Language Models (namely, Mistral-7B-Instruct and Gemini 1.0 Pro) in zero-shot and few-shot settings, or retrieved from a web-crawled repository of domain-relevant websites. The resulting contexts are compared to a professionally crafted reference corpus based on their textual characteristics (length, morphosyntactic, lexico-semantic, and discourse-related). In addition, we annotated the automatically derived contexts regarding their direct applicability, comprehensibility, and domain relevance. The 'Gemini, zero-shot' contexts are rated most highly by human annotators in terms of pedagogical usability, while the 'Mistral, few-shot' contexts are globally closest to the reference based on textual characteristics.

Download


Graded resources for learning and teaching foreign languages: An overview

September 2024

·

12 Reads

ITL - International Journal of Applied Linguistics

Innovative resources for teaching vocabulary have emerged during the last decades, among which the so-called ‘graded resources’, i.e. lexicons or inventories where linguistic forms have been associated with a difficulty level, having in mind a target reader. The idea of ‘grading’ in Education is not new and has evolved over time: vocabulary or text scales (grades) have been approached as an educational tool with the means available at each period of time. In this survey, our aim is to show how current approaches of graded resources for foreign language learning are rooted in the tradition of building frequency lists for education. We then synthesize the body of work that has been undertaken to design graduated resources, highlighting the different methodologies applied as well as existing resources.


PolylexFLE: A MWE database for French L2 language learners

April 2024

·

11 Reads

·

1 Citation

ITL - International Journal of Applied Linguistics

MWE knowledge is key in the process of learning a foreign language, but its teaching remains hindered by the lack of list of expressions connected to pedagogical aims. In this paper, we present an extended version of the PolylexFLE database, containing 4,525 French multiword expressions (MWE) of three types: idioms, collocations or fixed expressions. In order to propose exercises following the difficulty scale of the European Framework of Reference for Languages (CEFR), we used a mixed approach (manual and automatic) to annotate 1,186 expressions according to the CEFR levels. The paper focuses mostly on the automatic procedure that first identifies the expressions from the PolylexFLE database (and their variants) in a corpus of pedagogical texts (with CEFR labels) using a pattern-based system. In a second step, their distribution in this corpus is estimated and transformed into a single CEFR level. The automatic approach proposed is finally evaluated by 52 French as foreign language learners.


Figure 1: Box-plot of feature correlations by family.
Number of essays according to the gender, per (CEFR) level
TCFLE-8: a Corpus of Learner Written Productions for French as a Foreign Language and its Application to Automated Essay Scoring

January 2023

·

24 Reads

·

1 Citation

Automated Essay Scoring (AES) aims to automatically assess the quality of essays. Automation enables large-scale assessment, improvements in consistency, reliability, and standardization. Those characteristics are of particular relevance in the context of language certification exams. However, a major bottleneck in the development of AES systems is the availability of corpora, which, unfortunately, are scarce, especially for languages other than English. In this paper, we aim to foster the development of AES for French by providing the TCFLE-8 corpus, a corpus of 6.5k essays collected in the context of the Test de Connaissance du Français (TCF - French Knowledge Test) certification exam. We report the strict quality procedure that led to the scoring of each essay by at least two raters according to the CEFR levels and to the creation of a balanced corpus. In addition, we describe how linguistic properties of the essays relate to the learners’ proficiency in TCFLE-8. We also advance the state-of-the-art performance for the AES task in French by experimenting with two strong baselines (i.e. RoBERTa and feature-based). Finally, we discuss the challenges of AES using TCFLE-8.



Table 2 .
Towards a Verb Profile: distribution of verbal tenses in FFL textbooks and in learner productions

Morphological inflection is known to be difficult to master for L2 learners. In this paper, we examine the state of the use of inflection in the verbal tense system among learners of French, and contrast it with the use in FFL textbooks. The objectives of our study are threefold: 1) To establish the distribution of verbal tenses on French textbooks in an automatic way, in order to obtain the first fully empirical and extensive resource on French verbal tenses; 2) To objectively describe the use of verbal tenses by learners of different CEFR levels; 3) To identify the tenses that learners struggle with. Through the description of the use of the tenses in the learners, we found that they had difficulty with the past perfect indicative, even at advanced levels. The proposed Verb Profile summarizes which tenses should be understood at which level, and as such can guide teachers and learners, as well as help pinpoint tenses that learners are underperforming on.



Revisiting simplification in corpus-based translation studies: Insights from readability research

September 2022

·

35 Reads

Meta Journal des traducteurs

Ever since the publication of Laviosa’s (1998a; 1998b) pioneering work, the study of lexico-syntactic simplification has held centre stage in corpus translation research concerned with the typical features of translated texts. The simplification hypothesis states that translated texts are simpler than non-translated texts. The convergence hypothesis, also discussed by Laviosa (1998a; 1998b), but less so in follow-up studies, is that translated texts are more homogeneous than original texts, that is they display less variance. To date, simplification has mostly been operationalised in CBTS as type-token ratio, lexical density, core vocabulary coverage, list head coverage and average sentence length. Relying on these parameters, previous research has produced mixed results, with simplification varying across translation modalities, language pairs and registers. The present article sets out to revisit the simplification and convergence hypotheses through the lens of NLP-informed readability research. In particular, we rely on a larger set of simplification indicators and make use of multivariate statistical techniques. We present a simplification study of Europarl corpus data in French translated from English and in non-translated French. The results show that translated French is simpler than original French, lexically and syntactically. We also find evidence of convergence that shows that translators smooth out cross-speaker lexical heterogeneity in translated parliamentary proceedings.


Figure 1: Examples of possible medical translations into Arasaac pictographs.
Figure 2: Examples of system's outputs in Arasaac.
Investigating the Medical Coverage of a Translation System into Pictographs for Patients with an Intellectual Disability

May 2022

·

142 Reads

·

2 Citations

Communication between physician and patients can lead to misunderstandings, especially for disabled people. An automatic system that translates natural language into a pictographic language is one of the solutions that could help to overcome this issue. In this preliminary study, we present the French version of a translation system using the Arasaac pictographs and we investigate the strategies used by speech therapists to translate into pictographs. We also evaluate the medical coverage of this tool for translating physician questions and patient instructions.


Citations (11)


... Additionally, text simplification, among other use cases, is vital for improving comprehension among the general public in the context of legal and administrative texts. Such texts often serve as communication means between institutions and target audiences with different reading skills [2]. ...

Reference:

Automatic Simplification of Lithuanian Administrative Texts
AMesure: A Web Platform to Assist the Clear Writing of Administrative Texts
  • Citing Conference Paper
  • January 2020

... It offers normalized word frequencies by CEFR competence level and includes MWEs 1 . PolylexFLE: Tailored to MWEs in French and aiming to facilitate second language acquisition (Todirascu et al., 2024). It contains 4,525 MWEs and their CEFR competence levels and focuses on verbal MWEs 2 . ...

PolylexFLE: A MWE database for French L2 language learners
  • Citing Article
  • April 2024

ITL - International Journal of Applied Linguistics

... Concernant la désambiguïsation en langue française, (Norré et al., 2023) ont examiné la traduction de textes français en pictogrammes. Ils ont utilisé plusieurs modèles de langue, dont CamemBERT (Martin et al., 2019), FlauBERT (Le et al., 2020), DrBERT (Labrak et al., 2023), et CamemBERT-bio (Touchent et al., 2023), pour générer les vecteurs de plongement des phrases. ...

Word Sense Disambiguation for Automatic Translation of Medical Dialogues into Pictographs

... Even though work on AES started around 55 years ago, it is still an active area of research to this day (e.g. Beigman Klebanov and Madnani, 2020;Wilkens et al., 2023;Lagutina et al., 2023). ...

TCFLE-8: a Corpus of Learner Written Productions for French as a Foreign Language and its Application to Automated Essay Scoring

... This is likely because the additional selection and ranking steps implemented by Whistely et al. (2022) and the lack thereof shown within the LS system provided by North et al. (2022a) (Section 3.2). Wilkens et al. (2022) likewise experimented with a range of monolingual transformers for SG. They employed an ensemble of BERT-like models with three distinct masking strategies: 1) copy, 2) query expansion, and 3) paraphrase. ...

CENTAL at TSAR-2022 Shared Task: How Does Context Impact BERT-Generated Substitutions for Lexical Simplification?
  • Citing Conference Paper
  • January 2022

... Recent automatic SS systems mostly leverage data-driven and deep learning methods (Feblowitz and Kauchak, 2013;Xu et al., 2016;Aharoni and Goldberg, 2018;Alva-Manchego et al., 2017). Previous studies have primarily focused on English, based on datasets such as ASSET (Alva-Manchego et al., 2020), ASSET ann (Cardon et al., 2022) and Newsela (Xu et al., 2015). Less attention has been given to other languages such as Chinese. ...

Linguistic Corpus Annotation for Automatic Text Simplification Evaluation
  • Citing Conference Paper
  • January 2022

... Speech-to-text Translation (ST) is a technology that converts speech input in one language into text output in another language [1,2,3,4]. This technology has numerous real-world applications, such as simultaneous translation for international conferences [4,5,6], enhancing accessibility [7,8] and communication across language barriers [9,10]. Additionally, ST can be integrated with Text-to-Speech Synthesis (TTS) [11,12,13] to form a core component of cascaded Speech-to-Speech Translation systems [14,15], which are particularly useful in video dubbing applications [16,17]. ...

Investigating the Medical Coverage of a Translation System into Pictographs for Patients with an Intellectual Disability

... While recent studies have started to explore interpretability in SSL models broadly [43]- [45], research specifically tailored to pathological speech is still quite limited [37], [40], [41]. In this line, although there is an ongoing debate regarding the reliability of attention mechanisms for interpretability [46]- [48], attention-based model 0000-0000/00$00.00 © 2024 IEEE arXiv:2412.02006v1 ...

Is Attention Explanation? An Introduction to the Debate

... In addressing the complexity of language in ICFs, it is evident that despite the efforts of clinical translational scientists, these documents often remain mired in technical jargon and complex language structures. This issue underscores the need for employing literacy checkers and advanced technologies, such as artificial intelligence (AI), to enhance the readability of ICFs [31][32][33][34]. AIdriven language simplification tools offer a promising solution by analyzing and revising text to make it more accessible to nonspecialist audiences. ...

Simplification of literary and scientific texts to improve reading fluency and comprehension in beginning readers of French

Applied Psycholinguistics