Antoine Doucet

Antoine Doucet
La Rochelle Université · Laboratoire Informatique, Image et Interaction

PhD

About

227
Publications
32,295
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,416
Citations
Introduction
Antoine Doucet currently works at the Laboratoire Informatique, Image et Interaction, University of La Rochelle.

Publications

Publications (227)
Conference Paper
Full-text available
Query auto-completion (QAC) is one of the most recogniz-able and widely used services of modern search engines. Its goal is to assist a user in the process of query formulation. Current QAC systems are mainly reactive. They respond to the present request using past knowledge. Specifically, they mostly rely on query logs analysis [11, 10, 12] or cor...
Article
Full-text available
In the age of big data, automatic methods for creating summaries of documents become increasingly important. In this paper we propose a novel, unsupervised method for (multi-)document summarization. In an unsupervised and language-independent fashion, this approach relies on the strength of word associations in the set of documents to be summarized...
Conference Paper
Full-text available
This paper proposes a new approach for automatically dating a photograph, based solely on its content. Building on recent advances in computer vision, the images are first described by a set of features. Then, the age group of every image is predicted by a classifier trained with annotated data. The key strength of our approach -- which makes it pe...
Conference Paper
Full-text available
In this paper, we introduce a multilingual epidemiological news surveillance system. Its main contribution is its ability to extract epidemic events in any language, hence succeeding where state-of-the-art in surveillance systems usually fails : the objective of reactivity. Most systems indeed focus on a selected list of languages, deemed important...
Article
Full-text available
In this paper we discuss the problem of discovering interesting word se- quences in the light of two traditions: sequential pattern mining(from data mining) and collocations discovery(from computational linguistics). Smadja (1993) defines a collocation as "a recurrent combination of words that co- occur more often than chance and that correspond to...
Article
Full-text available
Automatic term extraction (ATE) is a natural language processing task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. In this paper, we treat ATE as a sequence-labeling task and explore the efficacy of XLMR in evaluating cross-lingual and multilingual learning against monoling...
Article
Historical document processing (HDP) corresponds to the task of converting the physical-bind form of historical archives into a web-based centrally digitized form for their conservation , preservation , and ubiquitous access . Besides the conservation of these invaluable historical collections, the key agenda is to make these geographically...
Conference Paper
In this paper, we address the challenge of document image analysis for historical index table documents with handwritten records. Demographic studies can gain insight from the use of automatic document analysis in such documents through the study of population movements. To evaluate the efficacy of automatic layout analysis tools, we release the PA...
Conference Paper
The digitization of historical documents is a critical task for preserving cultural heritage and making vast amounts of information accessible to the wider public. One of the challenges in this process is separating individual articles from old newspaper images, which is significant for text analysis and information retrieval. In this work, we pres...
Conference Paper
The digitization of historical newspapers is a crucial task for preserving cultural heritage and making it accessible for various natural language processing and information retrieval tasks. One of the key challenges in digitizing old newspapers is article separation, which consists of identifying and extracting individual articles from scanned new...
Article
Full-text available
The ubiquity of social networks and the unprecedented growth in web data have generated an ample resource of information for researchers as well as for market analysts to generate user-oriented recommendations. While many social recommender systems have been designed for individual users, some have been proposed for a group of users intended to han...
Article
Microblogging site Twitter (re-branded to X since July 2023) is one of the most influential online social media websites, which offers a platform for the masses to communicate, expresses their opinions, and shares information on a wide range of subjects and products, resulting in the creation of a large amount of unstructured data. This has attract...
Chapter
This paper provides an overview of the DocILE 2023 Competition, its tasks, participant submissions, the competition results and possible future research directions. This first edition of the competition focused on two Information Extraction tasks, Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR). Both of these task...
Chapter
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been...
Chapter
Pre-trained language models have been widely successful, particularly in settings with sufficient training data. However, achieving similar results in low-resource multilingual settings and specialized domains, such as epidemic surveillance, remains challenging. In this paper, we propose hypotheses regarding the factors that could impact the perfor...
Chapter
The widespread use of unsecured digital documents by companies and administrations as supporting documents makes them vulnerable to forgeries. Moreover, image editing software and the capabilities they offer complicate the tasks of digital image forensics. Nevertheless, research in this field struggles with the lack of publicly available realistic...
Chapter
In this paper, we tackle the task of document fraud detection. We consider that this task can be addressed with natural language processing techniques. We treat it as a regression-based approach, by taking advantage of a pre-trained language model in order to represent the textual content, and by enriching the representation with domain-specific on...
Chapter
Information Extraction plays a key role in the automation of auditing processes in administrative documents. However, variety in layout and language is always a challenging task. On the other hand, large volumes of public training datasets related to administrative documents such as invoices are rare to find. In this work, we use Graph Attention Ne...
Article
After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is...
Conference Paper
Full-text available
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Local-ization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has bee...
Article
In this article, we introduce an Open Education Resource (OER) on digital historical research with historical newspapers,11The URL will be given with the camera-ready version of this paper. intended to give students the means to understand the induced risks in working with large collections of digitised documents, as well as the keys to benefit fro...
Preprint
Full-text available
Large language models (LLMs) have been leveraged for several years now, obtaining state-of-the-art performance in recognizing entities from modern documents. For the last few months, the conversational agent ChatGPT has "prompted" a lot of interest in the scientific community and public due to its capacity of generating plausible-sounding answers....
Article
Full-text available
Research on graph representation learning (a.k.a. embedding) has received great attention in recent years and shows effective results for various types of networks. Nevertheless, few initiatives have been focused on the particular case of embeddings for bipartite graphs. In this paper, we first define the graph embedding problem in the case of bipa...
Chapter
Full-text available
In this paper, we address the detection of named entities in multilingual historical collections. We argue that, besides the multiple challenges that depend on the quality of digitization (e.g., misspellings and linguistic errors), historical documents can pose another challenge due to the fact that such collections are distributed over a long enou...
Chapter
In many documents, like receipts or invoices, textual information is constrained by the space and organization of the document. The document information has no natural language context, and expressions are often abbreviated to respect the graphical layout, both at word level and phrase level. In order to analyze the semantic content of these types...
Chapter
Due to the availability of cost-effective scanners, printers, and image processing software, document fraud detection is, unfortunately, quite common nowadays. The main challenges of this task are the lack of freely available annotated data and the overflow of mainly computer vision approaches. We consider that relying on the textual content of for...
Preprint
Full-text available
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been...
Preprint
Archive collections are nowadays mostly available through search engines interfaces, which allow a user to retrieve documents by issuing queries. The study of these collections may be, however, impaired by some aspects of search engines, such as the overwhelming number of documents returned or the lack of contextual knowledge provided. New methods...
Article
Full-text available
Due to the expanding rate of scientific publications, it has become a necessity to summarize scientific documents to allow researchers to keep track of recent developments. In this paper, we formulate the scientific document summarization problem in a multi-view clustering (MVC) framework. Two views of the scientific documents, semantic and syntact...
Preprint
Full-text available
Identifying and exploring emerging trends in the news is becoming more essential than ever with many changes occurring worldwide due to the global health crises. However, most of the recent research has focused mainly on detecting trends in social media, thus, benefiting from social features (e.g. likes and retweets on Twitter) which helped the tas...
Preprint
Full-text available
Automatic term extraction (ATE) is a Natural Language Processing (NLP) task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. As units of knowledge in a specific field of expertise, extracted terms are not only beneficial for several terminographical tasks, but also support and...
Preprint
Full-text available
Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ab...
Chapter
To prevent historical knowledge’s fading, research in event detection could facilitate access to digitized collections. In this paper, we propose a method for annotating multilingual historical documents for event detection in an unsupervised manner by leveraging entities and semantic notions of event types. We automatically annotate the documents...
Conference Paper
Full-text available
Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ab...
Conference Paper
This paper studies the dynamics between how the representation of terms changes through time and its potential emergence as a trending topic in the future. Previous research focused on contrasting directly two of the most recent representations of detected keywords to form a basis for predicting emerging topics. We, thus, propose the Term Context E...
Chapter
Automatic term extraction (ATE) is a natural language processing task that eases the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. In this paper, we experiment with XLM-RoBERTa to evaluate the abilities of cross-lingual and multilingual versus monolingual learning in the cross-domain ATE t...
Conference Paper
Full-text available
Natural Language Premise Selection (NLPS) is a mathematical Natural Language Processing (NLP) task that retrieves a set of applicable relevant premises to support the end-user finding the proof for a particular statement. This paper evaluates the impact of Transformer-based contextual information and different fundamental similarity scores toward N...
Conference Paper
Full-text available
Automatic term extraction (ATE) is a popular research task that eases the time and effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. In this paper, we treat terminology extraction as a sequence-labeling task and experiment with a Transformer-based model XLM-RoBERTa to evaluate the performance...
Article
Full-text available
Event detection is a crucial task for natural language processing and it involves the identification of instances of specified types of events in text and their classification into event types. The detection of events from digitised documents could enable historians to gather and combine a large amount of information into an integrated whole, a pan...
Chapter
Tracking news stories in documents is a way to deal with the large amount of information that surrounds us everyday, to reduce the noise and to detect emergent topics in news. Since the Covid-19 outbreak, the world has known a new problem: infodemic. News article titles are massively shared on social networks and the analysis of trends and growing...
Chapter
This paper presents an overview of the second edition of HIPE (Identifying Historical People, Places and other Entities), a shared task on named entity recognition and linking in multilingual historical documents. Following the success of the first CLEF-HIPE-2020 evaluation lab, HIPE-2022 confronts systems with the challenges of dealing with more l...
Preprint
Full-text available
This paper summarizes the joint participation of the Trading Central Labs and the L3i laboratory of the University of La Rochelle on both sub-tasks of the Shared Task FinSim-4 evaluation campaign. The first sub-task aims to enrich the 'Fortia ESG taxonomy' with new lexicon entries while the second one aims to classify sentences to either 'sustainab...
Article
Full-text available
Digital libraries have a key role in cultural heritage as they provide access to our culture and history by indexing books and historical documents (newspapers and letters). Digital libraries use natural language processing (NLP) tools to process these documents and enrich them with meta-information, such as named entities. Despite recent advances...
Chapter
Results of digitisation projects sometimes suffer from the limitations of optical character recognition software which is mainly designed for modern texts. Prior work has examined the impact of OCR errors on information retrieval (IR) and downstream natural language processing (NLP) tasks. However, questions remain open regarding the actual readabi...
Chapter
We present the HIPE-2022 shared task on named entity processing in multilingual historical documents. Following the success of the first CLEF-HIPE-2020 evaluation lab, this edition confronts systems with the challenges of dealing with more languages, learning domain-specific entities, and adapting to diverse annotation tag sets. HIPE-2022 is part o...
Chapter
In this paper, we approach a recent and under-researched paradigm for the task of event detection (ED) by casting it as a question-answering (QA) problem with the possibility of multiple answers and the support of entities. The extraction of event triggers is, thus, transformed into the task of identifying answer spans from a context, while also fo...
Article
Full-text available
Named entities (NEs) are among the most relevant type of information that can be used to properly index digital documents and thus easily retrieve them. It has long been observed that NEs are key to accessing the contents of digital library portals as they are contained in most user queries. However, most digitized documents are indexed through the...
Conference Paper
Full-text available
In this paper, we present a collection of five flexible background linking models created for the News Track in TREC 2021 that generate ranked lists of articles to provide contextual information. The collection is based on the use of sentence embeddings indexes, created with Sentence BERT and Open Distro for ElasticSearch. For each model, we explor...
Chapter
Full-text available
The role of Human Resources (HR) in an organization is significant, and the role of the world of education plays an essential role in producing and educating qualified and qualified human resources. In this paper, 20 students were evaluated as learning with Simple Additive Wight (SAW) and Analytic Hierarchy Process (AHP), which applied six criteria...
Preprint
Full-text available
Named entity recognition (NER) is an information extraction technique that aims to locate and classify named entities (e.g., organizations, locations,...) within a document into predefined categories. Correctly identifying these phrases plays a significant role in simplifying information access. However, it remains a difficult task because named en...
Chapter
Named entity recognition (NER) is an information extraction technique that aims to locate and classify named entities (e.g., organizations, locations, ...) within a document into predefined categories. Correctly identifying these phrases plays a significant role in simplifying information access. However, it remains a difficult task because named e...
Chapter
Unsupervised topic models such as Latent Dirichlet Allocation (LDA) are popular tools to analyse digitised corpora. However, the performance of these tools have been shown to degrade with OCR noise. Topic models that incorporate word embeddings during inference have been proposed to address the limitations of LDA, but these models have not seen muc...
Chapter
In this paper, we focus on epidemic event extraction in multilingual and low-resource settings. The task of extracting epidemic events is defined as the detection of disease names and locations in a document. We experiment with a multilingual dataset comprising news articles from the medical domain with diverse morphological structures (Chinese, En...
Preprint
Full-text available
After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is...
Chapter
In this paper, we present a dataset and a baseline evaluation for multilingual epidemic event extraction. We experiment with a multilingual news dataset which we annotate at the token level, a common tagging scheme utilized in event extraction systems. We approach the task of extracting epidemic events by first detecting the relevant documents from...
Chapter
In this paper, we present an efficient and accurate method to represent events from numerous public sources, such as Wikidata or more specific knowledge bases. We focus on events happening in the real world, such as festivals or assassinations. Our method merges knowledge from Wikidata and Wikipedia article summaries to gather entities involved in...
Chapter
The present paper is focused on information extraction from key fields of invoices using two different methods based on sequence labeling. Invoices are semi-structured documents in which data can be located based on the context. Common information extraction systems are model-driven, using heuristics and lists of trigger words curated by domain exp...
Article
Full-text available
Melanoma, one of the most dangerous types of skin cancer, results in a very high mortality rate. Early detection and resection are two key points for a successful cure. Recent researches have used artificial intelligence to classify melanoma and nevus and to compare the assessment of these algorithms to that of dermatologists. However, training neu...
Article
Full-text available
The growth of information technology is equal to the use of the algorithm. One of the most well-known algorithms is Memetic Algorithm (MA). MA is a part of the evolutionary algorithm and has been implemented on the most complex computational challenges. MA could be implemented in any field of research such as optimization, scheduling, prediction, i...
Article
Full-text available
This article considers the interdisciplinary opportunities and challenges of working with digital cultural heritage, such as digitized historical newspapers, and proposes an integrated digital hermeneutics workflow to combine purely disciplinary research approaches from computer science, humanities, and library work. Common interests and motivation...
Article
Full-text available
The fingerprint is one kind of biometric. This biometric unique data have to be processed well and secure. The problem gets more complicated as data grows. This work is conducted to process image fingerprint data with a memetic algorithm, a simple and reliable algorithm. In order to achieve the best result, we run this algorithm in a parallel envir...
Article
Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digi...
Preprint
Full-text available
In this paper, we propose a recent and under-researched paradigm for the task of event detection (ED) by casting it as a question-answering (QA) problem with the possibility of multiple answers and the support of entities. The extraction of event triggers is, thus, transformed into the task of identifying answer spans from a context, while also foc...
Preprint
Full-text available
This paper summarizes the participation of the Laboratoire Informatique, Image et Interaction (L3i laboratory) of the University of La Rochelle in the Recognizing Ultra Fine-grained Entities (RUFES) track within the Text Analysis Conference (TAC) series of evaluation workshops. Our participation relies on two neural-based models, one based on a pre...
Chapter
Breast cancer subtypes, which play a significant role in breast cancer prognosis and targeted therapy selection, can be identified with gene expression profiling. It is also beneficial for personalized treatment to know bio-markers that impact the development of cancer cells from studying gene expression. Therefore, this study uses recursive featur...
Conference Paper
We present a collection of Named Entity Recognition (NER) systems for six Slavic languages: Bulgarian, Czech, Polish, Slovenian, Russian and Ukrainian. These NER systems have been trained using different BERT models and a Frustratingly Easy Domain Adaptation (FEDA). FEDA allow us creating NER systems using multiple datasets without having to worry...
Conference Paper
We explore three different methods for improving Named Entity Recognition (NER) systems based on BERT, each responding to one of three potential issues: the processing of uppercase tokens, the detection of entity boundaries and low generalization. Specifically, we first explore the marking of uppercase tokens for providing extra casing information....
Article
We investigate the effectiveness of a successful model in Visual-Question-Answering (VQA) problems as the core component in a cross-modal retrieval system that can accept images or text as queries, in order to retrieve relevant data from a multimodal document collection. To this end, we adapt the VQA model for deep multimodal learning to combine vi...
Chapter
Event detection involves the identification of instances of specified types of events in text and their classification into event types. In this paper, we approach the event detection task as a relation extraction task. In this context, we assume that the clues brought by the entities participating in an event are important and could improve the pe...
Article
Full-text available
Personal identification has become one of the most important terms in our society regarding access control, crime and forensic identification, banking and also computer system. The fingerprint is the most used biometric feature caused by its unique, universality and stability. The fingerprint is widely used as a security feature for forensic recogn...
Article
Full-text available
Swarm Intelligence is the meta-heuristic algorithm that is inspired by the natural behavior of some groups of animals (like dragonfly, ants, ducks, etc.) striving for their life existence. One of them is Dragonfly Algorithm. Dragonfly Algorithm has been used to solve real-world nonlinear problems in engineering. In this paper, we reviewed Dragonfly...
Conference Paper
Named entities (NEs) are among the most relevant type of information that can be used to efficiently index and retrieve digital documents. Furthermore, the use of Entity Linking (EL) to disambiguate and relate NEs to knowledge bases, provides supplementary information which can be useful to differentiate ambiguous elements such as geographical loca...
Chapter
In the last decades, a huge number of documents has been digitised, before undergoing optical character recognition (OCR) to extract their textual content. This step is crucial for indexing the documents and to make the resulting collections accessible. However, the fact that documents are indexed through their OCRed content is posing a number of p...

Network

Cited By