Mickaël Coustaty

Mickaël Coustaty
  • PhD
  • Professor (Associate) at La Rochelle Université

About

202
Publications
57,068
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,047
Citations
Introduction
- Document Analysis based on Computer Vision and Natural Language Processing - Digital trust in information exchange with the development of fraud detection techniques and document securing systems - Segmentation and Classification of large document flow - Pattern Recognition
Current institution
La Rochelle Université
Current position
  • Professor (Associate)

Publications

Publications (202)
Article
Full-text available
Conservation of marine ecosystems can be improved through a better understanding of ecosystem functioning, particularly the cryptic underwater behaviours and interactions of marine predators. Image‐based bio‐logging devices (including images, videos and active acoustic) are increasingly used to monitor wildlife movements, foraging behaviours and th...
Preprint
Full-text available
Automating table extraction (TE) from business documents is critical for industrial workflows but remains challenging due to sparse annotations and error-prone multi-stage pipelines. While semi-supervised learning (SSL) can leverage unlabeled data, existing methods rely on confidence scores that poorly reflect extraction quality. We propose QUEST,...
Article
Full-text available
This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions,...
Article
Full-text available
Historical census records convey information that is key to perform genealogical research and demographic studies. Given the large number of documents of this type that exist, it is crucial to research methods that allow the automatic extraction of information from this type of document. In this work, we present a new corpus of this kind, comprisin...
Preprint
Full-text available
Extracting tables from documents is a critical task across various industries, especially on business documents like invoices and reports. Existing systems based on DEtection TRansformer (DETR) such as TAble TRansformer (TATR), offer solutions for Table Detection (TD) and Table Structure Recognition (TSR) but face challenges with diverse table form...
Preprint
Developing effective scene text detection and recognition models hinges on extensive training data, which can be both laborious and costly to obtain, especially for low-resourced languages. Conventional methods tailored for Latin characters often falter with non-Latin scripts due to challenges like character stacking, diacritics, and variable chara...
Article
Historical document processing (HDP) corresponds to the task of converting the physical-bind form of historical archives into a web-based centrally digitized form for their conservation , preservation , and ubiquitous access . Besides the conservation of these invaluable historical collections, the key agenda is to make these geographically...
Conference Paper
Historical newspapers serve as invaluable resources for understanding past societies and preserving cultural heritage. However, digitizing these newspapers presents challenges due to their complex layouts and vast content. Article segmentation, involving the identification and extraction of individual articles from scanned newspaper images, is cruc...
Preprint
Full-text available
Optical Character Recognition (OCR) continues to face accuracy challenges that impact subsequent applications. To address these errors, we explore the utility of OCR confidence scores for enhancing post-OCR error detection. Our study involves analyzing the correlation between confidence scores and error rates across different OCR systems. We develo...
Article
Full-text available
This research work proposes a novel protocol for rehearsal-based incremental learning models for the classification of business document streams using deep learning and, in particular, transformer-based natural language processing techniques. When implementing a rehearsal-based incremental classification model, the questions raised most often for p...
Conference Paper
Cet article concerne l'accessibilité de collections de presse ancienne. L'un des principaux défis à relever pour rendre les contenus accessibles est l'extraction d'articles individuels à partir d'images de pages numérisées en vue d'exploiter les documents à la granularité adéquate. Nous évaluons le jeu de données N ewsEye Article Separation (NAS),...
Article
Full-text available
Identity document (ID) verification is crucial in fostering trust in the digital realm, especially with the increasing shift of transactions to online platforms. Our research, building upon our previous work (Al-Ghadi et al. 2023), delves deeper into ID verification by focusing on guilloche patterns. We present two innovative ID verification models...
Conference Paper
In this paper, we address the challenge of document image analysis for historical index table documents with handwritten records. Demographic studies can gain insight from the use of automatic document analysis in such documents through the study of population movements. To evaluate the efficacy of automatic layout analysis tools, we release the PA...
Conference Paper
The digitization of historical documents is a critical task for preserving cultural heritage and making vast amounts of information accessible to the wider public. One of the challenges in this process is separating individual articles from old newspaper images, which is significant for text analysis and information retrieval. In this work, we pres...
Conference Paper
The digitization of historical newspapers is a crucial task for preserving cultural heritage and making it accessible for various natural language processing and information retrieval tasks. One of the key challenges in digitizing old newspapers is article separation, which consists of identifying and extracting individual articles from scanned new...
Conference Paper
Full-text available
In cases of digital enrolment via mobile and online services, identity documents (IDs) verification is critical to efficiently detect forgery and therefore build user trust in the digital world. In this paper, we propose a copy-move public dataset, called FMIDV (forged mobile ID video dataset) containing forged IDs with respect to guilloche pattern...
Article
Full-text available
Automatic document authentication is a complex task. The aim is to prove that the document at hand is not a fraudulent one. This can be achieved through a fingerprint that is based on the document’s content. To this end, it is necessary to analyze and describe the different constituent elements of the document: graphics, text, tables, as well as th...
Chapter
Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices, payslips, medical certificates, etc. To evaluate the OCR post-processing difficulty of these datasets, we propose a m...
Chapter
This paper provides an overview of the DocILE 2023 Competition, its tasks, participant submissions, the competition results and possible future research directions. This first edition of the competition focused on two Information Extraction tasks, Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR). Both of these task...
Chapter
Full-text available
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been...
Chapter
Along with the innovation and development of society, millions of documents are generated daily, and new types of documents related to new activities and services appear regularly. In the workflow for processing these documents, the first step is to classify the received documents to assign them to the relevant departments or staff. Therefore, two...
Chapter
Full-text available
This paper presents the results of the ICDAR 2023 competition on Document UnderstanDing of Everything. DUDE introduces a new dataset comprising 5 K visually-rich documents (VRDs) with 40 K questions with novelties related to types of questions, answers, and document layouts based on multi-industry, multi-domain, and multi-page VRDs of various origi...
Chapter
Information Extraction plays a key role in the automation of auditing processes in administrative documents. However, variety in layout and language is always a challenging task. On the other hand, large volumes of public training datasets related to administrative documents such as invoices are rare to find. In this work, we use Graph Attention Ne...
Chapter
Pre-trained models have proven their efficiency. Despite their good performance, these models require a lot of data and resources to allow state-of-the-art results. In this paper, we propose KAP a pre-trained model adapted for the domain specificity for corporate documents. KAP takes into account the domain specificity of corporate documents and pr...
Chapter
Key-value extraction is a challenging task in document AI, particularly in business documents such as invoices. Accurately extracting key-value pairs from such documents is crucial for downstream tasks like accounting, analytics, and decision-making. In this paper, we propose a method for grouping and linking key-value pairs in business documents u...
Chapter
In this paper, we propose a strategy to train a CNN to detect document manipulations in JPEG documents under data scarcity scenario. As it comes to scanned PDF documents, it is common that the document consists of a JPEG image encapsulated into a PDF. Indeed, if the document before tampering was a JPEG image, its manipulation will lead to double co...
Preprint
Full-text available
Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices, payslips, medical certificates, etc. To evaluate the OCR post-processing difficulty of these datasets, we propose a m...
Conference Paper
Full-text available
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Local-ization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has bee...
Preprint
Full-text available
We call on the Document AI (DocAI) community to reevaluate current methodologies and embrace the challenge of creating more practically-oriented benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to remediate the halted research progress in understanding visually-rich documents (VRDs). We present a new dataset with novelties rel...
Preprint
In the recent past, complex deep neural networks have received huge interest in various document understanding tasks such as document image classification and document retrieval. As many document types have a distinct visual style, learning only visual features with deep CNNs to classify document images have encountered the problem of low inter-cla...
Preprint
Full-text available
The massive use of digital documents due to the substantial trend of paperless initiatives confronted some companies to find ways to process thousands of documents per day automatically. To achieve this, they use automatic information retrieval (IR) allowing them to extract useful information from large datasets quickly. In order to have effective...
Preprint
Full-text available
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been...
Article
Full-text available
Few shot models have started to gain a lot of popularity in the past few years. This is mostly because these models grant the ability to structure the representation space (classes) using a very less amount of examples for each class. Such models are usually trained on a wide range of different classes and their examples, which allows them to form...
Article
Full-text available
Digital libraries have a key role in cultural heritage as they provide access to our culture and history by indexing books and historical documents (newspapers and letters). Digital libraries use natural language processing (NLP) tools to process these documents and enrich them with meta-information, such as named entities. Despite recent advances...
Preprint
Full-text available
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream approach. In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues, considering intra- and inter-mod...
Chapter
The extraction of information from corporate documents is increasing in the research field both for its economic aspect and a scientific challenge. To extract this information the use of textual and visual content becomes unavoidable to understand the inherent information of the image. The information to be extracted is most often fixed beforehand...
Chapter
Results of digitisation projects sometimes suffer from the limitations of optical character recognition software which is mainly designed for modern texts. Prior work has examined the impact of OCR errors on information retrieval (IR) and downstream natural language processing (NLP) tasks. However, questions remain open regarding the actual readabi...
Article
Full-text available
Named entities (NEs) are among the most relevant type of information that can be used to properly index digital documents and thus easily retrieve them. It has long been observed that NEs are key to accessing the contents of digital library portals as they are contained in most user queries. However, most digitized documents are indexed through the...
Chapter
Full-text available
This paper proposes a hashing approach for character authentication and retrieval based on the combination of a convolutional neural network (CNN) and the iterative quantization (ITQ) algorithm. This hashing approach is made up of two steps: feature extraction and hash construction. The feature extraction step involves the reduction of high-dimensi...
Article
Full-text available
In this paper, we make use of the 2-dimensional data obtained through t-Stochastic Neighborhood Embedding (t-SNE) when applied on high-dimensional data of Urdu handwritten characters and numerals. The instances of the dataset used for experimental work are classified in multiple classes depending on the shape similarity. We performed three tasks in...
Chapter
Full-text available
The corporate document classification process may rely on the use of textual approach considered separately of image features. On the opposite, some methods only use the visual content of documents while ignoring the semantic information. This semantic corresponds to an important part of corporate documents which make some classes of document impos...
Chapter
The present paper is focused on information extraction from key fields of invoices using two different methods based on sequence labeling. Invoices are semi-structured documents in which data can be located based on the context. Common information extraction systems are model-driven, using heuristics and lists of trigger words curated by domain exp...
Chapter
In the context of imbalanced classification, deep neural networks suffer from the lack of samples provided by low represented classes. They can’t train enough their weights with a statistically reliable set. All solutions in the state of the art that could offer better performance for those classes, sacrifice in return a huge part of their precisio...
Article
Full-text available
In the recent past, complex deep neural networks have received huge interest in various document understanding tasks such as document image classification and document retrieval. As many document types have a distinct visual style, learning only visual features with deep CNNs to classify document images has encountered the problem of low inter-clas...
Article
Full-text available
Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digi...
Article
We investigate the effectiveness of a successful model in Visual-Question-Answering (VQA) problems as the core component in a cross-modal retrieval system that can accept images or text as queries, in order to retrieve relevant data from a multimodal document collection. To this end, we adapt the VQA model for deep multimodal learning to combine vi...
Article
Full-text available
With the progress in technology innovations, business organizations have preferred usage of online trading instead of traditional ways of trading. Online stores let businessmen offer more variety of products without the need of having big warehouses. At the same time, online shopping also saves time of customers and let them enjoy buying-at-home ex...
Article
Full-text available
Graph-based methods have been widely used by the document image analysis and recognition community, as the different objects and the content in document images is best represented by this powerful structural representation. Designing of novel computation tools for processing these graph-based structural representations has always remained a hot top...
Chapter
The amount of multimedia data has increased on personal computers and the Internet requires the essential to finding a particular image or a collection of images have enhanced of demands. It urges researchers to propose new sophisticated methods to retrieve the information one desires. In the case of, the legacy approach cannot grow up with the rap...
Conference Paper
Named entities (NEs) are among the most relevant type of information that can be used to efficiently index and retrieve digital documents. Furthermore, the use of Entity Linking (EL) to disambiguate and relate NEs to knowledge bases, provides supplementary information which can be useful to differentiate ambiguous elements such as geographical loca...
Chapter
Scene Text VQA has been recently proposed as a new challenging task in the context of multimodal content description. The aim is to teach traditional VQA models to read text contained in natural images by performing a semantic analysis between the visual content and the textual information contained in associated questions to give the correct answe...
Chapter
In digital libraries, the accessibility of digitized documents is directly related to the way they are indexed. Named entities are one of the main entry points used to search and retrieve digital documents. However, most digitized documents are indexed through their OCRed version and OCR errors may hinder their accessibility. This paper aims to qua...
Chapter
Ancient printed documents are an infinite source of knowledge, but digital uses are usually complicated due to the age and the quality of the print. The Linguistic Atlas of France (ALF) maps are composed of printed phonetic words used to locate how words were pronounced over the country. Those words were printed using the Rousselot-Gillieron alphab...
Chapter
Full-text available
Separation of foreground text from noisy or textured background is an important preprocessing step for many document image processing problems. In this work we focus on decorated background removal and the extraction of textual components from French university diploma. As far as we know, this is the very first attempt to resolve this kind of probl...
Chapter
One major drawback of state of the art Neural Networks (NN)-based approaches for document classification purposes is the large number of training samples required to obtain an efficient classification. The minimum required number is around one thousand annotated documents for each class. In many cases it is very difficult, if not impossible, to gat...
Article
Full-text available
Work on the problem of handwritten text recognition in Urdu script has been an active research area. A significant progress is made in this interesting and challenging field in the last few years. In this study, the authors presented a comprehensive survey for a number of offline and online handwritten text recognition systems for Urdu script writt...
Preprint
Full-text available
One major drawback of state of the art Neural Networks (NN)-based approaches for document classification purposes is the large number of training samples required to obtain an efficient classification. The minimum required number is around one thousand annotated documents for each class. In many cases it is very difficult, if not impossible, to gat...
Chapter
Full-text available
Tourism industry could be one of the largest sources of revenue for any country. After the emergence of Web 2.0, it is also one of the largest data intensive industries in the world. Tourism‐rich countries often use Tourism Information Systems (TIS) for management of tourism‐related data. These systems are used are used on several levels of tourism...
Article
Full-text available
In this paper, we propose an approach to interactively propagate annotations representing the historians’ knowledge on a database of lettrine images manually populated by historians (with annotations). Based on a novel document indexing processing scheme which combines the use of the Zipf law and the use of bag of patterns, our approach extends the...
Article
Full-text available
Our study is a scientometric analysis of research publications of science and social science disciplines in Pakistan during 2009–2018. The study examines 2000 published articles belonging to 50 research scholars of different disciplines. This analysis is conducted on three different levels: researcher level, field level and domain level. In this pa...
Preprint
There have been many work in the literature on generation of various kinds of images such as Hand-Written characters (MNIST dataset), scene images (CIFAR-10 dataset), various objects images (ImageNet dataset), road signboard images (SVHN dataset) etc. Unfortunately, there have been very limited amount of work done in the domain of document image pr...
Chapter
Segmentation techniques based on community detection algorithms generally have an over-segmentation problem. This paper then propose a new algorithm to agglomerate near homogeneous regions based on texture and color features. More specifically, our strategy relies on the use of a community detection on graphs algorithm (used as a clustering approac...

Network

Cited By