Laure Berti-Equille

Laure Berti-Equille
Institute of Research for Development | IRD · 228 - Space for Development (ESPACE-DEV)

PhD in Computer Science, HDR (French Degree)

About

195
Publications
54,783
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,174
Citations
Citations since 2016
71 Research Items
1408 Citations
2016201720182019202020212022050100150200250
2016201720182019202020212022050100150200250
2016201720182019202020212022050100150200250
2016201720182019202020212022050100150200250
Introduction
Data quality problems are duplicates, outliers, inconsistencies, missing values, conflicting or obsolete data. All data types (numeric, categorical, (semi-)structured or free text, geo- and multimedia) and all application domains are affected by these problems. My research work focuses exploring and assessing the quality of data (i.e., detecting and quantifying data anomalies and inconsistencies that can co-exist in any dataset) and most importantly truth discovery.
Additional affiliations
March 2014 - present
Qatar Computing Research Institute
Position
  • Senior Researcher
January 2012 - March 2014
Institute of Research for Development
Position
  • Managing Director
January 2011 - January 2012
Institute of Research for Development
Position
  • Managing Director

Publications

Publications (195)
Article
Full-text available
The challenges of Reproducibility and Replicability (R & R) in computer science experiments have become a focus of attention in the last decade, as efforts to adhere to good research practices have increased. However, experiments using Deep Learning (DL) remain difficult to reproduce due to the complexity of the techniques used. Challenges such as...
Poster
Full-text available
The challenges of Reproducibility and Replicability (R&R) have become a focus of attention in order to promote open and accessible research. Therefore, efforts have been made to develop good practices for R&R in the area of computer science. Nevertheless, Deep Learning (DL) based experiments remain difficult to reproduce by others due to the comple...
Poster
Full-text available
In computer science, there are more and more efforts to improve reproducibility. However, it is still difficult to reproduce the experiments of other scientists, and even more difficult when it comes to Deep Learning (DL). Making a DL research experiment reproducible requires a lot of work to document, verify, and make the system usable. These chal...
Preprint
The automatic discovery of functional dependencies(FDs) has been widely studied as one of the hardest problems in data profiling. Existing approaches have focused on making the FD computation efficient while inspecting single relations at a time. In this paper, for the first time we address the problem of inferring FDs for multiple relations as the...
Chapter
One of the major issues in predicting poverty with satellite images is the lack of fine-grained and reliable poverty indicators. To address this problem, various methodologies were proposed recently. Most recent approaches use a proxy (e.g., nighttime light), as an additional information, to mitigate the problem of sparse data. They consist in buil...
Book
ICDE.International Conference on Data Engineering, Chania, GRC, 19-/04/2021 - 22/04/2021
Book
SIGMOD/PODS '21 : International Conference on Management of Data, En ligne, CHN, 12-/12/2025 - 12/12/2030
Preprint
Full-text available
In this paper, we study the problem of discovering join FDs, i.e., functional dependencies (FDs) that hold on multiple joined tables. We leverage logical inference, selective mining, and sampling and show that we can discover most of the exact join FDs from the single tables participating to the join and avoid the full computation of the join resul...
Chapter
Data cleaning and data preparation have been long-standing challenges in data science to avoid incorrect results, biases, and misleading conclusions obtained from “dirty” data. For a given dataset and data analytics task, a plethora of data preprocessing techniques and alternative data cleaning strategies are available, but they may lead to dramati...
Preprint
Outlier detection is a fundamental task in data mining and has many applications including detecting errors in databases. While there has been extensive prior work on methods for outlier detection, modern datasets often have sizes that are beyond the ability of commonly used methods to process the data within a reasonable time. To overcome this iss...
Conference Paper
Data cleaning and preparation has been a long-standing chal- lenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strate- gies may lead to dramatically different outp...
Conference Paper
With the success of machine learning (ML) techniques, ML has already proved a tremendous potential to impact the foundations, algorithms, and models of several data manage- ment tasks, such as error detection, data quality assessment, data cleaning, and data integration. In Knowledge Graphs, part of the data preparation and cleaning processes, such...
Article
Full-text available
Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely...
Article
In this article we present two different approaches for automatic remote sensing image interpretation which are based on a multi-paradigm collaborative framework which uses classification in order to guide the segmentation process. The first approach applies sequentially many one-vs-all class extractors in a manner inspired by cascading techniques...
Chapter
In this article, we present a collaborative framework for joint segmentation and classification. The framework is guided by and aware of the quality of each segment at every stage; it allows the consideration of both homogeneity based criteria as well as implicit semantic criteria to extract the objects belonging to a given thematic class. We apply...
Conference Paper
A large amount of Distributed Reflective Denial-of-Service (DRDoS) attacks are launched every day, and our understanding of the modus operandi of their perpetrators is yet very limited as we are submerged with so Big Data to analyze and do not have reliable and complete ways to validate our findings. In this paper, we propose a first analytic pipel...
Conference Paper
Full-text available
In an effort to curb air pollution, the city of Delhi (India), known to be one of the most populated, polluted, and congested cities in the world has run a trial experiment in two phases of 15 days intervals. During the experiment, most of four-wheeled vehicles were constrained to move on alternate days based on whether their plate numbers ended wi...
Conference Paper
In man biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, i...
Presentation
Full-text available
Presence of pollutants such as: hevay metals, peticides, pharmaceuticals, personal care products (PPCPs), or halogenated organic compounds on surface waters is a major concern due to their environmental and human impact. The uncontrolled release of such contaminants and the scarce information of their content in Mexican surface waters increase the...
Conference Paper
Full-text available
Error detection is the process of identifying problematic data cells that are different from their ground truth. Functional dependencies (FDs) have been widely studied in support of this process. Oftentimes, it is assumed that FDs are given by experts. Unfortunately, it is usually hard and expensive for the experts to define such FDs. In addition,...
Article
Les tâches de segmentation et de classification d'images sont étroitement liées dans le cadre de l'analyse d’images de télédétection. Les méthodes collaboratives permettent l’interaction entre les approches de segmentation et de classification afin d’améliorer simultanément leurs résultats. Dans cet article nous présentons un cadre collaboratif gén...
Article
In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse,...
Article
Full-text available
Recent studies have shown that the use of a priori knowledge can significantly improve the results of unsupervised classification. However, capturing and formatting such knowledge as constraints is not only very expensive requiring the sustained involvement of an expert but it is also very difficult because some valuable information can be lost whe...
Conference Paper
Full-text available
In this paper, we present a new approach combining topological un-supervised learning with ontology based reasoning to achieve both : (i) automatic interpretation of clustering, and (ii) scaling ontology reasoning over large datasets. The interest of such approach holds on the use of expert knowledge to automate cluster labeling and gives them high...
Poster
Full-text available
Water quality monitoring is a regular practice to assess the presence of pollutants in the water. The importance of monitoring is justified by the need to know the current state of aquatic ecosystems to design appropriate conservative and protective actions (Serrano Balderas et al., 2015). Data from water quality monitoring may be prone to have var...
Conference Paper
Many emerging applications, from domains such as healthcare and oil & gas, require several data processing systems for complex analytics. This demo paper showcases system, a framework that provides multi-platform task execution for such applications. It features a three-layer data processing abstraction and a new query optimization approach for mul...
Presentation
Full-text available
Tutorial at ICDE 2016, Helsinki, May 17th, 2016
Conference Paper
Diabetes is a leading health problem in the developed world. The recent surge of wealth in Qatar has made it one of the most vulnerable nations to diabetes and related diseases. Recent technological advances in 1H nuclear magnetic resonance (NMR) spectroscopy techniques for metabolomics profiling offer a great opportunity for biomarkers discovery t...
Conference Paper
Social networks and the Web in general are characterized by multiple information sources often claiming conflicting data values. Data veracity is hard to estimate, especially when there is no prior knowledge about the sources or the claims in time-dependent scenarios (e.g., crisis situation) where initially very few observers can report first infor...
Chapter
Full-text available
In this chapter we discuss some open issues related to two typologies of information sources that nowadays are particularly significant, namely, Web data and Big Data.
Conference Paper
Full-text available
Data Forensics with Analytics, or DAFNA for short, is an ambitious project initiated by the Data Analytics Research Group in Qatar Computing Research Institute, Hamad Bin Khalifa University. It main goal is to provide effective algorithms and tools for determining the veracity of structured information when they originate from multiple sources. The...
Conference Paper
Full-text available
Doha is one of the fastest growing cities of the world with a population that has increased by nearly 40% in the last five years. There are two significant trends that are relevant to our proposal. First, the government of Qatar is actively engaged in embracing the use of fine-grained data to “sense” the city for maintaining current services and fu...
Article
Full-text available
Climate change has received an extensive attention from public opinion in the last couple of years, after being considered for decades as an exclusive scientific debate. Governments and world-wide organizations such as the United Nations are working more than ever on raising and maintaining public awareness toward this global issue. In the present...
Conference Paper
Full-text available
Dans cet article nous présentons CoSC, un cadre collaboratif pour la segmentation et la classification d’images de télédétection permettant d’extraire les objets d’une classe thématique donnée. Le processus de collaboration est guidé par la qualité des données évaluée par des critères d’homogénéité ainsi que des critères implicitement liés à la sém...
Conference Paper
Full-text available
L'utilisation des connaissances a priori peut fortement améliorer la classification non-supervisée. L'injection de ces connaissances sous forme de contraintes sur les données figure parmi les techniques les plus efficaces de la littérature. Cependant, la génération des contraintes est très coûteuse et demande l'intervention de l'expert ; la sémanti...
Article
Full-text available
L'utilisation des connaissances a priori peut fortement améliorer la classification non-supervisée. L'injection de ces connaissances sous forme de contraintes sur les données figure parmi les techniques les plus efficaces de la littérature. Cependant, la génération des contraintes est très coûteuse et demande l'intervention de l'expert ; la sémanti...
Article
In this paper, we present a new approach combining topological unsupervised learning with ontology based reasoning to achieve both: (i) automatic interpretation of clustering, and (ii) scaling ontology reasoning over large datasets. The interest of such approach holds on the use of expert knowledge to automate cluster labeling and gives them high l...
Book
Full-text available
On the Web, a massive amount of user-generated content is available through various channels (e.g., texts, tweets, Web tables, databases, multimedia-sharing platforms, etc.). Conflicting information, rumors, erroneous and fake content can be easily spread across multiple sources, making it hard to distinguish between what is true and what is not. T...
Article
Full-text available
Object-based image analysis (OBIA) has been widely adopted as a common paradigm to deal with very high-resolution remote sensing images. Nevertheless, OBIA methods strongly depend on the results of image segmentation. Many segmentation quality metrics have been proposed. Supervised metrics give accurate quality estimation but require a ground-truth...