About
409
Publications
38,316
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
23,993
Citations
Publications
Publications (409)
Perspectives on the role and responsibility of the data-management research community in designing, developing, using, and overseeing automated decision systems.
The data revolution continues to transform every sector of science, industry, and government. Due to the incredible impact of data-driven technology on society, we are becoming increasingly aware of the imperative to use data and algorithms responsibly—in accordance with laws and ethical norms. In this article, we discuss three recent regulatory fr...
The data revolution continues to transform every sector of science, industry and government. Due to the incredible impact of data-driven technology on society, we are becoming increasingly aware of the imperative to use data and algorithms responsibly -- in accordance with laws and ethical norms. In this article we discuss three recent regulatory f...
We pursue an investigation of data-driven collaborative workflows. In the model, peers can access and update local data, causing side-effects on other peers' data. In this paper, we study means of explaining to a peer her local view of a global run, both at runtime and statically. We consider the notion of "scenario for a given peer" that is a subr...
Issues of responsible data analysis and use are coming to the forefront of the discourse in data science research and practice, with most significant efforts to date on the part of the data mining, machine learning, and security and privacy communities. In these fields, the research has been focused on analyzing the fairness, accountability and tra...
In April 2016, a community of researchers working in the area of Principles of Data Management (PDM) joined in a workshop at the Dagstuhl Castle in Germany. The workshop was organized jointly by the Executive Committee of the ACM Symposium on Principles of Database Systems (PODS) and the Council of the International Conference on Database Theory (I...
The typical Internet user has data spread over several devices and across several online systems. We demonstrate an open-source system for integrating user's data from different sources into a single Knowledge Base. Our system integrates data of different kinds into a coherent whole, starting with email messages, calendar, contacts, and location hi...
We designed a system to infer multimodal itineraries traveled by a user from a combination of smartphone sensor data (e.g., GPS, Wi-Fi, accelerometer) and knowledge of the transport network infrastructure (e.g., road and rail maps, public transportation timetables). The system uses a Transportation network that captures the set of possible paths of...
We study the problem of, given a corpus of XML documents and its schema, finding an optimal (generative) probabilistic model, where optimality here means maximizing the likelihood of the particular corpus to be generated. Focusing first on the structure of documents, we present an efficient algorithm for finding the best generative probabilistic mo...
The management of Web users' personal information is increasingly distributed across a broad array of applications and systems, including online social networks and cloud-based services. Users wish to share data using these systems, but avoiding the risks of unintended disclosures or unauthorized access by applications has become a major challenge....
Avec la participation de François Bancilhon (Data publica) François Bourdoncle (Dassault systèmes) Stephan Clemencon (Telecom ParisTech) Colin de la Higuera (U. Nantes, SIF) Gilbert Saporta (CNAM) Francoise Soulie-‐Fogelman (Kxen) François Bourdoncle et Paul Hermelin ont été nommés « chefs de file » de la filière Big Data française. Leur mission e...
We study deduction in the presence of inconsistencies. Following previous works, we capture deduction via datalog programs and inconsistencies through violations of functional dependencies (FDs). We study and compare two semantics for datalog with FDs: the first, of a logical nature, is based on inferring facts one at a time, while never violating...
Le Web contient une masse impressionnante de données, plus ou moins explicites et plus ou moins accessibles aux machines. Nous discutons ici des grandes tendances pour le management de ces données : l’extraction de connaissances du Web, l’enrichissement des connaissances par la communauté des internautes, leur représentation sous forme logique, et...
We survey recent work on the specification of an access control mechanism in
a collaborative environment. The work is presented in the context of the
WebdamLog language, an extension of datalog to a distributed context. We
discuss a fine-grained access control mechanism for intentional data based on
provenance as well as a control mechanism for del...
We introduce and study a model of collaborative data-driven workflows. In a local-as-view style, each peer has a partial view of a global instance that remains purely virtual. Local updates have side effects on other peers' data, defined via the global instance. We also assume that the peers provide (an abstraction of) their specifications, so that...
We present the WebdamLog system for managing distributed data on the Web in a
peer-to-peer manner. We demonstrate the main features of the system through an
application called Wepic for sharing pictures between attendees of the sigmod
conference. Using Wepic, the attendees will be able to share, download, rate
and annotate pictures in a highly dece...
We study the use of WebdamLog, a declarative high-level lan- guage in the
style of datalog, to support the distribution of both data and knowledge (i.e.,
programs) over a network of au- tonomous peers. The main novelty of WebdamLog
compared to datalog is its use of delegation, that is, the ability for a peer
to communicate a program to another peer...
The analysis of datalog programs over relational structures has been studied in depth, most notably the problem of containment. The analysis problems that have been considered were shown to be undecidable with the exception of (i) containment of arbitrary programs in nonrecursive ones, (ii) containment of monadic programs, and (iii) emptiness. In t...
The Internet and World Wide Web have revolutionized access to information. Users now store information across multiple platforms from personal computers, to smartphones, to websites such as Youtube and Picasa. As a consequence, data management concepts, methods, and techniques are increasingly focused on distribution concerns. Now that information...
Editing an XML document manually is a complicated task. While many XML editors exist in the market, we argue that some important functionalities are missing in all of them. Our goal is to makes the editing task simpler and faster. We present ALEX (Auto-completion Learning Editor for XML), an editor that assists the users by providing intelligent au...
We address the problem of comparing the expressiveness of workflow specification formalisms using a notion of view of a workflow. Views allow to compare widely different workflow systems by mapping them to a common representation capturing the observables relevant to the comparison. Using this framework, we compare the expressiveness of several wor...
The Webdam ERC grant is a five-year project that started in December 2008. The goal is to develop a formal model for Web data management that would open new horizons for the development of the Web in a well-principled way, enhancing its functionality, performance, and reliability. Specifically, the goal is to develop a universally accepted formal f...
This paper addresses the challenges faced by everyday Web users, who interact with inherently heterogeneous and distributed information. Managing such data is currently beyond the skills of casual users. We describe ongoing work that has as its goal the development of foundations for declarative distributed data management. In this approach, we see...
We study highly expressive query languages for unordered data trees, using as formal vehicles Active XML and extensions of languages in the while family. All languages may be seen as adding some form of control on top of a set of basic pattern queries. The results highlight the impact and interplay of different factors: the expressive power of basi...
Sources of data uncertainty and imprecision are numerous. A way to handle this uncertainty is to associate probabilistic annotations to data. Many such probabilistic database models have been proposed, both in the relational and in the semi-structured setting. The latter is particularly well adapted to the management of uncertain data coming from a...
One of the main challenges that the Semantic Web faces is the integration of
a growing number of independently designed ontologies. In this work, we present
PARIS, an approach for the automatic alignment of ontologies. PARIS aligns not
only instances, but also relations and classes. Alignments at the instance
level cross-fertilize with alignments a...
In this paper, we study watermarking methods to prove the ownership of an ontology. Different from existing approaches, we propose to watermark not by altering existing statements, but by removing them. Thereby, our approach does not introduce false statements into the ontology. We show how ownership of ontologies can be established with provably t...
We study the problem of, given a corpus of XML documents and its schema, finding an optimal probabilistic model (optimality meaning maximizing the likelihood of the corpus to be generated). We present an efficient algorithm for finding the best probabilistic model, in absence of constraints. We further study the problem in presence of integrity con...
While classic data management focuses on the data itself, research on Business Processes considers also the context in which this data is generated and manipulated, namely the processes, the users, and the goals that this data serves. This allows ...
There is a new trend to use Datalog-style rule-based languages to specify modern distributed applications, notably on the Web. We introduce here such a language for a distributed data model where peers exchange messages (i.e. logical facts) as well as rules. The model is formally defined and its interest for distributed data management is illustrat...
We present PARIS, an approach for the automatic alignment of ontologies.
PARIS aligns not only instances, but also relations and classes. Alignments at
the instance-level cross-fertilize with alignments at the schema-level.
Thereby, our system provides a truly holistic solution to the problem of
ontology alignment. The heart of the approach is prob...
Preference queries incorporate the notion of binary preference relation into relational database querying. Instead of returning all the answers, such queries return only the best answers, according to a given preference ...
Distributed data management systems consist of peers that store, exchange and process data in order to collaboratively achieve a common goal, such as evaluate some query. We study the equivalence of such systems. We model a distributed system by a collection of Active XML documents, i.e., trees augmented with function calls for performing tasks suc...
We address the problem of comparing the expressiveness of workflow specification formalisms using a notion of view of a workflow. Views allow to compare widely different workflow systems by mapping them to a common representation capturing the observables relevant to the comparison. Using this framework, we compare the expressiveness of several wor...
We study the problem of, given a corpus of XML documents and its schema, finding an optimal (generative) probabilistic model, where optimality here means maximizing the likelihood of the particular corpus to be generated. Focusing first on the structure of documents, we present an efficient algorithm for finding the best generative probabilistic mo...
A distributed XML document is an XML document that spans several machines. We assume that a distribution design of the document tree is given, consisting of an XML kernel-documentT[f1,…,fn] where some leaves are “docking points” for external resources providing XML subtrees (f1,…,fn, standing, e.g., for Web services or peers at remote locations). T...
A distributed XML document is an XML document that spans several machines. We assume that a distribution design of the document tree is given, consisting of an XML kernel-document T[f1,...,fn] where some leaves are "docking points" for external resources providing XML subtrees (f1,...,fn, standing, e.g., for Web services or peers at remote location...
The workflow models have been essentially operation-centric for many years, ignoring almost completely the data aspects. Recently, a new paradigm of data-centric workflows, called business arti- facts, has been introduced by Nigam and Caswell. We follow this approach and propose a model where artifacts are XML documents that evolve in time due to int...
Sources of data uncertainty and imprecision are numerous. A way to handle this uncertainty is to associate probabilistic annotations to data. Many such probabilistic database models have been proposed, both in the relational and in the semi-structured setting. The latter is particularly well adapted to the management of uncertain data coming from a...
The emergence of Web 2.0 and social network applications has enabled more and more users to share sensitive information over the Web. The information they manipulate has many facets: personal data (e.g., pictures, movies, music, contacts, emails), social data (e.g., annotations, recommendations, contacts), localization information (e.g., bookmarks)...
We consider a set of views stating possibly conflicting facts. Negative facts in the views may come, e.g., from functional dependencies in the underlying database schema. We want to predict the truth values of the facts. Beyond simple methods such as voting (typically rather accurate), we explore techniques based on ``corroboration'', i.e., taking...
Active XML is a high-level specification language tailored to data-intensive, distributed, dynamic Web services. Active XML is based on XML documents with embedded function calls. The state of a document evolves depending on the result of internal function calls (local computations) or external ones (interactions with users or other services). Func...
Various known models of probabilistic XML can be represented as instantiations of the abstract notion of p-documents. In addition to ordinary nodes, p-documents have distributionalnodesthatspecifythepossibleworldsand their probabilistic distribution. Particular families of p-doc- uments are determined by the types of distributional nodes that can b...
We introduce in this paper a class of constraints for describing how an XML document can evolve, namely XML update constraints. For these constraints, we study the implication problem, giving algorithms and complexity results for constraints of varying expressive power. Besides classical constraint implication, we also consider an instance-based ap...
Developer communities built around software products, like the SAP Community Network, provide a knowledge base for reocurring problems and their solutions. Due to the large amount of content maintained in such communities, e.g., in forums, finding relevant solutions is a major challenge beyond the scope of common keyword-based search engines. In fa...
Towards a data-centric workflow approach, we introduce an artifact model to capture data and workflow management activities in distributed settings. The model is built on Active XML, i.e., XML trees including Web service calls. We argue that the model captures the essential features of business artifacts as described informally in (1) or discussed...
Many Web applications are based on dynamic interactions between Web components exchanging flows of information. Such a situa- tion arises for instance in mashup systems (22) or when monitoring distributed autonomous systems (6). This is a challenging prob- lem that has generated recently a lot of attention; see Web 2.0 (38). For capturing interacti...
A distributed XML document is an XML document that spans several machines or Web repositories. We assume that a distribution design of the document tree is given, providing an XML tree some of whose leaves are "docking points", to which XML subtrees can be attached. These subtrees may be provided and controlled by peers at remote locations, or may...
A mashup is a Web application that integrates data, computation and GUI provided by several systems into a unique tool. The concept originated from the understanding that the number of applications available on the Web and the need for combining them to meet user requirements, are growing very rapidly. This demo presents MatchUp, a system that supp...
Many Web applications are based on dynamic interactions between Web components exchanging ows of information. Such a situation arises for instance in mashup systems or when monitoring distributed autonomous systems. Our work is in this challenging context that has generated recently a lot of attention; see Web 2.0. We introduce the axlog for- mal m...
evolving data (1). The model captures both the flow of control (workflow) of the application and the evolution of the relevant data (data cycle); see (2) for a brief survey. In the same spirit, we propose a new artifact model building upon Active XML (AXML for short), an extension of XML with embedded service calls (3). The services are hosted by a...
Information ubiquity has created a large crowd of users (most notably scientists), who could employ DBMS technology to share and search their data more effectively. Still, this user base prefers to keep its data in files that can be easily managed by applications such as spreadsheets, rather than deal with the complexity and rigidity of modern data...
Les sources d'incertitude et d'imprécision des données sont nombreuses. Une ma-nière de gérer cette incertitude est d'associer aux données des annotations probabi-listes. De nombreux modèles de bases de données probabilistes ont ainsi été proposés, dans les cadres relationnel et semi-structuré. Ce dernier est particulièrement adapté à la gestion de...