Khalid Belhajjame

Paris Dauphine University, Lutetia Parisorum, Île-de-France, France

Are you Khalid Belhajjame?

Claim your profile

Publications (67)37.44 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Scientific workflows are a popular mechanism for specifying and automating data-driven in silico experiments. A significant aspect of their value lies in their potential to be reused. Once shared, workflows become useful building blocks that can be combined or modified for developing new experiments. However, previous studies have shown that storing workflow specifications alone is not sufficient to ensure that they can be successfully reused, without being able to understand what the workflows aim to achieve or to re-enact them. To gain an understanding of the workflow, and how it may be used and repurposed for their needs, scientists require access to additional resources such as annotations describing the workflow, datasets used and produced by the workflow, and provenance traces recording workflow executions.
    Journal of Web Semantics 02/2015; DOI:10.1016/j.websem.2015.01.003 · 1.38 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future versions of YesWorkflow also will allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Scientific workflow management systems offer features for composing complex com- putational pipelines from modular building blocks, executing the resulting automated workflows, and recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future version of YesWorkflow will also allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinformatics community. Our assumption is that offering means of digital, structured aggregation and annotation of the objects of an experiment will provide necessary meta-data for a scientist to understand and recreate the results of an experiment. To support this we explored a model for the semantic description of a workflow-centric Research Object (RO), where an RO is defined as a resource that aggregates other resources, e.g., datasets, software, spreadsheets, text, etc. We applied this model to a case study where we analysed human metabolite variation by workflows. Results We present the application of the workflow-centric RO model for our bioinformatics case study. Three workflows were produced following recently defined Best Practices for workflow design. By modelling the experiment as an RO, we were able to automatically query the experiment and answer questions such as ?which particular data was input to a particular workflow to test a particular hypothesis??, and ?which particular conclusions were drawn from a particular workflow??. Conclusions Applying a workflow-centric RO model to aggregate and annotate the resources used in a bioinformatics experiment, allowed us to retrieve the conclusions of the experiment in the context of the driving hypothesis, the executed workflows and their input data. The RO model is an extendable reference model that can be used by other systems as well. Availability The Research Object is available at http://www.myexperiment.org/packs/428 The Wf4Ever Research Object Model is available at http://wf4ever.github.io/ro
    Journal of Biomedical Semantics 09/2014; 5(41). DOI:10.1186/2041-1480-5-41
  • Khalid Belhajjame, Suzanne M. Embury, Norman W. Paton
    [Show abstract] [Hide abstract]
    ABSTRACT: Semantic annotation of web services has been proposed as a solution to the problem of discovering services to fit a particular need, and reusing them appropriately. While there exist tools that assist human users in the annotation task, e.g., Radiant and Meteor-S, no semantic annotation proposal considers the problem of verifying the accuracy of the resulting annotations. Early evidence from workflow compatibility checking suggests that the proportion of annotations that are inaccurate is high, and yet no tools exist to help annotators to test the results of their work systematically before they are deployed for public use. In this paper, we adapt techniques from conventional software testing to the verification of semantic annotations for web service input and output parameters. We present an algorithm for the testing process, and discuss ways in which manual effort from the annotator during testing can be reduced. We also present two adequacy criteria for specifying test cases used as input for the testing process. These criteria are based on structural coverage of the domain ontology used for annotation. The results of an evaluation exercise, based on a collection of annotations for bioinformatics web services, show that defects can be successfully detected by the technique.
    IEEE Transactions on Services Computing 07/2014; 7(3). DOI:10.1109/TSC.2013.4 · 1.99 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: It has become widely recognized that user feedback can play a fundamental role in facilitating information integration tasks, e.g., the construction of integration schema and the specification of schema mappings. While promising, existing proposals make the assumption that the users providing feedback expect the same results from the integration system. In practice, however, different users may anticipate different results, due, e.g., to their preferences or application of interest, in which case the feedback they provide may be conflicting, thereby deteriorating the quality of the services provided by the integration system. In this paper, we present clustering strategies for grouping information integration users into groups of users with similar expectations as to the results delivered by the integration system. As well as grouping information integration users, we show that clustering results can be used as inputs to a wide range of functionalities that are relevant in the context of crowd-driven information integration. Specifically, we show that clustering can be used to identify feedback of relevance to a given user by exploiting the feedback provided by other users in the same cluster. We report on evaluation exercises that assess the effectiveness of the clustering strategies we propose, and showcase the benefits community- and crowd-driven information integration can derive from clustering.
    Distributed and Parallel Databases 03/2014; 33(1):1-35. DOI:10.1007/s10619-014-7160-z · 1.00 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Research in life sciences is increasingly being conducted in a digital and online environment. In particular, life scientists have been pioneers in embracing new computational tools to conduct their investigations. To support the sharing of digital objects produced during such research investigations, we have witnessed in the last few years the emergence of specialized repositories, e.g., DataVerse and FigShare. Such repositories provide users with the means to share and publish datasets that were used or generated in research investigations. While these repositories have proven their usefulness, interpreting and reusing evidence for most research results is a challenging task. Additional contextual descriptions are needed to understand how those results were generated and/or the circumstances under which they were concluded. Because of this, scientists are calling for models that go beyond the publication of datasets to systematically capture the life cycle of scientific investigations and provide a single entry point to access the information about the hypothesis investigated, the datasets used, the experiments carried out, the results of the experiments, the people involved in the research, etc. In this paper we present the Research Object (RO) suite of ontologies, which provide a structured container to encapsulate research data and methods along with essential metadata descriptions. Research Objects are portable units that enable the sharing, preservation, interpretation and reuse of research investigation results. The ontologies we present have been designed in the light of requirements that we gathered from life scientists. They have been built upon existing popular vocabularies to facilitate interoperability. Furthermore, we have developed tools to support the creation and sharing of Research Objects, thereby promoting and facilitating their adoption.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Provenance is a critical ingredient for establishing trust of published scientific content. This is true whether we are considering a data set, a computational workflow, a peer-reviewed publication or a simple scientific claim with supportive evidence. Existing vocabularies such as DC Terms and the W3C PROV-O are domain-independent and general-purpose and they allow and encourage for extensions to cover more specific needs. We identify the specific need for identifying or distinguishing between the various roles assumed by agents manipulating digital artifacts, such as author, contributor and curator. We present the Provenance, Authoring and Versioning ontology (PAV): a lightweight ontology for capturing just enough descriptions essential for tracking the provenance, authoring and versioning of web resources. We argue that such descriptions are essential for digital scientific content. PAV distinguishes between contributors, authors and curators of content and creators of representations in addition to the provenance of originating resources that have been accessed, transformed and consumed. We explore five projects (and communities) that have adopted PAV illustrating their usage through concrete examples. Moreover, we present mappings that show how PAV extends the PROV-O ontology to support broader interoperability. The authors strived to keep PAV lightweight and compact by including only those terms that have demonstrated to be pragmatically useful in existing applications, and by recommending terms from existing ontologies when plausible. We analyze and compare PAV with related approaches, namely Provenance Vocabulary, DC Terms and BIBFRAME. We identify similarities and analyze their differences with PAV, outlining strengths and weaknesses of our proposed model. We specify SKOS mappings that align PAV with DC Terms.
    Journal of Biomedical Semantics 11/2013; 4(37). DOI:10.1186/2041-1480-4-37
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Dataspace management systems (DSMSs) hold the promise of pay-as-you-go data integration. We describe a comprehensive model of DSMS functional-ity using an algebraic style. We begin by characterizing a dataspace life cycle and highlighting opportunities for both automation and user-driven improvement tech-niques. Building on the observation that many of the techniques developed in model management are of use in data integration contexts as well, we briefly introduce the model management area and explain how previous work on both data integration and model management needs extending if the full dataspace life cycle is to be sup-ported. We show that many model management operators already enable important functionality (e.g., the merging of schemas, the composition of mappings, etc.) and formulate these capabilities in an algebraic structure, thereby giving rise to the no-tion of the core functionality of a DSMS as a many-sorted algebra. Given this view, we show how core tasks in the dataspace life cycle can be enacted by means of al-gebraic programs. An extended case study illustrates how such algebraic programs capture a challenging, practical scenario.
    Advanced Query Processing, 11/2013: pages 305-341; , ISBN: 978-3-642-28322-2
  • [Show abstract] [Hide abstract]
    ABSTRACT: Reusing and repurposing scientific workflows for novel scientific experiments is nowadays facilitated by workflow repositories. Such repositories allow scientists to find existing workflows and re-execute them. However, workflow input parameters often need to be adjusted to the research problem at hand. Adapting these parameters may become a daunting task due to the infinite combinations of their values in a wide range of applications. Thus, a scientist may preferably use an automated optimization mechanism to adjust the workflow set-up and improve the result. Currently, automated optimizations must be started from scratch as optimization meta-data are not stored together with workflow provenance data. This important meta-data is lost and can neither be reused nor assessed by other researchers. In this paper we present a novel approach to capture optimization meta-data by extending the Research Object model and reusing the W3C standards. We validate our proposal through a real-world use case taken from the biodivertsity domain, and discuss the exploitation of our solution in the context of existing e-Science infrastructures.
    Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science; 11/2013
  • Pinar Alper, Carole A. Goble, Khalid Belhajjame
    [Show abstract] [Hide abstract]
    ABSTRACT: Thanks to the proliferation of computational techniques and the availability of datasets, data-intensive research has become commonplace in science. Sharing and re-use of datasets is key to scientific progress. A critical requirement for enabling data re-use, is for data to be accompanied by lineage metadata that describes the context in which data is produced, the source datasets from which it was derived and the tooling or settings involved in its generation. By and large, this metadata is provided through a manual curation process, which is tedious, repetitive and time consuming. In this paper, we explore the problem of curating data artifacts generated from scientific workflows, which have become an established method for organizing computational data analyses. Most workflow systems can be instrumented to gather provenance, i.e. lineage, information about the data artifacts generated as a result of their execution. While this form of raw provenance provides elaborate information on localized lineage traced during a run in the form of data derivation or activity causality relations, it is of little use when one needs to report on lineage in a broader scientific context. And, consequently, datasets resulting from workflow-based analyses also require manual curation prior to their publishing. We argue that by making the analysis process explicit, workflow-based investigations provide an opportunity for semi-automating data curation. In this paper we introduce a novel approach that semi-automates curation through a special kind of workflow, which we call a Labeling Workflow. Using 1) the description of a scientific workflow, 2) a set of semantic annotations characterizing the data processing in workflows, and, 3) a library of label handling functions, we devise a Labeling Workflow, which can be executed over raw provenance in order to curate the data artifacts it refers to. We semi-formally describe the elements of our solution, and showcase its usefulness using an example from Biodiversity.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: While workflow technology has gained momentum in the last decade as a means for specifying and enacting computational experiments in modern science, reusing and repurposing existing workflows to build new scientific experiments is still a daunting task. This is partly due to the difficulty that scientists experience when attempting to understand existing workflows, which contain several data preparation and adaptation steps in addition to the scientifically significant analysis steps. One way to tackle the understandability problem is through providing abstractions that give a high-level view of activities undertaken within workflows. As a first step towards abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna and Wings systems. Our analysis has resulted in a set of scientific workflow motifs that outline i) the kinds of data intensive activities that are observed in workflows (data oriented motifs), and ii) the different manners in which activities are implemented within workflows (workflow oriented motifs). These motifs can be useful to inform workflow designers on the good and bad practices for workflow development, to inform the design of automated tools for the generation of workflow abstractions, etc.
    Future Generation Computer Systems 09/2013; 36. DOI:10.1016/j.future.2013.09.018 · 2.64 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: One aspect of the vision of dataspaces has been articulated as providing various benefits of classical data integration with reduced up-front costs. In this paper, we present techniques that aim to support schema mapping specification through interaction with end users in a pay-as-you-go fashion. In particular, we show how schema mappings, that are obtained automatically using existing matching and mapping generation techniques, can be annotated with metrics estimating their fitness to user requirements using feedback on query results obtained from end users.Using the annotations computed on the basis of user feedback, and given user requirements in terms of precision and recall, we present a method for selecting the set of mappings that produce results meeting the stated requirements. In doing so, we cast mapping selection as an optimization problem. Feedback may reveal that the quality of schema mappings is poor. We show how mapping annotations can be used to support the derivation of better quality mappings from existing mappings through refinement. An evolutionary algorithm is used to efficiently and effectively explore the large space of mappings that can be obtained through refinement.User feedback can also be used to annotate the results of the queries that the user poses against an integration schema. We show how estimates for precision and recall can be computed for such queries. We also investigate the problem of propagating feedback about the results of (integration) queries down to the mappings used to populate the base relations in the integration schema.
    Information Systems 07/2013; 38(5):656–687. DOI:10.1016/j.is.2013.01.006 · 1.24 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Taverna workflow tool suite (http://www.taverna.org.uk) is designed to combine distributed Web Services and/or local tools into complex analysis pipelines. These pipelines can be executed on local desktop machines or through larger infrastructure (such as supercomputers, Grids or cloud environments), using the Taverna Server. In bioinformatics, Taverna workflows are typically used in the areas of high-throughput omics analyses (for example, proteomics or transcriptomics), or for evidence gathering methods involving text mining or data mining. Through Taverna, scientists have access to several thousand different tools and resources that are freely available from a large range of life science institutions. Once constructed, the workflows are reusable, executable bioinformatics protocols that can be shared, reused and repurposed. A repository of public workflows is available at http://www.myexperiment.org. This article provides an update to the Taverna tool suite, highlighting new features and developments in the workbench and the Taverna Server.
    Nucleic Acids Research 05/2013; 41(Web Server issue). DOI:10.1093/nar/gkt328 · 8.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an extension to the W3C PROV provenance model, aimed at representing process structure. Although the modelling of process structure is out of the scope of the PROV specification, it is beneficial when capturing and analyzing the provenance of data that is produced by programs or other formally encoded processes. In the paper, we motivate the need for such and extended model in the context of an ongoing large data federation and preservation project, DataONE2, where provenance traces of scientific workflow runs are captured and stored alongside the data products. We introduce new provenance relations for modelling process structure along with their usage patterns, and present sample queries that demonstrate their benefit.
    Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance; 04/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an extension to the W3C PROV provenance model, aimed at representing process structure. Although the modelling of process structure is out of the scope of the PROV specification, it is beneficial when capturing and analyzing the provenance of data that is produced by programs or other formally encoded processes. In the paper, we motivate the need for such and extended model in the context of an ongoing large data federation and preservation project, DataONE, where provenance traces of scientific workflow runs are captured and stored alongside the data products. We introduce new provenance relations for modelling process structure along with their usage patterns, and present sample queries that demonstrate their benefit.
    Proceedings of the 5th USENIX conference on Theory and Practice of Provenance; 04/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many scientists are using workflows to systematically design and run computational experiments. Once the workflow is executed, the scientist may want to publish the dataset generated as a result, to be, e.g., reused by other scientists as input to their experiments. In doing so, the scientist needs to curate such dataset by specifying metadata information that describes it, e.g. its derivation history, origins and ownership. To assist the scientist in this task, we explore in this paper the use of provenance traces collected by workflow management systems when enacting workflows. Specifically, we identify the shortcomings of such raw provenance traces in supporting the data publishing task, and propose an approach whereby distilled, yet more informative, provenance traces that are fit for the data publishing task can be derived.
    Proceedings of the Joint EDBT/ICDT 2013 Workshops; 03/2013
  • Paolo Missier, Khalid Belhajjame, James Cheney
    [Show abstract] [Hide abstract]
    ABSTRACT: Provenance, a form of structured metadata designed to record the origin or source of information, can be instrumental in deciding whether information is to be trusted, how it can be integrated with other diverse information sources, and how to establish attribution of information to authors through- out its history. The PROV set of speci cations, produced by the World Wide Web Consortium (W3C), is designed to pro- mote the publication of provenance information on the Web, and o ers a basis for interoperability across diverse prove- nance management systems. The PROV provenance model is deliberately generic and domain-agnostic, but extension mechanisms are available and can be exploited for modelling speci c domains. This tutorial provides an account of these speci cations. Starting from intuitive and informal exam- ples that present idiomatic provenance patterns, it progres- sively introduces the relational model of provenance along with the constraints model for validation of provenance doc- uments, and concludes with example applications that show the extension points in use.
    Procs. EDBT'13 (Tutorial); 01/2013
  • Pinar Alper, Khalid Belhajjame, Carole Goble, Pinar Karagoz
    [Show abstract] [Hide abstract]
    ABSTRACT: Scientific workflows have become the workhorse of Big Data analytics for scientists. As well as being repeatable and optimizable pipelines that bring together datasets and analysis tools, workflows make-up an important part of the provenance of data generated from their execution. By faithfully capturing all stages in the analysis, workflows play a critical part in building up the audit-trail (a.k.a. provenance) meta-data for derived datasets and contributes to the veracity of results. Provenance is essential for reporting results, reporting the method followed, and adapting to changes in the datasets or tools. These functions, however, are hampered by the complexity of workflows and consequently the complexity of data-trails generated from their instrumented execution. In this paper we propose the generation of workflow description summaries in order to tackle workflow complexity. We elaborate reduction primitives for summarizing workflows, and show how primitives, as building blocks, can be used in conjunction with semantic workflow annotations to encode different summarization strategies. We report on the effectiveness of the method through experimental evaluation using real-world workflows from the Tavern a system.
    Big Data (BigData Congress), 2013 IEEE International Congress on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We describe a corpus of provenance traces that we have collected by executing 120 real world scientific workflows. The workflows are from two different workflow systems: Taverna [5] and Wings [3], and 12 different application domains (see Figure 1). Table 1 provides a summary of this PROV-corpus.
    Proceedings of the Joint EDBT/ICDT 2013 Workshops; 01/2013