Detecting distant homologies on protozoans metabolic pathways using scientific workflows.
ABSTRACT Bioinformatics experiments are typically composed of programs in pipelines manipulating an enormous quantity of data. An interesting approach for managing those experiments is through workflow management systems (WfMS). In this work we discuss WfMS features to support genome homology workflows and present some relevant issues for typical genomic experiments. Our evaluation used Kepler WfMS to manage a real genomic pipeline, named OrthoSearch, originally defined as a Perl script. We show a case study detecting distant homologies on trypanomatids metabolic pathways. Our results reinforce the benefits of WfMS over script languages and point out challenges to WfMS in distributed environments.
- SourceAvailable from: Ewa Deelman[show abstract] [hide abstract]
ABSTRACT: Scientific workflows are being developed for many domains as a useful paradigm to manage complex scientific computations. In our work, we are challenged with efficiently generating and validating workflows that contain large amounts (hundreds to thousands) of individual computations to be executed over distributed environments. This paper describes a new approach to workflow creation that uses semantic representations to describe compactly complex scientific applications in a data- independent manner, then automatically generates workflows of computations for given data sets, and finally maps them to available computing resources. The semantic representations are used to automatically generate descriptions for each of the thousands of new data products. We interleave the creation of the workflow with its execution, which allows intermediate execution data products to influence the generation of the following portions of the workflow. We have implemented this approach in Wings, a workflow creation system that combines semantic representations with planning techniques. We have used Wings to create workflows of thousands of computations, which are submitted to the Pegasus mapping system for execution over distributed computing environments. We show results on an earthquake simulation workflow that was automatically created with a total number of 24,135 jobs and that executed for a total of 1.9 CPU years.Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, July 22-26, 2007, Vancouver, British Columbia, Canada; 01/2007
- [show abstract] [hide abstract]
ABSTRACT: Web services represents a significant advance towards data integration, especially in areas where this is a major problem, like Bioinformatics. However, besides platform and data interoperability e-scientists need to monitor their in silico experiment. Those researchers use to work in a team, frequently repeating those experiments, adjusting their parameters in order to obtain more sensible results or performance improvements according to the problem they are trying to address. In many cases, the record of those experiments is achieved in a manual and standalone way, which gives raise to flaws or efforts redundancy in the experiment knowledge acquisition process. In order to avoid those issues e-scientists need to confront the parameters used with their correspondent response. Logging facilities have been used as grounding for recording information describing Web servers and other applications activity. However, current logging approaches do not treat Web services responses. This paper presents a Web services logging architecture based on chains of intermediaries that captures comprehensive services usage information. This architecture can be applied to improve Bioinformatics experiments by providing feedback on services behavior against parameter tuning as well as on quality of services. We explore a particular Bioinformatics scenario, using a publicly available Web service, based on one of the most widespread program used in genomics analysis nowadays.
- [show abstract] [hide abstract]
ABSTRACT: Biological knowledge is inherently complex and so cannot readily be integrated into existing databases of molecular (for example, sequence) data. An ontology is a formal way of representing knowledge in which concepts are described both by their meaning and their relationship to each other. Unique identifiers that are associated with each concept in biological ontologies (bio-ontologies) can be used for linking to and querying molecular databases. This article reviews the principal bio-ontologies and the current issues in their design and development: these include the ability to query across databases and the problems of constructing ontologies that describe complex knowledge, such as phenotypes.Nature Reviews Genetics 04/2004; 5(3):213-22. · 41.06 Impact Factor