Paolo Di Tommaso

Paolo Di Tommaso
  • Developer at Centre for Genomic Regulation

About

48
Publications
16,440
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,478
Citations
Current institution
Centre for Genomic Regulation
Current position
  • Developer
Additional affiliations
January 2012 - present
January 2010 - December 2013
Pompeu Fabra University

Publications

Publications (48)
Preprint
Full-text available
The term scientific workflow has evolved over the last two decades to encompass a broad range of compositions of interdependent compute tasks and data movements. It has also become an umbrella term for processing in modern scientific applications. Today, many scientific applications can be considered as workflows made of multiple dependent steps, a...
Preprint
Full-text available
The computational complexity of many key bioinformatics problems has resulted in numerous alternative heuristic solutions, where no single approach consistently outperforms all others. This creates difficulties for users trying to identify the most suitable tool for their dataset and for developers managing and evaluating alternative methods. As da...
Preprint
Full-text available
Standardised analysis pipelines are an important part of FAIR bioinformatics research. Over the last decade, there has been a notable shift from point-and-click pipeline solutions such as Galaxy towards command-line solutions such as Nextflow and Snakemake. We report on recent developments in the nf-core and Nextflow frameworks that have led to wid...
Article
Full-text available
Containers are gaining popularity in life science research as they provide a solution for encompassing dependencies of provisioned tools, simplify software installations for end users and offer a form of isolation between processes. Scientific workflows are ideal for chaining containers into data analysis pipelines to aid in creating reproducible a...
Chapter
Many fields of biology rely on the inference of accurate multiple sequence alignments (MSA) of biological sequences. Unfortunately, the problem of assembling an MSA is NP-complete thus limiting computation to approximate solutions using heuristics solutions. The progressive algorithm is one of the most popular frameworks for the computation of MSAs...
Preprint
Containers are gaining popularity in life science research as they provide a solution for encompassing dependencies of provisioned tools, simplify software installations for end users and offer a form of isolation between processes. Scientific workflows are ideal for chaining containers into data analysis pipelines to aid in creating reproducible a...
Article
Full-text available
Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions³, including the progressive algorithm⁴. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, fro...
Chapter
Full-text available
Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can n...
Preprint
Full-text available
The standardization, portability, and reproducibility of analysis pipelines is a renowned problem within the bioinformatics community. Bioinformatic analysis pipelines are often designed for execution on-premise, and this inevitably leads to a level of customisation and integration that is only applicable to the local infrastructure. More notably,...
Article
Full-text available
Motivation: Most evolutionary analyses are based on pre-estimated multiple sequence alignment. Wong et al. established the existence of an uncertainty induced by multiple sequence alignment when reconstructing phylogenies. They were able to show that in many cases different aligners produce different phylogenies, with no simple objective criterion...
Article
Full-text available
Motivation Most evolutionary analyses are based on pre-estimated multiple sequence alignment. Wong et al. established the existence of an uncertainty induced by multiple sequence alignment when reconstructing phylogenies. They were able to show that in many cases different aligners produce different phylogenies, with no simple objective criterion s...
Preprint
Full-text available
Inferences derived from large multiple alignments of biological sequences are critical to many areas of biology, including evolution, genomics, biochemistry, and structural biology. However, the complexity of the alignment problem imposes the use of approximate solutions. The most common is the progressive algorithm, which starts by aligning the mo...
Preprint
Full-text available
Containers are gaining popularity in life science research as they encompass all dependencies of provisioned tools and simplifies software installations for end users, as well as offering a form of isolation between processes. Scientific workflows are ideal to chain containers into data analysis pipelines to sustain reproducible science. In this ma...
Preprint
Containers are gaining popularity in life science research as they encompass all dependencies of provisioned tools and simplifies software installations for end users, as well as offering a form of isolation between processes. Scientific workflows are ideal to chain containers into data analysis pipelines to sustain reproducible science. In this ma...
Article
Phylogenetic reconstructions are essential in genomics data analyses and depend on accurate multiple sequence alignment (MSA) models. We show that all currently available large-scale progressive multiple alignment methods are numerically unstable when dealing with amino-acid sequences. They produce significantly different output when changing seque...
Preprint
Full-text available
Reproducibility has become one of biology’s most pressing issues. This impasse has been fuelled by the combined reliance on increasingly complex data analysis methods and the exponential growth of biological datasets. Nextflow is a pipeline orchestration tool that has been designed to ease deployment and guarantee reproducibility across platforms....
Preprint
Full-text available
Reproducibility has become one of biology’s most pressing issues. This impasse has been fuelled by the combined reliance on increasingly complex data analysis methods and the exponential growth of biological datasets. Nextflow is a pipeline orchestration tool that has been designed to ease deployment and guarantee reproducibility across platforms....
Article
Reproducing routine bioinformatics analysis is challenging owing to a combination of factors hard to control for. Nextflow is a flow management framework that uses container technology to insure efficient deployment and reproducibility of computational analysis pipelines. Third party pipelines can be ported into Nextflow with minimum re-coding. We...
Article
Full-text available
The PSI/TM-Coffee web server performs multiple sequence alignment (MSA) of proteins by combining homology extension with a consistency based alignment approach. Homology extension is performed with Position Specific Iterative (PSI) BLAST searches against a choice of redundant and non-redundant databases. The main novelty of this server is to allow...
Code
This repository contains the Companion pipeline source code, input data and results produced for the "Reproducible in-silico omics analyses across clouds and clusters" paper. The content of this repository is available on GitHub at the following link https://github.com/cbcrg/companion/tree/nbt-docker
Code
Source code and result data of the RAxML experiment for the "Reproducible in-silico omics analyses across clouds and clusters" paper. The content of this repository is available on GitHub at the following link https://github.com/cbcrg/raxml-nf/tree/nbt-v1.0
Code
This repository contains the Kallisto-NF pipeline source code and results produced for the "Reproducible in-silico omics analyses across clouds and clusters" paper. The content of this repository is available on GitHub at the following link https://github.com/cbcrg/kallisto-nf-reproduce/tree/nbt-v1.0
Article
Full-text available
Genomic pipelines consist of several pieces of third party software and, because of their experimental nature, frequent changes and updates are commonly necessary thus raising serious deployment and reproducibility issues. Docker containers are emerging as a possible solution for many of these problems, as they allow the packaging of pipelines in a...
Article
Full-text available
Genomic pipelines consist of several pieces of third party software and, because their experimental nature, frequent changes and updates are commonly necessary thus raising serious distribution and reproducibility issues. Docker containers technology offers an ideal solution, as it allows the packaging of pipelines in an isolated and self-contained...
Article
Full-text available
Genomic pipelines consist of several pieces of third party software and, because their experimental nature, frequent changes and updates are commonly necessary thus raising serious distribution and reproducibility issues. Docker containers technology offers an ideal solution, as it allows the packaging of pipelines in an isolated and self-contained...
Article
Full-text available
Genomic pipelines consist of several pieces of third party software and, because their experimental nature, frequent changes and updates are commonly necessary thus raising serious distribution and reproducibility issues. Docker containers technology offers an ideal solution, as it allows the packaging of pipelines in an isolated and self-contained...
Article
Full-text available
This article introduces the Transitive Consistency Score (TCS) web server; a service making it possible to estimate the local reliability of protein multiple sequence alignments (MSAs) using the TCS index. The evaluation can be used to identify the aligned positions most likely to contain structurally analogous residues and also most likely to supp...
Article
Full-text available
This article introduces the SARA-Coffee web server; a service allowing the online computation of 3D structure based multiple RNA sequence alignments. The server makes it possible to combine sequences with and without known 3D structures. Given a set of sequences SARA-Coffee outputs a multiple sequence alignment along with a reliability index for ev...
Article
Full-text available
Multiple sequence alignment (MSA) is a key modeling procedure when analyzing biological sequences. Homology and evolutionary modeling are the most common applications of MSAs. Both are known to be sensitive to the underlying MSA accuracy. In this work, we show how this problem can be partly overcome using the transitive consistency score (TCS), an...
Article
T-Coffee, for Tree-based consistency objective function for alignment evaluation, is a versatile multiple sequence alignment (MSA) method suitable for aligning virtually any type of biological sequences. T-Coffee provides more than a simple sequence aligner; rather it is a framework in which alternative alignment methods and/or extra information (i...
Article
Full-text available
This article introduces the T-RMSD web server (tree-based on root-mean-square deviation), a service allowing the online computation of structure-based protein classification. It has been developed to address the relation between structural and functional similarity in proteins, and it allows a fine-grained structural clustering of a given protein f...
Data
For BAliBASE 2, authors did not publish the XML file allowing automated use of these blocks. The location of the block is only available in HTML file, the uppercase of character (i.e., http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE2/ref7/test/msl_ref7.html). We have generated the XML following the original BAliBASE annotation.
Article
Full-text available
Transmembrane proteins (TMPs) constitute about 20~30% of all protein coding genes. The relative lack of experimental structure has so far made it hard to develop specific alignment methods and the current state of the art (PRALINE™) only manages to recapitulate 50% of the positions in the reference alignments available from the BAliBASE2-ref7. We s...
Data
The core region of BAliBASE 2. For BAliBASE 2, authors did not publish the XML file allowing automated use of these blocks. The location of the block is only available in HTML file, the uppercase of character (i.e., http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE2/ref7/test/msl_ref7.html). We have generated the XML following the original BA...
Data
The performance of each TMP family by individual database. default means T-Coffee without homology extension. Others are PSI-Coffee searching against corresponding databases. The construction of databases is explained in "Methods" section.
Article
Full-text available
T-Coffee (Tree-based consistency objective function for alignment evaluation) is a versatile multiple sequence alignment (MSA) method suitable for aligning most types of biological sequences. The main strength of T-Coffee is its ability to combine third party aligners and to integrate structural (or homology) information when building MSAs. The ser...
Article
Full-text available
AMPA is a web application for assessing the antimicrobial domains of proteins, with a focus on the design on new antimicrobial drugs. The application provides fast discovery of antimicrobial patterns in proteins that can be used to develop new peptide-based drugs against pathogens. Results are shown in a user-friendly graphical interface and can be...
Article
Full-text available
This article introduces a new interface for T-Coffee, a consistency-based multiple sequence alignment program. This interface provides an easy and intuitive access to the most popular functionality of the package. These include the default T-Coffee mode for protein and nucleic acid sequences, the M-Coffee mode that allows combining the output of an...
Article
We present the first parallel implementation of the T-Coffee consistency-based multiple aligner. We benchmark it on the Amazon Elastic Cloud (EC2) and show that the parallelization procedure is reasonably effective. We also conclude that for a web server with moderate usage (10K hits/month) the cloud provides a cost-effective alternative to in-hous...
Article
Full-text available
We present the first parallel implementation of the T-Coffee consistency-based multiple aligner. We benchmark it on the Amazon Elastic Cloud (EC2) and show that the parallelization procedure is reasonably effective. We also conclude that for a web server with moderate usage (10K hits/month) the cloud provides a cost-effective alternative to in-hous...
Article
Domain-specific visual languages are often employed to specify both significant configurations and behaviours of systems of interest for the users. Moreover, behavioural diagrams can be developed for different components of a system, possibly employing different families of diagrams for each subsystem. At an abstract level, these diagrams express s...
Conference Paper
Domain specific visual languages express significant system configurations and behaviours. They are mainly used to express some form of system transformation, characterised by its pre-and post-conditions and by an execution policy. We propose an approach to management of transitions, independent from the adopted diagrammatic notation, and describe...
Article
Full-text available
I linguaggi di specifica del comportamento di agenti, e pi uin generale di sistemi reattivi, spesso utilizzano notazioni grafiche per esprimere le configurazioni significative dello stato dell'agente e le trasformazioni ammissibili di tali configurazioni. L'espressione dei due aspetti puo essere delegata a diversi tipi di diagrammi, o venire incorp...
Article
Domain specific visual languages are generally used to specify significant configurations and behaviours of systems of interest for the users and diagrams of differenttypes can be used to specify different components of a single system. At an abstract level, all these diagrams express some form of transformation of the system, whichcanbecharacteris...

Network

Cited By