
Nicola SoranzoEarlham Institute | TGAC · Data Infrastructure & Algorithms
Nicola Soranzo
PhD
About
63
Publications
11,698
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,401
Citations
Introduction
Nicola Soranzo currently works at the Earlham Institute as Galaxy Platform Development officer. Nicola does research in Genetics, Bioinformatics and Systems Biology.
Additional affiliations
September 2009 - December 2014
Publications
Publications (63)
There is an ongoing explosion of scientific datasets being generated, brought on by recent technological advances in many areas of the natural sciences. As a result, the life sciences have become increasingly computational in nature, and bioinformatics has taken on a central role in research studies. However, basic computational skills, data analys...
Properly and effectively managing reference datasets is an important task for many bioinformatics analyses. Refgenie is a reference asset management system that allows users to easily organize, retrieve, and share such datasets. Here, we describe the integration of refgenie into the Galaxy platform. Server administrators are able to configure Galax...
Galaxy is a mature, browser accessible workbench for scientific computing. It enables scientists to share, analyze and visualize their own data, with minimal technical impediments. A thriving global community continues to use, maintain and contribute to the project, with support from multiple national infrastructure providers that enable freely acc...
There are thousands of well-maintained high-quality open-source software utilities for all aspects of scientific data analysis. For over a decade, the Galaxy Project has been providing computational infrastructure and a unified user interface for these tools to make them accessible to a wide range of researchers. In order to streamline the process...
Background: Amplicon sequencing is an established and cost-efficient method for profiling microbiomes. However, many available tools to process this data require both bioinformatics skills and high computational power to process big datasets. Furthermore, there are only few tools that allow for long read amplicon data analysis. To bridge this gap,...
A complete RNA-Seq analysis involves the use of several different tools, with substantial software and computational requirements. The Galaxy platform simplifies the execution of such bioinformatics analyses by embedding the needed tools in its web interface, while also providing reproducibility. Here, we describe how to perform a reference-based R...
Background
The vast ecosystem of single-cell RNA-sequencing tools has until recently been plagued by an excess of diverging analysis strategies, inconsistent file formats, and compatibility issues between different software suites. The uptake of 10x Genomics datasets has begun to calm this diversity, and the bioinformatics community leans once more...
Properly and effectively managing reference datasets is an important task for many bioinformatics analyses. Refgenie is a reference asset management system that allows to easily organize, retrieve, and share such datasets. Here, we describe the integration of refgenie into the Galaxy platform. Server administrators are able to configure Galaxy to m...
Background The vast ecosystem of single-cell RNA-seq tools has until recently been plagued by an excess of diverging analysis strategies, inconsistent file formats, and compatibility issues between different software suites. The uptake of 10x Genomics datasets has begun to calm this diversity, and the bioinformatics community leans once more toward...
Background:
It is not a trivial step to move from single-cell RNA-sequencing (scRNA-seq) data production to data analysis. There is a lack of intuitive training materials and easy-to-use analysis tools, and researchers can find it difficult to master the basics of scRNA-seq quality control and the later analysis.
Results:
We have developed a ran...
Background
It is not a trivial step to move from single-cell RNA-seq (scRNA-seq) data production to data analysis. There is a lack of intuitive training materials and easy-to-use analysis tools, and researchers can find it difficult to master the basics of scRNA-seq quality control and analysis.
Results
We have developed a range of easy-to-use scr...
Background
Phylogenetic information inferred from the study of homologous genes helps us to understand the evolution of genes and gene families, including the identification of ancestral gene duplication events as well as regions under positive or purifying selection within lineages. Gene family and orthogroup characterisation enables the identific...
Many areas of research suffer from poor reproducibility, particularly in computationally intensive domains where results rely on a series of complex methodological decisions that are not well captured by traditional publication approaches. Various guidelines have emerged for achieving reproducibility, but implementation of these practices remains d...
The primary problem with the explosion of biomedical datasets is not the data, not computational resources, and not the required storage space, but the general lack of trained and skilled researchers to manipulate and analyze these data. Eliminating this problem requires development of comprehensive educational resources. Here we present a communit...
Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three k...
The primary problem with the explosion of biomedical datasets is not the data itself, not computational resources, and not the required storage space, but the general lack of trained and skilled researchers to manipulate and analyze these data. Eliminating this problem requires development of comprehensive educational resources. Here we present a c...
Background
Gene duplication is a major factor contributing to evolutionary novelty, and the contraction or expansion of gene families has often been associated with morphological, physiological, and environmental adaptations. The study of homologous genes helps us to understand the evolution of gene families. It plays a vital role in finding ancest...
We present Bioconda (https://bioconda.github.io), a distribution of bioinformatics software for the lightweight, multi-platform and language-agnostic package manager Conda. Currently, Bioconda offers a collection of over 3000 software packages, which is continuously maintained, updated, and extended by a growing global community of more than 200 co...
The phylogenetic information inferred from the study of homologous genes helps us to understand the evolution of gene families and plays a vital role in finding ancestral gene duplication events as well as identifying regions that are under positive selection within species. The Ensembl GeneTrees pipeline generates gene trees based on coding sequen...
The phylogenetic information inferred from the study of homologous genes helps us to understand the evolution of gene families and plays a vital role in finding ancestral gene duplication events as well as identifying regions that are under positive selection within species. The Ensembl GeneTrees pipeline generates gene trees based on coding sequen...
Background: Bioinformaticians routinely use multiple software tools and data sources in their day-to-day work, and have been guided in their choices by a number of cataloguing initiatives. The ELIXIR Tools and Data Services Registry (bio.tools) aims to provide a central information point, independent of any specific scientific scope within bioinfor...
Background
Gene duplication is a major factor contributing to evolutionary novelty, and the contraction or expansion of gene families has often been associated with morphological, physiological and environmental adaptations. The study of homologous genes helps us to understand the evolution of gene families. It plays a vital role in finding ancestr...
High-throughput data production technologies, particularly ‘next-generation’ DNA sequencing, have ushered in widespread and
disruptive changes to biomedical research. Making sense of the large datasets produced by these technologies requires sophisticated
statistical and computational methods, as well as substantial computational power. This has le...
GeneSeqToFamily is an open-source Galaxy workflow based on the Ensembl GeneTrees pipeline. The workflow helps users to run their analyses without using the command-line while still providing the flexibility to tailor the analysis by changing configurations and tools if necessary. It also allows users to subsequently visualise these gene families us...
The study of homologous genes enables the tracing back of conserved functionality through evolution and finds relationships among species. There are many tools to visualise syntenic information among species, representing gene order and orientation, but they do not provide details about structural diversity within genes and between gene families. A...
Command-line utilities to assist in developing tools for the Galaxy Project. http://galaxyproject.org
Obesity is linked to type 2 diabetes (T2D) and cardiovascular diseases; however, the underlying molecular mechanisms remain unclear. We aimed to identify obesity-associated molecular features that may contribute to obesity-related diseases. Using circulating monocytes from 1,264 Multi-Ethnic Study of Atherosclerosis participants, we quantified the...
The NCBI BLAST suite has become ubiquitous in modern molecular biology and is used for small tasks such as checking capillary sequencing results of single PCR products, genome annotation or even larger scale pan-genome analyses. For early adopters of the Galaxy web-based biomedical data analysis platform, integrating BLAST into Galaxy was a natural...
Background: The NCBI BLAST suite has become ubiquitous in modern molecular biology, used for small tasks like checking capillary sequencing results of single PCR products through to genome annotation or even larger scale pan-genome analyses. For early adopters of the Galaxy web-based biomedical data analysis platform, integrating BLAST was a natura...
Background:
Transcriptomic studies hold great potential towards understanding the human aging process. Previous transcriptomic studies have identified many genes with age-associated expression levels; however, small samples sizes and mixed cell types often make these results difficult to interpret.
Results:
Using transcriptomic profiles in CD14+...
In this work we present a strategy to integrate Hadoop-based applications into the Galaxy platform along with an extensible implementation of this adapter and related utilities. The strategy is based on the idea of introducing a new Galaxy datatype that provides a layer of indirection, thus relaxing the requirement to place data on a Galaxy-accessi...
BioBlend.objects is a new component of the BioBlend package, adding an object-oriented interface for the Galaxy REST-based application programming interface. It improves support for metacomputing on Galaxy entities by providing higher-level functionality and allowing users to more easily create programs to explore, query and create Galaxy datasets...
End-to-end NGS microbiology data analysis requires a diversity of tools covering bacterial resequencing, de novo assembly, scaffolding, bacterial RNA-Seq, gene annotation and metagenomics. However, the construction of computational pipelines that use different software packages is difficult due to a lack of interoperability, reproducibility, and tr...
Accurate estimation of parameters of biochemical models is required to characterize the dynamics of molecular processes. This problem is intimately linked to identifying the most informative experiments for accomplishing such tasks. While significant progress has been made, effective experimental strategies for parameter identification and for dist...
In this chapter, the in silico systems genetics dataset, used as a benchmark in the rest of the book, is described in detail, in particular regarding its simulation by SysGenSIM. Morever, the algorithms underlying the generation of the gene expression data and the genotype values are fully illustrated.
As the rate of samples to process increases, manually performing and tracking operations becomes increasingly difficult, costly and error-prone, while processing the massive amounts of data poses significant computational challenges. We will present how combining scientific workflow applications (Galaxy) with state-of-the-art processing technologie...
Reconstructing gene regulatory networks from high-throughput data is a long-standing challenge. Through the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we performed a comprehensive blind assessment of over 30 network inference methods on Escherichia coli, Staphylococcus aureus, Saccharomyces cerevisiae and in silico micr...
Given a large-scale biological network represented as an influence graph, in this article we investigate possible decompositions of the network aimed at highlighting specific dynamical properties.
The first decomposition we study consists in finding a maximal directed acyclic subgraph of the network, which dynamically corresponds to searching for a...
SysGenSIM is a software package to simulate Systems Genetics (SG) experiments in model organisms, for the purpose of evaluating and comparing statistical and computational methods and their implementations for analyses of SG data [e.g. methods for expression quantitative trait loci (eQTL) mapping and network inference]. SysGenSIM allows the user to...
Reverse-engineering gene networks from expression profiles is a difficult problem for which a multitude of techniques have been developed over the last decade. The yearly organized DREAM challenges allow for a fair evaluation and unbiased comparison of these methods.
We propose an inference algorithm that combines confidence matrices, computed as t...
In this paper we propose three different graph-theoretical decompositions of large-scale biologi-cal networks, all three aiming at highlighting specific dynamical properties of the system. The first consists in finding a maximal directed acyclic subgraph in the network, which dynamically cor-responds to searching for the maximal open-loop subsystem...
The authors use ideas from graph theory in order to determine how distant is a given biological network from being monotone. On the signed graph representing the system, the minimal number of sign inconsistencies (i.e. the distance to monotonicity) is shown to be equal to the minimal number of fundamental cycles having a negative sign. Suitable ope...
ERNEST Reaction Network Equilibria Study Toolbox is a MATLAB package which, by checking various different criteria on the structure of a chemical reaction network, can exclude the multistationarity of the corresponding reaction system. The results obtained are independent of the rate constants of the reactions, and can be used for model discriminat...
In yeast, genome-wide periodic patterns associated with energy-metabolic oscillations have been shown recently for both short (approx. 40 min) and long (approx. 300 min) periods.
The dynamical regulation due to mRNA stability is found to be an important aspect of the genome-wide coordination of the long-period yeast metabolic cycle. It is shown tha...
The gene expression response of yeast to various types of stresses/perturbations shows a common pattern for the vast majority of genes, characterized by a quick transient peak followed by a return to the basal level (adaptation). In order to model this transient and the consequent adaptation, we use the idea of integral feedback (the integral repre...
The concept of reverse engineering a gene network, i.e., of inferring a genome-wide graph of putative gene-gene interactions from compendia of high throughput microarray data has been extensively used in the last few years to deduce/integrate/validate various types of "physical" networks of interactions among genes or gene products.
This paper give...
In the past years devising methods for discovering gene regulatory mechanisms at a genome-wide level has become a fundamental topic in the field of systems biology. The aim is to infer gene-gene interactions in an increasingly sophisticated and reliable way through the continuous improvement of reverse engineering algorithms exploiting microarray d...
The concept of reverse engineering a gene network, i.e., of inferring a genome-wide graph of putative gene-gene interactions from high throughput microarray data has been used ex- tensively in the last years to deduce/integrate/validate various types of \physical" networks of interactions among genes or gene products. This paper investigates which...
In this work we compare the predictive power of some of the most popular algorithms used for gene network inference, seen as an unsupervised graph learning problem. The data, generated by an artificial model of a gene regulatory network, are taken in different conditions, like at equilibrium or during a time course, and different numbers of samples...
Inferring a gene regulatory network exclusively from microarray expression profiles is a difficult but important task. The aim of this work is to compare the predictive power of some of the most popular algorithms in different conditions (like data taken at equilibrium or time courses) and on both synthetic and real microarray data. We are in parti...