John Chilton’s research while affiliated with Pennsylvania State University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (37)


The Galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update
  • Article
  • Full-text available

May 2024

·

253 Reads

·

79 Citations

Gareth Price

·

·

·

[...]

·

Rand Zoabi
Download

Figure 2. Automation pipeline for Bioconda packages, BioContainers, Galaxy tools, and workflows.
Figure 3. An example GitHub pull request created by the Planemo autoupdate bot, updating a workflow hosted on the IWC.
The Planemo toolkit for developing, deploying, and executing scientific data analyses in Galaxy and beyond

February 2023

·

52 Reads

·

13 Citations

Genome Research

There are thousands of well-maintained high-quality open-source software utilities for all aspects of scientific data analysis. For more than a decade, the Galaxy Project has been providing computational infrastructure and a unified user interface for these tools to make them accessible to a wide range of researchers. To streamline the process of integrating tools and constructing workflows as much as possible, we have developed Planemo, a software development kit for tool and workflow developers and Galaxy power users. Here we outline Planemo's implementation and describe its broad range of functionality for designing, testing, and executing Galaxy tools, workflows, and training material. In addition, we discuss the philosophy underlying Galaxy tool and workflow development, and how Planemo encourages the use of development best practices, such as test-driven development, by its users, including those who are not professional software developers.


Figure 1. Usage of the usegalaxy servers in Australia (AU), Europe Union (EU) and the United States (US). Large compute infrastructure is available to anyone, for free, without any configuration and it spans the world (more below). User acquisition, user retention, and user activity are captured. A dip in usage captured at the right hand side of some diagrams is cyclical, due to the end of the calendar year. A significant increase in the number of monthly jobs in the EU is due to the start of analyzing SARS-CoV-2 data (more below).
Figure 2. Categorization of the type of tools executed by users across the three most popular usegalaxy servers.
Figure 3. A sample workflow report, showing tSNE and UMAP plots of single cell expression data, automatically generated and formatted based on the outputs of a workflow.
Figure 4. (A) The Galaxy-ML toolkit provides all the tools necessary to define a learner, train it, evaluate it, and visualize its performance. (B) A Galaxy workflow to create a learner using a pipeline, perform hyperparameter search and visualize the results.
The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update

April 2022

·

526 Reads

·

745 Citations

Nucleic Acids Research

Galaxy is a mature, browser accessible workbench for scientific computing. It enables scientists to share, analyze and visualize their own data, with minimal technical impediments. A thriving global community continues to use, maintain and contribute to the project, with support from multiple national infrastructure providers that enable freely accessible analysis and training services. The Galaxy Training Network supports free, self-directed, virtual training with >230 integrated tutorials. Project engagement metrics have continued to grow over the last 2 years, including source code contributions, publications, software packages wrapped as tools, registered users and their daily analysis jobs, and new independent specialized servers. Key Galaxy technical developments include an improved user interface for launching large-scale analyses with many files, interactive tools for exploratory data analysis, and a complete suite of machine learning tools. Important scientific developments enabled by Galaxy include Vertebrate Genome Project (VGP) assembly workflows and global SARS-CoV-2 collaborations.


Planemo: a command-line toolkit for developing, deploying, and executing scientific data analyses

March 2022

·

74 Reads

There are thousands of well-maintained high-quality open-source software utilities for all aspects of scientific data analysis. For over a decade, the Galaxy Project has been providing computational infrastructure and a unified user interface for these tools to make them accessible to a wide range of researchers. In order to streamline the process of integrating tools and constructing workflows as much as possible, we have developed Planemo, a software development kit for tool and workflow developers and Galaxy power users. Here we outline Planemo's implementation and describe its broad range of functionality for designing, testing and executing Galaxy tools, workflows and training material. In addition, we discuss the philosophy underlying Galaxy tool and workflow development, and how Planemo encourages the use of development best practices, such as test-driven development, by its users, including those who are not professional software developers. Planemo is a mature project widely used within the Galaxy community which has been downloaded over 80,000 times.


Figure 1. Inverting the model for data sharing (Left) In the traditional model, project data (shown in purple, orange, and green) are copied to multiple sites where they are accessed by users on institutional computing clusters. Under this model, each institution must establish its own data center, and collaboration is achieved primarily through copying files between data centers. (Right) In the inverted model, users connect to a cloud-enabled resource such as the AnVIL to remotely access and analyze the data without copying. In this model, users virtually access a unified data center, allowing for deeper collaboration and sharing of the results.
Figure 2. Overview of the AnVIL ecosystem
Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space

January 2022

·

260 Reads

·

123 Citations

Cell Genomics

The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts. The AnVIL is a federated cloud platform designed to manage and store genomics and related data, enable population-scale analysis, and facilitate collaboration through the sharing of data, code, and analysis results. By inverting the traditional model of data sharing, the AnVIL eliminates the need for data movement while also adding security measures for active threat detection and monitoring and provides scalable, shared computing resources for any researcher. We describe the core data management and analysis components of the AnVIL, which currently consists of Terra, Gen3, Galaxy, RStudio/Bioconductor, Dockstore, and Jupyter, and describe several flagship genomics datasets available within the AnVIL. We continue to extend and innovate the AnVIL ecosystem by implementing new capabilities, including mechanisms for interoperability and responsible data sharing, while streamlining access management. The AnVIL opens many new opportunities for analysis, collaboration, and data sharing that are needed to drive research and to make discoveries through the joint analysis of hundreds of thousands to millions of genomes along with associated clinical and molecular data types.



Figure 1: Excerpt from a large microbiome bioinformatics CWL workflow [23]. This part of the workflow has the aim to match the workflow inputs of genomic sequences to provided sequence-models, which are dispatched to four sub-workflows (e.g., find_16S_matches); the sub-workflows not detailed in the figure. The sub-worklow outputs are then collated to identify unique sequence hits, then provided as overall workflow outputs. Arrows define the dataflow between tasks and imply their partial ordering, depicted here as layers of tasks that may execute concurrently. Workflow steps (e.g., mask_rRNA_and_tRNA) execute command line tools, shown here with indicators for their different programming languages (e.g., [Py] for Python, [C] for the C language). (Workflow adapted from https://w3id.org/cwl/view/git/7bb76f33bf40b5cd2604001cac46f967a209c47f/workflows/ rna-selector.cwl )
Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language

May 2021

·

157 Reads

·

2 Citations

A widely used standard for portable multilingual data analysis pipelines would enable considerable benefits to scholarly publication reuse, research/industry collaboration, regulatory cost control, and to the environment. Published research that used multiple computer languages for their analysis pipelines would include a complete and reusable description of that analysis that is runnable on a diverse set of computing environments. Researchers would be able to easier collaborate and reuse these pipelines, adding or exchanging components regardless of programming language used; collaborations with and within the industry would be easier; approval of new medical interventions that rely on such pipelines would be faster. Time will be saved and environmental impact would also be reduced, as these descriptions contain enough information for advanced optimization without user intervention. Workflows are widely used in data analysis pipelines, enabling innovation and decision-making for the modern society. In many domains the analysis components are numerous and written in multiple different computer languages by third parties. However, lacking a standard for reusable and portable multilingual workflows, then reusing published multilingual workflows, collaborating on open problems, and optimizing their execution would be severely hampered. Moreover, only a standard for multilingual data analysis pipelines that was widely used would enable considerable benefits to research-industry collaboration, regulatory cost control, and to preserving the environment. Prior to the start of the CWL project, there was no standard for describing multilingual analysis pipelines in a portable and reusable manner. Even today / currently, although there exist hundreds of single-vendor and other single-source systems that run workflows, none is a general, community-driven, and consensus-built standard.


GalaxyCloudRunner: enhancing scalable computing for Galaxy

October 2020

·

6 Reads

·

1 Citation

Bioinformatics

Motivation The existence of more than 100 public Galaxy servers with service quotas is indicative of the need for an increased availability of compute resources for Galaxy to use. The GalaxyCloudRunner enables a Galaxy server to easily expand its available compute capacity by sending user jobs to cloud resources. User jobs are routed to the acquired resources based on a set of configurable rules and the resources can be dynamically acquired from any of 4 popular cloud providers (AWS, Azure, GCP, or OpenStack) in an automated fashion. Availability GalaxyCloudRunner is implemented in Python and leverages Docker containers. The source code is MIT licensed and available at https://github.com/cloudve/galaxycloudrunner. The documentation is available at http://gcr.cloudve.org/.


Distribution of nucleotide changes across SARS-CoV-2 genome
AF, minor allele frequency; POS, position; SARS-CoV-2, severe acute respiratory syndrome coronavirus 2.
Amino acid alignment of spike glycoprotein regions HR1 (A) and HR2 (B). The site of the Lys⁹²¹Gln substitution observed by us in a SARS-CoV-2 isolate is highlighted with a black rectangle in panel A. Its corresponding salt bridge partner is highlighted with a black rectangle in panel B. SARS-CoV-2, severe acute respiratory syndrome coronavirus 2.
Location of potential recombination breakpoints along the S gene (GARD analysis)
Analysis of branch-specific positive diversifying selection (aBSREL) along the branch leading to SARS-CoV-2 (MN988688)
SARS-CoV-2, severe acute respiratory syndrome coronavirus 2.
Methods used for the analysis of primary SARS-CoV-2 data
No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics

August 2020

·

129 Reads

·

29 Citations

The current state of much of the Wuhan pneumonia virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]) research shows a regrettable lack of data sharing and considerable analytical obfuscation. This impedes global research cooperation, which is essential for tackling public health emergencies and requires unimpeded access to data, analysis tools, and computational infrastructure. Here, we show that community efforts in developing open analytical software tools over the past 10 years, combined with national investments into scientific computational infrastructure, can overcome these deficiencies and provide an accessible platform for tackling global health emergencies in an open and transparent manner. Specifically, we use all SARS-CoV-2 genomic data available in the public domain so far to (1) underscore the importance of access to raw data and (2) demonstrate that existing community efforts in curation and deployment of biomedical software can reliably support rapid, reproducible research during global health crises. All our analyses are fully documented at https://github.com/galaxyproject/SARS-CoV-2.


GalaxyCloudRunner: enhancing scalable computing for Galaxy

May 2020

·

36 Reads

·

1 Citation

The existence of more than 100 public Galaxy servers with service quotas is indicative of the need for an increased availability of compute resources for Galaxy to use. The GalaxyCloudRunner enables a Galaxy server to easily expand its available compute capacity by sending user jobs to cloud resources. User jobs are routed to the acquired resources based on a set of configurable rules and the resources can be dynamically acquired from any of 4 popular cloud providers (AWS, Azure, GCP, or OpenStack) in an automated fashion. Availability and implementation GalaxyCloudRunner is implemented in Python and leverages Docker containers. The source code is MIT licensed and available at https://github.com/cloudve/galaxycloudrunner . The documentation is available at http://gcr.cloudve.org/ . Contact Enis Afgan ( enis.afgan@jhu.edu ) Supplementary information None


Citations (26)


... All the analyses were conducted on the European Galaxy server (https://usegalaxy.eu; acessed on 6 October 2023) [37]. ...

Reference:

New Insights into the Sex Chromosome Evolution of the Common Barker Frog Species Complex (Anura, Leptodactylidae) Inferred from Its Satellite DNA Content
The Galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update

... The connection of WorkflowHub to the LifeMonitor service 41 , through the LifeMonitor GitHub app, allows workflow function and status to be reported to maintainers and users through regular automated tests driven by continuous integration (CI) based monitoring (e.g. Planemo automated workflow testing using Galaxy [50]). In these cases, WorkflowHub will also include a badge that shows if the tests are passing or failing. ...

The Planemo toolkit for developing, deploying, and executing scientific data analyses in Galaxy and beyond

Genome Research

... This literature thus presents a contrast between discussions that promote awareness, adoption, and implementation of established standards, versus authors presenting and promoting a new standardization mechanism that may not have been used or implemented anywhere except by the authors (Crusoe et al., 2022). Both can lead to standardization downstream, depending on many factors, as discussed further in the next section. ...

Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language

Communications of the ACM

... Differentially expressed genes between NAS 0-3 and NAS 4-6 (all fibrosis stage ≤ 1) of the cohort 1 datasets were identified with edgeR (v3.36.0 + galaxy 5) [47] on the Galaxy Australia Bioinformatics Platform (https://usegalaxy.org.au/) accessed on 22 May 2025 [48]. Genes with very low expression (counts per million < 1 in a minimum of 3 samples) were excluded. ...

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update

Nucleic Acids Research

... Ongoing initiatives aim to enhance the accessibility of these bioinformatic software tools while also promoting the reproducibility of genomic analyses. Galaxy, KBase, AnVIL, Anvi'o, and QIIME2 are excellent examples of web-based tools, computing environments, or software ecosystems that are actively maintained by a large community of scientists [7][8][9][10][11]. Many instances of Galaxy are freely available worldwide, providing easy access to thousands of specialized bioinformatics tools, regardless of the user's level of computer training. ...

Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space

Cell Genomics

... As a result, container technologies like Docker and Singularity are becoming increasingly used within the community as tools to quickly and reliably deploy bioinformatics software 22,23 . In addition, tool or workflow definition standards and workflow engines are becoming more widely used within many pipeline and software stacks [24][25][26][27][28] . As such, we have developed an implementation of the eCLIP bioinformatics pipeline that leverages these technologies and standards to improve portability and reproducibility of our eCLIP data analysis methods. ...

Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language

... Changes will preserve across successive boots for non-volatile storage mediums such as USB sticks, ideal in deployment scenarios with infrequent or absent internet access. The annotation components will additionally be merged into the Bioconda [7] bioinformatic software distribution for the benefit of the wider bioinformatic community. ...

Practical computational reproducibility in the life sciences

... Next, we further evaluated these assemblies by looking at their 'informational' content. We detected differences between assemblies by estimating the full-length transcript 'coverage' of the different assembled transcripts, or as we prefer to call them, transfragments, when compared to the Uniprot_Sprot protein database with Blastx [14,15]. We selected Uniprot_Sprot because this is a high quality database [16][17][18][19]. ...

NCBI BLAST+ integrated into Galaxy

... Raw metabarcoding sequence data was analyzed using the eDNA Flow pipeline (Mousavi-Derazmahalleh et al. 2021), where data were demultiplexed and trimmed. Demultiplexed sequences were then processed in the Galaxy Europe platform (Batut et al. 2017) using the Mothur package (v 1.33; Schloss et al. 2009). Briefly sequences having at least 110 bp in length with no ambiguous bases, and no more than 12 homopolymers were retained. ...

Community-driven data analysis training for biology

... In the past few years several toolboxes have been released in an effort to address such challenges with using Galaxy [14][15][16][17][18][19]. Yet, these toolkits are often designed to analyse only one specific dimension of transcriptome diversity, and/or not fully automated and require some prior knowledge of R command line script [20]. ...

Enhancing pre-defined workflows with ad hoc analytics using Galaxy, Docker and Jupyter