Conference PaperPDF Available

Searching the Sequence Read Archive using Jetstream and Wrangler

Abstract and Figures

The Sequence Read Archive (SRA), the world's largest database of sequences, hosts approximately 10 petabases (1016 bp) of sequence data and is growing at the alarming rate of 10 TB per day. Yet this rich trove of data is inaccessible to most researchers: searching through the SRA requires large storage and computing facilities that are beyond the capacity of most laboratories. Enabling scientists to analyze existing sequence data will provide insight into ecology, medicine, and industrial applications. In this project we specifically focus on metagenomic sequences (whole community data sets from different environments). We are developing a set of tools to enable biologists to mine the metagenomes in the SRA using the NSF-funded cloud computing resources, Jetstream and Wrangler. We have developed a proof-of-principle pipeline to demonstrate the feasibility of the approach. We are leveraging our existing infrastructure to enable all scientists to access the SRA metagenomes regardless of their computational ability and are working to create a stable pipeline with a science gateway portal that is accessible to all researchers.
Searching the Sequence Read Archive using Jetstream and
Wrangler
Kyle Levi
San Diego State University - Biological and Medical
Informatics Program
San Diego, California, United States of America
klevi@sdsu.edu
Mats Rynge
USC Information Sciences Institute
Marina del Rey, California, United States of America
rynge@isi.edu
Eroma Abeysinghe
Indiana University - Science Gateways Research Center
Bloomington, Indiana, United States of America
eabeysin@iu.edu
Robert A. Edwards
San Diego State University - Department of Computer
Science
San Diego, California, United States of America
redwards@sdsu.edu
ABSTRACT
The Sequence Read Archive (SRA), the world’s largest database of
sequences, hosts approximately 10 petabases (10
16
bp) of sequence
data and is growing at the alarming rate of 10 TB per day. Yet
this rich trove of data is inaccessible to most researchers: searching
through the SRA requires large storage and computing facilities that
are beyond the capacity of most laboratories. Enabling scientists
to analyze existing sequence data will provide insight into ecology,
medicine, and industrial applications. In this project we speci-
cally focus on metagenomic sequences (whole community data sets
from dierent environments). We are developing a set of tools to
enable biologists to mine the metagenomes in the SRA using the
NSF-funded cloud computing resources, Jetstream and Wrangler.
We have developed a proof-of-principle pipeline to demonstrate
the feasibility of the approach. We are leveraging our existing in-
frastructure to enable all scientists to access the SRA metagenomes
regardless of their computational ability and are working to create
a stable pipeline with a science gateway portal that is accessible to
all researchers.
CCS CONCEPTS
Applied computing Bioinformatics
;Molecular sequence
analysis;Computational genomics;
KEYWORDS
Sequence Read Archive, SRA, Metagenomics, Jetstream, Wrangler,
Bacteriophage, Apache Airavata, SciGaP, Credential Store, Search
SRA, SRA Gateway, Metagenomics Discovery Challenge
ACM Reference Format:
Kyle Levi, Mats Rynge, Eroma Abeysinghe, and Robert A. Edwards . 2018.
Searching the Sequence Read Archive using Jetstream and Wrangler. In
PEARC ’18: Practice and Experience in Advanced Research Computing, July
PEARC ’18, July 22–26, 2018, Pittsburgh, PA, USA
©2018 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-6446-1/18/07.
https://doi.org/10.1145/3219104.3229278
22–26, 2018, Pittsburgh, PA, USA. ACM, New York, NY, USA, 7 pages. https:
//doi.org/10.1145/3219104.3229278
1 INTRODUCTION
A rapid drop in the cost of sequencing DNA - from roughly $10,000
dollars per Mbp in October of 2001 to less than $0.01 today - has
fueled the rapid growth of the SRA from 10
10
bases in March 2007
to 10
16
today with no sign of slowing [
4
] This mass of data is the
result of an international collaboration between the DNA Data Bank
of Japan, European Bioinformatics Institute, and National Center
for Biotechnology Information [
5
]. Despite creating the world’s
largest database of raw sequencing data, eorts to analyze the data
have lagged far behind its growth, leaving a trove of unanalyzed
biological data and the opportunity for big data experiments that
would be preventatively expensive even with lower sequencing
costs.
Data in the SRA is organized into studies, each of which contains
one or more samples. Each sample has one or more experiments,
and each experiment has one or more runs. At the time of writing
there were 2,445,782 experiments and 2,776,555 runs in the SRA
(approximately 80% of experiments have an only a single run). The
genomics community established this database to enable sharing
of the data, but the computational barrier to searching this data
leaves the it separated from the people most qualied to analyze it.
Though there are many types of genomic data within the SRA,
this project focuses on metagenomic datasets because these con-
tain many dierent organisms and can attract a wider interest of
questions compared to single organism runs. There are two popu-
lar metagenomics approaches: First, Amplicon sequencing where
a single piece of DNA (e.g. the 16S gene from bacteria or COX1
gene from eukaryotes) are amplied from the mixed DNA of many
organisms, and sequencing that en masse. These studies provide a
taxonomic prole of an environment and are less computationally
demanding, but only provide information about which organisms
are present and are incapable of detecting any viruses that may
be in the sample. The second type of study, whole shotgun (WGS)
metagenomics, is where random samples of the genomes in the
environment are sequenced, usually without amplication, and re-
sults in a mixture of all the DNA in the sample [
10
]. The analysis of
those samples is computationally intensive but provides a detailed
1
This work is licensed under a Creative Commons
Attribution International 4.0 License.
PEARC ’18, July 22–26, 2018, Pisburgh, PA, USA K. Levi et al.
view of the organisms in the environment and the biochemistry
being performed by those organisms [
10
]. These studies give a
holistic understanding of bacterial and viral communities never
before possible.
Over the last decade, metagenomics sequencing has focused
on understanding the role of microbes in the environment and
reconstructing genomes out of environmental sequences. With the
growth of the SRA, we can begin to approach metagenomics in
a new way. Instead of asking what genomes or metabolisms are
found in a particular environment, we can ask what environments
contain a gene, protein, genome or metabolism of interest using
the abundance of random sequences from diverse environments in
the SRA. This massive volume of data can also be used to identify
genes that are conserved across environments, or environments
that are hotspots of microbial or gene diversication.
However, this kind of computational approach to microbial ecol-
ogy requires large compute and storage capabilities, which are
beyond the reach of most biologists. The WGS projects in the SRA
are accumulating at roughly 3,000 runs per month (averaged from
June 2016 to June 2017), and the combined WGS data sets exceeds
100TB of data [4].
2 EXPERIMENTAL AND COMPUTATIONAL
DETAILS
2.1 Searching the SRA Examples
2.1.1 Investigating a newly discovered bacteriophage. In 2014
Edwards and colleagues published the description of an 100 kb
bacteriophage, a virus that infects Bacteroides, named crAssphage
[
9
]. Previously this phage has been found in approximately half
of the human intestinal metagenomics samples tested (n = 59). A
heuristic approach was developed to expand this search and screen
all WGS metagenome data sets within the SRA (screening 100,000
reads from each run) and used that approach to search the entire
SRA for crAssphage. The full details of the approach can be found
at https://github.com/linsalrob/SearchSRA.
The virus was found to be present in 10,260 runs as shown in Fig.
1. This gure demonstrates that some regions of the crAssphage
genome are highly conserved (darker blue) while other regions are
less well conserved (lighter blue). In particular, there are two genes
that appear to be missing in most crAssphage genomes (the two
lighter bands at 30kb). The presence/absence of these genes suggests
a fundamental process in the evolution of this phage, which would
not have been identied without the ability to investigate across
many unique SRA datasets.
2.1.2 Methane Cycle Proteins. In a similar scan, the SRA was
searched for two enzymes: particulate methane monooxygenase
(PMO) and methyl-coenzyme M reductase (MCR) that are criti-
cal elements of the biological methane cycle [
7
]. RAPSearch2 [
20
]
was used to compare 221 PMO and MCR protein sequences to
the nucleotide sequences in the SRA. Using ten Jetstream [
17
][
19
]
computes all SRA metagenomes were searched for the protein se-
quences in two days’ time. As expected, many of the metagenomes
did not have any similar sequences, but 9,149 metagenomes had at
least one similarity with an expected value of 10
5
or lower. Mapping
those sequences against the PMO and MCR sequences identied
Figure 1: Coverage of the crAssphage genome (x axis, posi-
tion in bp) in 10,260 metagenomes (y axis). The coverage is
based on log10(counts) as shown in the scale bar at right.
variants of those enzymes. As evidence of the validity of this com-
putational approach, the two runs with the highest number of hits
were from samples where the investigators had specically hunted
MCR sequences by PCR (SRA runs SRR398144 and SRR2046417).
2.2 XSEDE Resources
Computational resources used for the searchers comes from two
XSEDE resources: Jetstream [
17
] and Wrangler [
11
]. Both are co-
located at both Indiana University (IU) and Texas Advanced Com-
puting Center (TACC) but are two very dierent types of resources.
The former is an OpenStack based compute cloud, while the latter
is a data analysis system with a large ash based storage system.
The resources where chosen because Jetstream can provide the elas-
ticity required for a portal with a varying workload, VMs can be
added automatically to meet the active user request, and Wrangler
provides the required I/O to search the large amount of SRA data.
2.2.1 Cloud Autoscaling with HTCondor. Inside the Jetstream
cluster, user search requests are handled by HTCondor [
18
], which
is a high throughput computing system well suited for handling
workloads of this type - large amount of single or small number
of threads applications. When users submit a new search, the list
of runs to be searched is checked against the SRA runs available
on Wrangler. Missing SRAs are logged, while available SRA IDs
are grouped together into HTCondor jobs. The jobs are added to a
DAGMan workow [
8
] (Fig. 2), with a top-level job indexing the
reference genome, and a nal local job to package up the results
for nal delivery to the user. The DAGMan workow is submitted
to the HTCondor queue on the service virtual machine [1][2].
To serve submitted searches, Jetstream virtual machines are auto-
scaled based on the demand. The auto-scaler is implemented using
OpenStack’s shade library, which is a simple high-level Python
module for interacting with OpenStack based clouds [
3
]. Every 5
minutes, a cron job checks for pending jobs in the HTCondor queue.
2
Searching the Sequence Read Archive using Jetstream and Wrangler PEARC ’18, July 22–26, 2018, Pisburgh, PA, USA
Figure 2: Structure of the HTCondor DAGMan graph.
A decision to scale up is based on 3 metrics: the number of existing
virtual machines running jobs, the number of total pending jobs
and the number of jobs that have been in queue for more than 15
minutes.
To quickly scale up for new searches, the auto-scaler is slightly
more aggressive when starting from an empty cluster by only con-
sidering the total number of pending jobs. Once there are 3 or
more virtual machines running, the auto-scaler switches to only
considering the "old" jobs. The reason is to not oversubscribe - if
jobs go into the queue and are served quickly, there’s likely no
need for additional resources. Currently, the total number of virtual
machines are limited to 20, but the auto-scaler is under constant
development and the scaling decision logic and limits will likely
change as more users are added to the system.
2.2.2 Wrangler Integration. Metagenomes that are classied as
WGS metagenomes (regardless of the environment from which they
come) are mirrored to a local system and staged for comparisons.
Each month, the incremental new data is downloaded to Wrangler
and used to provide direct access to the data for Jetstream users.
In parallel with the mirroring, 100,000 reads used in the pre-
screening comparison are extracted and staged, integrating the new
datasets into the existing pipeline. The entire data set will be saved
for subsequent comparisons as required. Over the course of this
project this process will be automated so that all data is automati-
cally updated monthly. This automatic pipeline will be released in
common workow language, so others may automatically mirror
components of the SRA.
The Wrangler directory is directly mounted by the Jetstream vir-
tual machines using a dedicated OpenStack network and NFS. Each
virtual machine has two network interfaces: one for the general
communication between the virtual machines and the internet, and
one for the special address space and route required for communi-
cation with Wrangler. The latter was congured by the Jetstream
and Wrangler administrators. When booting the virtual machine
and attaching the two networks, the default route and hostname
Figure 3: Searching SRA Gateway, Apache Airavata Middle-
ware & Computational Resources.
came from the main network connection, and the default route
provided by the Wrangler network is. This was accomplished by
custom "enter" and "exit" scripts for the DHCP client on the virtual
machines. For example, in the "enter" script, the default route is ig-
nored based on the DHCP provided IP address. Both the auto-scaler
and the search workows codes are available on Github [1][2].
2.3 Science Gateway and Jetstream
2.3.1 Searching SRA Gateway with Apache Airavata. The Search-
ing SRA science gateway uses Science Gateway Platform as a Ser-
vice [16] (https://scigap.org/).
The SciGaP platform provides gateway services via Apache Aira-
vata [
13
] middleware. The Searching SRA gateway requires user
identity, accounts, authorization, and the ability to access XSEDE
cloud computational resources Jetstream and Wrangler for com-
putations and user data management as core features from hosted
Apache Airavata middleware. Fig. 3 depicts the components of the
Apache Airavata and its functional interactions with the gateway
instance and computing resource. The hosted Apache Airavata (Sci-
GaP platform) used for the Search SRA gateway is multi-tenanted
and manages multiple science gateways.
2.3.2 User Accounts & Gateway Access. Newly created Search-
ing SRA gateway user accounts require administrator approval
for the user to access Searching SRA software. When user cre-
ates an account, they are in an "access pending" state; once the
gateway administrator approves the account user will become a
"gateway-user" who can launch Searching SRA jobs. The gateway’s
roles also support users with administrative privileges; restricted
administrative privileges allow users to have read-only access to
administrator views [
14
]. All these gateway users except for ones
in pending state can submit jobs on XSEDE Jetstream cluster. The
admin user has the authorization to control metadata for accessing
Jetstream and running the Searching SRA application, to manage
users, and to monitor and access all user experiments information.
For authentication and authorization searching SRA gateway uses
3
PEARC ’18, July 22–26, 2018, Pisburgh, PA, USA K. Levi et al.
Figure 4: Interface and required elds for submitting a scan.
https://www.keycloak.org/ [
6
], the open source identity and access
management solution.
2.3.3 Searching Against SRA. The searching SRA gateway users
and gateway administrators can create, execute, monitor, share,
and manage computational experiments. Experiments are created
in the gateway to submit search jobs in to the Jetstream cluster.
Fig. 4 depicts the interface for experiment creation. Using this in-
terface, gateway users can upload search IDs or select search IDs
le available in Jetstream to submit searching jobs into Jetstream.
In order to make the interface user-friendly the interface is made
simple, and users are only required to provide the required data
les. The gateway decides where the computation runs as well as
the properties in terms of nodes, CPUs and wall-time required for
the computation. Once the search against the SRA is completed,
users can download output data directly from Jetstream through
the gateway. Users also have the option of sharing their work with
other gateway users. As part of managing their experiments, users
can cancel running experiments and clone existing and execute
new experiments on Jetstream. Experiments can be searched us-
ing "Experiment Browse" interfaces, and searches can be ltered
by creation date, application, experiment name, and description.
Experiments are grouped in to Projects. Projects are shareable with
other users similarly to experiments [15].
2.3.4 Monitoring Job Progress. Once an experiment is created
and launched in the gateway, and the corresponding job is sub-
mitted to Jetstream, both the owner of the experiment (gateway
user) and any gateway administrator can monitor the status. The
experiment status can be monitored in two ways. One option is
for the gateway users to provide their email address during experi-
ment creation to receive messages at job start and end. The other
option is to view the status in the "Experiment Summary" interface
(Fig. 5), once the experiment is launched, the experiment summary
Figure 5: Experiment Summary elds shown to users.
interface is automatically refreshed to show the real-time status
of the job submitted into the Jetstream. Regular users can moni-
tor experiments owned by them and shared with them by other
gateway users. Gateway administrators can monitor all gateway
experiments using the Experiment Statistics page (Fig. 6) in the
Admin Dashboard. This interface allows the gateway administrator
to view the status of all experiments and job submissions.
2.3.5 Gateway Administration. The Admin Dashboard is the
workspace for the gateway administrators. All the administrator
features mentioned earlier are available through the Admin Dash-
board. Apart from what is already discussed, the dashboard provides
a notication feature, extensive user interfaces for managing gate-
way congurations required for compute resources and storage
resources connectivity, and tools for managing credentials through
Credential Store [
12
] for secure compute resource communications.
Gateway notications are for messages related to gateway opera-
tions, application availability and for and news related to Jetstream
and Wrangler. In the Searching SRA gateway, administrators need
to congure information required to connect to Jetstream to sub-
mit search jobs. These congurations include adding Jetstream
login name, scratch location, preferred job submission protocol,
and allocation project number. Similarly, Credential Store is used
to generate an SSH credential token and key pair to be used in
compute resource and storage resource communications.
4
Searching the Sequence Read Archive using Jetstream and Wrangler PEARC ’18, July 22–26, 2018, Pisburgh, PA, USA
Figure 6: Dashboard for Admin Users.
Figure 7: Average time spent on each task for 87,702 SRA
metagenomes sampled for 10,000, 100,000, and 1,000,000
reads.
3 RESULTS AND DISCUSSION
3.1 Timing Results
3.1.1 Extracting vs scanning metagenomes. Without the high
capacity storage provided by Wrangler, extracting SRA datasets
becomes too time consuming to sample thousands or even hundreds
of data sets. When runs are downloaded from the SRA, they are
initially in a sparse le format (.sra) and must be extracted to the
more common FASTQ format before they can be searched. Ignoring
the time required to download the data sets (which can exceed 100
TB), the majority of the time spent analysing each run is extracting
the data sets into the FASTQ format. Fig. 7 compares the average
time spent per run on 3 dierent tasks: extracting the FASTQ le,
searching for crAssphage with Bowtie2, and searching for P2 phage
with Bowtie2. Scans for two dierent phages were included to
compare the eect of genome size on Bowtie2 search time. Despite
crAssphage having a genome almost three times as large as P2
(97kb vs 33kb), the search times diered by less than 2%.
File extraction is the most time expensive task by a large margin,
consuming 95%, 92%, and 87% of the total time per run for 10,000,
Figure 8: Typical CPU usage for worker VM during a search.
Figure 9: Typical read rate for worker VM during a search.
100,000 and 1,000,000 reads respectively, when searching for two
organisms. In a storage limited approach, these data sets are down-
load, extracted, scanned and deleted in a process that can take days
to weeks and oers no reusability. By storing the extracted data
sets on Wrangler, search times are reduced to a fraction of that
total time - in addition to saving time by being shared between
researchers. All datasets have been pre-processed to eliminate the
need for on-demand extraction of the sequences.
3.1.2 SRA Gateway Searches. Fig. 8 and Fig. 9 show typical CPU
metrics and the network I/O for a worker VM during a search.
In this instance, 86.5% of the CPU is busy in user space (running
Bowtie2) and 7.6% is System space (mostly waiting for data from
the Wrangler lesystem).
The network graph shows the consistent read rate with an av-
erage of 27.7 megabytes per second. These numbers, per virtual
machine, stay consistent as the auto scaling is adding and removing
virtual machines. We have not yet identied a bottleneck in the
number of virtual machines we can scale to - the scaling is currently
mostly constrained by the allocation and the default VM limits per
user in Jetstream.
3.2 Heuristic Error
SRA datasets can range in size from megabytes to tens of giga-
bytes and it is a common occurrence for a le to contain few or
no reads that map to the organism(s) being searched for. For this
reason, 100,000 reads are sampled from each dataset to determine
5
PEARC ’18, July 22–26, 2018, Pisburgh, PA, USA K. Levi et al.
if a sucient quantity of an organism is present. This sampling
introduces a source of false negatives (not nding an organism in
a sample where it is present). The probability of a false negative
is proportional to the total number of reads in the data set and
inversely proportional to the percentage of reads belonging to the
organism. This means the data sets that least contain the search
organism are most likely to be false negatives. In this setting, false
positives are impossible given the stringent matching requirements
of Bowtie 2.
3.3 Training Future Bioinformaticians
Bioinformatics is a rapidly expanding eld, and there is a strong
need to train the next generation of researchers and industry pro-
fessionals, but most classes lack the resources to introduce students
to modern bioinformatic techniques. Keeping with the increasing
demand, San Diego State University recently introduced a course
aimed at teaching bioinformatics techniques to undergraduate biol-
ogy students using the freely available data from the SRA, a gen-
erous allocation from Jetstream. The rst iteration of the 16-week
course included 7 weeks of training with Unix and programs related
to searching the SRA, 3 weeks of training on available NCBI re-
sources and 6 weeks to conduct an experiment related to antibiotic
resistance or viruses. Future versions of this course aim to reduce
the amount of Unix and BASH related training biology students
must undergo and increase the time spent on data analysis and
interpretation - a goal that will benet immensely from the SRA
Gateway.
3.4 Conclusion
The SRA Gateway, while still a work in progress, has already begun
demonstrating its usefulness to students and researchers alike by
solving two of the largest challenges when working with the SRA.
First, the web interface removes the need for users to be experienced
with Unix systems and commands before accessing SRA data and
allows researchers from all backgrounds access to ecient parallel
computing infrastructure to conduct experiments without requiring
the explicit knowledge of the infrastructure itself. This interface
also extends to students the opportunity to explore bioinformatics
by conducting novel experiments with real world data. Second,
by having the SRA datasets downloaded, extracted, preprocessed
and hosted on Wrangler, search times are reduced by over 99%.
This advantage saves the time of researches as well as computing
resources by implementing ecient job scheduling with HTCondor
and Jetstream instance autoscaling based on user demand. The nal
challenge when working with the SRA is data analysis. Presently,
results from SRA searches are returned to users as a downloadable
compressed folder of Binary Alignment Map (BAM) les, however,
there are plans to introduce general data analysis to the pipeline
using Python. The SRA is constantly accumulating new data and
this vastly underutilized resource holds information relevant to
many diverse areas of biology and medicine. It is the goal of the
SRA Gateway to begin analyzing this plethora of data by removing
the computational barriers between researches and the information
contained in the SRA.
ACKNOWLEDGMENTS
This work used the Extreme Science and Engineering Discovery
Environment (XSEDE), which is supported by National Science
Foundation grant number ACI-1548562.
Development of the Apache Airavata used to develop the science
gateway is supported by NSF award #1339774.
XSEDE resources used include JetStream and Wrangler at Indi-
ana University through allocation TG-MCB170036. We thank Eroma
Abeysinghe and Mats Rynge for their assistance with gateway and
system integration, which was made possible through the XSEDE
Extended Collaborative Support Service (ECSS).
REFERENCES
[1]
2018. jetstream-search-sra: Search the SRA using the jetstream sra cluster
setup. https://github.com/linsalrob/jetstream-search-sra original-date: 2018-01-
05T15:56:45Z.
[2]
2018. jetstream-sra-cluster-setup: This module sets up an autoscaled Jetstream
cluster for SRA work, based on HTCondor, the OpenStack Shade library, and
SaltStack. https://github.com/linsalrob/jetstream-sra-cluster- setup original-
date: 2018-01-05T15:52:04Z.
[3]
2018. shade: Client library for OpenStack containing Infra business logic. https:
//github.com/openstack-infra/shade original-date: 2015-01-07T21:07:08Z.
[4]
2018. SRA Documentation. https://www.ncbi.nlm.nih.gov/sra/docs/sragrowth/
[5]
Jamie Alnasir and Hugh P. Shanahan. 2015. Investigation into the annotation of
protocol sequencing steps in the sequence read archive. GigaScience 4, 1 (09 May
2015), 23. https://doi.org/10.1186/s13742-015- 0064-7
[6]
Marcus A. Christie, Anuj Bhandar, Supun Nakandala, Suresh Marru, Eroma
Abeysinghe, Sudhakar Pamidighantam, and Marlon E. Pierce. 2017. Using
Keycloak for Gateway Authentication and Authorization. (2017). https:
//doi.org/10.6084/m9.gshare.5483557.v1
[7]
Ralf Conrad. 2009. The global methane cycle: recent advances in understanding
the microbial processes involved. Environmental Microbiology Reports 1, 5 (2009),
285–292. https://doi.org/10.1111/j.1758-2229.2009.00038.x
[8]
Peter Couvares, Tevk Kosar, Alain Roy, Je Weber, and Kent Wenger. 2007.
Workow Management in Condor. Springer London, London, 357–375. https:
//doi.org/10.1007/978-1- 84628-757- 2_22
[9]
Bas E. Dutilh, Noriko Cassman, Katelyn McNair, Savannah E. Sanchez, Genivaldo
G. Z. Silva, Lance Boling, Jeremy J. Barr, Daan R. Speth, Victor Seguritan, Ramy K.
Aziz, Ben Felts, Elizabeth A. Dinsdale, John L. Mokili, and Robert A. Edwards.
2014. A highly abundant bacteriophage discovered in the unknown sequences
of human faecal metagenomes. Nature Communications 5 (2014), 4498. https:
//doi.org/10.1038/ncomms5498
[10]
Robert A. Edwards and Forest Rohwer. 2005. Viral metagenomics. Nature Reviews
Microbiology 3, 6 (2005), 504–510. https://doi.org/10.1038/nrmicro1163
[11]
N. Ganey, C. Jordan, T. Minyard, and D. Stanzione. 2014. Building Wrangler:
A transformational data intensive resource for the open science community.
In 2014 IEEE International Conference on Big Data (Big Data). 20–22. https:
//doi.org/10.1109/BigData.2014.7004480
[12]
T. A. Kanewala, S. Marru, J. Basney, and M. Pierce. 2014. A Credential Store for
Multi-tenant Science Gateways. In 2014 14th IEEE/ACM International Symposium
on Cluster, Cloud and Grid Computing (2014-05). 445–454. https://doi.org/10.
1109/CCGrid.2014.95
[13]
Suresh Marru, Lahiru Gunathilake, Chathura Herath, Patanachai Tangchaisin,
Marlon Pierce, Chris Mattmann, Raminder Singh, Thilina Gunarathne, Eran
Chinthaka, Ross Gardler, Aleksander Slominski, Ate Douma, Srinath Perera, and
Sanjiva Weerawarana. 2011. Apache Airavata: A Framework for Distributed
Applications and Computational Workows. In Proceedings of the 2011 ACM
Workshop on Gateway Computing Environments (2011) (GCE ’11). ACM, 21–28.
https://doi.org/10.1145/2110486.2110490
[14]
S. Nakandala, H. Gunasinghe, S. Marru, and M. Pierce. 2016. Apache Airavata
security manager: Authentication and authorization implementations for a multi-
tenant escience framework. In 2016 IEEE 12th International Conference on e-Science
(e-Science) (2016-10). 287–292. https://doi.org/10.1109/eScience.2016.7870911
[15]
Supun Nakandala, Suresh Marru, Marlon Piece, Sudhakar Pamidighantam, Ken-
neth Yoshimoto, Terri Schwartz, Subhashini Sivagnanam, Amit Majumdar, and
Mark A. Miller. 2017. Apache Airavata Sharing Service: A Tool for Enabling User
Collaboration in Science Gateways. In Proceedings of the Practice and Experience
in Advanced Research Computing 2017 on Sustainability, Success and Impact (2017)
(PEARC17). ACM, 20:1–20:8. https://doi.org/10.1145/3093338.3093359
[16]
Marlon Pierce, Suresh Marru, Borries Demeler, Amitava Majumdar, and Mark
Miller. 2013. Science Gateway Operational Sustainability: Adopting a Platform-
as-a-Service Approach. https://doi.org/10.6084/m9.gshare.790760.v1
6
Searching the Sequence Read Archive using Jetstream and Wrangler PEARC ’18, July 22–26, 2018, Pisburgh, PA, USA
[17]
Craig A. Stewart, George Turner, Matthew Vaughn, Niall I. Ganey, Timothy M.
Cockerill, Ian Foster, David Hancock, Nirav Merchant, Edwin Skidmore, Daniel
Stanzione, James Taylor, and Steven Tuecke. 2015. Jetstream: a self-provisioned,
scalable science and engineering cloud environment. ACM Press, 1–8. https:
//doi.org/10.1145/2792745.2792774
[18]
Douglas Thain, Todd Tannenbaum,and Miron Livny. 2005. Distributed computing
in practice: the Condor experience. Concurrency and Computation: Practice and
Experience 17, 2 (2005), 323–356. https://doi.org/10.1002/cpe.938
[19]
J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood,
S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr.
2014. XSEDE: Accelerating Scientic Discovery. Computing in Science Engineering
16, 5 (Sep 2014), 62–74. https://doi.org/10.1109/MCSE.2014.80
[20]
Yongan Zhao, Haixu Tang, and Yuzhen Ye. 2012. RAPSearch2: a fast and memory-
ecient protein similarity search tool for next-generation sequencing data.
Bioinformatics (Oxford, England) 28, 1 (2012), 125–126. https://doi.org/10.1093/
bioinformatics/btr595
7
... While this outcome suggests the absence of the query genome in the explored data sets, it may also be indicative of the low abundance of Atoyac-like phages in the analyzed samples. The searching strategy used in this work involves a data set subsampling step to make the search through the SRA metagenomics data more efficient (15) with the cost of being susceptible to some false-negative results, particularly when the sequencing reads of certain taxa are in low abundance. Our study shows that culture-based approaches are still a good alternative to gain new insights on the virosphere diversity with the additional advantage of leading to the isolation of specimens that can be further investigated in the lab. ...
... Whole shotgun metagenomes available from SRA and previously curated with PARTIE (52) were mapped to the reference genome using bowtie2 (53) with default parameters through the Search SRA Gateway (https://www.searchsra.org/) (15,54,55). Only alignments of reads . ...
... 50 bp were considered. The number of mapped reads and coverage of the alignments were determined from the sorted and indexed BAM files using Samtools v.1.7 (15,48,56). ...
Article
Full-text available
Phages are generally described as species specific or even strain specific, implying an inherent limitation for some to be maintained and spread in diverse bacterial communities. Moreover, phage isolation and host range determination rarely consider the phage ecological context, likely biasing our notion on phage specificity. Here we isolated and characterized a novel group of six promiscuous phages, named Atoyac, existing in rivers and sewage by using a diverse collection of over 600 bacteria retrieved from the same environments as potential hosts. These podophages isolated from different regions in Mexico display a remarkably broad host range, infecting bacteria from six genera: Aeromonas, Pseudomonas, Yersinia, Hafnia, Escherichia, and Serratia Atoyac phage genomes are ∼42 kb long and highly similar to each other, but not to those currently available in genome and metagenome public databases. Detailed comparison of the phages' efficiency of plating (EOP) revealed variation among bacterial genera, implying a cost associated with infection of distant hosts, and between phages, despite their sequence similarity. We show, through experimental evolution in single or alternate hosts of different genera, that efficiency of plaque production is highly dynamic and tends toward optimization in hosts rendering low plaque formation. However, adaptation to distinct hosts differed between similar phages; whereas one phage optimized its EOP in all tested hosts, the other reduced plaque production in one host, suggesting that propagation in multiple bacteria may be key to maintain promiscuity in some viruses. Our study expands our knowledge of the virosphere and uncovers bacterium-phage interactions overlooked in natural systems.IMPORTANCE In natural environments, phages coexist and interact with a broad variety of bacteria, posing a conundrum for narrow-host-range phage maintenance in diverse communities. This context is rarely considered in the study of host-phage interactions, typically focused on narrow-host-range viruses and their infectivity in target bacteria isolated from sources distinct to where the phages were retrieved from. By studying phage-host interactions in bacteria and viruses isolated from river microbial communities, we show that novel phages with promiscuous host range encompassing multiple bacterial genera can be found in the environment. Assessment of hundreds of interactions in diverse hosts revealed that similar phages exhibit different infection efficiency and adaptation patterns. Understanding host range is fundamental in our knowledge of bacterium-phage interactions and their impact on microbial communities. The dynamic nature of phage promiscuity revealed in our study has implications in different aspects of phage research such as horizontal gene transfer or phage therapy.
... To gain a better understanding of how the phages we isolated are distributed in nature, and how common they are to isolate when hunting for candidate phages, we assessed the distribution of representative phages from the Brockvirinae subfamily (Myoviridae V12) and the Saphexavirus genus (Siphoviridae CCS3). We queried these genomes against 67,429 publicly available metagenomes in NCBI's Sequence Read Archive (Table 1) (22). Metagenomes with positive hits were downloaded and aligned to representative Brockvirinae genomes to ensure most of each genome was covered. ...
... Searching the Sequence Read Archive. All metagenomes in the SRA were searched for Brockvirinae Enterococcus phages using the "searching SRA" tool with V12 and CCS3 as representative phages for the myovirus and siphovirus families, respectively (22). Briefly, the searching SRA tool searches for the query sequence in all 111,156 metagenomes currently on the SRA by subsampling 100,000 sequences from each metagenome. ...
Article
Full-text available
Due to the rise in antibiotic resistance, Enterococcus infections are a major health crisis that requires the development of alternative therapies. Phage therapy offers an alternative to antibiotics and has shown promise in both in vitro and early clinical studies.
... Sequence similarity searches against metagenomic sequence data provide important information about environmental distributions of similar sequences, which can be used for the finding of functionally important gene homologs and for the functional inference of sequences. Several groups provide web-based sequence similarity search services that can be used without downloading the huge (more than Terabytes) reference data (Levi et al., 2018;Mitchell et al., 2020). In these web applications, reference sequence data are still large, and this in turn requires reducing the reference data (Levi et al., 2018) or compressing the data by sequence assembling to reduce the calculation time (Mitchell et al., 2020). ...
... Several groups provide web-based sequence similarity search services that can be used without downloading the huge (more than Terabytes) reference data (Levi et al., 2018;Mitchell et al., 2020). In these web applications, reference sequence data are still large, and this in turn requires reducing the reference data (Levi et al., 2018) or compressing the data by sequence assembling to reduce the calculation time (Mitchell et al., 2020). However, down-sampling of sequences wipes out minor sequences in the sample, and assembling sequences eliminates abundance information of the sequence in the sample. ...
Article
Full-text available
Similarity searches of amino acid sequences against the public metagenomic data can provide users insights about the function of sequences based on the environmental distribution of similar sequences. However, a considerable reduction in the amount of data or the accuracy of the result was necessary to conduct sequence similarity searches against public metagenomic data, because of the vast data size more than Terabytes. Here, we present an ultra-fast service for the highly accurate amino acid sequence similarity search, called PZLAST, which can search the user's amino acid sequences to several Terabytes of public metagenomic sequences in approximately 10-20 minutes. PZLAST accomplishes its search speed by using PEZY-SC2, which is a MIMD many-core processor. Results of PZLAST are summarized by the ontology-based environmental distribution of similar sequences. PZLAST can be used to predict the function of sequences and mine for homologs of functionally important gene sequences. Availability and implementation: PZLAST is freely accessible at https://pzlast.riken.jp/meta without requiring registration. Supplementary information: Supplementary data are available at Bioinformatics online.
... We used searchsra (Levi et al. 2018; www.searchsra.org) to check for UP/901 relatives in metagenomes. We used a truncated 6,703 bp UP/901 reference that included only the core UP/ virus genes used in phylogenetic analyses. ...
... We next used searchsra (Levi et al. 2018) to find relatives of UP/901 in metagenomes. The initial results included over 100,000 sequencing runs with at least one read aligned to UP/901. ...
Article
Full-text available
Filamentous phages establish chronic infections in their bacterial hosts, and new phages are secreted by infected bacteria for multiple generations, typically without causing host death. Often, these viruses integrate in their host's genome by co-opting the host's XerCD recombinase system. In several cases, these viruses also encode genes that increase bacterial virulence in plants and animals. Here, we describe a new filamentous phage, UPϕ901, which we originally found integrated in a clinical isolate of Escherichia coli from urine. UPϕ901 and closely related phages can be found in published genomes of over 200 other bacteria, including strains of Citrobacter koseri, Salmonella enterica, Yersinia enterocolitica, and Klebsiella pneumoniae. Its closest relatives are consistently found in urine or in the blood and feces of patients with urinary tract infections. More distant relatives can be found in isolates from other environments, including sewage, water, soil, and contaminated food. Each of these phages, which we collectively call 'UPϕ viruses', also harbors two or more novel genes of unknown function.
... In this respect, alignment or k-mer based methods are more sensitive than assembly in detecting for the presence of low-abundance viruses (genome coverage < 1) with high identity to a reference sequence. Scoring libraries for genome coverage and depth is a good predictor of ultimate assembly success (Extended Data Fig. 3); thus, it can be used to efficiently prioritize computationally expensive assembly in the future, as has been previously demonstrated for large-scale SRA alignment analyses 58 . ...
Article
Full-text available
Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially¹. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 10⁵ novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis delta virus and huge phages, respectively, and analysed their environmental reservoirs. To catalyse the ongoing revolution of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.
... To investigate the distribution of Marinimicrobia-hgcA, the recovered Marinimicrobia-hgcA gene nucleotide sequence was first employed as a query to be searched in the NCBI SRA database using the SRA search tool [72], which is able to map~1% reads of more than 100,000 public whole shotgun metagenomic samples to the query sequences using Bowtie2 [73]. Samples that contained reads associated with the Marinimicrobia-hgcA gene were then checked by downloading the corresponding read set from the SRA database and mapping these reads to the hgcA nucleotide sequence. ...
Article
Full-text available
Microbes transform aqueous mercury (Hg) into methylmercury (MeHg), a potent neurotoxin that accumulates in terrestrial and marine food webs, with potential impacts on human health. This process requires the gene pair hgcAB , which encodes for proteins that actuate Hg methylation, and has been well described for anoxic environments. However, recent studies report potential MeHg formation in suboxic seawater, although the microorganisms involved remain poorly understood. In this study, we conducted large-scale multi-omic analyses to search for putative microbial Hg methylators along defined redox gradients in Saanich Inlet, British Columbia, a model natural ecosystem with previously measured Hg and MeHg concentration profiles. Analysis of gene expression profiles along the redoxcline identified several putative Hg methylating microbial groups, including Calditrichaeota, SAR324 and Marinimicrobia, with the last the most active based on hgc transcription levels. Marinimicrobia hgc genes were identified from multiple publicly available marine metagenomes, consistent with a potential key role in marine Hg methylation. Computational homology modelling predicts that Marinimicrobia HgcAB proteins contain the highly conserved amino acid sites and folding structures required for functional Hg methylation. Furthermore, a number of terminal oxidases from aerobic respiratory chains were associated with several putative novel Hg methylators. Our findings thus reveal potential novel marine Hg-methylating microorganisms with a greater oxygen tolerance and broader habitat range than previously recognized.
... BIS genes are present in human gut, mouth, and nose microbiomes. To determine the prevalence and distribution of BIS genes in human microbiomes, we searched shotgun DNA sequencing data from 11,219 microbiomes from the Human Microbiome Project (HMP) database, taken from several locations on the human body of 232 individuals (8,26). We sampled these metagenomes for the presence of 18 predicted BIS proteins (Table S4). ...
Article
Full-text available
To engage with host cells, diverse pathogenic bacteria produce syringe-like structures called contractile injection systems (CIS). CIS are evolutionarily related to the contractile tails of bacteriophages and are specialized to puncture membranes, often delivering effectors to target cells. Although CIS are key for pathogens to cause disease, paradoxically, similar injection systems have been identified within healthy human microbiome bacteria. Here, we show that gene clusters encoding a predicted CIS, which we term Bacteroidales injection systems (BIS), are present in the microbiomes of nearly all adult humans tested from Western countries. BIS genes are enriched within human gut microbiomes and are expressed both in vitro and in vivo . Further, a greater abundance of BIS genes is present within healthy gut microbiomes than in those humans with with inflammatory bowel disease (IBD). Our discovery provides a potentially distinct means by which our microbiome interacts with the human host or its microbiome.
... To search a subset of the NCBI SRA, the same fragment was used as a query on the SearchSRA website [38] (www.searchsra.org). Top hits were checked with ExPASy Translate [39] and confirmed with BLAST. ...
Article
Full-text available
Some cyanobacteria use light outside the visible spectrum for oxygenic photosynthesis. The far-red light (FRL) region is made accessible through a complex acclimation process that involves the formation of new phycobilisomes and photosystems containing chlorophyll f. Diverse cyanobacteria ranging from unicellular to branched-filamentous forms show this response. These organisms have been isolated from shaded environments such as microbial mats, soil, rock, and stromatolites. However, the full spread of chlorophyll f-containing species in nature is still unknown. Currently, discovering new chlorophyll f cyanobacteria involves lengthy incubation times under selective far-red light. We have used a marker gene to detect chlorophyll f organisms in environmental samples and metagenomic data. This marker, apcE2, encodes a phycobilisome linker associated with FRL-photosynthesis. By focusing on a far-red motif within the sequence, degenerate PCR and BLAST searches can effectively discriminate against the normal chlorophyll a-associated apcE. Even short recovered sequences carry enough information for phylogenetic placement. Markers of chlorophyll f photosynthesis were found in metagenomic datasets from diverse environments around the globe, including cyanobacterial symbionts, hypersaline lakes, corals, and the Arctic/Antarctic regions. This additional information enabled higher phylogenetic resolution supporting the hypothesis that vertical descent, as opposed to horizontal gene transfer, is largely responsible for this phenotype’s distribution.
Preprint
Full-text available
Phages are generally described as species- or even strain-specific viruses, implying an inherent limitation for some to be maintained and spread in diverse bacterial communities. Moreover, phage isolation and host range determination rarely consider the phage ecological context, likely biasing our notion on phage specificity. Here we identified and characterized a novel group of promiscuous phages existing in rivers by using diverse bacteria isolated from the same samples, and then used this biological system to investigate infection dynamics in distantly related hosts. We assembled a diverse collection of over 600 native bacterial strains and used them to isolate six podophages, named Atoyac, from different geographic origin and capable of infecting six genera in the Gammaproteobacteria. Atoyac phage genomes are highly similar to each other but not to those currently available in the genome and metagenome public databases. Detailed comparison of the phage’s infectivity in diverse hosts and trough hundreds of interactions revealed variation in plating efficiency amongst bacterial genera, implying a cost associated with infection of distant hosts, and between phages, despite their sequence similarity. We show, through experimental evolution in single or alternate hosts of different genera, that plaque production efficiency is highly dynamic and tends towards optimization in hosts rendering low plaque formation. Complex adaptation outcomes observed in the evolution experiments differed between highly similar phages and suggest that propagation in multiple hosts may be key to maintain promiscuity in some viruses. Our study expands our knowledge of the virosphere and uncovers bacteria-phage interactions overlooked in natural systems. Importance In natural environments, phages co-exist and interact with a broad variety of bacteria, posing a conundrum for narrow-host-range phages maintenance in diverse communities. This context is rarely considered in the study of host-phage interactions, typically focused on narrow-host-range viruses and their infectivity in target bacteria isolated from sources distinct to where the phages were retrieved from. By studying phage-host interactions in bacteria and viruses isolated from river microbial communities, we show that novel phages with promiscuous host range encompassing multiple bacterial genera can be found in the environment. Assessment of hundreds of interactions in diverse hosts revealed that similar phages exhibit different infection efficiency and adaptation patterns. Understanding host range is fundamental in our knowledge of bacteria-phage interactions and their impact in microbial communities. The dynamic nature of phage promiscuity revealed in our study has implications in different aspects of phage research such as horizontal gene transfer or phage therapy.
Article
Our emerging view of the gut microbiome largely focuses on bacteria, while less is known about other microbial components, such as bacteriophages (phages). Though phages are abundant in the gut, very few phages have been isolated from this ecosystem. Here, we report the genomes of 27 phages from the United States and Bangladesh that infect the prevalent human gut bacterium Bacteroides thetaiotaomicron. These phages are mostly distinct from previously sequenced phages with the exception of two, which are crAss-like phages. We compare these isolates to existing human gut metagenomes, revealing similarities to previously inferred phages and additional unexplored phage diversity. Finally, we use host tropisms of these phages to identify alleles of phage structural genes associated with infectivity. This work provides a detailed view of the gut’s “viral dark matter” and a framework for future efforts to further integrate isolation- and sequencing-focused efforts to understand gut-resident phages.
Article
Full-text available
Background The workflow for the production of high-throughput sequencing data from nucleic acid samples is complex. There are a series of protocol steps to be followed in the preparation of samples for next-generation sequencing. The quantification of bias in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment, remains to be determined. Results We examined the experimental metadata of the public repository Sequence Read Archive (SRA) in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords commonly occurring in key preparatory protocol steps partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively), had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three steps in the protocol (5.58% of all SRA records). Conclusions The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on these data will have a source of bias that cannot be quantified at present.
Conference Paper
Full-text available
Science Gateways bridge multiple computational grids and clouds, acting as overlay cyber infrastructure. Gateways have three logical tiers: a user interfacing tier, a resource tier and a bridging middleware tier. Different groups may operate these tiers. This introduces three security challenges. First, the gateway middleware must manage multiple types of credentials associated with different resource providers. Second, the separation of the user interface and middleware layers means that security credentials must be securely delegated from the user interface to the middleware. Third, the same middleware may serve multiple gateways, so the middleware must correctly isolate user credentials associated with different gateways. We examine each of these three scenarios, concentrating on the requirements and implementation of the middleware layer. We propose and investigate the use of a Credential Store to solve the three security challenges.
Article
Full-text available
Metagenomics, or sequencing of the genetic material from a complete microbial community, is a promising tool to discover novel microbes and viruses. Viral metagenomes typically contain many unknown sequences. Here we describe the discovery of a previously unidentified bacteriophage present in the majority of published human faecal metagenomes, which we refer to as crAssphage. Its ~97 kbp genome is six times more abundant in publicly available metagenomes than all other known phages together; it comprises up to 90% and 22% of all reads in virus-like particle (VLP)-derived metagenomes and total community metagenomes, respectively; and it totals 1.68% of all human faecal metagenomic sequencing reads in the public databases. The majority of crAssphage-encoded proteins match no known sequences in the database, which is why it was not detected before. Using a new co-occurrence profiling approach, we predict a Bacteroides host for this phage, consistent with Bacteroides-related protein homologues and a unique carbohydrate-binding domain encoded in the phage genome.
Conference Paper
Science Gateways provide user environments and a set of supporting services that help researchers make effective and enhanced use of a diverse set of computing, storage, and related resources. Gateways provide the services and tools users require to enable their scientific exploration, which includes tasks such as running computer simulations or performing data analysis. Historically gateways have been constructed to support the workflow of individual users, but collaboration between users has become an increasingly important part of the discovery process. This trend has created a driving need for gateways to support data sharing between users. For example, a chemistry research group may want to run simulations collaboratively, analyze experimental data or tune parameter studies based on simulation output generated by peers, whether as a default capability, or through explicit creation of sharing privileges. As another example, students in a classroom setting may be required to share their simulation output or data analysis results with the instructor. However most existing gateways (including the popularly used XSEDE gateways SEAGrid, Ultrascan, CIPRES, and NSG), do not support direct data sharing, so users have to handle these collaborations outside the gateway environment. Given the importance of collaboration in current scientific practice, user collaboration should be a prime consideration in building science gateways. In this work, we present design considerations and implementation of a generic model that can be used to describe and handle a diverse set of user collaboration use cases that arise in gateways, based on general requirements gathered from the SEAGrid, CIPRES, and NSG gateways. We then describe the integration of this sharing service into these gateways. Though the model and the system were tested and used in the context of Science Gateways, the concepts are universally applicable to any domain, and the service can support data sharing in a wide variety of use cases.
Conference Paper
eScience middleware frameworks integrating multiple virtual organizations must incorporate comprehensive user identity and access management solutions. In this paper we examine usage patterns for these systems and map the patterns to widely used security standards and approaches. We focus on science gateways, a class of distributed system cyberinfrastructure. Science gateways are end user environments that provide access to a wide range of academic and commercial computing and storage resources for virtual organizations. Successful gateways focus on specific scientific communities and domains, but they build on many reusable features that can be provided by general purpose hosted platform services that can support multiple tenants. Providing a security framework for identity and access management for such hosted service removes the burden for each gateway to handle its user identity management and control access to its critical resources. From the resource provider's point of view, it provides a basis for more uniform accounting and auditing. Challenges arise from the range of gateways (both legacy and newly created), the range of technologies used to build them, and the range of end user environments (Web, mobile, desktop, and programmatic API clients) that gateways provide. Using Apache Airavata as an implementation, we examine three common gateway types based on where the user identity information is held and how these can be treated in a unified manner using OAuth2 and OpenID-Connect. Our solutions for identity and access management are not specific to Apache Airavata but can be generally applied to any e-Science platform.
Article
Jetstream will be the first production cloud resource supporting general science and engineering research within the XD ecosystem. In this report we describe the motivation for proposing Jetstream, the configuration of the Jetstream system as funded by the NSF, the team that is implementing Jetstream, and the communities we expect to use this new system. Our hope and plan is that Jetstream, which will become available for production use in 2016, will aid thousands of researchers who need modest amounts of computing power interactively. The implementation of Jetstream should increase the size and disciplinary diversity of the US research community that makes use of the resources of the XD ecosystem.
Conference Paper
With the growth of data in science and engineering fields and the I/O intense technologies used to carry out research with these massive datasets, it has become clear new solutions to support data research is required. In support of this, the Texas Advanced Computing Center presents Wrangler, the first open science research platform built from the ground up in support of data. Wrangler features a replicated 10 PB Lustre based parallel file system, compute capacity of 120 Intel Haswell nodes and 15 TB of RAM. In addition to the base system, Wrangler features a unique NAND flash-based storage system from DSSD, providing users with 0.5 PB of storage 1 TB/s bandwidth and 250 million IOP/s across the cluster. Supporting Hadoop, but not just Hadoop, Wrangler will provide current and future researchers with an environment supporting the most I/O intensive workflows in fields from astronomy to paleontology. With data at the forefront of Wrangler’s mission, support for ETL workflows, data curation, and data publication will enable users as they both discover new results and publish their own research. Support for both SQL and noSQL databases and GIS based extensions will also be provided, allowing users to leverage these tools for both data cataloging and cross-study integration. Wrangler will allow users to focus more on what is most important to them, the data and knowledge gained from its analysis, and less on the details of curation and I/O optimization.
Article
Computing in science and engineering is now ubiquitous: digital technologies underpin, accelerate, and enable new, even transformational, research in all domains. Access to an array of integrated and well-supported high-end digital services is critical for the advancement of knowledge. Driven by community needs, the Extreme Science and Engineering Discovery Environment (XSEDE) project substantially enhances the productivity of a growing community of scholars, researchers, and engineers (collectively referred to as "scientists"' throughout this article) through access to advanced digital services that support open research. XSEDE's integrated, comprehensive suite of advanced digital services federates with other high-end facilities and with campus-based resources, serving as the foundation for a national e-science infrastructure ecosystem. XSEDE's e-science infrastructure has tremendous potential for enabling new advancements in research and education. XSEDE's vision is a world of digitally enabled scholars, researchers, and engineers participating in multidisciplinary collaborations to tackle society's grand challenges.
Article
this document to be the first to solve a problem known as NUG30. [11] NUG30is a quadratic assignment problem that was first proposed in 1968 as one of the most di#cultcombinatorial optimization challenges, but remained unsolved for 32 years because of itscomplexity
Article
The global budget of atmospheric CH4 , which is on the order of 500-600 Tg CH4 per year, is mainly the result of environmental microbial processes, such as archaeal methanogenesis in wetlands, rice fields, ruminant and termite digestive systems and of microbial methane oxidation under anoxic and oxic conditions. This review highlights recent progress in the research of anaerobic CH4 oxidation, of CH4 production in the plant rhizosphere, of CH4 serving as substrate for the aquatic trophic food chain and the discovery of novel aerobic methanotrophs. It also emphasizes progress and deficiencies in our knowledge of microbial utilization of low atmospheric CH4 concentrations in soil, CH4 production in the plant canopy, intestinal methanogenesis and CH4 production in pelagic water.