About
374
Publications
31,603
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
34,855
Citations
Publications
Publications (374)
The 2024 Cyberinfrastructure for Major Facilities (CI4MFs) Workshop, organized by U.S. National Science Foundation (NSF) CI Compass, the NSF Cyberinfrastructure Center of Excellence, brought together cyberinfrastructure (CI) professionals from the NSF major and midscale research facilities along with participants from the broader CI ecosystem to di...
Background
The PubMed archive contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The...
Background
The PubMed database contains more than 34 million articles; consequently, it is becoming increasingly difficult for a biomedical researcher to keep up-to-date with different knowledge domains. Computationally efficient and interpretable tools are needed to help researchers find and understand associations between biomedical concepts. The...
Many important scientific discoveries require lengthy experimental processes of trial and error and could benefit from intelligent prioritization based on deep domain understanding. While exponential growth in the scientific literature makes it difficult to keep current in even a single domain, that same rapid growth in literature also presents an...
Quantum machine learning could possibly become a valuable alternative to classical machine learning for applications in high energy physics by offering computational speedups. In this study, we employ a support vector machine with a quantum kernel estimator (QSVM-Kernel method) to a recent LHC flagship physics analysis: tt¯H (Higgs boson production...
During the first observation run the LIGO collaboration needed to offload some of its most, intense CPU workflows from its dedicated computing sites to opportunistic resources. Open Science Grid enabled LIGO to run PyCbC, RIFT and Bayeswave workflows to seamlessly run in a combination of owned and opportunistic resources. One of the challenges is e...
Quantum machine learning could possibly become a valuable alternative to classical machine learning for applications in High Energy Physics by offering computational speed-ups. In this study, we employ a support vector machine with a quantum kernel estimator (QSVM-Kernel method) to a recent LHC flagship physics analysis: $t\bar{t}H$ (Higgs boson pr...
One of the major objectives of the experimental programs at the LHC is the discovery of new physics. This requires the identification of rare signals in immense backgrounds. Using machine learning algorithms greatly enhances our ability to achieve this objective. With the progress of quantum technologies, quantum machine learning could become a pow...
During the first observation run the LIGO collaboration needed to offload some of its most, intense CPU workflows from its dedicated computing sites to opportunistic resources. Open Science Grid enabled LIGO to run PyCbC, RIFT and Bayeswave workflows to seamlessly run in a combination of owned and opportunistic resources. One of the challenges is e...
Literature-based discovery (LBD) uncovers undiscovered public knowledge by linking terms A to C via a B intermediate. Existing LBD systems are limited to process certain A, B, and C terms, and many are not maintained. We present SKiM (Serial KinderMiner), a generalized LBD system for processing any combination of A, Bs, and Cs. We evaluate SKiM via...
Literature-based discovery (LBD) uncovers undiscovered public knowledge by linking terms A to C via a B intermediate. Existing LBD systems are limited to process certain A, B, and C terms, and many are not maintained. We present SKiM (Serial KinderMiner), a generalized LBD system for processing any combination of A, Bs, and Cs. We evaluate SKiM via...
Mechanisms for remote execution of computational tasks enable a distributed system to effectively utilize all available resources. This ability is essential to attaining the objectives of high availability, system reliability, and graceful degradation and directly contribute to flexibility, adaptability, and incremental growth. As part of a nationa...
Translational research (TR) has been extensively used in the health science domain, where results from laboratory research are translated to human studies and where evidence-based practices are adopted in real-world settings to reach broad communities. In computer science, much research stops at the result publication and dissemination stage withou...
Many important scientific discoveries require lengthy experimental processes of trial and error and could benefit from intelligent prioritization based on deep domain understanding. While exponential growth in the scientific literature makes it difficult to keep current in even a single domain, that same rapid growth in literature also presents an...
In view of the increasing computing needs for the HL-LHC era, the LHC experiments are exploring new ways to access, integrate and use non-Grid compute resources. Accessing and making efficient use of Cloud and High Performance Computing (HPC) resources present a diversity of challenges for the CMS experiment. In particular, network limitations at t...
Since 2001 the Pegasus Workflow Management System has evolved into a robust and scalable system that automates the execution of a number of complex applications running on a variety of heterogeneous, distributed high-throughput, and high-performance computing environments. Pegasus was built on the principle of separation between the workflow descri...
The growth of the biological nuclear magnetic resonance (NMR) field and the development of new experimental technology have mandated the revision and enlargement of the NMR-STAR ontology used to represent experiments, spectral and derived data, and supporting metadata. We present here a brief description of the NMR-STAR ontology and software tools...
Without significant changes to data organization, management, and access (DOMA), HEP experiments will find scientific output limited by how fast data can be accessed and digested by computational resources. In this white paper we discuss challenges in DOMA that HEP experiments, such as the HL-LHC, will face as well as potential ways to address them...
The user of a computing facility must make a critical decision when submitting jobs for execution: how many resources (such as cores, memory, and disk) should be requested for each job If the request is too small, the job may fail due to resource exhaustion; if the request is too big, the job may succeed, but resources will be wasted. This decision...
GeoDeepDive combines library science, computer science, and geoscience to dive into repositories of published text, tables, and figures and return valuable information.
Advances in computation have been enabling many recent advances in biomolecular applications of NMR. Due to the wide diversity of applications of NMR, the number and variety of software packages for processing and analyzing NMR data is quite large, with labs relying on dozens, if not hundreds of software packages. Discovery, acquisition, installati...
Background
The nuclear magnetic resonance (NMR) spectroscopic data for biological macromolecules archived at the BioMagResBank (BMRB) provide a rich resource of biophysical information at atomic resolution. The NMR data archived in NMR-STAR ASCII format have been implemented in a relational database. However, it is still fairly difficult for users...
High throughput computing (HTC) has aided the scientific community in the analysis of vast amounts of data and computational jobs in distributed environments. To manage these large workloads, several systems have been developed to efficiently allocate and provide access to distributed resources. Many of these systems rely on job characteristics est...
Robust high throughput computing requires effective monitoring
and enforcement of a variety of resources including
CPU cores, memory, disk, and network traffic. Without effective
monitoring and enforcement, it is easy to overload
machines, causing failures and slowdowns, or underload machines,
which results in wasted opportunities. This paper
explo...
Estimates of task runtime, disk space usage, and memory consumption, are commonly used by scheduling and resource provisioning algorithms to support efficient and reliable workflow executions. Such algorithms often assume that accurate estimates are available, but such estimates are difficult to generate in practice. In this work, we first profile...
Many aspects of macroevolutionary theory and our understanding of biotic responses to global environmental change derive from literature-based compilations of paleontological data. Existing manually assembled databases are, however, incomplete and difficult to assess and enhance with new data types. Here, we develop and validate the quality of a ma...
Modern science often requires the execution of large-scale, multi-stage simulation and data analysis pipelines to enable the study of complex systems. The amount of computation and data involved in these pipelines requires scalable workflow management systems that are able to reliably and efficiently coordinate and automate data movement and task e...
Many aspects of macroevolutionary theory and our knowledge of biotic
responses to global environmental change derive from literature-based
compilations of paleontological data. Although major features in the
macroevolutionary history of life, notably long-term patterns of biodiversity,
are similar across compilations, critical assessments of synthe...
Task characteristics estimations such as runtime, disk space, and memory consumption, are commonly used by scheduling algorithms and resource provisioning techniques to provide
successful and efficient workflow executions. These methods assume that accurate estimations are available, but in production systems it is hard to compute such estimates wi...
Automated image acquisition, a custom analysis algorithm, and a distributed computing resource were used to add time as a third dimension to a quantitative trait locus (QTL) map for plant root gravitropism, a model growth response to an environmental cue. Digital images of Arabidopsis thaliana seedling roots from two independently reared sets of 16...
We introduce Harmony, a system for extracting the multiprocessor scheduling policies from commodity operating systems. Harmony can be used to unearth many aspects of multiprocessor scheduling policy, including the nuanced behaviors of core scheduling mechanisms and policies. We demonstrate the effectiveness of Harmony by applying it to the analysis...
As it enters adolescence the Open Science Grid (OSG) is bringing a maturing fabric of Distributed High Throughput Computing (DHTC) services that supports an expanding HEP community to an increasingly diverse spectrum of domain scientists. Working closely with researchers on campuses throughout the US and in collaboration with national cyberinfrastr...
Cloud computing is steadily gaining traction both in commercial and research worlds, and there seems to be significant potential to the HEP community as well. However, most of the tools used in the HEP community are tailored to the current computing model, which is based on grid computing. One such tool is glideinWMS, a pilot-based workload managem...
PanDA (Production and Distributed Analysis) is the workload management system of the ATLAS experiment, used to run managed production and user analysis jobs on the grid. As a late-binding, pilot-based system, the maintenance of a smooth and steady stream of pilot jobs to all grid sites is critical for PanDA operation. The ATLAS Computing Facility (...
Condor is being used extensively in the HEP environment. It is the batch system of choice for many compute farms, including several WLCG Tier Is, Tier 2s and Tier 3s. It is also the building block of one of the Grid pilot infrastructures, namely glideinWMS. As with any software, Condor does not scale indefinitely with the number of users and/or the...
The mission of the National Science Foundation (NSF) Advisory Committee on Cyberinfrastructure
(ACCI) is to advise the NSF as a whole on matters related to vision and strategy regarding
cyberinfrastructure (CI) [1]. In early 2009 the ACCI charged six task forces with making
recommendations to the NSF in strategic areas of cyberinfrastructure: Campu...
The grid community has been migrating towards service-oriented architectures as means of
exposing and interacting with computational resources across organizational boundaries. The
adoption of Web Service standards provides us with an increased level of manageability,
extensibility and interoperability between loosely coupled services that is cruci...
Variation in genome structure is an important source of human genetic polymorphism: It affects a large proportion of the genome and has a variety of phenotypic consequences relevant to health and disease. In spite of this, human genome structure variation is incompletely characterized due to a lack of approaches for discovering a broad range of str...
This article describes the Open Science Grid, a large distributed computational infrastructure in the United States which
supports many different high-throughput scientific applications, and partners (federates) with other infrastructures nationally
and internationally to form multi-domain integrated distributed systems for science. The Open Scienc...
A number of recent enhancements to the Condor batch system have been stimulated by the challenges of LHC computing. The result is a more robust, scalable, and flexible computing platform. One product of this effort is the Condor Job Router, which serves as a high-throughput scheduler for feeding multiple (e.g. grid) queues from a single input job q...
FPC gap estimations based on map alignments between optical maps and the in silico maps of FPC contig sequence pseudomolecules and B73 RefGen_v1 reference chromosomes.
(0.12 MB PDF)
Ordering FPC contig sequence pseudomolecules based on the map alignments between optical maps and the in silico maps of the FPC contig sequence pseudomolecules.
(0.10 MB PDF)
Discordances in the well-aligned map segments between optical maps and the in silico maps of the B73 RefGen_v1 reference chromosomes.
(0.25 MB PDF)
Using optical contigs to estimate sizes of gaps between adjacent FPC contigs. (A) Restriction map view (box = restriction fragment) of FPC contigs aligned to the AGP Chr 6 pseudomolecule; the gap is highlighted. (B) Cartoon showing the alignments of FPC contigs ct280 and ctg281 to Chr6 pseudomolecule. A gap of unknown size is illustrated, with gree...
About 85% of the maize genome consists of highly repetitive sequences that are interspersed by low-copy, gene-coding sequences. The maize community has dealt with this genomic complexity by the construction of an integrated genetic and physical map (iMap), but this resource alone was not sufficient for ensuring the quality of the current sequence b...
In large distributed systems, where shared resources are owned by distinct entities, there is a need to reflect resource ownership in resource allocation. An appropriate resource management system should guarantee that resource's owners have access to a share of resources proportional to the share they provide. In order to achieve that some policie...
Utility grids such as the Amazon EC2 and Amazon S3 clouds offer computational and storage resources that can be used on-demand for a fee by compute- and data-intensive applications. The cost of running an application on such a cloud depends on the compute, storage and communication resources it will provision and consume. Different execution plans...
In this paper, we develop data-driven strategies for batch computing schedulers. Current CPU-centric batch schedulers ignore the data needs within workloads and execute them by linking them transparently and directly to their needed data. When scheduled on remote computational resources, this elegant solution of direct data access can incur an orde...
Distributed computing, and in particular Grid computing, enables physicists to use thousands of CPU days worth of computing every day, by submitting thousands of compute jobs. Unfortunately, a small fraction of such jobs regularly fail; the reasons vary from disk and network problems to bugs in the user code. A subset of these failures result in jo...
SCMSWeb-Condor interface development is an international collaborative effort among some PRAGMA member institutions (ThaiGrid, Thailand; KISTI, Korea; SDSC, USA) and Condor team at University of Wisconsin. It aims at utilizing the rich information collected by the grid monitoring system - SCMSWeb, to enable Condor-G to provide more intelligent grid...
Utility grids such as the Amazon EC2 cloud and Amazon S3 offer computational and storage resources that can be used on-demand for a fee by compute and data-intensive applications. The cost of running an application on such a cloud depends on the compute, storage and communication resources it will provision and consume. Different execution plans of...
All predicted loci associated with putative binding sites for LexA, sigma-54, or Fur
(0.54 MB XLS)
A summary of the numbers and types of loci identified in each of the 932 replicons included in the SIPHT search
(0.29 MB XLS)
All predicted loci with homology to previously predicted and/or confirmed ncRNAs in our database
(5.43 MB XLS)
All predicted loci lacking homology to but sharing conserved synteny with previously predicted and/or confirmed regulatory RNAs in our database
(2.98 MB XLS)
The Center for Enabling Distributed Petascale Science is developing serviced to enable researchers to manage large, distributed datasets. The center projects focus on three areas: tools for reliable placement of data, issues involving failure detection and failure diagnosis in distributed systems, and scalable services that process requests to acce...
The Open Science Grid (OSG) includes work to enable new science, new scientists, and new modalities in support of computationally based research. There are frequently significant sociological and organizational changes required in transformation from the existing to the new. OSG leverages its deliverables to the large-scale physics experiment membe...
The Open Science Grid (OSG) provides a distributed facility where the
Consortium members provide guaranteed and opportunistic access to shared
computing and storage resources. The OSG project[1] is funded by the
National Science Foundation and the Department of Energy Scientific
Discovery through Advanced Computing program. The OSG project provides...
Background:
Diverse bacterial genomes encode numerous small non-coding RNAs (sRNAs) that regulate myriad biological processes. While bioinformatic algorithms have proven effective in identifying sRNA-encoding loci, the lack of tools and infrastructure with which to execute these computationally demanding algorithms has limited their utilization. G...
[This corrects the article on p. e3197 in vol. 3, PMID: 18787707.].
The BioMagResBank (BMRB: www.bmrb.wisc.edu) is a repository for experimental and derived data gathered from nuclear magnetic resonance (NMR) spectroscopic studies of
biological molecules. BMRB is a partner in the Worldwide Protein Data Bank (wwPDB). The BMRB archive consists of four main
data depositories: (i) quantitative NMR spectral parameters f...
The grid vision of a single computing utility has yet to materialize: while many grids with thousands of processors each exist, most work in isolation. An important obstacle for the effective and efficient inter-operation of grids is the problem of resourceselection. In this paper weproposeasolution to this pro