John Paul Walters

University at Buffalo, The State University of New York, Buffalo, NY, United States

Are you John Paul Walters?

Claim your profile

Publications (24)3.45 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the potential in using of using a graphics processor unit (GPU) for Monte-Carlo (MC)-based radiation dose calculations. The percent depth dose (PDD) of photons in a medium with known absorption and scattering coefficients is computed using a MC simulation running on both a standard CPU and a GPU. We demonstrate that the GPU's capability for massive parallel processing provides a significant acceleration in the MC calculation, and offers a significant advantage for distributed stochastic simulations on a single computer. Harnessing this potential of GPUs will help in the early adoption of MC for routine planning in a clinical environment.
    Journal of Medical Physics 04/2010; 35(2):120-2.
  • Source
    J.P. Walters, V. Chaudhary
    [Show abstract] [Hide abstract]
    ABSTRACT: As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system.
    IEEE Transactions on Parallel and Distributed Systems 08/2009; · 1.80 Impact Factor
  • Source
    J.P. Walters, R. Darole, V. Chaudhary
    [Show abstract] [Hide abstract]
    ABSTRACT: We present PIO-HMMER, an enhanced version of MPI-HMMER. PIO-HMMER improves on MPI-HMMER's scalability through the use of parallel I/O and a parallel file system. In addition, we describe several enhancements, including a new load balancing scheme, enhanced post-processing, improved double- buffering support, and asynchronous I/O for returning scores to the master node. Our enhancements to the core HMMER search tools, hmmsearch and hmmpfam, allow for scalability up to 256 nodes where MPI-HMMER previously did not scale beyond 64 nodes. We show that our performance enhancements allow hmmsearch to achieve between 48x and 221x speedup using 256 nodes, depending on the size of the input HMM and the database. Further, we show that by integrating database caching with PIO-HMMER's hmmpfam tool we can achieve up to 328x performance using only 256 nodes.
    Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on; 06/2009
  • Source
    John Paul Walters, Vipin Chaudhary
    [Show abstract] [Hide abstract]
    ABSTRACT: As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system.
    IEEE Trans. Parallel Distrib. Syst. 01/2009; 20:997-1010.
  • Source
    John Paul Walters, Vipin Chaudhary
    [Show abstract] [Hide abstract]
    ABSTRACT: Virtualization is a common strategy for improving the utilization of existing computing resources, particularly within data centers. However, its use for high performance computing (HPC) applications is currently limited despite its potential for both improving resource utilization as well as providing resource guarantees to its users. In this article, we systematically evaluate three major virtual machine implementations for computationally intensive HPC applications using various standard benchmarks. Using VMWare Server, Xen, and OpenVZ, we examine the suitability of full virtualization (VMWare), paravirtualization (Xen), and operating system-level virtualization (OpenVZ) in terms of network utilization, SMP performance, file system performance, and MPI scalability. We show that the operating system-level virtualization provided by OpenVZ provides the best overall performance, particularly for MPI scalability. With the knowledge gained by our VM evaluation, we extend OpenVZ to include support for checkpointing and fault-tolerance for MPI-based virtual server distributed computing.
    The Journal of Supercomputing 01/2009; 50:209-239. · 0.92 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present the results of parallelizing two life sciences applications, Markov random fields- based (MRF) liver segmentation and HMMER's Viterbi algorithm, using GPUs. We relate our experiences in porting both applications to the GPU as well as the tech- niques and optimizations that are most beneficial. The unique characteristics of both algorithms are demon- strated by implementations on an NVIDIA 8800 GTX Ul- tra using the CUDA programming environment. We test multiple enhancements in our GPU kernels in order to demonstrate the effectiveness of each strategy. Our opti- mized MRF kernel achieves over 130x speedup, and our hmmsearch implementation achieves up to 38x speedup. We show that the differences in speedup between MRF and hmmsearch is due primarily to the frequency at which the hmmsearch must read from the GPU's DRAM.
    23rd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, Rome, Italy, May 23-29, 2009; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Virtualization is a common strategy for improving the utilization of existing computing resources, particularly within data centers. However, its use for high performance computing (HPC) applications is currently limited despite its potential for both improving resource utilization as well as providing resource guarantees to its users. This paper systematically evaluates various VMs for computationally intensive HPC applications using various standard benchmarks. Using VMWare Server, xen, and OpenVZ we examine the suitability of full virtualization, paravirtualization, and operating system-level virtualization in terms of network utilization SMP performance, file system performance, and MPI scalability. We show that the operating system-level virtualization provided by OpenVZ provides the best overall performance, particularly for MPI scalability.
    Advanced Information Networking and Applications, 2008. AINA 2008. 22nd International Conference on; 04/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Virtualization is a common strategy for improving the utilization of existing computing resources, particularly within data centers. However, its use for high performance computing (HPC) applications is currently limited despite its potential for both improving resource utilization as well as providing resource guarantees to its users. This paper systematically evaluates various VMs for computationally intensive HPC applications using various standard benchmarks. Using VMWare Server, Xen, and OpenVZ we examine the suitability of full virtualization, paravirtualization, and operating system-level virtualization in terms of network utilization, SMP performance, file system performance, and MPI scalability. We show that the operating system-level virtualization provided by OpenVZ provides the best overall performance, particularly for MPI scalability.
    22nd International Conference on Advanced Information Networking and Applications, AINA 2008, GinoWan, Okinawa, Japan, March 25-28, 2008; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In contrast with the converging length scales of atomistic simulations and experimental nanoscience, large time scale discrepancies still remain, due to the time-scale limitations of molecular dynamics. We briefly review two recently devel- oped methods, derived from transition state theory, for accel- erating molecular dynamics simulations of infrequent-event processes. These techniques, parallel replica dynamics and hyperdynamics, can reach simulation times several orders of magnitude longer than direct molecular dynamics while re- taining full atomistic detail. : Infrequent events, accelerated dynamics, hy- perdynamics, parallel replica dynamics
    ISCA 21st International Conference on Parallel and Distributed Computing and Communication Systems, PDCCS 2008, September 24-26, 2008, Holiday Inn Downtown-Superdome, New Orleans, Louisiana, USA; 01/2008
  • Source
    John Paul Walters, Joseph Landman, Vipin Chaudhary
    04/2007: pages 51 - 70; , ISBN: 9780470191637
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: HMMER, based on the profile Hidden Markov Model (HMM) is one of the most widely used sequence database searching tools, allowing researchers to compare HMMs to sequence databases or sequences to HMM databases. Such searches often take many hours and consume a great number of CPU cycles on modern computers. We present a cluster-enabled hardware/software-accelerated implementation of the HMMER search tool hmmsearch. Our results show that combining the parallel efficiency of a cluster with one or more high-speed hardware accelerators (FPGAs) can significantly improve performance for even the most time consuming searches, often reducing search times from several hours to minutes.
    Journal of VLSI Signal Processing 01/2007; 48:223-238. · 0.73 Impact Factor
  • Source
    John Paul Walters, Vipin Chaudhary
    [Show abstract] [Hide abstract]
    ABSTRACT: As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, SAN-based solutions, and a commercial parallel file system, and show that they are not scalable, particularly beyond 64 CPUs. We demonstrate the low overhead of our replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with much lower overhead than that provided by current techniques.
    High Performance Computing - HiPC 2007, 14th International Conference, Goa, India, December 18-21, 2007, Proceedings; 01/2007
  • John Paul Walters, Vipin Chaudhary
    Proceedings of the ISCA 20th International Conference on Parallel and Distributed Computing Systems, September 24-26, 2007, Las Vegas, Nevada, USA; 01/2007
  • Source
    J.P. Walters, B. Qudah, V. Chaudhary
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to the ever-increasing size of sequence databases it has become clear that faster techniques must be employed to effectively perform biological sequence analysis in a reasonable amount of time. Exploiting the inherent parallelism between sequences is a common strategy. In this paper we enhance both the fine-grained and course-grained parallelism within the HMMER sequence analysis suite. Our strategies are complementary to one another and, where necessary, can be used as drop-in replacements to the strategies already provided within HMMER. We use conventional processors (Intel Pentium IV Xeon) as well as the freely available MPICH parallel programming environment. Our results show that the MPICH implementation greatly outperforms the PVM HMMER implementation, and our SSE2 implementation also lends greater computational power at no cost to the user.
    Advanced Information Networking and Applications, 2006. AINA 2006. 20th International Conference on; 05/2006
  • Source
    J. Landman, J. Ray, J.P. Walters
    [Show abstract] [Hide abstract]
    ABSTRACT: HMMer is a widely used tool for protein sequence homology detection, as well as functional annotation of homologous protein sequences, and protein family classification. The HMMer program is based upon a Viterbi algorithm coded in C, and is quite time consuming. Significant efforts have been undertaken to accelerate this program using custom special purpose hardware, as well as more recent attempts to leverage commodity special purpose hardware. This work will report on several minimally invasive code refactoring efforts independently undertaken by the authors, and their significant performance impact on wall clock execution time of the entire program for various test cases.
    Advanced Information Networking and Applications, 2006. AINA 2006. 20th International Conference on; 05/2006
  • Source
    John Paul Walters, Hai Jiang, Vipin Chaudhary
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a mechanism to run parallel ap- plications in heterogeneous, dynamic environments while maintaining thread synchrony. A heterogeneous software DSM is used to provide synchronization constructs similar to Pthreads, while providing for individual thread mobility. An asymmetric data conversion scheme is adopted to restore thread states among different computers during thread mi- gration. Within this framework we create a mechanism ca- pable of maintaining the distributed state between migrated (and possibly heterogeneous) threads. We show that thread synchrony can be maintained with minimal overhead and minimal burden to the programmer.
    2006 International Conference on Parallel Processing Workshops (ICPP Workshops 2006), 14-18 August 2006, Columbus, Ohio, USA; 01/2006
  • Source
    John Paul Walters, Vipin Chaudhary
    [Show abstract] [Hide abstract]
    ABSTRACT: In its simplest form, checkpointing is the act of saving a pro- gram's computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original pro- cess(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every level of the system, from utilizing special hardware/architectural checkpointing features through modification of the user's source code. This survey will discuss the various techniques used in application-level checkpointing, with special attention being paid to techniques for checkpointing parallel and distributed ap- plications.
    Distributed Computing and Internet Technology, Third International Conference, ICDCIT 2006, Bhubaneswar, India, December 20-23, 2006, Proceedings; 01/2006
  • Source
    Hai Jiang, Vipin Chaudhary, John Paul Walters
    [Show abstract] [Hide abstract]
    ABSTRACT: Process/thread migration and checkpointing schemes support load balancing, load sharing and fault tolerance to improve application performance and system resource usage on workstation clusters. To enable these schemes to work in heterogeneous environments, we have developed an application-level migration and checkpointing package, MigThread, to abstract computation states at the language level for portability. To save and restore such states across different platforms, this paper proposes a novel "Receiver Makes Right" (RMR) data conversion method, called Coarse-Grain Tagged RMR (CGT-RMR), for efficient data marshalling and unmarshalling. Unlike common data representation standards, CGT-RMR does not require programmers to analyze data types, flatten aggregate types, and encode /decode scalar types explicitly within programs. With help from MigThread's type system, CGT-RMR assigns a tag to each data type and converts non-scalar types as a whole. This speeds up the data conversion process and eases the programming task dramatically, especially for the large data trunks common to migration and checkpointing. Armed with this "Plug-and-Play" style data conversion scheme, MigThread has been ported to work in heterogeneous environments. Some microbenchmarks and performance measurements within the SPLASH-2 suite are given to illustrate the efficiency of the data conversion process.
    12/2003;
  • Source
    Hai Jiang, V. Chaudhary, J.P. Walters
    [Show abstract] [Hide abstract]
    ABSTRACT: Process/thread migration and checkpointing schemes support load balancing, load sharing and fault tolerance to improve application performance and system resource usage on workstation clusters. To enable these schemes to work in heterogeneous environments, we have developed an application-level migration and checkpointing package, MigThread, to abstract computation states at the language level for portability. To save and restore such states across different platforms, we propose a novel "receiver makes right" (RMR) data conversion method, called coarse-grain tagged RMR (CGT-RMR), for efficient data marshalling and unmarshalling. Unlike common data representation standards, CGT-RMR does not require programmers to analyze data types, flatten aggregate types, and encode/decode scalar types explicitly within programs. With help from MigThread's type system, CGT-RMR assigns a tag to each data type and converts nonscalar types as a whole. This speeds up the data conversion process and eases the programming task dramatically, especially for the large data trunks common to migration and checkpointing. Armed with this "plug-and-play" style data conversion scheme, MigThread has been ported to work in heterogeneous environments. Some microbenchmarks and performance measurements within the SPLASH-2 suite are given to illustrate the efficiency of the data conversion process
    Parallel Processing, 2003. Proceedings. 2003 International Conference on; 11/2003
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Computational clusters, the grids that federate them, and the applications that utilize their significant computing potential, all continue to grow with advances in hardware technology, cluster management, and grid middleware solutions. As they do, the likeli-hood that large-scale long-running grid and cluster applications will have to deal with underlying node unavailability and cluster failure increases as well. The primary weapons against this problem— checkpointing, migration, replication, and effective scheduling—do not currently scale well enough to be effective for the largest, most important grid and cluster applications. Complementary research efforts in upstate New York are beginning to address this issue at a variety of levels, including: (i) low level mechanisms that will predict individual processor failures by observing and reacting to low-level indicators in their chip state; (ii) scalable cluster-level checkpointing solutions that do not require centralized storage for replicated checkpoints; (iii) grid-level efforts to differentiate between different node unavailability states, to characterize the behavior of nodes, to predict their near-future unavailability, and to make better grid scheduling decisions based on this information, and on characteristics and capabilities of applications.

Publication Stats

181 Citations
3.45 Total Impact Points

Institutions

  • 2009
    • University at Buffalo, The State University of New York
      • Department of Computer Science and Engineering
      Buffalo, NY, United States
  • 2007–2009
    • State University of New York
      New York City, New York, United States
    • University of Sydney
      Sydney, New South Wales, Australia
  • 2003–2007
    • Wayne State University
      • Department of Computer Science
      Detroit, Michigan, United States